DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization

Pucheng Dang; Xing Hu; Dong Li; Rui Zhang; Qi Guo; Kaidi Xu

doi:10.48550/arxiv.2408.11071

Back

DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization

Preprint

Open access

DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization

Pucheng Dang, Xing Hu, Dong Li, Rui Zhang, Qi Guo and Kaidi Xu

arXiv.org

18 Aug 2024

DOI: https://doi.org/10.48550/arxiv.2408.11071

Files and links (1)

url

https://arxiv.org/abs/2408.11071View

Preprint (Author's original)arXiv.org - Non-exclusive license to distribute, Open

Abstract

Computer Science - Artificial Intelligence

Computer Science - Computer Vision and Pattern Recognition

Computer Science - Cryptography and Security

Current text-to-image (T2I) synthesis diffusion models raise misuse concerns, particularly in creating prohibited or not-safe-for-work (NSFW) images. To address this, various safety mechanisms and red teaming attack methods are proposed to enhance or expose the T2I model's capability to generate unsuitable content. However, many red teaming attack methods assume knowledge of the text encoders, limiting their practical usage. In this work, we rethink the case of \textit{purely black-box} attacks without prior knowledge of the T2l model. To overcome the unavailability of gradients and the inability to optimize attacks within a discrete prompt space, we propose DiffZOO which applies Zeroth Order Optimization to procure gradient approximations and harnesses both C-PRV and D-PRV to enhance attack prompts within the discrete prompt domain. We evaluated our method across multiple safety mechanisms of the T2I diffusion model and online servers. Experiments on multiple state-of-the-art safety mechanisms show that DiffZOO attains an 8.5% higher average attack success rate than previous works, hence its promise as a practical red teaming tool for T2l models.

Metrics

9 Record Views

Details

Title: DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization
Creators: Pucheng Dang
Xing Hu
Dong Li
Rui Zhang
Qi Guo
Kaidi Xu
Publication Details: arXiv.org
Resource Type: Preprint
Language: English
Academic Unit: Computer Science
Other Identifier: 991021899309104721

DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization

Files and links (1)

Abstract

Metrics

Details

Drexel University Social media