Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model

Hao Cheng; Erjia Xiao; Jindong Gu; Le Yang; Jinhao Duan; Jize Zhang; Jiahang Cao; Kaidi Xu; Renjing Xu

doi:10.48550/arxiv.2402.19150

Back

Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model

Preprint

Open access

Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model

Hao Cheng, Erjia Xiao, Jindong Gu, Le Yang, Jinhao Duan, Jize Zhang, Jiahang Cao, Kaidi Xu and Renjing Xu

arXiv.org

21 Mar 2024

DOI: https://doi.org/10.48550/arxiv.2402.19150

Files and links (1)

url

https://doi.org/10.48550/arxiv.2402.19150View

Preprint (Author's original)arXiv.org - Non-exclusive license to distribute, Open

Abstract

Computer Science - Computer Vision and Pattern Recognition

Large Vision-Language Models (LVLMs) rely on vision encoders and Large Language Models (LLMs) to exhibit remarkable capabilities on various multi-modal tasks in the joint space of vision and language. However, the Typographic Attack, which disrupts vision-language models (VLMs) such as Contrastive Language-Image Pretraining (CLIP), has also been expected to be a security threat to LVLMs. Firstly, we verify typographic attacks on current well-known commercial and open-source LVLMs and uncover the widespread existence of this threat. Secondly, to better assess this vulnerability, we propose the most comprehensive and largest-scale Typographic Dataset to date. The Typographic Dataset not only considers the evaluation of typographic attacks under various multi-modal tasks but also evaluates the effects of typographic attacks, influenced by texts generated with diverse factors. Based on the evaluation results, we investigate the causes why typographic attacks may impact VLMs and LVLMs, leading to three highly insightful discoveries. By the examination of our discoveries and experimental validation in the Typographic Dataset, we reduce the performance degradation from $42.07\%$ to $13.90\%$ when LVLMs confront typographic attacks.

Metrics

71 Record Views

Details

Title: Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model
Creators: Hao Cheng
Erjia Xiao
Jindong Gu
Le Yang
Jinhao Duan
Jize Zhang
Jiahang Cao
Kaidi Xu
Renjing Xu
Publication Details: arXiv.org
Resource Type: Preprint
Language: English
Academic Unit: Computer Science (Computing)
Other Identifier: 991021871351704721

Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model

Files and links (1)

Abstract

Metrics

Details

Drexel University Social media