Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Models

Hao Cheng; Erjia Xiao; Jindong Gu; Le Yang; Jinhao Duan; Jize Zhang; Jiahang Cao; Kaidi Xu; Renjing Xu

doi:10.1007/978-3-031-73202-7_11

Back

Book chapter

Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Models

Hao Cheng, Erjia Xiao, Jindong Gu, Le Yang, Jinhao Duan, Jize Zhang, Jiahang Cao, Kaidi Xu and Renjing Xu

Computer Vision – ECCV 2024, v 15117, pp 179-196

21 Nov 2024

DOI: https://doi.org/10.1007/978-3-031-73202-7_11

Featured in Collection : UN Sustainable Development Goals @ Drexel

Additional Links

Abstract

Attention

Typographic Attack

Vision-Language Model

Large Vision-Language Models (LVLMs) rely on vision encoders and Large Language Models (LLMs) to exhibit remarkable capabilities on various multi-modal tasks in the joint space of vision and language. However, typographic attacks, which disrupt Vision-Language Models (VLMs) such as Contrastive Language-Image Pretraining (CLIP), have also been expected to be a security threat to LVLMs. Firstly, we verify typographic attacks on current well-known commercial and open-source LVLMs and uncover the widespread existence of this threat. Secondly, to better assess this vulnerability, we propose the most comprehensive and largest-scale Typographic Dataset to date. The Typographic Dataset not only considers the evaluation of typographic attacks under various multi-modal tasks but also evaluates the effects of typographic attacks, influenced by texts generated with diverse factors. Based on the evaluation results, we investigate the causes why typographic attacks impacting VLMs and LVLMs, leading to three highly insightful discoveries. During the process of further validating the rationality of our discoveries, we can reduce the performance degradation caused by typographic attacks from 42.07% to 13.90%. Code and Dataset are available in https://github.com/ChaduCheng/TypoDeceptions.

Metrics

15 Record Views

Details

Title: Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Models
Creators: Hao Cheng
Erjia Xiao
Jindong Gu
Le Yang
Jinhao Duan
Jize Zhang
Jiahang Cao
Kaidi Xu
Renjing Xu
Contributors: Aleš Leonardis (Editor)
Elisa Ricci (Editor)
Stefan Roth (Editor)
Olga Russakovsky (Editor)
Torsten Sattler (Editor)
Gül Varol (Editor)
Publication Details: Computer Vision – ECCV 2024, v 15117, pp 179-196
Series: Lecture Notes in Computer Science
Publisher: Springer Nature Switzerland; Cham
Number of pages: 18
Resource Type: Book chapter
Language: English
Academic Unit: Computer Science (Computing)
Web of Science ID: WOS:001401048900011
Scopus ID: 2-s2.0-85210866393
Other Identifier: 991021965471404721

UN Sustainable Development Goals (SDGs)

This publication has contributed to the advancement of the following goals:

InCites Highlights

Data related to this publication, from InCites Benchmarking & Analytics tool:

Collaboration types: Domestic collaboration; International collaboration
Web of Science research areas: Computer Science, Artificial Intelligence; Computer Science, Interdisciplinary Applications; Computer Science, Theory & Methods