Computer Science - Computer Vision and Pattern Recognition
Large Vision-Language Models (LVLMs) rely on vision encoders and Large
Language Models (LLMs) to exhibit remarkable capabilities on various
multi-modal tasks in the joint space of vision and language. However, the
Typographic Attack, which disrupts vision-language models (VLMs) such as
Contrastive Language-Image Pretraining (CLIP), has also been expected to be a
security threat to LVLMs. Firstly, we verify typographic attacks on current
well-known commercial and open-source LVLMs and uncover the widespread
existence of this threat. Secondly, to better assess this vulnerability, we
propose the most comprehensive and largest-scale Typographic Dataset to date.
The Typographic Dataset not only considers the evaluation of typographic
attacks under various multi-modal tasks but also evaluates the effects of
typographic attacks, influenced by texts generated with diverse factors. Based
on the evaluation results, we investigate the causes why typographic attacks
may impact VLMs and LVLMs, leading to three highly insightful discoveries. By
the examination of our discoveries and experimental validation in the
Typographic Dataset, we reduce the performance degradation from $42.07\%$ to
$13.90\%$ when LVLMs confront typographic attacks.
Metrics
71 Record Views
Details
Title
Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model
Creators
Hao Cheng
Erjia Xiao
Jindong Gu
Le Yang
Jinhao Duan
Jize Zhang
Jiahang Cao
Kaidi Xu
Renjing Xu
Publication Details
arXiv.org
Resource Type
Preprint
Language
English
Academic Unit
Computer Science (Computing)
Other Identifier
991021871351704721
Research Home Page
Browse by research and academic units
Learn about the ETD submission process at Drexel
Learn about the Libraries’ research data management services