With the widespread application of Large Language Models (LLMs), it has
become a significant concern to ensure their safety and prevent harmful
responses. While current safe-alignment methods based on instruction
fine-tuning and Reinforcement Learning from Human Feedback (RLHF) can
effectively reduce harmful responses from LLMs, they often require high-quality
datasets and heavy computational overhead during model training. Another way to
align language models is to modify the logit of tokens in model outputs without
heavy training. Recent studies have shown that contrastive decoding can enhance
the performance of language models by reducing the likelihood of confused
tokens. However, these methods require the manual selection of contrastive
models or instruction templates. To this end, we propose Adversarial
Contrastive Decoding (ACD), an optimization-based framework to generate two
opposite system prompts for prompt-based contrastive decoding. ACD only needs
to apply a lightweight prompt tuning on a rather small anchor dataset (< 3 min
for each model) without training the target model. Experiments conducted on
extensive models and benchmarks demonstrate that the proposed method achieves
much better safety performance than previous model training-free decoding
methods without sacrificing its original generation ability.
Metrics
13 Record Views
Details
Title
Adversarial Contrastive Decoding: Boosting Safety Alignment of Large Language Models via Opposite Prompt Optimization
Creators
Zhengyue Zhao
Xiaoyun Zhang
Kaidi Xu
Xing Hu
Rui Zhang
Zidong Du
Qi Guo
Yunji Chen
Publication Details
arXiv.org
Resource Type
Preprint
Language
English
Academic Unit
Computer Science (Computing)
Other Identifier
991021889464404721
Research Home Page
Browse by research and academic units
Learn about the ETD submission process at Drexel
Learn about the Libraries’ research data management services