It is widely known that convolutional neural networks (CNNs) are vulnerable
to adversarial examples: images with imperceptible perturbations crafted to
fool classifiers. However, interpretability of these perturbations is less
explored in the literature. This work aims to better understand the roles of
adversarial perturbations and provide visual explanations from pixel, image and
network perspectives. We show that adversaries have a promotion-suppression
effect (PSE) on neurons' activations and can be primarily categorized into
three types: i) suppression-dominated perturbations that mainly reduce the
classification score of the true label, ii) promotion-dominated perturbations
that focus on boosting the confidence of the target label, and iii) balanced
perturbations that play a dual role in suppression and promotion. We also
provide image-level interpretability of adversarial examples. This links PSE of
pixel-level perturbations to class-specific discriminative image regions
localized by class activation mapping (Zhou et al. 2016). Further, we examine
the adversarial effect through network dissection (Bau et al. 2017), which
offers concept-level interpretability of hidden units. We show that there
exists a tight connection between the units' sensitivity to adversarial attacks
and their interpretability on semantic concepts. Lastly, we provide some new
insights from our interpretation to improve the adversarial robustness of
networks.
Metrics
10 Record Views
Details
Title
Interpreting Adversarial Examples by Activation Promotion and Suppression
Creators
Kaidi Xu
Sijia Liu
Gaoyuan Zhang
Mengshu Sun
Pu Zhao
Quanfu Fan
Chuang Gan
Xue Lin
Publication Details
arXiv.org
Resource Type
Preprint
Language
English
Academic Unit
Computer Science (Computing)
Other Identifier
991021871454604721
Research Home Page
Browse by research and academic units
Learn about the ETD submission process at Drexel
Learn about the Libraries’ research data management services