Fooling Network Interpretation in Image Classification
No Thumbnail Available
Links to Files
Permanent Link
Author/Creator
Subramanya, Akshayvarun
Pillai, Vipin
Pirsiavash, Hamed
Author/Creator ORCID
Date
2019-09-24
Type of Work
Department
Program
Citation of Original Publication
Subramanya, Akshayvarun; Pillai, Vipin; Pirsiavash, Hamed; Fooling Network Interpretation in Image Classification; Computer Vision and Pattern Recognition (2019); https://arxiv.org/abs/1812.02843
Rights
This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
Subjects
deep neural networks
algorithms
adversarial patches
misclassification
algorithms
adversarial patches
misclassification
Abstract
Deep neural networks have been shown to be fooled rather easily using adversarial attack algorithms. Practical methods such as adversarial patches have been shown to be extremely effective in causing misclassification. However, these patches are highlighted using standard network interpretation algorithms, thus revealing the identity of the adversary. We show that it is possible to create adversarial patches which not only fool the prediction, but also change what we interpret regarding the cause of the prediction. Moreover, we introduce our attack as a controlled setting to measure the accuracy of interpretation algorithms. We show this using extensive experiments for Grad-CAM interpretation that transfers to occluding patch interpretation as well. We believe our algorithms can facilitate developing more robust network interpretation tools that truly explain the network's underlying decision making process.