Attention correction mechanism of visual contexts in visual Question answering

Author/Creator

Author/Creator ORCID

Date

2018-01-01

Department

Computer Science and Electrical Engineering

Program

Computer Science

Citation of Original Publication

Rights

Distribution Rights granted to UMBC by the author.
Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.
This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.

Abstract

To answer a question about an image or to merely describe an object in an image for answering, the current visual question answering systems have been augmented with attention mechanisms. The visual question answering mechanisms before the advent of attention mechanisms worked on the principle of training over a combination of image feature vectors and question and answer embeddings. Attention mechanisms like stacked attention networks and hierarchical co-attention attention mechanisms, help to figure out which parts of the image to attend but hardly emphasize on correcting attention. We propose a mechanism for correcting visual attention by using the concept of saliency of parts of the image being attended to. We primarily use a study of how the gaze of humans shifts over an image can help us improving the attention generated by introducing an auxiliary loss in a standard stacked attention network pipeline. For this mechanism, we use a dataset known as the VQA HAT dataset which is a large-scale collection of images containing regions explored by humans, and we use this dataset for further augmenting the work.