VQA-LOL: Visual Question Answering Under the Lens of Logic

dc.contributor.authorGokhale, Tejas
dc.contributor.authorBanerjee, Pratyay
dc.contributor.authorBaral, Chitta
dc.contributor.authorYang, Yezhou
dc.date.accessioned2025-06-05T14:03:19Z
dc.date.available2025-06-05T14:03:19Z
dc.date.issued2020-11-12
dc.description16th European Conference Glasgow, UK, August 23–28, 2020 Proceedings, Part XXI
dc.description.abstractLogical connectives and their implications on the meaning of a natural language sentence are a fundamental aspect of understanding. In this paper, we investigate whether visual question answering (VQA) systems trained to answer a question about an image, are able to answer the logical composition of multiple such questions. When put under this Lens of Logic, state-of-the-art VQA models have difficulty in correctly answering these logically composed questions. We construct an augmentation of the VQA dataset as a benchmark, with questions containing logical compositions and linguistic transformations (negation, disjunction, conjunction, and antonyms). We propose our Lens of Logic (LOL) model which uses question-attention and logic-attention to understand logical connectives in the question, and a novel Fréchet-Compatibility Loss, which ensures that the answers of the component questions and the composed question are consistent with the inferred logical operation. Our model shows substantial improvement in learning logical compositions while retaining performance on VQA. We suggest this work as a move towards robustness by embedding logical connectives in visual understanding.
dc.description.sponsorshipSupport from NSF Robust Intelligence Program (1816039 and 1750082), DARPA (W911NF2020006) and ONR (N00014-20-1-2332) is gratefully acknowledged.
dc.description.urihttps://link.springer.com/chapter/10.1007/978-3-030-58589-1_23
dc.format.extent17 pages
dc.genreconference papers and proceedings
dc.genrebook chapters
dc.genrepostprints
dc.identifierdoi:10.13016/m2kjdw-v7vz
dc.identifier.citationGokhale, Tejas, Pratyay Banerjee, Chitta Baral, and Yezhou Yang. “VQA-LOL: Visual Question Answering Under the Lens of Logic.” Edited by Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm. Computer Vision – ECCV 2020, 2020, 379–96. https://doi.org/10.1007/978-3-030-58589-1_23.
dc.identifier.urihttps://doi.org/10.1007/978-3-030-58589-1_23
dc.identifier.urihttp://hdl.handle.net/11603/38689
dc.language.isoen_US
dc.publisherSpringer Nature
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department
dc.rightsThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.subjectVisual question answering
dc.subjectLogical robustness
dc.titleVQA-LOL: Visual Question Answering Under the Lens of Logic
dc.typeText
dcterms.creatorhttps://orcid.org/0000-0002-5593-2804

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
123660375.pdf
Size:
1.82 MB
Format:
Adobe Portable Document Format