Analyzing the Sensitivity of Vision Language Models in Visual Question Answering

dc.contributor.authorShah, Monika
dc.contributor.authorBalaji, Sudarshan
dc.contributor.authorSarkhel, Somdeb
dc.contributor.authorDey, Sanorita
dc.contributor.authorVenugopal, Deepak
dc.date.accessioned2025-08-13T20:14:40Z
dc.date.issued2025-07-28
dc.description.abstractWe can think of Visual Question Answering as a (multimodal) conversation between a human and an AI system. Here, we explore the sensitivity of Vision Language Models (VLMs) through the lens of cooperative principles of conversation proposed by Grice. Specifically, even when Grice's maxims of conversation are flouted, humans typically do not have much difficulty in understanding the conversation even though it requires more cognitive effort. Here, we study if VLMs are capable of handling violations to Grice's maxims in a manner that is similar to humans. Specifically, we add modifiers to human-crafted questions and analyze the response of VLMs to these modifiers. We use three state-of-the-art VLMs in our study, namely, GPT-4o, Claude-3.5-Sonnet and Gemini-1.5-Flash on questions from the VQA v2.0 dataset. Our initial results seem to indicate that the performance of VLMs consistently diminish with the addition of modifiers which indicates our approach as a promising direction to understand the limitations of VLMs.
dc.description.sponsorshipThis research was supported by NSF award #2008812, and awards from the Gates Foundation and Adobe. The opinions, findings, and results are solely the authors’ and do not reflect those of the funding agencies.
dc.description.urihttp://arxiv.org/abs/2507.21335
dc.format.extent8 pages
dc.genrejournal articles
dc.genrepreprints
dc.identifierdoi:10.13016/m2dela-wzva
dc.identifier.urihttps://doi.org/10.48550/arXiv.2507.21335
dc.identifier.urihttp://hdl.handle.net/11603/39803
dc.language.isoen
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Faculty Collection
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department
dc.rightsAttribution 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectUMBC Ebiquity Research Group
dc.subjectComputer Science - Computer Vision and Pattern Recognition
dc.titleAnalyzing the Sensitivity of Vision Language Models in Visual Question Answering
dc.typeText
dcterms.creatorhttps://orcid.org/0000-0003-3346-5886

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
2507.21335v1.pdf
Size:
1.04 MB
Format:
Adobe Portable Document Format