Deep Comprehension of Visual Stories through Summarization and Question Answering
Links to Files
Permanent Link
Author/Creator
Author/Creator ORCID
Date
Type of Work
Department
Computer Science and Electrical Engineering
Program
Computer Science
Citation of Original Publication
Rights
Distribution Rights granted to UMBC by the author.
This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Abstract
Reasoning that requires joint modeling of images and text is gaining importance with its applicability in multiple research areas, such as image captioning, visual concepts and question answering. In this theses, we propose reasoning across a sequence of coherent images. This is conceptually different from reasoning over just a single image, as it involves the requirement of assessing and linking information from different images. This linking is more than general descriptions or captions of the images but complex narratives describing the situations more like scenes in a story. Therefore, we propose a novel task of Visual Comprehension which reasons across multiple related images by narratives written to broadly describe what is occurring in those images. We focus on different reasoning aspects starting from identifying the core concepts of the image sequences and stories in the form of concise summaries, to gaining detailed information about different facets of the image sequences and stories through complex question answers. We develop a new dataset for this purpose by crowdsourcing one-line summaries and question answers based on sequences of 5 images and their corresponding visual stories. Summaries are evaluated based on neural machine translation resulting in generations mostly driven by stories compared to images, whereas question answers are evaluated based on K-class classification resulting in predictions more driven by images, but stories do not hurt. Nonetheless, visual stories prove to be helpful for reasoning across multiple images. Thus, we propose a new task involving reasoning across a sequence of images and a short accompanying story through summarization and question answering.
