Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning

Fang, Zhiyuan; Gokhale, Tejas; Banerjee, Pratyay; Baral, Chitta; Yang, Yezhou

Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning

dc.contributor.author	Fang, Zhiyuan
dc.contributor.author	Gokhale, Tejas
dc.contributor.author	Banerjee, Pratyay
dc.contributor.author	Baral, Chitta
dc.contributor.author	Yang, Yezhou
dc.date.accessioned	2025-06-05T14:03:18Z
dc.date.available	2025-06-05T14:03:18Z
dc.date.issued	2020-11
dc.description	Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
dc.description.abstract	Captioning is a crucial and challenging task for video understanding. In videos that involve active agents such as humans, the agent`s actions can bring about myriad changes in the scene. Observable changes such as movements, manipulations, and transformations of the objects in the scene, are reflected in conventional video captioning. Unlike images, actions in videos are also inherently linked to social aspects such as intentions (why the action is taking place), effects (what changes due to the action), and attributes that describe the agent. Thus for video understanding, such as when captioning videos or when answering questions about videos, one must have an understanding of these commonsense aspects. We present the first work on generating commonsense captions directly from videos, to describe latent aspects such as intentions, effects, and attributes. We present a new dataset “Video-to-Commonsense (V2C)” that contains ~9k videos of human agents performing various actions, annotated with 3 types of commonsense descriptions. Additionally we explore the use of open-ended video-based commonsense question answering (V2C-QA) as a way to enrich our captions. Both the generation task and the QA task can be used to enrich video captions.
dc.description.sponsorship	The authors acknowledge support from the NSF Robust Intelligence Program project #1816039, the DARPA KAIROS program (LESTAT project), the DARPA SAIL-ON program, and ONR award N00014-20-1-2332. ZF, TG, YY thank the organizers and the participants of the Telluride Neuromorphic Cognition Workshop, especially the Machine Common Sense (MCS) group
dc.description.uri	https://aclanthology.org/2020.emnlp-main.61/
dc.format.extent	21 pages
dc.genre	conference papers and proceedings
dc.identifier	doi:10.13016/m2tful-rmhc
dc.identifier.citation	Fang, Zhiyuan, Tejas Gokhale, Pratyay Banerjee, Chitta Baral, and Yezhou Yang. “Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning.” Edited by Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), November 2020, 840–60. https://doi.org/10.18653/v1/2020.emnlp-main.61.
dc.identifier.uri	https://doi.org/10.18653/v1/2020.emnlp-main.61
dc.identifier.uri	http://hdl.handle.net/11603/38685
dc.language.iso	en_US
dc.publisher	ACL
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Computer Science and Electrical Engineering Department
dc.rights	Attribution 4.0 International
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/deed.en
dc.title	Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning
dc.type	Text
dcterms.creator	https://orcid.org/0000-0002-5593-2804

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 2020.emnlpmain.61.pdf
Size:: 4.78 MB
Format:: Adobe Portable Document Format

Download

Collections

UMBC Computer Science and Electrical Engineering Department