COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Ging, Simon; Zolfaghari, Mohammadreza; Pirsiavash, Hamed; Brox, Thomas

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

dc.contributor.author	Ging, Simon
dc.contributor.author	Zolfaghari, Mohammadreza
dc.contributor.author	Pirsiavash, Hamed
dc.contributor.author	Brox, Thomas
dc.date.accessioned	2021-03-30T18:19:37Z
dc.date.available	2021-03-30T18:19:37Z
dc.date.issued	2020-11-01
dc.description	34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada	en_US
dc.description.abstract	Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotext	en_US
dc.description.sponsorship	We thank Ehsan Adeli for helpful comments, Antoine Miech for providing details on their retrieval evaluation, and Facebook for providing us a GPU server with Tesla P100 processors for this research work.	en_US
dc.description.uri	https://proceedings.neurips.cc/paper/2020/file/ff0abbcc0227c9124a804b084d161a2d-Paper.pdf	en_US
dc.format.extent	27 pages	en_US
dc.genre	conference papers and proceedings preprints	en_US
dc.identifier	doi:10.13016/m2iyh2-4ozw
dc.identifier.citation	Ging, Simon; Zolfaghari, Mohammadreza; Pirsiavash, Hamed; Brox, Thomas; COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning; Advances in Neural Information Processing Systems 33 (NeurIPS 2020); https://proceedings.neurips.cc/paper/2020/file/ff0abbcc0227c9124a804b084d161a2d-Paper.pdf	en_US
dc.identifier.uri	http://hdl.handle.net/11603/21259
dc.language.iso	en_US	en_US
dc.publisher	NeurIPS Proceedings	en_US
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartof	UMBC Faculty Collection
dc.rights	This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.rights	Attribution 4.0 International (CC BY 4.0)	*
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	*
dc.title	COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning	en_US
dc.type	Text	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 2011.00597.pdf
Size:: 20.39 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.56 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

UMBC Computer Science and Electrical Engineering Department
UMBC Faculty Collection