COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
dc.contributor.author | Ging, Simon | |
dc.contributor.author | Zolfaghari, Mohammadreza | |
dc.contributor.author | Pirsiavash, Hamed | |
dc.contributor.author | Brox, Thomas | |
dc.date.accessioned | 2021-03-30T18:19:37Z | |
dc.date.available | 2021-03-30T18:19:37Z | |
dc.date.issued | 2020-11-01 | |
dc.description | 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada | en_US |
dc.description.abstract | Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotext | en_US |
dc.description.sponsorship | We thank Ehsan Adeli for helpful comments, Antoine Miech for providing details on their retrieval evaluation, and Facebook for providing us a GPU server with Tesla P100 processors for this research work. | en_US |
dc.description.uri | https://proceedings.neurips.cc/paper/2020/file/ff0abbcc0227c9124a804b084d161a2d-Paper.pdf | en_US |
dc.format.extent | 27 pages | en_US |
dc.genre | conference papers and proceedings preprints | en_US |
dc.identifier | doi:10.13016/m2iyh2-4ozw | |
dc.identifier.citation | Ging, Simon; Zolfaghari, Mohammadreza; Pirsiavash, Hamed; Brox, Thomas; COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning; Advances in Neural Information Processing Systems 33 (NeurIPS 2020); https://proceedings.neurips.cc/paper/2020/file/ff0abbcc0227c9124a804b084d161a2d-Paper.pdf | en_US |
dc.identifier.uri | http://hdl.handle.net/11603/21259 | |
dc.language.iso | en_US | en_US |
dc.publisher | NeurIPS Proceedings | en_US |
dc.relation.isAvailableAt | The University of Maryland, Baltimore County (UMBC) | |
dc.relation.ispartof | UMBC Computer Science and Electrical Engineering Department Collection | |
dc.relation.ispartof | UMBC Faculty Collection | |
dc.rights | This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author. | |
dc.rights | Attribution 4.0 International (CC BY 4.0) | * |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | * |
dc.title | COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning | en_US |
dc.type | Text | en_US |