COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

dc.contributor.authorGing, Simon
dc.contributor.authorZolfaghari, Mohammadreza
dc.contributor.authorPirsiavash, Hamed
dc.contributor.authorBrox, Thomas
dc.date.accessioned2021-03-30T18:19:37Z
dc.date.available2021-03-30T18:19:37Z
dc.date.issued2020-11-01
dc.description34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canadaen_US
dc.description.abstractMany real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotexten_US
dc.description.sponsorshipWe thank Ehsan Adeli for helpful comments, Antoine Miech for providing details on their retrieval evaluation, and Facebook for providing us a GPU server with Tesla P100 processors for this research work.en_US
dc.description.urihttps://proceedings.neurips.cc/paper/2020/file/ff0abbcc0227c9124a804b084d161a2d-Paper.pdfen_US
dc.format.extent27 pagesen_US
dc.genreconference papers and proceedings preprintsen_US
dc.identifierdoi:10.13016/m2iyh2-4ozw
dc.identifier.citationGing, Simon; Zolfaghari, Mohammadreza; Pirsiavash, Hamed; Brox, Thomas; COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning; Advances in Neural Information Processing Systems 33 (NeurIPS 2020); https://proceedings.neurips.cc/paper/2020/file/ff0abbcc0227c9124a804b084d161a2d-Paper.pdfen_US
dc.identifier.urihttp://hdl.handle.net/11603/21259
dc.language.isoen_USen_US
dc.publisherNeurIPS Proceedingsen_US
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartofUMBC Faculty Collection
dc.rightsThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.rightsAttribution 4.0 International (CC BY 4.0)*
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/*
dc.titleCOOT: Cooperative Hierarchical Transformer for Video-Text Representation Learningen_US
dc.typeTexten_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
2011.00597.pdf
Size:
20.39 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.56 KB
Format:
Item-specific license agreed upon to submission
Description: