COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Author/Creator ORCID

Date

2020-11-01

Department

Program

Citation of Original Publication

Ging, Simon; Zolfaghari, Mohammadreza; Pirsiavash, Hamed; Brox, Thomas; COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning; Advances in Neural Information Processing Systems 33 (NeurIPS 2020); https://proceedings.neurips.cc/paper/2020/file/ff0abbcc0227c9124a804b084d161a2d-Paper.pdf

Rights

This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
Attribution 4.0 International (CC BY 4.0)

Subjects

Abstract

Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotext