COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Loading...
Permanent Link
Author/Creator
Author/Creator ORCID
Date
2020-11-01
Type of Work
Department
Program
Citation of Original Publication
Ging, Simon; Zolfaghari, Mohammadreza; Pirsiavash, Hamed; Brox, Thomas; COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning; Advances in Neural Information Processing Systems 33 (NeurIPS 2020); https://proceedings.neurips.cc/paper/2020/file/ff0abbcc0227c9124a804b084d161a2d-Paper.pdf
Rights
This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
Attribution 4.0 International (CC BY 4.0)
Attribution 4.0 International (CC BY 4.0)
Subjects
Abstract
Many real-world video-text tasks involve different levels of granularity, such as
frames and words, clip and sentences or videos and paragraphs, each with distinct
semantics. In this paper, we propose a Cooperative hierarchical Transformer
(COOT) to leverage this hierarchy information and model the interactions between
different levels of granularity and different modalities. The method consists of three
major components: an attention-aware feature aggregation layer, which leverages
the local temporal context (intra-level, e.g., within a clip), a contextual transformer
to learn the interactions between low-level and high-level semantics (inter-level,
e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to
connect video and text. The resulting method compares favorably to the state of
the art on several benchmarks while having few parameters. All code is available
open-source at https://github.com/gingsi/coot-videotext