MASTAF: A Model-Agnostic Spatio-Temporal Attention Fusion Network for Few-shot Video Classification
Loading...
Links to Files
Author/Creator
Author/Creator ORCID
Date
2023-02-06
Type of Work
Department
Program
Citation of Original Publication
Liu, Xin, Huanle Zhang, Hamed Pirsiavash, and Xin Liu. “MASTAF: A Model-Agnostic Spatio-Temporal Attention Fusion Network for Few-Shot Video Classification.” In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2507–16, 2023. https://doi.org/10.1109/WACV56688.2023.00254.
Rights
© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Subjects
Abstract
We propose MASTAF, a Model-Agnostic SpatioTemporal Attention Fusion network for few-shot video
classification. MASTAF takes input from a general video
spatial and temporal representation,e.g., using 2D CNN,
3D CNN, and Video Transformer. Then, to make the most
of such representations, we use self- and cross-attention
models to highlight the critical spatio-temporal region
to increase the inter-class variations and decrease the
intra-class variations. Last, MASTAF applies a lightweight
fusion network and a nearest neighbor classifier to classify
each query video. We demonstrate that MASTAF improves the state-of-the-art performance on three few-shot
video classification benchmarks(UCF101, HMDB51, and
Something-Something-V2), e.g., by up to 91.6%, 69.5%,
and 60.7% for five-way one-shot video classification,
respectively.