VIVAR: learning view-invariant embedding for video action recognition
Loading...
Author/Creator ORCID
Date
2025-03-10
Type of Work
Department
Program
Citation of Original Publication
Hasan, Zahid, Masud Ahmed, Abu Zaher Md Faridee, Sanjay Purushotham, Hyungtae Lee, Heesung Kwon, and Nirmalya Roy. “VIVAR: Learning View-Invariant Embedding for Video Action Recognition.” In Eighth International Conference on Video and Image Processing (ICVIP 2024), 13558:94–105. Kuala Lumpur, Malaysia: SPIE, 2025. https://doi.org/10.1117/12.3059138.
Rights
This work was written as part of one of the author's official duties as an Employee of the United States Government and is therefore a work of the United States Government. In accordance with 17 U.S.C. 105, no copyright protection is available for such works under U.S. Law.
Public Domain
Public Domain
Abstract
Deep learning has achieved state-of-the-art video action recognition (VAR) performance by comprehending action-related features from raw video. However, these models often learn to jointly encode auxiliary view (viewpoints and sensor properties) information with primary action features, leading to performance degradation under novel views and security concerns by revealing sensor types and locations. Here, we systematically study these shortcomings of VAR models and develop a novel approach, VIVAR, to learn view-invariant spatiotemporal action features removing view information. In particular, we leverage contrastive learning to separate actions and jointly optimize adversarial loss that aligns view distributions to remove auxiliary view information in the deep embedding space using the unlabeled synchronous multiview (MV) video to learn view-invariant VAR system. We evaluate VIVAR using our in-house large-scale time synchronous MV video dataset containing 10 actions with three angular viewpoints and sensors in diverse environments. VIVAR successfully captures view-invariant action features, improves inter and intra-action clusters’ quality, and outperforms SoTA models consistently with 8% more accuracy. We additionally perform extensive studies with our datasets, model architectures, multiple contrastive learning, and view distribution alignments to provide VIVAR insights. We open-source our code and dataset to facilitate further research in view-invariant systems.