Learning View-invariant and Novel Spatio-temporal Features Under Uncertainty from Video

Author/Creator

Author/Creator ORCID

Date

2024-01-01

Department

Information Systems

Program

Information Systems

Citation of Original Publication

Rights

This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.

Abstract

Video understanding research enables machines to interpret activities, objects, and contextual situations, and measure physiological signals from humans using video data. It has various applications in human-computer interaction, robotics, contactless health monitoring, security, computer vision, search and rescue, autonomous navigation, and surveillance. Deep learning (DL) has emerged as the standard method in video understanding research, achieving state-of-the-art results in close label space due to their data-driven learning capability from large-scale datasets. However, traditional supervised DL models exhibit suboptimal performance when developed using noisy labeled data and therefore, fail to incorporate open-world unlabeled and unknown novel data during the model development. Moreover, the performance of such models degrades when encountering novel data patterns in open-set scenarios. In this thesis, we address these challenges by introducing robust DL algorithms to learn under data uncertainty in two real-world video understanding applications: contactless physiological health sensing for remote heart rate monitoring referred to as the remote photoplethysmograph (rPPG), and video action recognition (VAR). First, we proposed generalized DL approaches that utilize large-scale rPPG data containing inherent aleatoric uncertainty in labels to learn to extract the micro PPG signals from skin videos for remote heart rate monitoring applications. We made three key algorithmic contributions to design the DL-based rPPG systems: (i) introducing multi-task learning for noise separation, (2) leveraging the self-supervised learning to reduce reliance on labeled data, and (3) designing a self-supervised, adversarial framework for refining rPPG estimation using large-scale unlabeled data. Further, we developed the rPPG model pruning technique to reduce DL model size for real-time edge deployment and released an open-source large-scale rPPG dataset. Next, to broaden the scope of contactless health sensing from physiological signals to human action and activity recognition, we postulated uncertainty-aware DL-based video action recognition (VAR) models. Our proposed VAR model to comprehend spatiotemporal action patterns amidst epistemic data uncertainty due to the knowledge gaps in data from unlabeled novel angular viewpoints, and partially annotated open-world action space. In particular, we proposed self-supervision and adversarial optimization to learn view-invariant VAR models, addressing unlabeled viewpoints. We depicted two novel algorithms in novel category discovery (NCD) research: (i) negative learning, variance, and entropy constraints, and (ii) uncertainty-aware generalized statistical constraints to facilitate learning novel action categories. We demonstrated the scaleability of our NCD algorithms across image and 1-D time-series classification tasks. In this thesis, we posited a novel video understanding framework under data uncertainty by introducing novel DL algorithms for learning pixel-level information in rPPG and uncovering visual classes.