Pixel-Level Scene Recognition Under Diverse Constraints

Distribution Rights granted to UMBC by the author.
This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu

Subjects

Active Learning
Continual Learning
Continuous Autoregressive Model
Domain Adaptation
Semantic Segmentation

Abstract

Recent advances in computer vision have significantly enhanced tasks such as object recognition and semantic segmentation, thereby enabling a myriad of applications in smart cities, autonomous driving, medical diagnostics, and robotics. Convolutional neural networks (CNNs) have achieved remarkable success through supervised learning; however, their performance often deteriorates when confronted with substantial domain shifts between training and real-world deployment environments. Unsupervised domain adaptation (UDA) seeks to bridge this gap by exploiting labeled source data along with unlabeled target data, yet these methods typically reach a performance plateau when the domain discrepancy is too large. Fine-tuning with a small, carefully selected subset of target data emerges as a promising strategy to overcome these limitations while reducing the burden of extensive manual annotation. In this work, we first address the fine-tuning challenge within a CNN-based framework by actively sampling high-uncertainty regions from target images and employing continual learning techniques to adapt the model incrementally. Recognizing the inherent limitations of CNNs in capturing complex and nuanced variations in real-world data, we propose a novel transformer-based semantic segmentation approach that operates in a continuous embedding space. Unlike conventional vector quantization methods that depend on discrete embeddings, our framework leverages continuous embeddings using an autoregressive (AR) generative model guided by a diffusion loss. This approach synergistically combines a CNN-based encoder for local feature extraction, a diffusion-based AR transformer to capture long-range dependencies, and a CNN-based decoder to reconstruct detailed pixel-level segmentation masks. Extensive experiments conducted on public datasets such as GTAV, Cityscapes, SemanticKITTI, ACDC, as well as our own CADEdgeTune dataset—characterized by low-angle, real-world imagery—demonstrate that our model attains impressive zero-shot domain adaptation performance. It achieves robust segmentation under adverse weather conditions and varied viewpoints, while also exhibiting strong resilience against noise. Future work will extend these concepts to LiDAR-based semantic segmentation and explore the design of large vision models that fully exploit continuous embedding representations.

Pixel-Level Scene Recognition Under Diverse Constraints

Files

Links to Files

Permanent Link

Collections

Author/Creator

Author/Creator ORCID

Date

Type of Work

Department

Program

Citation of Original Publication

Rights

Subjects

Abstract