Pixel-Level Scene Recognition Under Diverse Constraints
Links to Files
Permanent Link
Author/Creator
Author/Creator ORCID
Date
Type of Work
Department
Information Systems
Program
Information Systems
Citation of Original Publication
Rights
Distribution Rights granted to UMBC by the author.
This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Abstract
Recent advances in computer vision have significantly enhanced tasks such as object recognition and semantic segmentation, thereby enabling a myriad of applications in smart cities, autonomous driving, medical diagnostics, and robotics. Convolutional neural networks (CNNs) have achieved remarkable success through supervised learning; however, their performance often deteriorates when confronted with substantial domain shifts between training and real-world deployment environments. Unsupervised domain adaptation (UDA) seeks to bridge this gap by exploiting labeled source data along with unlabeled target data, yet these methods typically reach a performance plateau when the domain discrepancy is too large. Fine-tuning with a small, carefully selected subset of target data emerges as a promising strategy to overcome these limitations while reducing the burden of extensive manual annotation. In this work, we first address the fine-tuning challenge within a CNN-based framework by actively sampling high-uncertainty regions from target images and employing continual learning techniques to adapt the model incrementally. Recognizing the inherent limitations of CNNs in capturing complex and nuanced variations in real-world data, we propose a novel transformer-based semantic segmentation approach that operates in a continuous embedding space. Unlike conventional vector quantization methods that depend on discrete embeddings, our framework leverages continuous embeddings using an autoregressive (AR) generative model guided by a diffusion loss. This approach synergistically combines a CNN-based encoder for local feature extraction, a diffusion-based AR transformer to capture long-range dependencies, and a CNN-based decoder to reconstruct detailed pixel-level segmentation masks. Extensive experiments conducted on public datasets such as GTAV, Cityscapes, SemanticKITTI, ACDC, as well as our own CADEdgeTune dataset—characterized by low-angle, real-world imagery—demonstrate that our model attains impressive zero-shot domain adaptation performance. It achieves robust segmentation under adverse weather conditions and varied viewpoints, while also exhibiting strong resilience against noise. Future work will extend these concepts to LiDAR-based semantic segmentation and explore the design of large vision models that fully exploit continuous embedding representations.