Pixel-Level Scene Recognition Under Diverse Constraints
dc.contributor.advisor | Roy, Nirmalya | |
dc.contributor.author | Ahmed, Masud | |
dc.contributor.department | Information Systems | |
dc.contributor.program | Information Systems | |
dc.date.accessioned | 2025-07-18T17:08:41Z | |
dc.date.issued | 2025-01-01 | |
dc.description.abstract | Recent advances in computer vision have significantly enhanced tasks such as object recognition and semantic segmentation, thereby enabling a myriad of applications in smart cities, autonomous driving, medical diagnostics, and robotics. Convolutional neural networks (CNNs) have achieved remarkable success through supervised learning; however, their performance often deteriorates when confronted with substantial domain shifts between training and real-world deployment environments. Unsupervised domain adaptation (UDA) seeks to bridge this gap by exploiting labeled source data along with unlabeled target data, yet these methods typically reach a performance plateau when the domain discrepancy is too large. Fine-tuning with a small, carefully selected subset of target data emerges as a promising strategy to overcome these limitations while reducing the burden of extensive manual annotation. In this work, we first address the fine-tuning challenge within a CNN-based framework by actively sampling high-uncertainty regions from target images and employing continual learning techniques to adapt the model incrementally. Recognizing the inherent limitations of CNNs in capturing complex and nuanced variations in real-world data, we propose a novel transformer-based semantic segmentation approach that operates in a continuous embedding space. Unlike conventional vector quantization methods that depend on discrete embeddings, our framework leverages continuous embeddings using an autoregressive (AR) generative model guided by a diffusion loss. This approach synergistically combines a CNN-based encoder for local feature extraction, a diffusion-based AR transformer to capture long-range dependencies, and a CNN-based decoder to reconstruct detailed pixel-level segmentation masks. Extensive experiments conducted on public datasets such as GTAV, Cityscapes, SemanticKITTI, ACDC, as well as our own CADEdgeTune dataset—characterized by low-angle, real-world imagery—demonstrate that our model attains impressive zero-shot domain adaptation performance. It achieves robust segmentation under adverse weather conditions and varied viewpoints, while also exhibiting strong resilience against noise. Future work will extend these concepts to LiDAR-based semantic segmentation and explore the design of large vision models that fully exploit continuous embedding representations. | |
dc.format | application:pdf | |
dc.genre | dissertation | |
dc.identifier | doi:10.13016/m2b5gv-hapx | |
dc.identifier.other | 13035 | |
dc.identifier.uri | http://hdl.handle.net/11603/39424 | |
dc.language | en | |
dc.relation.isAvailableAt | The University of Maryland, Baltimore County (UMBC) | |
dc.relation.ispartof | UMBC Information Systems Department Collection | |
dc.relation.ispartof | UMBC Theses and Dissertations Collection | |
dc.relation.ispartof | UMBC Graduate School Collection | |
dc.relation.ispartof | UMBC Student Collection | |
dc.source | Original File Name: Ahmed_umbc_0434D_13035.pdf | |
dc.subject | Active Learning | |
dc.subject | Continual Learning | |
dc.subject | Continuous Autoregressive Model | |
dc.subject | Domain Adaptation | |
dc.subject | Semantic Segmentation | |
dc.title | Pixel-Level Scene Recognition Under Diverse Constraints | |
dc.type | Text | |
dcterms.accessRights | Distribution Rights granted to UMBC by the author. | |
dcterms.accessRights | This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu |