CAM-Seg: A Continuous-valued Embedding Approach for Semantic Image Generation

dc.contributor.authorAhmed, Masud
dc.contributor.authorHasan, Zahid
dc.contributor.authorHaque, Syed Arefinul
dc.contributor.authorFaridee, Abu Zaher Md
dc.contributor.authorPurushotham, Sanjay
dc.contributor.authorYou, Suya
dc.contributor.authorRoy, Nirmalya
dc.date.accessioned2025-04-23T20:31:09Z
dc.date.available2025-04-23T20:31:09Z
dc.date.issued2025-03-19
dc.description.abstractTraditional transformer-based semantic segmentation relies on quantized embeddings. However, our analysis reveals that autoencoder accuracy on segmentation mask using quantized embeddings (e.g. VQ-VAE) is 8% lower than continuous-valued embeddings (e.g. KL-VAE). Motivated by this, we propose a continuous-valued embedding framework for semantic segmentation. By reformulating semantic mask generation as a continuous image-to-embedding diffusion process, our approach eliminates the need for discrete latent representations while preserving fine-grained spatial and semantic details. Our key contribution includes a diffusion-guided autoregressive transformer that learns a continuous semantic embedding space by modeling long-range dependencies in image features. Our framework contains a unified architecture combining a VAE encoder for continuous feature extraction, a diffusion-guided transformer for conditioned embedding generation, and a VAE decoder for semantic mask reconstruction. Our setting facilitates zero-shot domain adaptation capabilities enabled by the continuity of the embedding space. Experiments across diverse datasets (e.g., Cityscapes and domain-shifted variants) demonstrate state-of-the-art robustness to distribution shifts, including adverse weather (e.g., fog, snow) and viewpoint variations. Our model also exhibits strong noise resilience, achieving robust performance (≈ 95% AP compared to baseline) under gaussian noise, moderate motion blur, and moderate brightness/contrast variations, while experiencing only a moderate impact (≈ 90% AP compared to baseline) from 50% salt and pepper noise, saturation and hue shifts. Code available: this https URL
dc.description.sponsorshipThis work has been partially supported by U.S. Army Grant #W911NF2120076, U.S. Army Grant #W911NF2410367, ONR Grant #N00014-23-1-2119, NSF CAREER Award #1750936, NSF REU Site Grant #2050999, and NSF CNS EAGER Grant #2233879.
dc.description.urihttps://arxiv.org/abs/2503.15617
dc.format.extent10 pages
dc.genrejournal artciles
dc.genrepreprints
dc.identifierdoi:10.13016/m2mfur-6ihc
dc.identifier.urihttps://doi.org/10.48550/arXiv.2503.15617
dc.identifier.urihttp://hdl.handle.net/11603/38023
dc.language.isoen_US
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Information Systems Department
dc.relation.ispartofUMBC Center for Real-time Distributed Sensing and Autonomy
dc.relation.ispartofUMBC Student Collection
dc.relation.ispartofUMBC Faculty Collection
dc.rightsThis work was written as part of one of the author's official duties as an Employee of the United States Government and is therefore a work of the United States Government. In accordance with 17 U.S.C. 105, no copyright protection is available for such works under U.S. Law.
dc.rightsPublic Domain
dc.rights.urihttps://creativecommons.org/publicdomain/mark/1.0/
dc.subjectUMBC M
dc.subjectUMBC Mobile, Pervasive and Sensor Computing Lab (MPSC Lab)
dc.titleCAM-Seg: A Continuous-valued Embedding Approach for Semantic Image Generation
dc.typeText
dcterms.creatorhttps://orcid.org/0000-0002-8495-0948
dcterms.creatorhttps://orcid.org/0000-0002-8324-1197

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
250315617v1.pdf
Size:
7.58 MB
Format:
Adobe Portable Document Format