CAM-Seg: A Continuous-valued Embedding Approach for Semantic Image Generation

Ahmed, MasudHasan, ZahidHaque, Syed ArefinulFaridee, Abu Zaher MdPurushotham, SanjayYou, SuyaRoy, NirmalyaCAM-Seg: A Continuous-valued Embedding Approach for Semantic Image GenerationMy University2025UMBC MUMBC Mobile, Pervasive and Sensor Computing Lab (MPSC Lab)My UniversityMy University2025-04-232025-04-232025-03-19enTexthttps://doi.org/10.48550/arXiv.2503.15617http://hdl.handle.net/11603/3802310 pagesThis work was written as part of one of the author's official duties as an Employee of the United States Government and is therefore a work of the United States Government. In accordance with 17 U.S.C. 105, no copyright protection is available for such works under U.S. Law.Public Domainhttps://creativecommons.org/publicdomain/mark/1.0/Traditional transformer-based semantic segmentation relies on quantized embeddings. However, our analysis reveals that autoencoder accuracy on segmentation mask using quantized embeddings (e.g. VQ-VAE) is 8% lower than continuous-valued embeddings (e.g. KL-VAE). Motivated by this, we propose a continuous-valued embedding framework for semantic segmentation. By reformulating semantic mask generation as a continuous image-to-embedding diffusion process, our approach eliminates the need for discrete latent representations while preserving fine-grained spatial and semantic details. Our key contribution includes a diffusion-guided autoregressive transformer that learns a continuous semantic embedding space by modeling long-range dependencies in image features. Our framework contains a unified architecture combining a VAE encoder for continuous feature extraction, a diffusion-guided transformer for conditioned embedding generation, and a VAE decoder for semantic mask reconstruction. Our setting facilitates zero-shot domain adaptation capabilities enabled by the continuity of the embedding space. Experiments across diverse datasets (e.g., Cityscapes and domain-shifted variants) demonstrate state-of-the-art robustness to distribution shifts, including adverse weather (e.g., fog, snow) and viewpoint variations. Our model also exhibits strong noise resilience, achieving robust performance (≈ 95% AP compared to baseline) under gaussian noise, moderate motion blur, and moderate brightness/contrast variations, while experiencing only a moderate impact (≈ 90% AP compared to baseline) from 50% salt and pepper noise, saturation and hue shifts. Code available: this https URL