InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models

Rastegar, Sarah; Chatalbasheva, Violeta; Falkena, Sieger; Singh, Anuj; Wang, Yanbo; Gokhale, Tejas; Palangi, Hamid; Jamali-Rad, Hadi

InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models

dc.contributor.author	Rastegar, Sarah
dc.contributor.author	Chatalbasheva, Violeta
dc.contributor.author	Falkena, Sieger
dc.contributor.author	Singh, Anuj
dc.contributor.author	Wang, Yanbo
dc.contributor.author	Gokhale, Tejas
dc.contributor.author	Palangi, Hamid
dc.contributor.author	Jamali-Rad, Hadi
dc.date.accessioned	2026-02-03T18:14:44Z
dc.date.issued	2025-12-27
dc.description.abstract	Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Proposed loss leverages different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and a balanced object presence during sampling. The method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive evaluations on VISOR and T2I-CompBench show that InfSplign establishes a new state-of-the-art (to the best of our knowledge), achieving substantial performance gains over the strongest existing inference-time baselines and even outperforming the fine-tuning-based methods. Codebase is available at GitHub.
dc.description.uri	http://arxiv.org/abs/2512.17851
dc.format.extent	25 pages
dc.genre	journal articles
dc.genre	preprints
dc.identifier	doi:10.13016/m2bdjk-f37j
dc.identifier.uri	https://doi.org/10.48550/arXiv.2512.17851
dc.identifier.uri	http://hdl.handle.net/11603/41658
dc.language.iso	en
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Faculty Collection
dc.relation.ispartof	UMBC Computer Science and Electrical Engineering Department
dc.rights	This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.subject	Computer Science - Computer Vision and Pattern Recognition
dc.subject	Computer Science - Artificial Intelligence
dc.title	InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models
dc.type	Text
dcterms.creator	https://orcid.org/0000-0002-5593-2804

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 2512.17851v2.pdf
Size:: 14.38 MB
Format:: Adobe Portable Document Format

Download

Collections

UMBC Faculty Collection
UMBC Computer Science and Electrical Engineering Department