InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models
| dc.contributor.author | Rastegar, Sarah | |
| dc.contributor.author | Chatalbasheva, Violeta | |
| dc.contributor.author | Falkena, Sieger | |
| dc.contributor.author | Singh, Anuj | |
| dc.contributor.author | Wang, Yanbo | |
| dc.contributor.author | Gokhale, Tejas | |
| dc.contributor.author | Palangi, Hamid | |
| dc.contributor.author | Jamali-Rad, Hadi | |
| dc.date.accessioned | 2026-02-03T18:14:44Z | |
| dc.date.issued | 2025-12-27 | |
| dc.description.abstract | Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Proposed loss leverages different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and a balanced object presence during sampling. The method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive evaluations on VISOR and T2I-CompBench show that InfSplign establishes a new state-of-the-art (to the best of our knowledge), achieving substantial performance gains over the strongest existing inference-time baselines and even outperforming the fine-tuning-based methods. Codebase is available at GitHub. | |
| dc.description.uri | http://arxiv.org/abs/2512.17851 | |
| dc.format.extent | 25 pages | |
| dc.genre | journal articles | |
| dc.genre | preprints | |
| dc.identifier | doi:10.13016/m2bdjk-f37j | |
| dc.identifier.uri | https://doi.org/10.48550/arXiv.2512.17851 | |
| dc.identifier.uri | http://hdl.handle.net/11603/41658 | |
| dc.language.iso | en | |
| dc.relation.isAvailableAt | The University of Maryland, Baltimore County (UMBC) | |
| dc.relation.ispartof | UMBC Faculty Collection | |
| dc.relation.ispartof | UMBC Computer Science and Electrical Engineering Department | |
| dc.rights | This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author. | |
| dc.subject | Computer Science - Computer Vision and Pattern Recognition | |
| dc.subject | Computer Science - Artificial Intelligence | |
| dc.title | InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models | |
| dc.type | Text | |
| dcterms.creator | https://orcid.org/0000-0002-5593-2804 |
Files
Original bundle
1 - 1 of 1
