InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models

dc.contributor.authorRastegar, Sarah
dc.contributor.authorChatalbasheva, Violeta
dc.contributor.authorFalkena, Sieger
dc.contributor.authorSingh, Anuj
dc.contributor.authorWang, Yanbo
dc.contributor.authorGokhale, Tejas
dc.contributor.authorPalangi, Hamid
dc.contributor.authorJamali-Rad, Hadi
dc.date.accessioned2026-02-03T18:14:44Z
dc.date.issued2025-12-27
dc.description.abstractText-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Proposed loss leverages different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and a balanced object presence during sampling. The method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive evaluations on VISOR and T2I-CompBench show that InfSplign establishes a new state-of-the-art (to the best of our knowledge), achieving substantial performance gains over the strongest existing inference-time baselines and even outperforming the fine-tuning-based methods. Codebase is available at GitHub.
dc.description.urihttp://arxiv.org/abs/2512.17851
dc.format.extent25 pages
dc.genrejournal articles
dc.genrepreprints
dc.identifierdoi:10.13016/m2bdjk-f37j
dc.identifier.urihttps://doi.org/10.48550/arXiv.2512.17851
dc.identifier.urihttp://hdl.handle.net/11603/41658
dc.language.isoen
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Faculty Collection
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department
dc.rightsThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.subjectComputer Science - Computer Vision and Pattern Recognition
dc.subjectComputer Science - Artificial Intelligence
dc.titleInfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models
dc.typeText
dcterms.creatorhttps://orcid.org/0000-0002-5593-2804

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
2512.17851v2.pdf
Size:
14.38 MB
Format:
Adobe Portable Document Format