Benchmarking spatial relationships in text-to-image generation

Gokhale, Tejas; Palangi, Hamid; Nushi, Besmira; Vineet, Vibhav; Horvitz, Eric; Kamar, Ece; Baral,  Chitta; Yang, Yezhou

Benchmarking spatial relationships in text-to-image generation

dc.contributor.author	Gokhale, Tejas
dc.contributor.author	Palangi, Hamid
dc.contributor.author	Nushi, Besmira
dc.contributor.author	Vineet, Vibhav
dc.contributor.author	Horvitz, Eric
dc.contributor.author	Kamar, Ece
dc.contributor.author	Baral, Chitta
dc.contributor.author	Yang, Yezhou
dc.date.accessioned	2024-02-27T22:51:10Z
dc.date.available	2024-02-27T22:51:10Z
dc.date.issued	2022-12-20
dc.description.abstract	Spatial understanding is a fundamental aspect of computer vision and integral for human-level reasoning about images, making it an important component for grounded language understanding. While recent text-to-image synthesis (T2I) models have shown unprecedented improvements in photorealism, it is unclear whether they have reliable spatial understanding capabilities. We investigate the ability of T2I models to generate correct spatial relationships among objects and present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image. To benchmark existing models, we introduce a dataset, SR2D, that contains sentences describing two or more objects and the spatial relationships between them. We construct an automated evaluation pipeline to recognize objects and their spatial relationships, and employ it in a large-scale evaluation of T2I models. Our experiments reveal a surprising finding that, although state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations between them. Our analyses demonstrate several biases and artifacts of T2I models such as the difficulty with generating multiple objects, a bias towards generating the first object mentioned, spatially inconsistent outputs for equivalent relationships, and a correlation between object co-occurrence and spatial understanding capabilities. We conduct a human study that shows the alignment between VISOR and human judgement about spatial understanding. We offer the SR2D dataset and the VISOR metric to the community in support of T2I reasoning research.
dc.description.uri	https://arxiv.org/abs/2212.10015
dc.format.extent	18 pages
dc.genre	journal articles
dc.genre	preprints
dc.identifier	doi:10.13016/m2c9zh-c6hl
dc.identifier.uri	https://doi.org/10.48550/arXiv.2212.10015
dc.identifier.uri	http://hdl.handle.net/11603/31725
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartof	UMBC Faculty Collection
dc.rights	This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.rights	Attribution-NonCommercial-NoDerivs 4.0 International	en
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.title	Benchmarking spatial relationships in text-to-image generation
dc.type	Text
dcterms.creator	https://orcid.org/0000-0002-5593-2804

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 2212.10015.pdf
Size:: 5.56 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.56 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

UMBC Computer Science and Electrical Engineering Department
UMBC Faculty Collection