Benchmarking spatial relationships in text-to-image generation

dc.contributor.authorGokhale, Tejas
dc.contributor.authorPalangi, Hamid
dc.contributor.authorNushi, Besmira
dc.contributor.authorVineet, Vibhav
dc.contributor.authorHorvitz, Eric
dc.contributor.authorKamar, Ece
dc.contributor.authorBaral, Chitta
dc.contributor.authorYang, Yezhou
dc.date.accessioned2024-02-27T22:51:10Z
dc.date.available2024-02-27T22:51:10Z
dc.date.issued2022-12-20
dc.description.abstractSpatial understanding is a fundamental aspect of computer vision and integral for human-level reasoning about images, making it an important component for grounded language understanding. While recent text-to-image synthesis (T2I) models have shown unprecedented improvements in photorealism, it is unclear whether they have reliable spatial understanding capabilities. We investigate the ability of T2I models to generate correct spatial relationships among objects and present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image. To benchmark existing models, we introduce a dataset, SR2D, that contains sentences describing two or more objects and the spatial relationships between them. We construct an automated evaluation pipeline to recognize objects and their spatial relationships, and employ it in a large-scale evaluation of T2I models. Our experiments reveal a surprising finding that, although state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations between them. Our analyses demonstrate several biases and artifacts of T2I models such as the difficulty with generating multiple objects, a bias towards generating the first object mentioned, spatially inconsistent outputs for equivalent relationships, and a correlation between object co-occurrence and spatial understanding capabilities. We conduct a human study that shows the alignment between VISOR and human judgement about spatial understanding. We offer the SR2D dataset and the VISOR metric to the community in support of T2I reasoning research.
dc.description.urihttps://arxiv.org/abs/2212.10015
dc.format.extent18 pages
dc.genrejournal articles
dc.genrepreprints
dc.identifierdoi:10.13016/m2c9zh-c6hl
dc.identifier.urihttps://doi.org/10.48550/arXiv.2212.10015
dc.identifier.urihttp://hdl.handle.net/11603/31725
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartofUMBC Faculty Collection
dc.rightsThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.rightsAttribution-NonCommercial-NoDerivs 4.0 Internationalen
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/
dc.titleBenchmarking spatial relationships in text-to-image generation
dc.typeText
dcterms.creatorhttps://orcid.org/0000-0002-5593-2804

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
2212.10015.pdf
Size:
5.56 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.56 KB
Format:
Item-specific license agreed upon to submission
Description: