REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

Chatterjee, Agneet; Luo, Yiran; Gokhale, Tejas; Yang, Yezhou; Baral, Chitta

REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

dc.contributor.author	Chatterjee, Agneet
dc.contributor.author	Luo, Yiran
dc.contributor.author	Gokhale, Tejas
dc.contributor.author	Yang, Yezhou
dc.contributor.author	Baral, Chitta
dc.date.accessioned	2024-08-27T20:37:56Z
dc.date.available	2024-08-27T20:37:56Z
dc.date.issued	2024-10-30
dc.description	Computer Vision – ECCV 2024 - 18th European Conference, Milan, Italy, September 29–October 4, 2024,
dc.description.abstract	Text-to-Image (T2I) and multimodal large language models (MLLMs) have been adopted in solutions for several computer vision and multimodal learning tasks. However, it has been found that such vision-language models lack the ability to correctly reason over spatial relationships. To tackle this shortcoming, we develop the REVISION framework which improves spatial fidelity in vision-language models. REVISION is a 3D rendering based pipeline that generates spatially accurate synthetic images, given a textual prompt. REVISION is an extendable framework, which currently supports 100+ 3D assets, 11 spatial relationships, all with diverse camera perspectives and backgrounds. Leveraging images from REVISION as additional guidance in a training-free manner consistently improves the spatial consistency of T2I models across all spatial relationships, achieving competitive performance on the VISOR and T2I-CompBench benchmarks. We also design RevQA, a question-answering benchmark to evaluate the spatial reasoning abilities of MLLMs, and find that state-of-the-art models are not robust to complex spatial reasoning under adversarial settings. Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware generative models.
dc.description.sponsorship	The authors acknowledge resources provided by Research Computing at Arizona State University. The authors also acknowledge technical access and support from ASU Enterprise Technology. This work was supported by NSF Robust Intelligence program grants #1750082 and #2132724. TG was supported by Microsoft’s Accelerating Foundation Model Research (AFMR) program and UMBC’s Strategic Award for Research Transitions (START). The views and opinions of the authors expressed herein do not necessarily state or reflect those of the funding agencies and employers.
dc.description.uri	https://link.springer.com/chapter/10.1007/978-3-031-73404-5_20
dc.format.extent	19 pages
dc.genre	conference papers and proceedings
dc.genre	postprints
dc.identifier	doi:10.1007/978-3-031-73404-5_20
dc.identifier.citation	Chatterjee, Agneet, Yiran Luo, Tejas Gokhale, Yezhou Yang, and Chitta Baral. “REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models.” Edited by Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol. Computer Vision – ECCV 2024, 2025, 339–57. https://doi.org/10.1007/978-3-031-73404-5_20.
dc.identifier.uri	https://doi.org/10.1007/978-3-031-73404-5_20
dc.identifier.uri	http://hdl.handle.net/11603/35795
dc.language.iso	en
dc.publisher	Springer Nature
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Computer Science and Electrical Engineering Department
dc.relation.ispartof	UMBC Faculty Collection
dc.rights	This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.subject	Computer Science - Computer Vision and Pattern Recognition
dc.title	REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models
dc.type	Text
dcterms.creator	https://orcid.org/0000-0002-5593-2804

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 04416.pdf
Size:: 3.58 MB
Format:: Adobe Portable Document Format

Download

Collections

UMBC Computer Science and Electrical Engineering Department
UMBC Faculty Collection