REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

dc.contributor.authorChatterjee, Agneet
dc.contributor.authorLuo, Yiran
dc.contributor.authorGokhale, Tejas
dc.contributor.authorYang, Yezhou
dc.contributor.authorBaral, Chitta
dc.date.accessioned2024-08-27T20:37:56Z
dc.date.available2024-08-27T20:37:56Z
dc.date.issued2024-10-30
dc.descriptionComputer Vision – ECCV 2024 - 18th European Conference, Milan, Italy, September 29–October 4, 2024,
dc.description.abstractText-to-Image (T2I) and multimodal large language models (MLLMs) have been adopted in solutions for several computer vision and multimodal learning tasks. However, it has been found that such vision-language models lack the ability to correctly reason over spatial relationships. To tackle this shortcoming, we develop the REVISION framework which improves spatial fidelity in vision-language models. REVISION is a 3D rendering based pipeline that generates spatially accurate synthetic images, given a textual prompt. REVISION is an extendable framework, which currently supports 100+ 3D assets, 11 spatial relationships, all with diverse camera perspectives and backgrounds. Leveraging images from REVISION as additional guidance in a training-free manner consistently improves the spatial consistency of T2I models across all spatial relationships, achieving competitive performance on the VISOR and T2I-CompBench benchmarks. We also design RevQA, a question-answering benchmark to evaluate the spatial reasoning abilities of MLLMs, and find that state-of-the-art models are not robust to complex spatial reasoning under adversarial settings. Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware generative models.
dc.description.sponsorshipThe authors acknowledge resources provided by Research Computing at Arizona State University. The authors also acknowledge technical access and support from ASU Enterprise Technology. This work was supported by NSF Robust Intelligence program grants #1750082 and #2132724. TG was supported by Microsoft’s Accelerating Foundation Model Research (AFMR) program and UMBC’s Strategic Award for Research Transitions (START). The views and opinions of the authors expressed herein do not necessarily state or reflect those of the funding agencies and employers.
dc.description.urihttps://link.springer.com/chapter/10.1007/978-3-031-73404-5_20
dc.format.extent19 pages
dc.genreconference papers and proceedings
dc.genrepostprints
dc.identifierdoi:10.1007/978-3-031-73404-5_20
dc.identifier.citationChatterjee, Agneet, Yiran Luo, Tejas Gokhale, Yezhou Yang, and Chitta Baral. “REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models.” Edited by Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol. Computer Vision – ECCV 2024, 2025, 339–57. https://doi.org/10.1007/978-3-031-73404-5_20.
dc.identifier.urihttps://doi.org/10.1007/978-3-031-73404-5_20
dc.identifier.urihttp://hdl.handle.net/11603/35795
dc.language.isoen
dc.publisherSpringer Nature
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department
dc.relation.ispartofUMBC Faculty Collection
dc.rightsThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.subjectComputer Science - Computer Vision and Pattern Recognition
dc.titleREVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models
dc.typeText
dcterms.creatorhttps://orcid.org/0000-0002-5593-2804

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
04416.pdf
Size:
3.58 MB
Format:
Adobe Portable Document Format