REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models
| dc.contributor.author | Chatterjee, Agneet | |
| dc.contributor.author | Luo, Yiran | |
| dc.contributor.author | Gokhale, Tejas | |
| dc.contributor.author | Yang, Yezhou | |
| dc.contributor.author | Baral, Chitta | |
| dc.date.accessioned | 2024-08-27T20:37:56Z | |
| dc.date.available | 2024-08-27T20:37:56Z | |
| dc.date.issued | 2024-10-30 | |
| dc.description | Computer Vision – ECCV 2024 - 18th European Conference, Milan, Italy, September 29–October 4, 2024, | |
| dc.description.abstract | Text-to-Image (T2I) and multimodal large language models (MLLMs) have been adopted in solutions for several computer vision and multimodal learning tasks. However, it has been found that such vision-language models lack the ability to correctly reason over spatial relationships. To tackle this shortcoming, we develop the REVISION framework which improves spatial fidelity in vision-language models. REVISION is a 3D rendering based pipeline that generates spatially accurate synthetic images, given a textual prompt. REVISION is an extendable framework, which currently supports 100+ 3D assets, 11 spatial relationships, all with diverse camera perspectives and backgrounds. Leveraging images from REVISION as additional guidance in a training-free manner consistently improves the spatial consistency of T2I models across all spatial relationships, achieving competitive performance on the VISOR and T2I-CompBench benchmarks. We also design RevQA, a question-answering benchmark to evaluate the spatial reasoning abilities of MLLMs, and find that state-of-the-art models are not robust to complex spatial reasoning under adversarial settings. Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware generative models. | |
| dc.description.sponsorship | The authors acknowledge resources provided by Research Computing at Arizona State University. The authors also acknowledge technical access and support from ASU Enterprise Technology. This work was supported by NSF Robust Intelligence program grants #1750082 and #2132724. TG was supported by Microsoft’s Accelerating Foundation Model Research (AFMR) program and UMBC’s Strategic Award for Research Transitions (START). The views and opinions of the authors expressed herein do not necessarily state or reflect those of the funding agencies and employers. | |
| dc.description.uri | https://link.springer.com/chapter/10.1007/978-3-031-73404-5_20 | |
| dc.format.extent | 19 pages | |
| dc.genre | conference papers and proceedings | |
| dc.genre | postprints | |
| dc.identifier | doi:10.1007/978-3-031-73404-5_20 | |
| dc.identifier.citation | Chatterjee, Agneet, Yiran Luo, Tejas Gokhale, Yezhou Yang, and Chitta Baral. “REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models.” Edited by Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol. Computer Vision – ECCV 2024, 2025, 339–57. https://doi.org/10.1007/978-3-031-73404-5_20. | |
| dc.identifier.uri | https://doi.org/10.1007/978-3-031-73404-5_20 | |
| dc.identifier.uri | http://hdl.handle.net/11603/35795 | |
| dc.language.iso | en | |
| dc.publisher | Springer Nature | |
| dc.relation.isAvailableAt | The University of Maryland, Baltimore County (UMBC) | |
| dc.relation.ispartof | UMBC Computer Science and Electrical Engineering Department | |
| dc.relation.ispartof | UMBC Faculty Collection | |
| dc.rights | This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author. | |
| dc.subject | Computer Science - Computer Vision and Pattern Recognition | |
| dc.title | REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models | |
| dc.type | Text | |
| dcterms.creator | https://orcid.org/0000-0002-5593-2804 |
Files
Original bundle
1 - 1 of 1
