Towards Integrated Multimodal Interaction: Merging Immersive 3D Worlds with Language Based Retrieval for 3D Scene Understanding
| dc.contributor.author | Bowser, Shawn | |
| dc.contributor.author | Matuszek, Cynthia | |
| dc.contributor.author | Lukin, Stephanie | |
| dc.date.accessioned | 2025-07-30T19:21:51Z | |
| dc.date.issued | 2025-06-30 | |
| dc.description | ICMR '25: International Conference on Multimedia Retrieva,l Chicago IL, USA, 30 June 2025- 3 July 2025 | |
| dc.description.abstract | We propose a novel multimodal interactive system for 3D scene understanding and question-answering for disaster scenarios and related tasks. Our approach creates open-domain annotations for arbitrary RGB image sequences, enabling natural language-based retrieval of 3D scenes. We incorporate an automated evaluation strategy using a vision-language model (VLM) to identify temporal differences between two scenes, significantly increasing scene understanding and same-place recognition accuracy. We demonstrate the robustness of our method on dynamic scenes, including indoor environments and real-world disasters. Finally, we test our method within human-agent collaboration by designing a novel interface for users to ask questions and retrieve visual evidence from a 3D scene rendered with 3D Gaussian Splatting (3DGS) as well as navigate through it on a desktop. Users were highly engaged with the interface, and succeeded in providing visual evidence using natural language based queries and navigation for tasks with properties that may appear in emergency responses. Both the method and task-based interface lay foundations for more resilient emergency management technologies that can adapt to rapidly changing environments. | |
| dc.description.sponsorship | Cynthia Matuszek’s work was supported in part by NSF grants IIS2024878 and IIS-2145642, and this material is also based on research that is in part supported by the Army Research Laboratory, Grant No. W911NF2120076 | |
| dc.description.uri | https://dl.acm.org/doi/10.1145/3733566.3734430 | |
| dc.format.extent | 6 pages | |
| dc.genre | conference papers and proceedings | |
| dc.identifier | doi:10.13016/m2kqlj-fese | |
| dc.identifier.citation | Bowser, Shawn, Cynthia Matuszek, and Stephanie Lukin. “Towards Integrated Multimodal Interaction: Merging Immersive 3D Worlds with Language Based Retrieval for 3D Scene Understanding.” Proceedings of the 6th Workshop on Intelligent Cross-Data Analysis and Retrieval, ICDAR ’25, June 30, 2025, 32–37. https://doi.org/10.1145/3733566.3734430. | |
| dc.identifier.uri | https://doi.org/10.1145/3733566.3734430 | |
| dc.identifier.uri | http://hdl.handle.net/11603/39464 | |
| dc.language.iso | en_US | |
| dc.publisher | ACM | |
| dc.relation.isAvailableAt | The University of Maryland, Baltimore County (UMBC) | |
| dc.relation.ispartof | UMBC Faculty Collection | |
| dc.relation.ispartof | UMBC Computer Science and Electrical Engineering Department | |
| dc.rights | This work was written as part of one of the author's official duties as an Employee of the United States Government and is therefore a work of the United States Government. In accordance with 17 U.S.C. 105, no copyright protection is available for such works under U.S. Law. | |
| dc.rights | Public Domain | |
| dc.rights.uri | https://creativecommons.org/publicdomain/mark/1.0/ | |
| dc.subject | UMBC Interactive Robotics and Language Lab | |
| dc.title | Towards Integrated Multimodal Interaction: Merging Immersive 3D Worlds with Language Based Retrieval for 3D Scene Understanding | |
| dc.type | Text | |
| dcterms.creator | https://orcid.org/0000-0003-1383-8120 |
Files
Original bundle
1 - 1 of 1
