Towards Integrated Multimodal Interaction: Merging Immersive 3D Worlds with Language Based Retrieval for 3D Scene Understanding

Bowser, Shawn; Matuszek, Cynthia; Lukin, Stephanie

Towards Integrated Multimodal Interaction: Merging Immersive 3D Worlds with Language Based Retrieval for 3D Scene Understanding

dc.contributor.author	Bowser, Shawn
dc.contributor.author	Matuszek, Cynthia
dc.contributor.author	Lukin, Stephanie
dc.date.accessioned	2025-07-30T19:21:51Z
dc.date.issued	2025-06-30
dc.description	ICMR '25: International Conference on Multimedia Retrieva,l Chicago IL, USA, 30 June 2025- 3 July 2025
dc.description.abstract	We propose a novel multimodal interactive system for 3D scene understanding and question-answering for disaster scenarios and related tasks. Our approach creates open-domain annotations for arbitrary RGB image sequences, enabling natural language-based retrieval of 3D scenes. We incorporate an automated evaluation strategy using a vision-language model (VLM) to identify temporal differences between two scenes, significantly increasing scene understanding and same-place recognition accuracy. We demonstrate the robustness of our method on dynamic scenes, including indoor environments and real-world disasters. Finally, we test our method within human-agent collaboration by designing a novel interface for users to ask questions and retrieve visual evidence from a 3D scene rendered with 3D Gaussian Splatting (3DGS) as well as navigate through it on a desktop. Users were highly engaged with the interface, and succeeded in providing visual evidence using natural language based queries and navigation for tasks with properties that may appear in emergency responses. Both the method and task-based interface lay foundations for more resilient emergency management technologies that can adapt to rapidly changing environments.
dc.description.sponsorship	Cynthia Matuszek’s work was supported in part by NSF grants IIS2024878 and IIS-2145642, and this material is also based on research that is in part supported by the Army Research Laboratory, Grant No. W911NF2120076
dc.description.uri	https://dl.acm.org/doi/10.1145/3733566.3734430
dc.format.extent	6 pages
dc.genre	conference papers and proceedings
dc.identifier	doi:10.13016/m2kqlj-fese
dc.identifier.citation	Bowser, Shawn, Cynthia Matuszek, and Stephanie Lukin. “Towards Integrated Multimodal Interaction: Merging Immersive 3D Worlds with Language Based Retrieval for 3D Scene Understanding.” Proceedings of the 6th Workshop on Intelligent Cross-Data Analysis and Retrieval, ICDAR ’25, June 30, 2025, 32–37. https://doi.org/10.1145/3733566.3734430.
dc.identifier.uri	https://doi.org/10.1145/3733566.3734430
dc.identifier.uri	http://hdl.handle.net/11603/39464
dc.language.iso	en_US
dc.publisher	ACM
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Faculty Collection
dc.relation.ispartof	UMBC Computer Science and Electrical Engineering Department
dc.rights	This work was written as part of one of the author's official duties as an Employee of the United States Government and is therefore a work of the United States Government. In accordance with 17 U.S.C. 105, no copyright protection is available for such works under U.S. Law.
dc.rights	Public Domain
dc.rights.uri	https://creativecommons.org/publicdomain/mark/1.0/
dc.subject	UMBC Interactive Robotics and Language Lab
dc.title	Towards Integrated Multimodal Interaction: Merging Immersive 3D Worlds with Language Based Retrieval for 3D Scene Understanding
dc.type	Text
dcterms.creator	https://orcid.org/0000-0003-1383-8120

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 3733566.3734430.pdf
Size:: 5.32 MB
Format:: Adobe Portable Document Format

Download

Collections

UMBC Faculty Collection
UMBC Computer Science and Electrical Engineering Department