Towards Integrated Multimodal Interaction: Merging Immersive 3D Worlds with Language Based Retrieval for 3D Scene Understanding

dc.contributor.authorBowser, Shawn
dc.contributor.authorMatuszek, Cynthia
dc.contributor.authorLukin, Stephanie
dc.date.accessioned2025-07-30T19:21:51Z
dc.date.issued2025-06-30
dc.descriptionICMR '25: International Conference on Multimedia Retrieva,l Chicago IL, USA, 30 June 2025- 3 July 2025
dc.description.abstractWe propose a novel multimodal interactive system for 3D scene understanding and question-answering for disaster scenarios and related tasks. Our approach creates open-domain annotations for arbitrary RGB image sequences, enabling natural language-based retrieval of 3D scenes. We incorporate an automated evaluation strategy using a vision-language model (VLM) to identify temporal differences between two scenes, significantly increasing scene understanding and same-place recognition accuracy. We demonstrate the robustness of our method on dynamic scenes, including indoor environments and real-world disasters. Finally, we test our method within human-agent collaboration by designing a novel interface for users to ask questions and retrieve visual evidence from a 3D scene rendered with 3D Gaussian Splatting (3DGS) as well as navigate through it on a desktop. Users were highly engaged with the interface, and succeeded in providing visual evidence using natural language based queries and navigation for tasks with properties that may appear in emergency responses. Both the method and task-based interface lay foundations for more resilient emergency management technologies that can adapt to rapidly changing environments.
dc.description.sponsorshipCynthia Matuszek’s work was supported in part by NSF grants IIS2024878 and IIS-2145642, and this material is also based on research that is in part supported by the Army Research Laboratory, Grant No. W911NF2120076
dc.description.urihttps://dl.acm.org/doi/10.1145/3733566.3734430
dc.format.extent6 pages
dc.genreconference papers and proceedings
dc.identifierdoi:10.13016/m2kqlj-fese
dc.identifier.citationBowser, Shawn, Cynthia Matuszek, and Stephanie Lukin. “Towards Integrated Multimodal Interaction: Merging Immersive 3D Worlds with Language Based Retrieval for 3D Scene Understanding.” Proceedings of the 6th Workshop on Intelligent Cross-Data Analysis and Retrieval, ICDAR ’25, June 30, 2025, 32–37. https://doi.org/10.1145/3733566.3734430.
dc.identifier.urihttps://doi.org/10.1145/3733566.3734430
dc.identifier.urihttp://hdl.handle.net/11603/39464
dc.language.isoen_US
dc.publisherACM
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Faculty Collection
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department
dc.rightsThis work was written as part of one of the author's official duties as an Employee of the United States Government and is therefore a work of the United States Government. In accordance with 17 U.S.C. 105, no copyright protection is available for such works under U.S. Law.
dc.rightsPublic Domain
dc.rights.urihttps://creativecommons.org/publicdomain/mark/1.0/
dc.subjectUMBC Interactive Robotics and Language Lab
dc.titleTowards Integrated Multimodal Interaction: Merging Immersive 3D Worlds with Language Based Retrieval for 3D Scene Understanding
dc.typeText
dcterms.creatorhttps://orcid.org/0000-0003-1383-8120

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
3733566.3734430.pdf
Size:
5.32 MB
Format:
Adobe Portable Document Format