Grounded Instruction Understanding with Large Language Models: Toward Trustworthy Human-Robot Interaction

dc.contributor.authorOgbadu, Ekele Aga
dc.contributor.authorLukin, Stephanie
dc.contributor.authorMatuszek, Cynthia
dc.date.accessioned2026-01-06T20:51:55Z
dc.date.issued2025-11-23
dc.description2025 AAAI Fall Symposiu, November 6-8, 2025, Arlington, VA, USA
dc.description.abstractUnderstanding natural language as a representational bridge between perception and action is critical for deploying autonomous robots in complex, high-risk environments. This work investigates how large language models (LLMs) can support this bridge by interpreting unconstrained human instructions in urban disaster response scenarios. Leveraging the SCOUT corpus, a multimodal dataset capturing human-robot dialogue through Wizard-of-Oz experiments, we construct SCOUT++, aligning over 11,000 visual frames with language commands and robot actions. We evaluate three instruction classification approaches: a neural network trained on tokenized text, GPT-4 using text alone, and GPT-4 with synchronized visual input. Results show that while GPT-4 (text-only) outperforms traditional models in accuracy, its multimodal variant exhibits degraded performance, often producing vague or hallucinated outputs. These findings expose the challenges of reliably grounding language in visual context and raise questions about the trustworthiness of foundation models in safety-critical settings. We contribute SCOUT++, a reproducible multimodal pipeline, and benchmark results that shed light on the capabilities and current limitations of vision-language models for risk-sensitive human-robot interaction.
dc.description.urihttps://ojs.aaai.org/index.php/AAAI-SS/article/view/36890
dc.format.extent9 pages
dc.genreconference papers and proceedings
dc.identifierdoi:10.13016/m2nmvg-fpo9
dc.identifier.citationOgbadu, Ekele, Stephanie Lukin, and Cynthia Matuszek. “Grounded Instruction Understanding with Large Language Models: Toward Trustworthy Human-Robot Interaction.” Proceedings of the AAAI Symposium Series 7, no. 1 (2025): 223–31. https://doi.org/10.1609/aaaiss.v7i1.36890.
dc.identifier.urihttps://doi.org/10.1609/aaaiss.v7i1.36890
dc.identifier.urihttp://hdl.handle.net/11603/41392
dc.language.isoen
dc.publisherAAAI
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department
dc.relation.ispartofUMBC Student Collection
dc.relation.ispartofUMBC Faculty Collection
dc.rightsThis work was written as part of one of the author's official duties as an Employee of the United States Government and is therefore a work of the United States Government. In accordance with 17 U.S.C. 105, no copyright protection is available for such works under U.S. Law.
dc.rightsPublic Domain
dc.rights.urihttps://creativecommons.org/publicdomain/mark/1.0/
dc.subjectUMBC Interactive Robotics and Language Lab
dc.titleGrounded Instruction Understanding with Large Language Models: Toward Trustworthy Human-Robot Interaction
dc.typeText
dcterms.creatorhttps://orcid.org/0000-0003-1383-8120

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
36890ArticleText409671220251123(1).pdf
Size:
1.96 MB
Format:
Adobe Portable Document Format