Speaker-Based Variability in Robotic Spoken Language Grounding
Loading...
Links to Files
Permanent Link
Author/Creator
Author/Creator ORCID
Date
2022-01-01
Type of Work
Department
Computer Science and Electrical Engineering
Program
Computer Science
Citation of Original Publication
Rights
This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan through a local library, pending author/copyright holder's permission.
Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.
Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan through a local library, pending author/copyright holder's permission.
Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.
Abstract
Robots in human spaces need to be able to understand human-provided natural language instructions in the context of their physical environment. Learning to understand grounded language, which connects natural language to percepts, is a critical research area. However, the majority of existing efforts relies on highly curated text and ignores the noise and variance present in end-user speech. Existing speech-based grounded language learning works require an extensive amount of speech data. Additionally, variation in speech characteristics can cause challenges for grounding models, and prior works do not investigate the difference in performance between demographic groups. In this thesis, I train and evaluate language grounding models on collected spoken and textual descriptions of common household objects. I leverage recent work in self-supervised speech representation models to learn groundings without the interference of transcriptions as an intermediate representation. The goal is to eliminate the effects of off-the-shelf speech-to-text models as a potential source of bias. The experimental results suggest that this approach can make language grounding systems more inclusive towards accented speakers and increase general performance.