SPEECH VS TEXTUAL DATA FOR GROUNDED LANGUAGE LEARNING

Author/Creator

Author/Creator ORCID

Date

2020-01-20

Department

Computer Science and Electrical Engineering

Program

Computer Science

Citation of Original Publication

Rights

Distribution Rights granted to UMBC by the author.
This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu

Abstract

In this theses, we describe the compatibility of audio data with the Grounded Learning system adopted from text-only systems. My theses work lies in the junction of NLP, Speech, and Robotics. First, we conduct in-person user studies to collect audio descriptions of household objects in a controlled environment. In this work, we use category-based Grounded Learning System~\cite{pillai2018}. This system learns the meaning of words used in crowd-sourced descriptions by grounding them in the physical representation of the objects that the workers describe. We compare the performance of the category-based model with the in-lab collected speech data and crowd-sourced text data. We find that the system can learn color, object, and shape words with comparable performance. To expand the analysis, we collect natural language descriptions both in textual as well as speech format for various kitchen, office, and household items using the crowd-sourced platform. Our work involves an in-depth comparative and qualitative analysis of crowd-sourced speech and textual data. We compare the F1-scores generated for learned tokens using the category-based model for speech and text data collected using AMT. We find that the final averaged F1 scores of all the individual tokens learned are comparable in the two cases with no significant difference.