Multimodal Language Learning for Object Retrieval in Low Data Regimes in the Face of Missing Modalities

Darvish, KasraRaff, EdwardFerraro, FrancisMatuszek, CynthiaMultimodal Language Learning for Object Retrieval in Low Data Regimes in the Face of Missing ModalitiesOpenReview2023UMBC Discovery, Research, and Experimental Analysis of Malware Lab (DREAM Lab)UMBC Interactive Robotics and Language Lab (IRAL Lab)UMBC Interactive Robotics and Language LabUMBC Ebiquity Research GroupMy UniversityMy University2024-09-242024-09-242023-08-11enTextDarvish, Kasra, Edward Raff, Francis Ferraro, and Cynthia Matuszek. “Multimodal Language Learning for Object Retrieval in Low Data Regimes in the Face of Missing Modalities.” Transactions on Machine Learning Research, August 11, 2023. https://openreview.net/forum?id=cXa6Xdm0v7.http://hdl.handle.net/11603/3636917 pagesAttribution 4.0 Internationalhttps://creativecommons.org/licenses/by/4.0/Our study is motivated by robotics, where when dealing with robots or other physical systems, we often need to balance competing concerns of relying on complex, multimodal data coming from a variety of sensors with a general lack of large representative datasets. Despite the complexity of modern robotic platforms and the need for multimodal interaction, there has been little research on integrating more than two modalities in a low data regime with the real-world constraint that sensors fail due to obstructions or adverse conditions. In this work, we consider a case in which natural language is used as a retrieval query against objects, represented across multiple modalities, in a physical environment. We introduce extended multimodal alignment (EMMA), a method that learns to select the appropriate object while jointly refining modality-specific embeddings through a geometric (distance-based) loss. In contrast to prior work, our approach is able to incorporate an arbitrary number of views (modalities) of a particular piece of data. We demonstrate the efficacy of our model on a grounded language object retrieval scenario. We show that our model outperforms state-of-the-art baselines when little training data is available. Our code is available at https://github.com/kasraprime/EMMA.