From Perceptions to Meaning: Multimodal and Contrastive Machine Learning to Ground Human Intentions and Language

Darvish, Kasra

From Perceptions to Meaning: Multimodal and Contrastive Machine Learning to Ground Human Intentions and Language

dc.contributor.advisor	Matuszek, Cynthia
dc.contributor.advisor	Ferraro, Francis
dc.contributor.author	Darvish, Kasra
dc.contributor.department	Computer Science and Electrical Engineering
dc.contributor.program	Computer Science
dc.date.accessioned	2025-07-18T17:08:21Z
dc.date.issued	2025-01-01
dc.description.abstract	The overarching theme of this thesis is making sense of senses—a journey from perception to meaning by teaching machines to uncover the latent connections between natural language and the physical world as perceived through multiple sensor modalities. Learning these connections is referred to as multimodal grounded language learning, and it allows AI agents and robots to interact more intuitively with their environments and communicate naturally with humans. By understanding how language aligns with the physical world, AI systems can go beyond recognizing objects and begin to infer high-level human intentions behind tasks—an essential step for building truly intelligent, interactive systems. The growing demand for personal robots as caretakers is just one example that illustrates the real-world importance of this goal. Training AI to interact effectively in such situations requires an understanding of objects through multiple sensor modalities and the language humans use to describe them. To facilitate this, I contributed to the development of a multimodal dataset of everyday objects, and proposed approaches to multimodal grounding of language, including Extended Multimodal Alignment (EMMA), a model capable of integrating any number of modalities while remaining robust to sensor failures. EMMA not only sets a new state-of-the-art by learning effectively from low amounts of data but also converges to optimal performance twice as fast. However, understanding objects is only part of the challenge—humans often communicate abstractly, omitting steps to perform a task or using indirect speech acts. For instance, when someone tells a robot, ‘the bathroom is dirty,’ the unspoken goal is ‘clean the bathroom.’ Inferring both the desired outcome and the sequence of actions required to achieve it is second nature to humans but a significant challenge for AI. To address this, I introduce Intentionality—a novel framework that grounds human intentions in tasks by learning to infer the underlying goal and the sequence of steps required to accomplish it, given natural language instructions and visual context. This thesis advances multimodal grounded language learning by enabling AI to process multiple modalities and making it more capable of interpreting human intentions and interacting intuitively with its environment.
dc.format	application:pdf
dc.genre	dissertation
dc.identifier	doi:10.13016/m2bwgi-sedl
dc.identifier.other	13050
dc.identifier.uri	http://hdl.handle.net/11603/39387
dc.language	en
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartof	UMBC Theses and Dissertations Collection
dc.relation.ispartof	UMBC Graduate School Collection
dc.relation.ispartof	UMBC Student Collection
dc.source	Original File Name: Darvish_umbc_0434D_13050.pdf
dc.subject	Computer Vision
dc.subject	Deep Neural Networks
dc.subject	Human AI Interaction
dc.subject	Machine Learning
dc.subject	Multimodal Machine Learning
dc.subject	Natural Language Processing
dc.title	From Perceptions to Meaning: Multimodal and Contrastive Machine Learning to Ground Human Intentions and Language
dc.type	Text
dcterms.accessRights	Distribution Rights granted to UMBC by the author.
dcterms.accessRights	This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Darvish_umbc_0434D_13050.pdf
Size:: 2.87 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: Darvish-Kasra_Open.pdf
Size:: 260.85 KB
Format:: Adobe Portable Document Format
Description:

Download

Collections

UMBC Theses and Dissertations
UMBC Computer Science and Electrical Engineering Department
UMBC Graduate School
UMBC Student Collection