Learning Natural Language from Probabilistic Perceptual Representations with Limited Resources

Author/Creator ORCID




Computer Science and Electrical Engineering


Computer Science

Citation of Original Publication


Distribution Rights granted to UMBC by the author.
This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu


The advent of artificially intelligent technologies has generated an explicit requirement to study the semantic comprehension of perceptual and semantic world experiences. My thesis focuses on designing an integrated grounded language acquisition system composed of linguistic and visual symbols generated by obtaining meaningful perceptual representations from a physically grounded world. Specifically, my research presents semantic models that holistically enhance language acquisition by enabling learning systems to construct concise, category-free language from visual content. Definitive knowledge of a visual concept requires not only a precise understanding of its positive information (information that provides a valid inference about a subject) but also of its negative information (information that provides information about what a subject is not). Obtaining negative examples of language referents is a challenging problem; people tend to describe things that are true of a particular situation (i.e., use positive information), rather than negatives about it. Therefore, it is difficult or uncommon for an individual to acquire negative data without first being prompted. To address this problem in information acquisition, in the first work I employed semantically inferred linguistic information to overcome the difficulty of naturally finding negative perceptual data. More specifically, I devised mathematical models to draw the association between the visual concepts "blue” and "not blue,” applying document similarity metrics on the natural descriptions. My experiments show that such semantic measures are effective in choosing positive and negative samples for perceptual learning, thus reducing the need for explicit data collection. My research also explores the complexities involved in multimodal language–visual grounding tasks. In the second work presented in this thesis, I quantify the complexity of linguistic and visual observations associated with multi-modal language acquisition to help researchers make informed design decisions that grounded language learning performance. I employ entropy-based and compression error-based metrics to quantify the diversity in visuo-linguistic grounding inputs. The results formalize the linguistic and visual complexity present in language acquisition tasks and provide insight into the cross-modal grounding performances to keep task success consistent in the following works. Subsequently, in the third work, I present and explain how a correctly presented order of visual content accelerates language acquisition and makes it more efficient. I demonstrate the benefits of careful selection of representative and diverse samples from a pool of unlabeled visual representations using active learning techniques and advanced language acquisition. For this purpose, I utilize probabilistic clustering characteristics and point process modeling as active learning strategies. My research also explores the user experience side of interactive learning in grounded language acquisition using a joint model of vision and language. Finally, this research presents a unified generative method that infers meaningful, representational, and latent visual embedding for generalizing language acquisition. Such a generative approach helps grounded language acquisition to move away from learning predefined categories and toward category-free learning. I tackle the problem of category-free visual language learning using unsupervised approaches. Experimental results indicate that the methods suggested are competent in building semantic, linguistic, and visual models and make grounded language acquisition more efficient.