A Large Model’s Ability to Identify 3D Objects as a Function of Viewing Angle
Loading...
Author/Creator
Date
2024-01-01
Type of Work
Department
Program
Citation of Original Publication
Rubinstein, Jacob, Francis Ferraro, Cynthia Matuszek, and Don Engel. “A Large Model’s Ability to Identify 3D Objects as a Function of Viewing Angle,” 281–88. IEEE Computer Society, 2024. https://doi.org/10.1109/AIxVR59861.2024.00047.
Rights
© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Subjects
Abstract
Virtual reality is progressively more widely used to support embodied AI agents, such as robots, which frequently engage in ‘sim-to-real’ based learning approaches. At the same time, tools such as large vision-and-language models offer new capabilities that tie into a wide variety of tasks and capabilities. In order to understand how such agents can learn from simulated environments, we explore a language model’s ability to recover the type of object represented by a photorealistic 3D model as a function of the 3D perspective from which the model is viewed. We used photogrammetry to create 3D models of commonplace objects and rendered 2D images of these models from an fixed set of 420 virtual camera perspectives. A well-studied image and language model (CLIP) was used to generate text (i.e., prompts) corresponding to these images. Using multiple instances of various object classes, we studied which camera perspectives were most likely to return accurate text categorizations for each class of object.