Human and AI Interpretations of Photogrammetrically Captured Scenes

Author/Creator

Author/Creator ORCID

Date

2024-01-01

Department

Computer Science and Electrical Engineering

Program

Computer Science

Citation of Original Publication

Rights

This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.

Abstract

3D technologies are increasingly prevalent and powerful, fundamentally reshaping how we interpret and comprehend information. This additional modality changes the way both humans and AI perceive scenes they are shown and interact with. This work aims to explore this shift from multiple angles. Chapter 2 delves into the ramifications of three-dimensional space on AI agents, while Chapter 3 explores how humans can harness 3D techniques to enhance collaboration for the preservation of cultural heritage. Virtual reality is increasingly utilized to support embodied AI agents, such as robots, engaged in `sim-to-real' based learning approaches. At the same time, tools such as large vision-and-language models offer new capabilities that tie into a wide variety of tasks and capabilities. In order to understand how such agents can learn from simulated environments, Chapter 2 explores a language model's ability to recover the type of object represented by a photorealistic 3D model as a function of the 3D perspective from which the model is viewed. We used photogrammetry to create 3D models of commonplace objects and rendered 2D images of these models from an fixed set of 420 virtual camera perspectives. A well-studied image and language model (CLIP) was used to generate text (i.e., prompts) corresponding to these images. Using multiple instances of various object classes, we studied which camera perspectives were most likely to return accurate text categorizations for each class of object. Affordable drones and geotagged photos have created many new opportunities for geospatial analysis, with divergent application domains such as historical preservation, national defense, and disaster response. In chapter 3, we analyze a series of group work tasks comprising a project to index a cemetery with incomplete records of its older sections, while noting that many of these group work tasks are agnostic to the application domain in question. To prepare for the group work, hundreds of images are captured by a pre-programmed flight of a consumer-grade quadcopter at low altitude. These images are then orthorectified to create a web-based map layer of sufficiently high resolution for group members to visually identify and annotate individual gravestones. Group members then visit the site in person and capture close-up and contextual geotagged photos using mobile phones. Contextual photos are framed such that their positions can be determined using the web-based map layer and visual landmarks. As on-site photos are captured, group members can work off-site to annotate the web-based map and link these annotations to a third-party website, findagrave.com, where they upload photos and type metadata (e.g., names, dates, notes). Gravestones and other positions of interest which require other on-site actions are marked as such on the map and group members return to the site to take these actions. Notably, group members can participate in any number of tasks within the workflow, and different phases of work can happen in parallel for different parts of the cemetery. Throughout this work, the focus is on understanding how a 2D image from a single perspective enables an agent (human or AI) to understand the 3D context of that image. The presence of key visual indicators - whether a stem of an apple or a tree behind a grave - is important for both humans and AI to comprehend the meaning afforded to them from their visual vantage point.