Using Text to Improve Classification of Man-Made Objects
Loading...
Links to Files
Permanent Link
Author/Creator
Author/Creator ORCID
Date
2022-01-01
Type of Work
Department
Computer Science and Electrical Engineering
Program
Computer Science
Citation of Original Publication
Rights
This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.
Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.
Distribution Rights granted to UMBC by the author.
Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.
Abstract
People identify man-made objects by their visual appearance and the text on them e.g., does a bottle say water or shampoo? We use text as an important visual cue to help distinguish between similar looking objects. This theses explores a novel joint model of visual appearance and textual cues for image classification.We perform this in three functions - (a) Isolating an object in an input image; (b) Extracting text from the image; (c) Training a joint vision/text model. We simplify the task by extracting text separately and presenting it to the model in machine readable format. Such a joint model has utility in many real world challenges where language is interpreted through a sensory perception like vision or sound.
The aim of the research is to understand whether visual percepts, when understood in the context of extracted language, will provide a better classification of image objects than using only pure vision to perform image classification. In conclusion, we show that joint classifier models can successfully make use of text present in images to classify objects, provided that the extracted text from images is of high quality and we have the number of images proportional to the number of classification classes.