Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models
dc.contributor.author | Khan, Md Azim | |
dc.contributor.author | Gangopadhyay, Aryya | |
dc.contributor.author | Wang, Jianwu | |
dc.contributor.author | Erbacher, Robert F. | |
dc.date.accessioned | 2025-04-23T20:31:54Z | |
dc.date.available | 2025-04-23T20:31:54Z | |
dc.date.issued | 2025-03-08 | |
dc.description.abstract | Situational awareness applications rely heavily on real-time processing of visual and textual data to provide actionable insights. Vision language models (VLMs) have become essential tools for interpreting complex environments by connecting visual inputs with natural language descriptions. However, these models often face computational challenges, especially when required to perform efficiently in real environments. This research presents a novel vision language model (VLM) framework that leverages frequency domain transformations and low-rank adaptation (LoRA) to enhance feature extraction, scalability, and efficiency. Unlike traditional VLMs, which rely solely on spatial-domain representations, our approach incorporates Discrete Fourier Transform (DFT) based low-rank features while retaining pretrained spatial weights, enabling robust performance in noisy or low visibility scenarios. We evaluated the proposed model on caption generation and Visual Question Answering (VQA) tasks using benchmark datasets with varying levels of Gaussian noise. Quantitative results demonstrate that our model achieves evaluation metrics comparable to state-of-the-art VLMs, such as CLIP ViT-L/14 and SigLIP. Qualitative analysis further reveals that our model provides more detailed and contextually relevant responses, particularly for real-world images captured by a RealSense camera mounted on an Unmanned Ground Vehicle (UGV). | |
dc.description.sponsorship | This work is supported by U.S. Army Grant No. W911NF2120076 | |
dc.description.uri | https://arxiv.org/abs/2503.06003 | |
dc.format.extent | 8 pages | |
dc.genre | journal articles | |
dc.genre | preprints | |
dc.identifier | doi:10.13016/m2ppw2-prf3 | |
dc.identifier.uri | https://doi.org/10.48550/arXiv.2503.06003 | |
dc.identifier.uri | http://hdl.handle.net/11603/38085 | |
dc.language.iso | en_US | |
dc.relation.isAvailableAt | The University of Maryland, Baltimore County (UMBC) | |
dc.relation.ispartof | UMBC Computer Science and Electrical Engineering Department | |
dc.relation.ispartof | UMBC Joint Center for Earth Systems Technology (JCET) | |
dc.relation.ispartof | UMBC Information Systems Department | |
dc.relation.ispartof | UMBC Student Collection | |
dc.relation.ispartof | UMBC Center for Accelerated Real Time Analysis | |
dc.relation.ispartof | UMBC Center for Real-time Distributed Sensing and Autonomy | |
dc.relation.ispartof | UMBC GESTAR II | |
dc.relation.ispartof | UMBC College of Engineering and Information Technology Dean's Office | |
dc.relation.ispartof | UMBC Faculty Collection | |
dc.rights | This work was written as part of one of the author's official duties as an Employee of the United States Government and is therefore a work of the United States Government. In accordance with 17 U.S.C. 105, no copyright protection is available for such works under U.S. Law. | |
dc.rights | Public Domain | |
dc.rights.uri | https://creativecommons.org/publicdomain/mark/1.0/ | |
dc.subject | UMBC Accelerated Cognitive Cybersecurity Laboratory | |
dc.subject | UMBC Center for Cybersecurity | |
dc.subject | UMBC Big Data Analytics Lab | |
dc.title | Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models | |
dc.type | Text | |
dcterms.creator | https://orcid.org/0000-0002-7553-7932 | |
dcterms.creator | https://orcid.org/0000-0002-9933-1170 |
Files
Original bundle
1 - 1 of 1