Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models

dc.contributor.authorKhan, Md Azim
dc.contributor.authorGangopadhyay, Aryya
dc.contributor.authorWang, Jianwu
dc.contributor.authorErbacher, Robert F.
dc.date.accessioned2025-04-23T20:31:54Z
dc.date.available2025-04-23T20:31:54Z
dc.date.issued2025-03-08
dc.description.abstractSituational awareness applications rely heavily on real-time processing of visual and textual data to provide actionable insights. Vision language models (VLMs) have become essential tools for interpreting complex environments by connecting visual inputs with natural language descriptions. However, these models often face computational challenges, especially when required to perform efficiently in real environments. This research presents a novel vision language model (VLM) framework that leverages frequency domain transformations and low-rank adaptation (LoRA) to enhance feature extraction, scalability, and efficiency. Unlike traditional VLMs, which rely solely on spatial-domain representations, our approach incorporates Discrete Fourier Transform (DFT) based low-rank features while retaining pretrained spatial weights, enabling robust performance in noisy or low visibility scenarios. We evaluated the proposed model on caption generation and Visual Question Answering (VQA) tasks using benchmark datasets with varying levels of Gaussian noise. Quantitative results demonstrate that our model achieves evaluation metrics comparable to state-of-the-art VLMs, such as CLIP ViT-L/14 and SigLIP. Qualitative analysis further reveals that our model provides more detailed and contextually relevant responses, particularly for real-world images captured by a RealSense camera mounted on an Unmanned Ground Vehicle (UGV).
dc.description.sponsorshipThis work is supported by U.S. Army Grant No. W911NF2120076
dc.description.urihttps://arxiv.org/abs/2503.06003
dc.format.extent8 pages
dc.genrejournal articles
dc.genrepreprints
dc.identifierdoi:10.13016/m2ppw2-prf3
dc.identifier.urihttps://doi.org/10.48550/arXiv.2503.06003
dc.identifier.urihttp://hdl.handle.net/11603/38085
dc.language.isoen_US
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department
dc.relation.ispartofUMBC Joint Center for Earth Systems Technology (JCET)
dc.relation.ispartofUMBC Information Systems Department
dc.relation.ispartofUMBC Student Collection
dc.relation.ispartofUMBC Center for Accelerated Real Time Analysis
dc.relation.ispartofUMBC Center for Real-time Distributed Sensing and Autonomy
dc.relation.ispartofUMBC GESTAR II
dc.relation.ispartofUMBC College of Engineering and Information Technology Dean's Office
dc.relation.ispartofUMBC Faculty Collection
dc.rightsThis work was written as part of one of the author's official duties as an Employee of the United States Government and is therefore a work of the United States Government. In accordance with 17 U.S.C. 105, no copyright protection is available for such works under U.S. Law.
dc.rightsPublic Domain
dc.rights.urihttps://creativecommons.org/publicdomain/mark/1.0/
dc.subjectUMBC Accelerated Cognitive Cybersecurity Laboratory
dc.subjectUMBC Center for Cybersecurity
dc.subjectUMBC Big Data Analytics Lab
dc.titleIntegrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models
dc.typeText
dcterms.creatorhttps://orcid.org/0000-0002-7553-7932
dcterms.creatorhttps://orcid.org/0000-0002-9933-1170

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
2503.06003v1.pdf
Size:
2.71 MB
Format:
Adobe Portable Document Format