A Survey on Efficient Vision-Language Models

dc.contributor.authorShinde, Gaurav
dc.contributor.authorRavi, Anuradha
dc.contributor.authorDey, Emon
dc.contributor.authorSakib, Shadman
dc.contributor.authorRampure, Milind
dc.contributor.authorRoy, Nirmalya
dc.date.accessioned2025-06-05T14:02:46Z
dc.date.available2025-06-05T14:02:46Z
dc.date.issued2025-04-13
dc.description.abstractVision-language models (VLMs) integrate visual and textual information, enabling a wide range of applications such as image captioning and visual question answering, making them crucial for modern AI systems. However, their high computational demands pose challenges for real-time applications. This has led to a growing focus on developing efficient vision language models. In this survey, we review key techniques for optimizing VLMs on edge and resource-constrained devices. We also explore compact VLM architectures, frameworks and provide detailed insights into the performance-memory trade-offs of efficient VLMs. Furthermore, we establish a GitHub repository at https://github.com/MPSCUMBC/Efficient-Vision-Language-Models-A-Survey to compile all surveyed papers, which we will actively update. Our objective is to foster deeper research in this area.
dc.description.sponsorshipThis work has been partially supported by NSF CAREER Award 1750936 NSF REU Site Grant 2050999 NSF CNS EAGER Grant 2233879 and ONR Grant N000142312119
dc.description.urihttp://arxiv.org/abs/2504.09724
dc.format.extent35 pages
dc.genrejournal articles
dc.genrepreprints
dc.identifierdoi:10.13016/m2vhh3-8yac
dc.identifier.urihttps://doi.org/10.48550/arXiv.2504.09724
dc.identifier.urihttp://hdl.handle.net/11603/38594
dc.language.isoen_US
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Faculty Collection
dc.relation.ispartofUMBC Information Systems Department
dc.relation.ispartofUMBC Center for Real-time Distributed Sensing and Autonomy
dc.relation.ispartofUMBC Student Collection
dc.rightsAttribution 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectUMBC Mobile, Pervasive and Sensor Computing Lab (MPSC Lab)
dc.subjectComputer Science - Computer Vision and Pattern Recognition
dc.titleA Survey on Efficient Vision-Language Models
dc.typeText
dcterms.creatorhttps://orcid.org/0000-0002-1290-0378
dcterms.creatorhttps://orcid.org/0009-0007-5268-4562

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
2504.09724v1.pdf
Size:
1.34 MB
Format:
Adobe Portable Document Format