A Survey on Efficient Vision-Language Models

Shinde, GauravRavi, AnuradhaDey, EmonSakib, ShadmanRampure, MilindRoy, Nirmalya2025-06-052025-06-052025-04-13https://doi.org/10.48550/arXiv.2504.09724http://hdl.handle.net/11603/38594Vision-language models (VLMs) integrate visual and textual information, enabling a wide range of applications such as image captioning and visual question answering, making them crucial for modern AI systems. However, their high computational demands pose challenges for real-time applications. This has led to a growing focus on developing efficient vision language models. In this survey, we review key techniques for optimizing VLMs on edge and resource-constrained devices. We also explore compact VLM architectures, frameworks and provide detailed insights into the performance-memory trade-offs of efficient VLMs. Furthermore, we establish a GitHub repository at https://github.com/MPSCUMBC/Efficient-Vision-Language-Models-A-Survey to compile all surveyed papers, which we will actively update. Our objective is to foster deeper research in this area.35 pagesen-USAttribution 4.0 Internationalhttps://creativecommons.org/licenses/by/4.0/UMBC Mobile, Pervasive and Sensor Computing Lab (MPSC Lab)Computer Science - Computer Vision and Pattern RecognitionA Survey on Efficient Vision-Language ModelsText