Towards Comprehensive Benchmarking of Medical Vision Language Models

Khatri, Dimple; Gupta, Sanjan TP

Towards Comprehensive Benchmarking of Medical Vision Language Models

dc.contributor.author	Khatri, Dimple
dc.contributor.author	Gupta, Sanjan TP
dc.date.accessioned	2026-01-22T16:19:02Z
dc.date.issued	2025-12-01
dc.description.abstract	Medical imaging workflows integrate radiology images with their corresponding free-text reports. Large language models (LLMs) and large vision–language models (LVLMs) achieve strong results but face deployment barriers in hospitals due to computational demands, privacy risks and infrastructure needs. Small language models (SLMs) and small vision–language models (SVLMs), typically under 10 billion parameters, provide a more efficient and auditable alternative for on-premise, privacy-preserving applications in radiology. Recent advancements, including CheXzero, MedCLIP, XrayGPT, LLaVA-Med, MedFILIP and MedBridge, show that smaller multimodal models support classification, retrieval and report generation. Complementary baselines from lightweight SLMs such as DistilBERT, TinyBERT, BioClinicalBERT and T5-Small highlight opportunities for radiology report understanding.Building on these efforts, we propose a reproducible evaluation framework anchored on IU-CXR (for Indiana University Chest X-ray dataset), with potential extensions to CT, MRI and ophthalmology datasets. Our framework integrates task metrics such as ROUGE, F1-score and AUROC, together with efficiency measures including VRAM usage, latency, and model size; alongside trust dimensions like factuality, bias, and robustness. We also conduct ablation studies on model architecture, tokenizers and parameter-efficient fine-tuning (e.g. qLoRA), while analyzing trade-offs between accuracy, efficiency, and stability. This work establishes reproducible baselines and guidance for deploying radiology AI, while also advancing open-source research (available at https://github.com/dimplek0424/MedVLMBenchPhase1).
dc.description.uri	https://academic.oup.com/bib/article/26/Supplement_1/i44/8378055
dc.format.extent	2 pages
dc.genre	journal articles
dc.identifier	doi:10.13016/m2s6gm-0kh6
dc.identifier.citation	Khatri, Dimple, and Sanjan TP Gupta. “Towards Comprehensive Benchmarking of Medical Vision Language Models.” Briefings in Bioinformatics 26, no. Supplement_1 (2025): i44-45. https://doi.org/10.1093/bib/bbaf631.077.
dc.identifier.uri	https://doi.org/10.1093/bib/bbaf631.077
dc.identifier.uri	http://hdl.handle.net/11603/41534
dc.language.iso	en
dc.publisher	Oxford Academic
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Computer Science and Electrical Engineering Department
dc.relation.ispartof	UMBC Student Collection
dc.rights	Attribution 4.0 International
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.title	Towards Comprehensive Benchmarking of Medical Vision Language Models
dc.type	Text

Files

Original bundle

Now showing 1 - 1 of 1

Name:: bbaf631.035.pdf
Size:: 267.92 KB
Format:: Adobe Portable Document Format

Download

Collections

UMBC Computer Science and Electrical Engineering Department
UMBC Student Collection