Enhancing vision-language models for medical imaging: bridging the 3D gap with innovative slice selection

Wang, Yuli; Jian, Peng; Dai, Yuwei; Jones, Craig; Sair, Haris I.; Shen, Jinglai; Loizou, Nicolas; Wu, Jing; Hsu, Wen-Chi; Imami, Maliha Rubaiyat; Jiao, Zhicheng; Zhang, Paul J.; Bai, Harrison

Enhancing vision-language models for medical imaging: bridging the 3D gap with innovative slice selection

dc.contributor.author	Wang, Yuli
dc.contributor.author	Jian, Peng
dc.contributor.author	Dai, Yuwei
dc.contributor.author	Jones, Craig
dc.contributor.author	Sair, Haris I.
dc.contributor.author	Shen, Jinglai
dc.contributor.author	Loizou, Nicolas
dc.contributor.author	Wu, Jing
dc.contributor.author	Hsu, Wen-Chi
dc.contributor.author	Imami, Maliha Rubaiyat
dc.contributor.author	Jiao, Zhicheng
dc.contributor.author	Zhang, Paul J.
dc.contributor.author	Bai, Harrison
dc.date.accessioned	2024-12-11T17:01:58Z
dc.date.available	2024-12-11T17:01:58Z
dc.date.issued	2024-11-13
dc.description	38th Conference on Neural Information Processing Systems (NeurIPS 2024)
dc.description.abstract	Recent approaches to vision-language tasks are built on the remarkable capabilities of large vision-language models (VLMs). These models excel in zero-shot and few-shot learning, enabling them to learn new tasks without parameter updates. However, their primary challenge lies in their design, which primarily accommodates 2D input, thus limiting their effectiveness for medical images, particularly radiological images like MRI and CT, which are typically 3D. To bridge the gap between state-of-the-art 2D VLMs and 3D medical image data, we developed an innovative, one-pass, unsupervised representative slice selection method called Vote-MI, which selects representative 2D slices from 3D medical imaging. To evaluate the effectiveness of vote-MI when implemented with VLMs, we introduce BrainMD, a robust, multimodal dataset comprising 2,453 annotated 3D MRI brain scans with corresponding textual radiology reports and electronic health records. Based on BrainMD, we further develop two benchmarks, BrainMD-select (including the most representative 2D slice of 3D image) and BrainBench (including various vision-language downstream tasks). Extensive experiments on the BrainMD dataset and its two corresponding benchmarks demonstrate that our representative selection method significantly improves performance in zero-shot and few-shot learning tasks. On average, Vote-MI achieves a 14.6% and 16.6% absolute gain for zero-shot and few-shot learning, respectively, compared to randomly selecting examples. Our studies represent a significant step toward integrating AI in medical imaging to enhance patient care and facilitate medical research. We hope this work will serve as a foundation for data selection as vision-language models are increasingly applied to new tasks.
dc.description.sponsorship	This publication was made possible by the Johns Hopkins Institute for Clinical and Translational Research (ICTR), which is funded in part by Grant Number 1UM1TR004926-01 from the National Center for Advancing Translational Sciences (NCATS) a component of the National Institutes of Health (NIH), and NIH Roadmap for Medical Research. Its contents are solely the responsibility of the authors and do not necessarily represent the official view of the Johns Hopkins ICTR, NCATS or NIH.
dc.description.uri	https://openreview.net/forum?id=JrJW21IP9p#discussion
dc.format.extent	18 pages
dc.genre	conference papers and proceedings
dc.genre	preprints
dc.identifier	doi:10.13016/m2bgpf-o4y8
dc.identifier.citation	Wang, Yuli, Peng Jian, Yuwei Dai, Craig Jones, Haris I. Sair, Jinglai Shen, Nicolas Loizou, et al. “Enhancing Vision-Language Models for Medical Imaging: Bridging the 3D Gap with Innovative Slice Selection,” 2024. https://openreview.net/forum?id=JrJW21IP9p#discussion.
dc.identifier.uri	http://hdl.handle.net/11603/37012
dc.language.iso	en
dc.publisher	OpenReview
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Mathematics and Statistics Department
dc.relation.ispartof	UMBC Faculty Collection
dc.rights	Attribution 4.0 International
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.title	Enhancing vision-language models for medical imaging: bridging the 3D gap with innovative slice selection
dc.type	Text
dcterms.creator	https://orcid.org/0000-0003-2172-4182

Files

Original bundle

Now showing 1 - 2 of 2

Name:: 1905_Enhancing_vision_language.pdf
Size:: 2.23 MB
Format:: Adobe Portable Document Format

Download

Name:: 1905_Enhancing_vision_language_SupplementaryMaterial.pdf
Size:: 5.87 MB
Format:: Adobe Portable Document Format

Download

Collections

UMBC Mathematics and Statistics Department
UMBC Faculty Collection