DEEP-CAM: Attention based Multi-modal Deep Learning Models for Medical Instructional Question Generation

Saha, Shaswati; Purushotham, Sanjay

DEEP-CAM: Attention based Multi-modal Deep Learning Models for Medical Instructional Question Generation

dc.contributor.author	Saha, Shaswati
dc.contributor.author	Purushotham, Sanjay
dc.date.accessioned	2023-12-12T17:06:11Z
dc.date.available	2023-12-12T17:06:11Z
dc.date.issued	2023
dc.description.abstract	This paper describes the participation of the UMBCVQA team in the Medical Instructional Question Generation (MIQG) task of the MedVidQA challenge at TREC Video Retrieval Evaluation (TRECVID 2023). The goal of the MIQG task is to generate instructional questions for which the given medical video segment serves as the visual answer. We propose DEEP-CAM, a deep spatio-temporal, crossmodality, and cross-attention encoder-decoder model that takes a medical video segment and its corresponding subtitle text as input and generates a natural language question as output. DEEP-CAM first extracts visual features from the videos and textual embeddings from the subtitles corresponding to the video frames, simultaneously learning the attention for both the text and video frames. Furthermore, these jointly attended features are passed through an LSTMbased decoder to generate instructional questions based on the provided video frames. • Training data: We used 800 videos with 2710 questions from the MedVidQA dataset [8]. In addition, we extracted and used time-stamped subtitles for either the entire video or video segments. • Our approach: We proposed DEEP-CAM, a deep spatio-temporal, cross-modality, and cross-attention encoder-decoder model that takes a medical video segment and subtitle text to generate an instructional question. • Runs: We submitted two runs to the challenge. The key difference between our submitted runs is that in Run 1, we utilized the timed subtitles, while in Run 2, we provided the entire subtitle of a video to our model. • Results: We found that the first iteration outperforms the second on all metrics, including ROUGE-2 [16], ROUGE-L [16], and BERTScore [24].
dc.description.uri	https://www-nlpir.nist.gov/projects/tvpubs/tv23.papers/umbcvqa.pdf
dc.format.extent	5 pages
dc.genre	journal articles
dc.genre	preprints
dc.identifier.uri	http://hdl.handle.net/11603/31049
dc.language.iso	en_US
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Information Systems Department Collection
dc.relation.ispartof	UMBC Faculty Collection
dc.relation.ispartof	UMBC Student Collection
dc.rights	This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.title	DEEP-CAM: Attention based Multi-modal Deep Learning Models for Medical Instructional Question Generation
dc.type	Text

Files

Original bundle

Now showing 1 - 1 of 1

Name:: umbcvqa.pdf
Size:: 6.39 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.56 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

UMBC Information Systems Department
UMBC Faculty Collection
UMBC Student Collection