DEEP-CAM: Attention based Multi-modal Deep Learning Models for Medical Instructional Question Generation

dc.contributor.authorSaha, Shaswati
dc.contributor.authorPurushotham, Sanjay
dc.date.accessioned2023-12-12T17:06:11Z
dc.date.available2023-12-12T17:06:11Z
dc.date.issued2023
dc.description.abstractThis paper describes the participation of the UMBCVQA team in the Medical Instructional Question Generation (MIQG) task of the MedVidQA challenge at TREC Video Retrieval Evaluation (TRECVID 2023). The goal of the MIQG task is to generate instructional questions for which the given medical video segment serves as the visual answer. We propose DEEP-CAM, a deep spatio-temporal, crossmodality, and cross-attention encoder-decoder model that takes a medical video segment and its corresponding subtitle text as input and generates a natural language question as output. DEEP-CAM first extracts visual features from the videos and textual embeddings from the subtitles corresponding to the video frames, simultaneously learning the attention for both the text and video frames. Furthermore, these jointly attended features are passed through an LSTMbased decoder to generate instructional questions based on the provided video frames. • Training data: We used 800 videos with 2710 questions from the MedVidQA dataset [8]. In addition, we extracted and used time-stamped subtitles for either the entire video or video segments. • Our approach: We proposed DEEP-CAM, a deep spatio-temporal, cross-modality, and cross-attention encoder-decoder model that takes a medical video segment and subtitle text to generate an instructional question. • Runs: We submitted two runs to the challenge. The key difference between our submitted runs is that in Run 1, we utilized the timed subtitles, while in Run 2, we provided the entire subtitle of a video to our model. • Results: We found that the first iteration outperforms the second on all metrics, including ROUGE-2 [16], ROUGE-L [16], and BERTScore [24].
dc.description.urihttps://www-nlpir.nist.gov/projects/tvpubs/tv23.papers/umbcvqa.pdf
dc.format.extent5 pages
dc.genrejournal articles
dc.genrepreprints
dc.identifier.urihttp://hdl.handle.net/11603/31049
dc.language.isoen_US
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Information Systems Department Collection
dc.relation.ispartofUMBC Faculty Collection
dc.relation.ispartofUMBC Student Collection
dc.rightsThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.titleDEEP-CAM: Attention based Multi-modal Deep Learning Models for Medical Instructional Question Generation
dc.typeText

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
umbcvqa.pdf
Size:
6.39 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.56 KB
Format:
Item-specific license agreed upon to submission
Description: