DEEP-CAM: Attention based Multi-modal Deep Learning Models for Medical Instructional Question Generation
dc.contributor.author | Saha, Shaswati | |
dc.contributor.author | Purushotham, Sanjay | |
dc.date.accessioned | 2023-12-12T17:06:11Z | |
dc.date.available | 2023-12-12T17:06:11Z | |
dc.date.issued | 2023 | |
dc.description.abstract | This paper describes the participation of the UMBCVQA team in the Medical Instructional Question Generation (MIQG) task of the MedVidQA challenge at TREC Video Retrieval Evaluation (TRECVID 2023). The goal of the MIQG task is to generate instructional questions for which the given medical video segment serves as the visual answer. We propose DEEP-CAM, a deep spatio-temporal, crossmodality, and cross-attention encoder-decoder model that takes a medical video segment and its corresponding subtitle text as input and generates a natural language question as output. DEEP-CAM first extracts visual features from the videos and textual embeddings from the subtitles corresponding to the video frames, simultaneously learning the attention for both the text and video frames. Furthermore, these jointly attended features are passed through an LSTMbased decoder to generate instructional questions based on the provided video frames. • Training data: We used 800 videos with 2710 questions from the MedVidQA dataset [8]. In addition, we extracted and used time-stamped subtitles for either the entire video or video segments. • Our approach: We proposed DEEP-CAM, a deep spatio-temporal, cross-modality, and cross-attention encoder-decoder model that takes a medical video segment and subtitle text to generate an instructional question. • Runs: We submitted two runs to the challenge. The key difference between our submitted runs is that in Run 1, we utilized the timed subtitles, while in Run 2, we provided the entire subtitle of a video to our model. • Results: We found that the first iteration outperforms the second on all metrics, including ROUGE-2 [16], ROUGE-L [16], and BERTScore [24]. | |
dc.description.uri | https://www-nlpir.nist.gov/projects/tvpubs/tv23.papers/umbcvqa.pdf | |
dc.format.extent | 5 pages | |
dc.genre | journal articles | |
dc.genre | preprints | |
dc.identifier.uri | http://hdl.handle.net/11603/31049 | |
dc.language.iso | en_US | |
dc.relation.isAvailableAt | The University of Maryland, Baltimore County (UMBC) | |
dc.relation.ispartof | UMBC Information Systems Department Collection | |
dc.relation.ispartof | UMBC Faculty Collection | |
dc.relation.ispartof | UMBC Student Collection | |
dc.rights | This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author. | |
dc.title | DEEP-CAM: Attention based Multi-modal Deep Learning Models for Medical Instructional Question Generation | |
dc.type | Text |