MedFuseNet: Attention-based Multimodal deep learning model for Visual Question Answering in the Medical Domain
Loading...
Files
Links to Files
Permanent Link
Author/Creator
Author/Creator ORCID
Date
2021
Type of Work
Department
Program
Citation of Original Publication
Sharma, Dhruv; Purushotham, Sanjay; Reddy, Chandan K.; MedFuseNet: Attention-based Multimodal deep learning model for Visual Question Answering in the Medical Domain; Virginia Polytechnic Institute and State University, 2021; https://people.cs.vt.edu/~reddy/papers/NSR21.pdf
Rights
This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
Subjects
Abstract
Medical images are difficult to comprehend for a person without expertise. The limited number of practitioners across the
globe often face the issue of physical and mental fatigue due to the high number of cases, inducing human-errors during the
diagnosis. In such scenarios, having an additional opinion can be helpful in boosting the confidence of the decision-maker.
Thus, it becomes crucial to have a reliable Visual Question Answering (VQA) system to provide a ‘second opinion’ on medical
cases. However, most of the VQA systems that work today cater to real-world problems and are not specifically tailored for
handling medical images. Moreover, the VQA system for medical images needs to consider a limited amount of training data
available in this domain. In this paper, we develop MedFuseNet, an attention-based multimodal deep learning model, for VQA
on medical images taking the associated challenges into account. Our MedFuseNet aims at maximizing the learning with
minimal complexity by breaking the problem statement into simpler tasks and predicting the answer. We tackle two types of
answer prediction - categorization and generation. We conducted an extensive set of quantitative and qualitative analyses to
evaluate the performance of MedFuseNet. Our experiments demonstrates that MedFuseNet outperforms the state-of-the-art
VQA methods, and that visualization of the captured attentions showcases the intepretability of our model’s predicted results.