Fusion of Vision Transformer and Convolutional Neural Network for Explainable and Efficient Histopathological Image Classification in Cyber-Physical Healthcare Systems
Links to Files
Author/Creator
Author/Creator ORCID
Date
Type of Work
Department
Program
Citation of Original Publication
Rahman, Mohammad Ishtiaque. “Fusion of Vision Transformer and Convolutional Neural Network for Explainable and Efficient Histopathological Image Classification in Cyber-Physical Healthcare Systems.” Journal of Transformative Technologies and Sustainable Development 9, no. 1 (2025): 8. https://doi.org/10.1007/s41314-025-00079-0.
Rights
Attribution 4.0 International
Subjects
Abstract
Accurate and interpretable classification of breast cancer histopathology images is critical for early diagnosis and treatment planning. This study proposes a hybrid deep learning model that integrates convolutional neural networks (CNNs) with a Vision Transformer (ViT) to jointly capture local texture patterns and global contextual features. The fusion architecture is evaluated on two publicly available datasets: BreakHis and the invasive ductal carcinoma (IDC) dataset. Results demonstrate that the ViT+CNN model consistently outperforms standalone CNN and ViT models, achieving state-of-the-art accuracy while maintaining robustness across datasets. To assess the feasibility of deployment in real-world clinical scenarios, we benchmark inference latency and memory usage under both standard and edge-constrained environments. Although the fusion model has higher computational cost, its latency remains within acceptable thresholds for real-time diagnostic workflows. Furthermore, we enhance interpretability by combining Grad-CAM with attention rollout, allowing for transparent visual explanation of the model’s decisions. The findings support the clinical potential of hybrid transformer-convolutional models for scalable, reliable, and explainable medical image analysis.
