Fusion of Vision Transformer and Convolutional Neural Network for Explainable and Efficient Histopathological Image Classification in Cyber-Physical Healthcare Systems

Department

Program

Citation of Original Publication

Rahman, Mohammad Ishtiaque. “Fusion of Vision Transformer and Convolutional Neural Network for Explainable and Efficient Histopathological Image Classification in Cyber-Physical Healthcare Systems.” Journal of Transformative Technologies and Sustainable Development 9, no. 1 (2025): 8. https://doi.org/10.1007/s41314-025-00079-0.

Rights

Attribution 4.0 International

Abstract

Accurate and interpretable classification of breast cancer histopathology images is critical for early diagnosis and treatment planning. This study proposes a hybrid deep learning model that integrates convolutional neural networks (CNNs) with a Vision Transformer (ViT) to jointly capture local texture patterns and global contextual features. The fusion architecture is evaluated on two publicly available datasets: BreakHis and the invasive ductal carcinoma (IDC) dataset. Results demonstrate that the ViT+CNN model consistently outperforms standalone CNN and ViT models, achieving state-of-the-art accuracy while maintaining robustness across datasets. To assess the feasibility of deployment in real-world clinical scenarios, we benchmark inference latency and memory usage under both standard and edge-constrained environments. Although the fusion model has higher computational cost, its latency remains within acceptable thresholds for real-time diagnostic workflows. Furthermore, we enhance interpretability by combining Grad-CAM with attention rollout, allowing for transparent visual explanation of the model’s decisions. The findings support the clinical potential of hybrid transformer-convolutional models for scalable, reliable, and explainable medical image analysis.