The CNN and ViT Fusion Model Based on Hierarchical Adaptive Token Refinement Method in Pneumonia X-ray Image Classification

Peiqi Zhang

2025

Abstract

Convolutional Neural Networks (CNN) and Vision Transformers (VIT) each have their own advantages in medical image analysis, particularly in the automatic classification of X-ray images. Many studies have contributed to the effective combination of these two.This paper proposes a CNN and ViT merging method - Hierarchical Adaptive Token Refinement (HATR), combining the local feature extraction capability of CNN with the global modeling ability of ViT. The experimental results show that the accuracy rate of the fusion model based on ResNet (HATR-ResNet) is 91.4%, which is significantly better than that of ResNet alone (87.3%). The accuracy rate of the fusion model based on Conv2D (HATR-Conv2D) is 88.2%, which is approximately 5% higher than that of Conv2D alone (82.7%). The superiority of HATR-ResNet stems from the deep residual network structure of ResNet, which can better extract complex features and capture details, while the shallower network structure of Conv2D is relatively insufficient when dealing with complex patterns.This study proposes a new fusion method for CNN and ViT, and compares the performance differences of the fusion models based on different CNN backbones. It contributes to the subsequent research on new model structures and the exploration of new fusion methods.

Download


Paper Citation


in Harvard Style

Zhang P. (2025). The CNN and ViT Fusion Model Based on Hierarchical Adaptive Token Refinement Method in Pneumonia X-ray Image Classification. In Proceedings of the 2nd International Conference on Engineering Management, Information Technology and Intelligence - Volume 1: EMITI; ISBN 978-989-758-792-4, SciTePress, pages 504-510. DOI: 10.5220/0014361900004718


in Bibtex Style

@conference{emiti25,
author={Peiqi Zhang},
title={The CNN and ViT Fusion Model Based on Hierarchical Adaptive Token Refinement Method in Pneumonia X-ray Image Classification},
booktitle={Proceedings of the 2nd International Conference on Engineering Management, Information Technology and Intelligence - Volume 1: EMITI},
year={2025},
pages={504-510},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0014361900004718},
isbn={978-989-758-792-4},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 2nd International Conference on Engineering Management, Information Technology and Intelligence - Volume 1: EMITI
TI - The CNN and ViT Fusion Model Based on Hierarchical Adaptive Token Refinement Method in Pneumonia X-ray Image Classification
SN - 978-989-758-792-4
AU - Zhang P.
PY - 2025
SP - 504
EP - 510
DO - 10.5220/0014361900004718
PB - SciTePress