while also leveraging the global modelling
advantages of the Transformer to enhance the
accuracy of overall judgment.
Although this method performs well on medium
and small-scale datasets, there are still some
limitations and directions worthy of further
exploration:
Firstly, the selection of the CNN backbone is still
relatively fixed and lacks structural adaptability.
Under different tasks or data distributions, the
currently used ResNet or Conv2D may not always
maintain stable performance. In the future, more
flexible and adjustable backbone structures, such as
automatic neural architecture search (NAS), can be
explored to enhance generalization capabilities.
Secondly, due to ViT's strong reliance on large-
scale data, the fusion model is still prone to
overfitting or performance fluctuations when the data
volume is insufficient. In addition, to maintain a
lightweight configuration, this study has adopted a
shallow ViT structure. Although this reduces
computing costs, it may still limit the expressive
power in more complex data environments.
Finally, the robustness of the current model still
needs to be enhanced when dealing with large-scale,
multi-category or cross-device collected data. In the
future, the stability and practical application value of
the model can be further enhanced by integrating
transfer learning, domain adaptation or multimodal
information (such as clinical text data).
REFERENCES
Chen, X., Xie, S., & He, K. (2021). An empirical study of
training self-supervised vision transformers.
Proceedings of the IEEE/CVF International Conference
on Computer Vision (ICCV), 9620–9629.
Dai, Z., Liu, H., Le, Q. V., & Tan, M. (2021). CoAtNet:
Marrying convolution and attention for all data sizes.
Advances in Neural Information Processing Systems
(NeurIPS), 3965–3977.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2021).
An image is worth 16×16 words: Transformers for
image recognition at scale. Proceedings of the
International Conference on Learning Representations
(ICLR), 1–22.
Gupta, A. B., Wang, Y., Smith, J., Lee, D., Chen, M., &
Johnson, T. (2024). Inappropriate diagnosis of
community-acquired pneumonia among hospitalized
adults. JAMA Internal Medicine, 184(5), 548–556.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual
learning for image recognition. Proceedings of the
IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 770–778.
Kermany, D. S., Zhang, K., & Goldbaum, M. (2018).
Identifying medical diagnoses and treatable diseases by
image-based deep learning. Cell, 172(5), 1122–1131.
Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A.,
Ciompi, F., Ghafoorian, M., ... & van Ginneken, B.
(2017). A survey on deep learning in medical image
analysis. Medical Image Analysis, 42, 60–88.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin,
S., & Guo, B. (2021). Swin Transformer: Hierarchical
vision transformer using shifted windows. Proceedings
of the IEEE/CVF International Conference on
Computer Vision (ICCV), 10012–10022.
Raghu, M., Zhang, C., Kleinberg, J., & Bengio, S. (2019).
Transfusion: Understanding transfer learning for
medical imaging. Advances in Neural Information
Processing Systems (NeurIPS), 3347–3357.
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net:
Convolutional networks for biomedical image
segmentation. Proceedings of the International
Conference on Medical Image Computing and
Computer-Assisted Intervention (MICCAI), 234–241.
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles,
A., & Jégou, H. (2021). Training data-efficient image
transformers & distillation through attention.
Proceedings of the International Conference on
Machine Learning (ICML), 10347–10357.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017).
Attention is all you need. Advances in Neural
Information Processing Systems (NeurIPS), 5998–
6008.