and segmentation prediction to enhance feature
representation. By focusing on these important
regions, the model can better understand the key
features and patterns associated with DR.Under the
detection of publicly available datasets on multiple
platforms, the model still maintains good
performance and accuracy. This shows its ability to
adapt to different datasets and platforms, although
challenges related to data requirements, background
complexity, initialization sensitivity, generalization
capabilities, and the need for domain - specific
knowledge must be addressed to enhance its practical
application in clinical settings.
The residual visual deformers (ResViT) model
uses a generative adversarial framework which
combines convolution operators with visual
deformers (Dalmaz, 2022). This hybrid design
combines polymerized leftover transformer blocks to
enhance feature representation while maintaining
computational efficiency. ResViT excels in medical
image synthesis. ResViT's multimodal inheritance
capability enables it to process medical images of
different modes, which makes it superior to SSiT for
different image processing. However, the complexity
of its application and the huge occupation of
computing resources make it difficult to apply it in
the detection of diabetic retinopathy.
Another notable architecture is the Compact
Convolution Transformer (CCT), which is
specifically designed for efficient DR Detection using
low-resolution images (Khan, 2023). By combining
convolution tokenization with the transformer
backbone, CCT achieved 84.52% accuracy, saving
computational resources compared to ResNeT and
other traditional models, while reducing training time
without compromising diagnostic accuracy.
The CNN and MLP Mixed Transformer Model is a
hybrid method that integrates Convolutional Neural
Networks (CNNs) to extract features, employs
transformers to capture the global context, and uses
Multi-Layer Perceptrons (MLPs) for classification.
(Kumar & Karthikeyan, 2021). This model addresses
class imbalance through a custom loss function and
achieves notable accuracy of 90.17% in predicting
DR severity, highlighting its effectiveness in
managing the complexities of retinal images.
Lastly, The RTNet Model focuses on enhancing
segmentation accuracy by employing a dual - branch
structure that includes a Global Transformer Block
for extracting global features and a Relation
Transformer Block for capturing interdependencies
between lesion features and vascular patterns. (Huang,
2022). The GTB focuses on extracting global
features, while the RTB emphasizes the relationships
between different lesions and their connections to
vascular structures. This dual approach enables a
more thorough comprehension of the spatial
connections in retinal images. This is essential for
accurate segmentation. RTNet has achieved
competitive performance on benchmark datasets like
IDRiD and DDR, although it encounters challenges
that are associated with limitations of the dataset and
the requirement for comprehensive pixel-level
annotations.
This section is centered around models based on
transformers for the detection and segmentation of
diabetic retinopathy (DR). The dual transformer
encoder model addresses the token size issue in ViT
and has a unique architecture that enhances capturing
long-range dependencies, showing robustness for DR
detection. The Self - Supervised Image Transformer
(SSiT) uses self-supervised learning with saliency
maps, enabling pre-training on large datasets, but
faces challenges for clinical use. MAE-enhanced ViT
uses a large dataset to address data limits. Self-
supervised learning via MAE helps capture features,
improving classification, especially with scarce data.
It shows higher accuracy than traditional methods,
offering promise for DR diagnosis.Residual Visual
Deformers (ResViT) has a hybrid design for feature
representation but is complex and resource-intensive
for DR detection. The Compact Convolution
Transformer (CCT) is designed for low-resolution
images, achieving good accuracy while saving
resources. The CNN and MLP Mixed Transformer
Model combines different techniques to address class
imbalance and predict DR severity effectively. The
RTNet Model uses a dual-branch structure for better
segmentation accuracy, facing dataset and annotation
challenges. These models each have unique features
and challenges, contributing to the advancement of
DR detection and segmentation.
4 CONCLUSIONS
In conclusion, this review has explored the
advancements in deep learning models for diabetic
retinopathy detection. The improved CNN-based
models, such as Romero-Oraa's framework, bilayered
neural network, and weighted fusion deep learning
network, have demonstrated effectiveness in
addressing challenges related to fundus image
analysis and have achieved good diagnostic accuracy.
Transformer-based models, including the dual
transformer encoder model, Self - Supervised Image
Transformer, and others, have also shown significant
progress in capturing Long-range dependencies and