Deep Learning Models for Diabetic Retinopathy Detection: A Review

of CNN and Transformer-Based Approaches

Guanglongyu Huo

Chengdu University of Technology, Erxianqiao Street, Chengdu City, Sichuan Province, China

Keywords: Diabetic Retinopathy, Deep Learning, Convolutional Neural Network, Transformer, Diagnosis.

Abstract: This article reviews the progress of many deep learning neural network models in the detection of diabetes

retinopathy (DR), and discusses the significance of these advances in clinical practice. It examined improved

Convolutional Neural Networks (CNN) and Transformer based DR detection models. Models based on CNN,

such as Romero Oraa's framework, two-layer neural networks, and weighted fusion deep learning networks,

have shown promising results in addressing challenges such as lighting and image quality. Transformer based

models, including dual transformer encoder models and self supervised image transformers, utilize their

unique architecture to improve performance. These models improve the accuracy and efficiency of diagnosis,

promoting the effectiveness of early intervention for DR treatment. In addition, the integration of these

advanced technologies not only simplifies the diagnostic process, but also has the potential to alleviate the

burden on the healthcare system by providing scalable solutions for extensive screening, ultimately helping

to improve patient outcomes.

1 INTRODUCTION

The leading cause of blindness worldwide is diabetic

retinopathy (DR), which is an important complication

of diabetes (Solomon, 2017). It is found by Wong et

al. that diabetic retinopathy occurs in about one-third

of people with diabetes, with severe forms including

proliferative DR and diabetic macular edema (2016).

DR is associated with prolonged diabetes, high blood

sugar, and hypertension. While traditionally viewed

as a microvascular disease, it also involves retinal

neurodegeneration. The development of DR is driven

by complex mechanisms related to hyperglycemia,

including genetic factors, free radicals, and

inflammatory mediators. Effective control of blood

glucose and blood pressure is crucial for prevention.

Treatments like anti-VEGF therapy and laser

photocoagulation can help manage vision loss.

Increased public awareness and regular screenings

are essential for improving outcomes and preventing

blindness in DR patients.

As the prevalence of diabetes continues to rise, the

need for effective screening and diagnostic tools

becomes increasingly critical. Traditional methods of

DR detection often rely on manual examination of

retinal images, which can be time-consuming and

subject to human error. In recent years, the way for

automated systems that can enhance the accuracy and

efficiency of diabetic retinopathy diagnosis has been

paved by advancements in deep learning and artificial

intelligence.

This paper reviews various improved

convolutional neural network(CNN) models and

Transformer based for the detection and classification

of DR. By utilizing innovative architectures and

techniques, these models are designed to improve

diagnostic performance while addressing challenges

such as class imbalances, new image perspectives,

and the need for generalization across diverse

populations and imaging conditions.

Through a comprehensive analysis of recent

literature, this review emphasizes the progress of

deep learning in detecting diabetic retinopathy and

deliberates on the significance of these advancements

for clinical practice. By summarizing the most

innovative approaches and their performance metrics,

this work aims to provide insights for future research

directions in this critical area of medical imaging.

Ultimately, the goal is to facilitate the development of

more reliable and accessible tools for early detection

and intervention in diabetic retinopathy, thereby

improving patient outcomes and reducing the burden

of this preventable disease.

594

Huo and G.

Deep Learning Models for Diabetic Retinopathy Detection: A Review of CNN and Transformer-Based Approaches.

DOI: 10.5220/0013533700004619

In Proceedings of the 2nd International Conference on Data Analysis and Machine Learning (DAML 2024), pages 594-598

ISBN: 978-989-758-754-2

2 CNN-BASED MODELS

The development of various improved CNN-based

models for DR detection have been led by recent

advancements in deep learning, each addressing

specific challenges associated with fundus image

analysis. These models leverage different techniques

to enhance accuracy, efficiency, and robustness in

diagnosing DR.

In the field of diabetic retinopathy (DR) detection,

a novel framework that harnesses the power of deep

learning techniques. This framework is designed to

automate the detection and grading of DR, a crucial

step in early diagnosis. Incorporating an attention

mechanism the model can focus on different aspects

of the retinal images (Romero - Oraa et al 2024).

Specifically, it separates dark structures and bright

structures. This focused attention enhances the

classification accuracy as it enables the model to

better distinguish between various features and

patterns associated with DR. The framework further

decomposes the input images. This decomposition

serves to improve the visibility of lesions within the

images. Additionally, it generates interpretable

attention maps. These maps provide valuable insights

into the model's predictions, allowing clinicians to

better understand how the model arrived at its

conclusions. Verified on the Kaggle DR Detection

dataset, the accuracy of the model is 83.7%, and 0.78

on the quadratic weighted Kappa. These results are

significant as they outperform several state-of-the-art

methods. This performance makes the framework a

valuable diagnostic tool for clinicians, aiding in the

early detection and grading of DR.

The bilayered neural network is another notable

approach in DR detection. It employs a feedforward

architecture with two fully connected layers (Islam,

2021). Retinal images often present challenges due to

varying illumination and fields of vision. These

factors can complicate accurate detection of DR. The

bilayered approach is designed to address these

challenges. It allows for enhanced feature extraction

and classification, enabling the model to better

discern subtle differences in DR severity.Through its

unique design, the model is able to learn complex

patterns in fundus images. This is reflected in its

performance on the test set, where it achieved an

accuracy of 93.33%. The incorporation of

resubstitution validation further optimizes its

performance. As a result, it holds promise as a

solution for automated DR detection in clinical

settings.

In contrast, weighted fusion deep learning

network(WFDLN) tackles the challenges posed by

low - quality fundus images, a common issue in DR

diagnosis (Nneji, 2022). The dual-channel scans,

namely the CLAHE which stands for contrast-limited

adaptive histogram equalization images and CECED

which is contrast-enhanced canny edge detection

images, are processed by the network. This approach

allows it to handle the complexity of low - quality

images more effectively. WFDLN utilizes fine -

tuned Inception V3 and VGG - 16 for feature

extraction. Impressive performance metrics are the

result of this combination. On the dataset of

Messidor, It achieved an accuracy corresponding to

98.5 per hundred, a sensitivity of 98.9 per hundred,

and a specificity of 98.0 per hundred. These results

highlight its effectiveness in accurate and automated

DR classification, demonstrating high accuracy and

robustness while addressing common challenges in

fundus image analysis.

Additionally, A deep learning - based approach has

been developed to predict the risk of developing

referable DR. This model utilizes a substantial dataset

of 156,363 fundus images from the EyePACS

database (Bora, 2021). The model's ability to predict

the risk of developing referable DR is a significant

advancement. When averaging scores from multiple

images, it reaches an AUC value of 0.81. This

indicates the potential for personalized risk

assessments, which can enhance screening strategies

and enable timely interventions.

The BigAug method introduces a novel approach

to generalizing deep learning models for medical

image segmentation across unseen domains by

applying random transformations that improve

robustness against variations in medical imaging data

(Zhang, 2022). This approach is crucial as medical

imaging data can vary significantly. Evaluations of

the BigAug method demonstrate competitive

performance comparable to fully supervised models.

This significant contribution enhances the

adaptability of deep learning techniques in diverse

clinical settings, particularly in the automatic

processing of fundus images for diabetic retinopathy

grading, thereby improving diagnostic accuracy and

efficiency.

While not exclusively focused on diabetic

retinopathy, the IVGG13 model modifies the VGG16

architecture for pneumonia classification but

showcases how architectural improvements can

enhance training efficiency and classification

performance (Jiang, 2021). This modification

showcases how architectural improvements can

enhance training efficiency and classification

performance. It highlights the broader applicability of

refined architectures in medical image analysis,

Deep Learning Models for Diabetic Retinopathy Detection: A Review of CNN and Transformer-Based Approaches

595

providing insights that can be applied to DR detection

models as well.

Lastly, DiaNet is a dedicated architecture designed

to diagnose diabetes from retinal photographs using

CNNs to extract features effectively (Islam, 2021). It

uses CNNs to extract features effectively from retinal

photographs. The experimental results indicate

significant accuracy, establishing DiaNet as a

promising non - invasive diagnostic tool. This

emphasizes the potential of retinal imaging in

identifying diabetes - related health risks, which is

relevant to the context of DR as diabetes is a precursor

to DR.

This section details various improved CNN-based

models for diabetic retinopathy detection. Romero-

Oraa's framework uses an attention mechanism and

image decomposition, achieving good accuracy on

the Kaggle dataset. The bilayered neural network

addresses illumination and vision challenges, with

high test set accuracy. WFDLN tackles low-quality

images well, showing strong performance on the

Messidor dataset. A model using EyePACS data

predicts DR risk. The BigAug method generalizes

models. The IVGG13 model shows architectural

benefits, and DiaNet diagnoses diabetes from retinal

images effectively, all contributing to advancements

in DR detection and related medical imaging tasks.

3 TRANSFORMER-BASED

MODELS

Recent progress in architectures based on

transformers has greatly improved the detection and

segmentation of DR by means of diverse innovative

methods. These models leverage the strengths of

transformers, such as the ability of them to capture

long-range dependencies. and contextual

information, which are crucial for accurately

analyzing retinal images.

A novel dual transformer encoder model

specifically designed for medical image

classification. It addresses the issue of fixed token

size in Vision Transformer (ViT) by utilizing two

encoders with different hidden sizes, enabling the

model to better adapt to various medical image types

(Yan, Yan, & Pei, 2023). It addresses the issue of

fixed token size in Vision Transformer (ViT) by

utilizing two encoders with different hidden sizes.

This enables the model to better adapt to various

medical image types, a crucial factor in accurately

analyzing retinal images which can vary in size and

content. The key contribution of this model lies in its

proposed dual transformer encoder architecture. This

architecture enhances the ability of them to capture

long-range dependencies within medical images.

Long - range dependencies are important for

understanding the overall context and relationships

within the image, which is essential for accurate

classification. The model employs the LCA module

to integrate features from all encoder layers. This

integration leads to improved classification

performance as it combines the information from

different levels of the model's processing.The Dual

Transformer Encoder Model features a unique

architecture with two transformer encoders of varying

hidden sizes, enabling improved multi - scale feature

extraction and efficient training. It has demonstrated

superior robustness across multiple datasets

compared to single transformer encoders and

traditional CNNs. This shows its potential for

applications in real-world medical imaging,

particularly in the detection of diabetic retinopathy.

Different from dual transformer encoder model

utilizing two encoders with different hidden sizes, a

new method to classify diabetic retinopathy (DR)

using a masked autoencoder (MAE) enhanced visual

transformer (ViT) (Yang, 2024). The model

addresses the challenge of limited data by using a

large dataset of more than 100,000 fundus images that

are larger than the typical ViTs input size. The key

contribution of the model is its ability to utilize self-

supervised learning through MAE, which enables

ViT to efficiently capture rich features from retinal

images. This approach improves the model's

performance in terms of referable DR Classification,

especially when training data is scarce. By pre-

training retinal images, the model reduces overfitting

and improves generalization ability. Compared with

traditional methods, the architecture shows higher

classification accuracy, indicating its practical

application potential in the detection of diabetic

retinopathy. Overall, the integration of mask

autoencoders with ViT offers a promising avenue to

advance automated diagnosis of DR, especially in

clinical Settings where rapid and accurate assessment

is critical.

A prominent method is the Self - Supervised Image

Transformer (SSiT), which utilizes self - supervised

learning techniques guided by saliency maps (Huang,

2022). It enables the model to pre - train on large

datasets without the need for a large amount of

labeled data. This is a significant advantage as

obtaining labeled data for medical images can be

difficult and time-consuming. The model focuses on

salient regions within fundus images and employs

tasks such as saliency - guided contrastive learning

DAML 2024 - International Conference on Data Analysis and Machine Learning

596

and segmentation prediction to enhance feature

representation. By focusing on these important

regions, the model can better understand the key

features and patterns associated with DR.Under the

detection of publicly available datasets on multiple

platforms, the model still maintains good

performance and accuracy. This shows its ability to

adapt to different datasets and platforms, although

challenges related to data requirements, background

complexity, initialization sensitivity, generalization

capabilities, and the need for domain - specific

knowledge must be addressed to enhance its practical

application in clinical settings.

The residual visual deformers (ResViT) model

uses a generative adversarial framework which

combines convolution operators with visual

deformers (Dalmaz, 2022). This hybrid design

combines polymerized leftover transformer blocks to

enhance feature representation while maintaining

computational efficiency. ResViT excels in medical

image synthesis. ResViT's multimodal inheritance

capability enables it to process medical images of

different modes, which makes it superior to SSiT for

different image processing. However, the complexity

of its application and the huge occupation of

computing resources make it difficult to apply it in

the detection of diabetic retinopathy.

Another notable architecture is the Compact

Convolution Transformer (CCT), which is

specifically designed for efficient DR Detection using

low-resolution images (Khan, 2023). By combining

convolution tokenization with the transformer

backbone, CCT achieved 84.52% accuracy, saving

computational resources compared to ResNeT and

other traditional models, while reducing training time

without compromising diagnostic accuracy.

The CNN and MLP Mixed Transformer Model is a

hybrid method that integrates Convolutional Neural

Networks (CNNs) to extract features, employs

transformers to capture the global context, and uses

Multi-Layer Perceptrons (MLPs) for classification.

(Kumar & Karthikeyan, 2021). This model addresses

class imbalance through a custom loss function and

achieves notable accuracy of 90.17% in predicting

DR severity, highlighting its effectiveness in

managing the complexities of retinal images.

Lastly, The RTNet Model focuses on enhancing

segmentation accuracy by employing a dual - branch

structure that includes a Global Transformer Block

for extracting global features and a Relation

Transformer Block for capturing interdependencies

between lesion features and vascular patterns. (Huang,

2022). The GTB focuses on extracting global

features, while the RTB emphasizes the relationships

between different lesions and their connections to

vascular structures. This dual approach enables a

more thorough comprehension of the spatial

connections in retinal images. This is essential for

accurate segmentation. RTNet has achieved

competitive performance on benchmark datasets like

IDRiD and DDR, although it encounters challenges

that are associated with limitations of the dataset and

the requirement for comprehensive pixel-level

annotations.

This section is centered around models based on

transformers for the detection and segmentation of

diabetic retinopathy (DR). The dual transformer

encoder model addresses the token size issue in ViT

and has a unique architecture that enhances capturing

long-range dependencies, showing robustness for DR

detection. The Self - Supervised Image Transformer

(SSiT) uses self-supervised learning with saliency

maps, enabling pre-training on large datasets, but

faces challenges for clinical use. MAE-enhanced ViT

uses a large dataset to address data limits. Self-

supervised learning via MAE helps capture features,

improving classification, especially with scarce data.

It shows higher accuracy than traditional methods,

offering promise for DR diagnosis.Residual Visual

Deformers (ResViT) has a hybrid design for feature

representation but is complex and resource-intensive

for DR detection. The Compact Convolution

Transformer (CCT) is designed for low-resolution

images, achieving good accuracy while saving

resources. The CNN and MLP Mixed Transformer

Model combines different techniques to address class

imbalance and predict DR severity effectively. The

RTNet Model uses a dual-branch structure for better

segmentation accuracy, facing dataset and annotation

challenges. These models each have unique features

and challenges, contributing to the advancement of

DR detection and segmentation.

4 CONCLUSIONS

In conclusion, this review has explored the

advancements in deep learning models for diabetic

retinopathy detection. The improved CNN-based

models, such as Romero-Oraa's framework, bilayered

neural network, and weighted fusion deep learning

network, have demonstrated effectiveness in

addressing challenges related to fundus image

analysis and have achieved good diagnostic accuracy.

Transformer-based models, including the dual

transformer encoder model, Self - Supervised Image

Transformer, and others, have also shown significant

progress in capturing Long-range dependencies and

Deep Learning Models for Diabetic Retinopathy Detection: A Review of CNN and Transformer-Based Approaches

597

contextual information enhance diabetic retinopathy

detection and segmentation.

These models offer promising solutions for

enhancing the accuracy and efficiency of diabetic

retinopathy diagnosis, which is crucial in the face of

the increasing prevalence of diabetes and the need for

early detection and intervention. Nevertheless,

challenges remain, such as data requirements,

background complexity, initialization sensitivity,

generalization capabilities, and the need for domain-

specific knowledge, especially for some transformer-

based models. Future research ought to concentrate

on tackling these challenges in order to further

enhance the practical application of these models in

clinical environments and ultimately make a

contribution to better patient results and a reduction

in the burden of DR.

REFERENCES

Bora, A. et al. (2021) Predicting the risk of developing

diabetic retinopathy using deep learning, The Lancet

Digital Health, 3, pp.10-19. Available at:

https://doi.org/10.1016/ S2589-7500(20)30250-8

*Contributed equally

Dalmaz, O. et al. (2022) ResViT: Residual Vision

Transformers for Multimodal Medical Image Synthesis,

TRANSACTIONS ON MEDICAL IMAGING, 41(10),

pp.2598-2614. Available at:

https://ieeexplore.ieee.org/document/9758823.

Huang, S. et al. (2022) RTNet: Relation Transformer

Network for Diabetic Retinopathy Multi-Lesion

Segmentation, TRANSACTIONS ON MEDICAL

IMAGING, 41(6), pp.1596-1607. Available at:

https://ieeexplore.ieee.org/document/9684442.

Huang, Y. et al. (2022) SSiT: Saliency-guided Self-

supervised Image Transformer for Diabetic Retinopathy

Grading, Available at:https://arxiv.org/abs/2210.10969

Islam, M. T. et al. (2021) DiaNet: A Deep Learning Based

Architecture to Diagnose Diabetes Using Retinal

Images Only, Digital Object Identifier, 9, pp.15686-

15695. Available at:

https://ieeexplore.ieee.org/document/9328261

Jiang, Z. et al. (2021) An Improved VGG16 Model for

Pneumonia Image Classification, Applied Science, 11,

pp.1-19. Available at: https://doi.org/10.3390/

app112311185

Khan, I U. et al. (2023) A Computer-Aided Diagnostic

System to Identify Diabetic Retinopathy, Utilizing a

Modiﬁed Compact Convolutional Transformer and

Low-Resolution Images to Reduce Computation Time,

Biomedicines2023, 11, Available at:

https://doi.org/10.3390/biomedicines11061566.

Kumar, N. and Karthikeyan, R. (2021) Diabetic

Retinopathy Detection using CNN, Transformer and

MLP based Architectures, International Symposium on

Intelligent Signal Processing and Communication

Systems, DOI: 10.1109/ISPA CS51563.2021.965102.

Nneji, G U. et al. (2022) Identiﬁcation of Diabetic

Retinopathy Using Weighted Fusion Deep Learning

Based on Dual-Channel Fundus Scans, Diagnostics, 12,

Available at: https://

doi.org/10.3390/diagnostics12020540

Romero-Oraa, R. et al. (2024) Attention-based deep

learning framework for automatic fundus image

processing to aid in diabetic retinopathy grading,

Computer Methods and Programs in Biomedicine, 249,

Available

at:https://doi.org/10.1016/j.cmpb.2024.108160.

Solomon, D S. et al. (2017) Diabetic Retinopathy: A

Position Statement by the American Diabetes

Association, Diabetes Care, 40 pp.412-418.

DOI:10.2337/dc16-264.

Wong, T. et al. (2016) Diabetic retinopathy. Nat Rev Dis

Primers, 2, Available

at: .https://doi.org/10.1038/nrdp.2016.12

Yan, F., Yan, B. and Pei, M. (2023) DUAL

TRANSFORMER ENCODER MODEL FOR

MEDICAL IMAGE CLASSIFICATION, International

Conference on Image Processing, DOI:

10.1109/ICIP49359.2023.10222303

Yang, Y. et al. (2024) ‘Vision transformer with masked

autoencoders for referable diabetic retinopathy

classification based on large-size retina image’, PLoS

ONE, 19(3), Available at:

https://doi.org/10.1371/journal. pone.0299265

Zhang, L. et al. (2020) ‘Generalizing Deep Learning for

Medical Image Segmentation to Unseen Domains via

Deep Stacked Transformation’, TRANSACTIONS ON

MEDICAL IMAGING, 39(7), pp. 2531-2540.

Available

at:https://ieeexplore.ieee.org/abstract/document/89954

DAML 2024 - International Conference on Data Analysis and Machine Learning

598