
to the limited availability of large-scale datasets, vari-
ability in hand gestures, and the need for real-time
processing capabilities. The contributions of this pa-
per are as follows:
• Utilizing YOLOv5 for Arabic Sign Language
(ArSL) Recognition: This study explores the
application of the YOLOv5 model for detecting
and classifying 28 Arabic sign language alphabet
gestures, addressing the challenge of recognizing
gestures within a small dataset.
• Real-Time Model Implementation: The pro-
posed approach emphasizes real-time detection
and classification capabilities, making it suitable
for practical applications requiring instant recog-
nition.
The paper is organized as follows: Section 2 presents
a literature review of advances in ArSL recogni-
tion, Section 3 details methodology and implemen-
tation, Section 4 discusses results and comparisons
with state-of-the-art methods, and Section 5 con-
cludes with future research directions.
2 LITERATURE REVIEW
Early efforts in ArSL recognition primarily relied
on classical machine learning techniques, emphasiz-
ing image processing and feature extraction for ges-
ture classification. (Aly and Mohammed, 2014) de-
veloped an ArSL recognition system in 2014 using
Local Binary Patterns on Three Orthogonal Planes
(LBP-TOP) and SVM, which involved preprocessing
steps such as segmenting the hand and face through
RGB-to-color-space conversion. Similarly, (Tharwat
et al., 2021) proposed a system in 2021 focusing on
28 Quranic dashed letters, employing classifiers such
as K-Nearest Neighbor (KNN), Multilayer Perceptron
(MLP), C4.5, and Na
¨
ıve Bayes. Their approach uti-
lized a dataset of 9240 images captured under vary-
ing conditions and achieved a recognition accuracy of
99.5% for 14 letters using KNN. While these methods
demonstrated reasonable accuracy, they were con-
strained by limited scalability and the lack of real-
time implementation capabilities.
Researchers have increasingly adopted advanced
deep learning techniques for sign language recogni-
tion across various languages. For instance, (Tao
et al., 2018) utilized CNNs to address ASL recog-
nition, highlighting CNNs’ ability to effectively cap-
ture sign gestures. Similarly, (Suliman et al., 2021)
proposed a method for ArSL recognition, combin-
ing CNNs for feature extraction and Long Short-Term
Memory (LSTM) networks for classification. Their
approach employed the AlexNet architecture to ex-
tract deep features from input images and utilized
LSTMs to maintain the temporal structure of video
frames. The system achieved an overall recognition
accuracy of 95.9% in signer-dependent scenarios and
43.62% in signer-independent scenarios.
Pretrained models are widely used in sign lan-
guage recognition for leveraging knowledge from
large datasets. (Duwairi and Halloush, 2022) em-
ployed VGGNet, achieving 97% accuracy on the
ArSL2018 dataset, demonstrating the efficacy of pre-
trained architectures. (Zakariah et al., 2022) explored
the use of EfficientNetB4 on the ArSL2018 dataset,
achieving a training accuracy of 98% and a testing
accuracy of 95%. Their work incorporated extensive
preprocessing and data augmentation to enhance con-
sistency and balance within the dataset.
In addition, pre-trained YOLO-based approaches
have achieved remarkable results. (Ningsih et al.,
2024) applied YOLOv5-NAS-S to BISINDO sign
language, achieving a mAP of 97.2% and Recall
of 99.6%. (Al Ahmadi et al., 2024) introduced
attention mechanisms within YOLO for ArSL de-
tection, achieving a mAP@0.5 of 0.9909. Simi-
larly, (Alaftekin et al., 2024) utilized an optimized
YOLOv4-CSP algorithm for real-time recognition of
Turkish Sign Language, achieving over 98% preci-
sion and recall, further demonstrating YOLO’s effi-
cacy in high-speed and accurate sign language detec-
tion tasks.
A significant limitation in ArSL research remains
the lack of standardized datasets (refer Table 1). Most
studies rely on custom datasets with isolated signs,
such as ArSL2018, which is insufficient for compre-
hensive, continuous sign recognition (Al-Shamayleh
et al., 2020).
3 METHODOLOGY AND
IMPLEMENTATION
This section outlines the workflow of training and
evaluating the YOLOv5 model for ArSL recogni-
tion, as illustrated in Figure 1. The dataset, is di-
vided into training, validation, and test sets. The
training and validation sets are utilized to train the
YOLOv5 model over 400 epochs, during which hy-
perparameters are fine-tuned to achieve optimal per-
formance. Following the completion of the training
process, the trained model is evaluated using the test
set based on evaluation metrics such as Accuracy,
Precision, Recall, F1 Score, Mean Average Precision
(mAP), mAP@50, mAP@50-95, Intersection over
Union (IoU), Logarithmic Loss, Confusion Matrix,
KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval
182