residual connection (GCAR) for sign language
recognition. Their approach combines spatial-
temporal contextual learning using both joint skeleton
and motion information. The GCAR model, enhanced
with a channel attention module, demonstrated high
performance on extensive datasets, including
WLASL (90.31% accuracy for Top-10) and
ASLLVD (34.41% accuracy). While the method
shows efficiency and generalizability, the authors
note computational challenges when dealing with
larger datasets.
In their study, Abu Saleh Musa Miah and
colleagues (Miah, et al. , 2024) present the GmTC
model for hand gesture recognition in multi-cultural
sign languages (McSL). This end-to-end system
combines graph-based features with general deep
learning through dual streams. A Graph
Convolutional Network (GCN) extracts distance-
based relationships among superpixels, while
attention-based features are processed using a Multi-
Head Self-Attention (MHSA) and CNN module. By
merging these features, the model improves
generalizability across diverse cultural datasets,
including Korean, Bangla, and Japanese Sign
Languages. Evaluations on five datasets show
superior accuracy compared to state-of-the-art
systems. The research also identifies challenges such
as computational complexity and fixed patch sizes in
image segmentation.
Hamzah Luqman (Luqman, 2022) proposes a two-
stream network for isolated sign language
recognition, emphasizing accumulative video motion.
The method employs a Dynamic Motion Network
(DMN) for spatiotemporal feature extraction and an
Accumulative Motion Network (AMN) to encode
motion into a single frame. A Sign Recognition
Network (SRN) fuses and classifies features from
both streams. This approach addresses variations in
dynamic gestures and enhances recognition accuracy
in signer-independent scenarios. The model was
tested on Arabic and Argentinian sign language
datasets, achieving significant performance
improvements over existing techniques.
Giray Sercan Özcan et al. (Özcan, et al. , 2024)
investigate Zero-Shot Sign Language Recognition
(ZSSLR) by modeling hand and pose-based features.
Their framework utilizes ResNeXt and MViTv2 for
spatial feature extraction, ST-GCN for spatial-
temporal relationships, and CLIP for semantic
embedding. The method maps visual representations
to unseen textual class descriptions, enabling
recognition of previously unencountered classes.
Evaluated on benchmark ZSSLR datasets, the
approach demonstrates substantial improvements in
accuracy, setting a new standard for addressing
insufficient training data in sign language recognition.
Jungpil Shin et al. (Shin, et al. , 2024) created an
innovative Korean Sign Language (KSL) recognition
system that combines handcrafted and deep learning
features to identify KSL alphabets. Their approach
utilized two streams: one based on skeleton data to
extract geometric features such as joint distances and
angles, and another employing a ResNet101
architecture to capture pixel-based representations.
The system merged these features and processed them
through a classification module, achieving high
recognition accuracy across newly developed KSL
alphabet datasets and established benchmarks like
ArSL and ASL. The researchers also contributed a
new KSL alphabet dataset featuring diverse
backgrounds, addressing limitations in existing
datasets. However, the model's dependence on
substantial computational resources and the need for
additional testing on more extensive datasets were
identified as potential areas for enhancement.
Candy Obdulia Sosa-Jiménez et al. (Jiménez , et
al. , 2022) developed a two-way translator system for
Mexican Sign Language (MSL) specifically designed
for primary healthcare settings. The system combines
sign recognition using Microsoft Kinect sensors and
hidden Markov models with MSL synthesis via a
signing avatar for real-time communication. It can
recognize 31 static and 51 dynamic signs, providing a
specialized vocabulary for medical consultations. The
research demonstrated the system's efficacy in
facilitating communication between deaf patients and
hearing doctors, with average accuracy and F1 scores
of 99% and 88%, respectively. Although innovative,
the system's reliance on specific hardware (Kinect)
could restrict its scalability and widespread
implementation.
Zinah Raad Saeed et al. (Saeed, et al. , 2022)
performed a comprehensive review of sensory glove
systems for sign language pattern recognition,
examining studies from 2017 to 2022. They
emphasized the benefits of glove-based techniques,
such as high recognition accuracy and functionality in
low-light environments, while also noting challenges
including user comfort, cost, and limited datasets. The
review classified motivations, challenges, and
recommendations, stressing the importance of
developing scalable, affordable, and comfortable
designs. Despite advancements, the study identified
gaps in handling dynamic gestures and incorporating
non-manual signs like facial expressions, outlining a
direction for future research to address these
limitations.