
recognition, even in uncontrolled settings. Privacy -
Quick - Scalable - Needed real-world boost to allow
those who are Deaf to better communicate in more
dynamic and varied environments."
2 PROBLEM STATEMENT
Although there has been tremendous progress in the
area of sign language recognition systems, the current
technologies are still faced with a number of
fundamental challenges, preventing their practical
deployment. Classical approaches mainly to signs
isolated hand gestures or hand alphabets but do not
perceive the dynamic and continuous nature of actual
sign language, including hand movements, facial
expressions and other contextually conditioned
information. Furthermore, a large number of those
systems have the need of high computational power,
hence they are not suitable for on- device, real-time
use, for instance on mobile or edge devices.
Moreover, the problem of background noise,
differences in lighting conditions, occlusion of the
hands, and vocabulary coverage is still present,
which affects the vulnerability and accuracy of these
systems under everyday conditions.
These restrictions prohibit the use of sign
language recognition systems in the real world to
bring the Deaf people a better communication
method. The sign language recognition system must
be powerful, real-time, and inclusively implemented,
while being able to generalize into various kinds of
complex environments, provide high accuracy with
robust posture tracking and running on lightweight
edge devices for continuous hand gesture recognition.
Our contribution is to help fill these gaps, specifically
by presenting a deep learning-based method that fuses
of multiple modalities in real-time and guarantees
privacy thanks to an implementation in the end-
device.
3 LITERATURE SURVEY
While the recent years brought great improvements
for SLR using DL, especially on the sign language
recognition, many works remain constrained in terms
of coverage, robustness, and on-line operationality.
Different studies have investigated Convolutional
Neural Networks (CNNs) and Recurrent Neural
Networks (RNNs) for static hand gesture
classification, having achieved good results. For
example, Bankar et al. (2022) implemented a CNN
model for the recognition of the alphabet-level signs,
which shows good accuracy for small datasets but the
model does not support continuous or dynamic signs.
In the same way, Aggarwal and Kaur (2023)
concentrated on real-time CNN-based recognition,
However, their approach lowered the recognition
performance in presence of a variety of
environmental settings.
To further enhance recognition on gesture
dynamics, Jiang et al. (2021) introduced skeleton-
sensitive multi-modal models based on the
combination of hand and bodypose, and the work in
Chakraborty and Banerjee (2021) leveraged 3D CNN
to process spatiotemporal features. Nevertheless,
such models can require substantial computational
resources, making them infeasible on mobile or
embedded end-devices. Wu et al. (2021) attempted to
address this using a hybrid CNN-HMM architecture,
but the sequential nature of HMMs could put real-
time inference in jeopardy.
For static human gesture recognition with
efficient pipelines, Tayade and Patil (2022) employed
MediaPipe and classical classifiers and performed
well under controlled conditions but undermined in
dynamic cases. Additionally, Sun et al. (2021), which
provide a detailed survey that describes the absence
of scalable, multimodal and privacy-preserving SLR
systems.
More recently, Hu et al. (2023) proposed
SignBERT+, a transformer architecture with the
hand-model awareness for better sign
comprehension, and Zhou et al. (2021) dealt with
iterative alignment methods in the context of
continuous SLR. However, they are computationally
expensive despite their high accuracy.
Alsharif et al. to address hardware challenges and
scalability. (2024) and Abdellatif & Abdelghafar
(2025) investigated super lightweight models and
transfer learning methods for deployment on edge
devices. However, such systems mostly do not
integrate multimodal fusion and indeed are aimed at
letter or word level and not full phrasal or
conversational gestures.
Together, these studies show that whereas current
deep learning architectures can form a solid
foundation, there is still a need of an implementation
for a robust real-time edge-enabled multi-modal SL
recognition system which can work effectively in
uncontrolled real-world scenarios.
Edgeâ
˘
A
´
SEnabled Continuous Multiâ
˘
A
´
SModal Deep Learning Framework for Robust Realâ
˘
A
´
STime Sign Language Recognition to
Empower Inclusive Communication with the Deaf Community
21