architectures such as Convolutional Neural Networks
(CNNs) and Long Short-Term Memory (LSTM)
models have made it possible to translate gestures
into text with near-human accuracy, approximating
spoken language semantics (Sun et al.; Wang et al.).
One notable system introduced at the UNI-TEAS
2024 conference translates spoken input to text and
then into International Sign Language (ISL). It
operates offline and predicts signs or phrases based
on real-time video feed, achieving ~100% accuracy
in sign recognition and ~96% for phrase structures
(Goyal & Singh; Wang et al.).
A significant contribution to Indian Sign
Language research utilized MobileNetV2 with
transfer learning for accurate ISL gesture recognition,
aiming to improve accessibility for the hearing-
impaired in India (Karishma & Singh; Goyal &
Singh). Additionally, CNNs have been combined
with Generative Adversarial Networks (GANs) for
both recognition and video generation of ISL signs,
resulting in high-quality outputs as shown by a PSNR
of 31.14 dB and an SSIM of 0.9916 (Khadhraoui et
al.; Ni et al.). These metrics confirm minimal
distortion and high fidelity in the generated video
content.
Systems that recognize sign language using
depth-sensing tools like Microsoft Kinect have shown
effective alphabet recognition, further supporting
real-time communication (Dong et al.). Hand gesture
datasets, such as the 2D ASL dataset proposed by
Barczak et al., have served as foundational resources
for gesture training and classification. Furthermore,
gesture-based systems have been explored for
enhancing decision support in sports through visual
cues (Bhansali & Narvekar), and wearable
technologies like ISL-to-speech gloves have
demonstrated real-world applicability (Heera et al.).
Mobile applications integrating image processing
for sign translation provide scalable, platform-
independent solutions for American Sign Language
(Jin et al.). Likewise, early research into gesture
interfaces laid the groundwork for modern deep
learning-based systems (Lesha et al.). Altogether,
these efforts underscore the evolution of sign
language recognition from static image interpretation
to dynamic, intelligent systems capable of real-time
translation and inclusive communication (Murthy et
al.).
3 METHODOLOGY
3.1 Data Sets
• Feature Extraction: Relevant features such as
hand shape, finger orientation, and joint positions
need to be extracted from the hand images.
• Normalization: Normalized the features that
have been extracted in order to standardize input
to the model.
• Deep Learning Model Selection: CNN for
Spatial Features: Perform convolutional neural
networks to extract spatial features of hand
gestures in every frame. RNN for Temporal
Features: Implement the RNNs such as LSTMs
to describe the temporal dynamics that
encompass the sign language gestures in several
frames.
• Training the Model:
• Dataset Splitting: Divide the collected dataset
into a training set, a testing set as well as a
validation set.
• Training the Network: For the CNN or RNN
model, train on the training data and optimize the
parameters so that the gesture in the sign
language can be classified most accurately.
• Hyperparameter Tuning: Learn adjustment of
learning rate, batch size, and other
hyperparameters to improve model performance.
• Real-time Inference: Frame by frame analysis,
Frame extraction, Features extracted from the
Frame, Feature extracted on every frame is feed
into the learnt model for prediction of
corresponding sign language gesture.
• Translates the recognized signs into text or
speech output, Important Consideration, Sign
Language Variation, Consider the specific sign
language dialect when gathering data and
training the model.
• Robustness to Noise: Implement ways of dealing
with changes in lighting, background clutter, and
hand movements.
• A detector detects a number in this sign language
recognition system which can be easily extended
to cover a wide range of other signs and hands
sign including alphabets. In the model that we
use to develop our system, we are employing a
machine learning model known as CNN.
• With the camera turned on, the user can make
hand signs, and the system will decode the sign
and display it for the user. Making hand signs,
the person can send out a lot of information in a
short period of time. Sign language recognition