Ensemble Deep Learning for Multilingual Sign Language Translation

and Recognition

Ancylin Albert P, Karumanchi Dolly Sree and Nivethitha R

Department of CSE, Karunya Institute of Technology and Sciences, Coimbatore, India

Keywords: Energy Consumption Prediction, Random Forest, Machine Learning, Sustainability, Feature Engineering,

Energy Optimization, Decision Tree, Linear Regression, User Interface, Data Analysis.

Abstract: This research introduces an advanced system for instantaneous sign language interpretation and conversion,

employing a fusion of sophisticated neural networks such as ResNet, DenseNet, and EfficientNet. The

innovative technology seeks to bridge the communication gap between hearing-impaired and hearing

individuals by precisely decoding sign language movements and converting them into spoken language.The

system accommodates various input formats and provides instant translations in multiple languages.

Empirical tests demonstrate that the EfficientNet model achieved superior performance with a 99.8% accuracy

rate, surpassing other models. This innovation enhances communication accessibility for the deaf community

and enables seamless interaction across language barriers. Ongoing research will concentrate on improving

computational efficiency and expanding language support capabilities.

1 INTRODUCTION

Human interaction hinges on communication,

allowing individuals to exchange thoughts, convey

feelings, and establish meaningful relationships that

contribute to personal development and social

integration. For those who are deaf or hard of hearing,

effective communication is even more crucial,

significantly impacting their ability to interact with

the world around them. Their primary mode of

expression is sign language—a sophisticated and

nuanced form of communication that incorporates

complex hand gestures, facial expressions, and body

language.

Although sign language is culturally significant

and widely used, a persistent communication gap

exists between its users and those unfamiliar with it.

This divide often marginalizes the deaf community,

limiting their access to crucial areas such as

education, healthcare, and employment. In medical

settings, misinterpretations can result in serious

errors, while workplace obstacles can impede career

advancement and personal growth. Addressing this

gap is not just a technological challenge but a societal

obligation that emphasizes inclusivity and

accessibility.

Efforts to develop sign language recognition

systems have been ongoing for years to tackle this

issue. However, current solutions often fall short due

to inherent limitations. Achieving high accuracy in

real-time gesture recognition remains a significant

challenge, as sign language varies widely among

users due to individual expression, regional dialects,

and environmental factors such as lighting and

background noise. Moreover, many systems lack

support for multiple languages, restricting their

application to a single language or requiring extensive

customization for others. These limitations reduce

their global applicability, particularly in linguistically

diverse regions.

Another major obstacle is the reliance on

expensive specialized hardware, such as depth-

sensing cameras or motion-capture devices, which

hinders widespread adoption. Additionally, many

systems still depend on manual interpretation or

human intervention, reducing autonomy, increasing

costs, and limiting scalability.

To address these challenges, this research

introduces an advanced real-time sign language

recognition and multilingual translation system

core utilizes an ensemble of deep neural networks

including ResNet, DenseNet, and EfficientNet to

accurately identify and interpret sign language

Albert P, A., Dolly Sree, K. and R, N.

Ensemble Deep Learning for Multilingual Sign Language Translation and Recognition.

DOI: 10.5220/0013602200004664

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 3rd International Conference on Futuristic Technology (INCOFT 2025) - Volume 2, pages 769-775

ISBN: 978-989-758-763-4

769

gestures. These networks, trained on diverse datasets,

ensure robust performance by accounting for user

variability and environmental conditions.

The system extends beyond gesture recognition by

incorporating multilingual translation capabilities,

converting recognized gestures into spoken and

written languages in real time. Its user-friendly

interface and real-time processing empower both deaf

individuals and their communication partners,

facilitating seamless interactions without the need for

third-party interpreters.

By ensuring high precision in gesture recognition,

the system offers consistent accuracy across a range

of gestures, even those with complex or subtle

variations. Furthermore, its scalable multilingual

support allows the system to handle multiple

languages, making it adaptable to global contexts

without requiring substantial customization.

Moreover, the system is more cost-effective,

eliminating the need for expensive specialized

hardware, making it more accessible to a wider

audience.

This system's influence on society is significant.

Its capacity for instantaneous, cross-language sign

language interpretation has the potential to

revolutionize various sectors, including education,

healthcare, and the job market, while also improving

social interactions and promoting inclusiveness.

Future advancements might encompass the addition

of emotion detection, capturing the complete

expressive spectrum of sign language, or the

incorporation of AR and VR technologies to provide

immersive communication experiences.

To sum up, this study marks a critical

advancement towards a more inclusive world, where

cutting-edge technologies eliminate barriers,

empower individuals, and honor diversity on a

worldwide scale.

2 RELATED WORKS

Menglin Zhang et al. (Zhang, et al. , 2023) introduce

a deep learning model for distinguishing standard sign

language. Their approach combines enhanced hand

detection using an improved Faster R-CNN with a

correctness discrimination model that utilizes 3D and

deformable 2D convolutional networks. The model

incorporates a sequence attention mechanism to boost

feature extraction. The researchers developed the

SLCD dataset, annotating videos with category and

standardization correctness metrics. The model's

performance is assessed through semi-supervised

learning, showing high accuracy in hand detection

and correctness discrimination. The study notes

challenges in enhancing optical flow for better hand

detection accuracy in dynamic situations.

The research by B. Natarajan et al. (Natarajan, et

al. , 2022) presents H-DNA, a comprehensive deep

learning framework for sign language recognition,

translation, and video generation. This system

integrates a hybrid CNN-BiLSTM for recognition,

Neural Machine Translation (NMT) for text-to-sign

conversion, and Dynamic GAN for video production.

H-DNA achieves high accuracy across multilingual

datasets and offers real-time application potential. It

also generates high-quality sign language videos with

natural gestures. While addressing challenges like

signer independence and complex background

handling, the study acknowledges limitations in

scaling the model to larger datasets and improving

alignment in dynamic gestures.

Deep R. Kothadiya et al. (Kothadiya, et al. , 2024)

propose a hybrid InceptionNet-based architecture for

isolated sign language recognition. This model

combines convolutional layers with an optimized

version of Inception v4, enhanced by auxiliary

classifiers and spatial factorization for feature

learning. The research emphasizes ensemble learning

to merge predictions from multiple models,

enhancing accuracy and robustness. The system

achieved 98.46% accuracy on the IISL-2020 dataset

and showed competitive performance on other

benchmark datasets. The study identifies challenges

in managing computational costs and training

complexities associated with ensemble methods, with

future work aimed at exploring lightweight models

and expanding real-time datasets.

Tangfei Tao et al. (Tao, Zhao, et al. , 2024) offer

a thorough examination of conventional and deep

learning approaches for sign language recognition

(SLR). Their paper traces the progression from early

sensor- and glove-based methods to contemporary

computer vision and deep learning techniques, with a

focus on feature extraction and temporal modeling.

The review highlights the growing use of

transformers and graph neural networks to enhance

spatio-temporal learning. The authors classify

datasets and pinpoint challenges, including signer

dependency and novel sentences, while suggesting

future research directions for robust, real-world SLR

systems. This comprehensive review serves as a

valuable resource for understanding advancements

and ongoing issues in sign language recognition

research.

Abu Saleh Musa Miah and collaborators (Miah,

et al. , 2024) introduce an innovative two-stream

multistage graph convolution with attention and

INCOFT 2025 - International Conference on Futuristic Technology

770

residual connection (GCAR) for sign language

recognition. Their approach combines spatial-

temporal contextual learning using both joint skeleton

and motion information. The GCAR model, enhanced

with a channel attention module, demonstrated high

performance on extensive datasets, including

WLASL (90.31% accuracy for Top-10) and

ASLLVD (34.41% accuracy). While the method

shows efficiency and generalizability, the authors

note computational challenges when dealing with

larger datasets.

In their study, Abu Saleh Musa Miah and

colleagues (Miah, et al. , 2024) present the GmTC

model for hand gesture recognition in multi-cultural

sign languages (McSL). This end-to-end system

combines graph-based features with general deep

learning through dual streams. A Graph

Convolutional Network (GCN) extracts distance-

based relationships among superpixels, while

attention-based features are processed using a Multi-

Head Self-Attention (MHSA) and CNN module. By

merging these features, the model improves

generalizability across diverse cultural datasets,

including Korean, Bangla, and Japanese Sign

Languages. Evaluations on five datasets show

superior accuracy compared to state-of-the-art

systems. The research also identifies challenges such

as computational complexity and fixed patch sizes in

image segmentation.

Hamzah Luqman (Luqman, 2022) proposes a two-

stream network for isolated sign language

recognition, emphasizing accumulative video motion.

The method employs a Dynamic Motion Network

(DMN) for spatiotemporal feature extraction and an

Accumulative Motion Network (AMN) to encode

motion into a single frame. A Sign Recognition

Network (SRN) fuses and classifies features from

both streams. This approach addresses variations in

dynamic gestures and enhances recognition accuracy

in signer-independent scenarios. The model was

tested on Arabic and Argentinian sign language

datasets, achieving significant performance

improvements over existing techniques.

Giray Sercan Özcan et al. (Özcan, et al. , 2024)

investigate Zero-Shot Sign Language Recognition

(ZSSLR) by modeling hand and pose-based features.

Their framework utilizes ResNeXt and MViTv2 for

spatial feature extraction, ST-GCN for spatial-

temporal relationships, and CLIP for semantic

embedding. The method maps visual representations

to unseen textual class descriptions, enabling

recognition of previously unencountered classes.

Evaluated on benchmark ZSSLR datasets, the

approach demonstrates substantial improvements in

accuracy, setting a new standard for addressing

insufficient training data in sign language recognition.

Jungpil Shin et al. (Shin, et al. , 2024) created an

innovative Korean Sign Language (KSL) recognition

system that combines handcrafted and deep learning

features to identify KSL alphabets. Their approach

utilized two streams: one based on skeleton data to

extract geometric features such as joint distances and

angles, and another employing a ResNet101

architecture to capture pixel-based representations.

The system merged these features and processed them

through a classification module, achieving high

recognition accuracy across newly developed KSL

alphabet datasets and established benchmarks like

ArSL and ASL. The researchers also contributed a

new KSL alphabet dataset featuring diverse

backgrounds, addressing limitations in existing

datasets. However, the model's dependence on

substantial computational resources and the need for

additional testing on more extensive datasets were

identified as potential areas for enhancement.

Candy Obdulia Sosa-Jiménez et al. (Jiménez , et

al. , 2022) developed a two-way translator system for

Mexican Sign Language (MSL) specifically designed

for primary healthcare settings. The system combines

sign recognition using Microsoft Kinect sensors and

hidden Markov models with MSL synthesis via a

signing avatar for real-time communication. It can

recognize 31 static and 51 dynamic signs, providing a

specialized vocabulary for medical consultations. The

research demonstrated the system's efficacy in

facilitating communication between deaf patients and

hearing doctors, with average accuracy and F1 scores

of 99% and 88%, respectively. Although innovative,

the system's reliance on specific hardware (Kinect)

could restrict its scalability and widespread

implementation.

Zinah Raad Saeed et al. (Saeed, et al. , 2022)

performed a comprehensive review of sensory glove

systems for sign language pattern recognition,

examining studies from 2017 to 2022. They

emphasized the benefits of glove-based techniques,

such as high recognition accuracy and functionality in

low-light environments, while also noting challenges

including user comfort, cost, and limited datasets. The

review classified motivations, challenges, and

recommendations, stressing the importance of

developing scalable, affordable, and comfortable

designs. Despite advancements, the study identified

gaps in handling dynamic gestures and incorporating

non-manual signs like facial expressions, outlining a

direction for future research to address these

limitations.

Ensemble Deep Learning for Multilingual Sign Language Translation and Recognition

771

Md. Amimul Ihsan et al. (Ihsan, et al. , 2024)

developed MediSign, a deep learning framework

designed to enhance communication between

hearing-impaired patients and doctors. The system

employs MobileNetV2 for feature extraction and an

attention-based BiLSTM to process temporal

information. The researchers created a custom dataset

featuring 30 medical-related signs performed by 20

diverse signers, achieving a validation accuracy of

95.83%. MediSign demonstrates robustness in

handling various backgrounds, lighting conditions,

and physiological differences among signers.

However, the study identifies areas for future

improvement, including expanding the dataset to

encompass more complex medical terminology and

evaluating the system's performance in real-world

settings.

3 METHODOLOGY

The suggested system utilizes a cutting-edge blend of

advanced technologies and techniques to enable

precise, instantaneous sign language interpretation

and translation across multiple languages. This

segment details the specific technical procedures and

approaches implemented throughout the system's

various elements, highlighting how each component

integrates and operates to provide a fluid user

interface.

3.1 Ensemble DNN Architecture

A key feature of the system is its collection of deep

neural networks, engineered to boost the

dependability and precision of gesture identification.

By integrating multiple model structures, this

collective approach capitalizes on each model's

individual strengths to deliver exceptional results.

The ensemble consists of three carefully chosen

architectures:

3.1.1 EfficientNetB0:

In sign language detection, EfficientNet stands out for

its remarkable efficiency, delivering superior

performance by optimizing network depth, width, and

input resolution while minimizing computational

expenses. This optimization proves especially

valuable in real-time systems demanding rapid and

accurate recognition of hand gestures and

movements. EfficientNet's compound scaling method

enables models from B0 to B7 to effectively manage

the varying complexities of sign language datasets,

capturing intricate hand forms, motions, and facial

expressions. With its reduced parameter count,

EfficientNet is well-suited for mobile and embedded

devices, making it a preferred choice for developing

accessible and portable sign language translation

applications.

3.1.2 ResNet50:

ResNet's architecture, featuring residual connections,

excels in identifying complex patterns and features

crucial for sign language detection, such as shifts in

hand position and finger articulations. By addressing

the vanishing gradient issue, ResNet facilitates the use

of very deep networks capable of extracting the

intricate features essential for accurate gesture

classification. Variants such as ResNet-50 and

ResNet-101 are widely adopted in gesture recognition

due to their proficiency in learning complex spatial

features. In the realm of sign language detection,

ResNet's skip connections enhance feature

propagation, enabling the model to capture subtle

gesture variations and achieve high classification

accuracy across multiple sign categories.

3.1.3 DenseNet169:

DenseNet's architecture, characterized by densely

connected layers, is particularly well-suited for sign

language detection, where efficient gradient flow and

feature reuse are paramount. Unlike traditional

networks, DenseNet connects each layer to every

other layer in a feed-forward manner, facilitating the

capture of detailed gesture features and contextual

relationships across frames. This structure preserves

fine-grained spatial and motion information, which is

crucial for recognizing complex sequential signs.

DenseNet-121 and DenseNet-169 have demonstrated

strong performance in gesture recognition, achieving

high accuracy while maintaining a compact model

structure. DenseNet's ability to minimize redundant

computations while reusing features makes it highly

suitable for real-time, resource-efficient sign

language translation systems.

The training process for each model involves

using a comprehensive, labeled dataset of sign

language movements. This dataset encompasses a

wide range of sign examples to ensure the models can

handle variations in user technique, geographical

differences, and linguistic variations.

To enhance the models' ability to generalize, the

dataset is artificially expanded using various data

augmentation methods, including image flipping,

rotating, resizing, and trimming.

INCOFT 2025 - International Conference on Futuristic Technology

772

Figure 1: System Architecture.

Figure 2: DNN Architecture.

The models work independently to extract relevant

features from the input information. Subsequently,

these extracted features are merged using a fusion

method to generate a single, consolidated output.

4 RESULT AND DISCUSSION

This segment showcases the practical results of

implementing the newly developed sign language

recognition and translation system, examining its

effectiveness and potential impact on improving

communication accessibility. The system underwent

extensive evaluation using a comprehensive dataset

comprising more than 10,000 annotated examples of

sign language gestures.

Table 1: Accuracy of different models.

Model Accuracy(%)

EfficientNetB0 99.8

ResNet50 74.2

DenseNet169 53

Figure 3: Training vs Validation Accuracy and Loss Graph

of EfficientNetB0.

Ensemble Deep Learning for Multilingual Sign Language Translation and Recognition

773

Figure 4: Training vs Validation Accuracy and Loss Graph

of ResNet50.

Figure 5: Training vs Validation Accuracy and Loss Graph

of DenseNet169

The findings indicate that EfficientNetB0

demonstrates superior accuracy compared to other

models tested. The system's exceptional precision

underscores its proficiency in accurately interpreting

and converting sign language into spoken or written

form in live situations.

By incorporating multiple language translation

features, the system's utility is expanded, making it a

valuable resource for users from diverse linguistic

communities.

Although the results are encouraging, the system

requires substantial computing power due to the

intricate nature of the combined models and

instantaneous processing. Improving computational

performance remains a crucial focus for future

developments.

The effective application of the ensemble deep

neural network architecture in this context opens up

new avenues for exploring real-time gesture

recognition technologies. This could potentially

extend beyond sign language to other forms of

communication that rely on gesture interpretation.

conclusion

The development and evaluation of an advanced

real-time sign language recognition and translation

system have been successfully accomplished in this

study. Utilizing a combination of sophisticated deep

neural networks, the system has exhibited remarkable

performance, achieving a 99.8% accuracy rate in

interpreting sign language gestures instantaneously.

The system's utility is further expanded through

the incorporation of multilingual translation features,

establishing it as a crucial tool for dismantling

communication barriers between deaf and hearing

individuals across various linguistic backgrounds.

While the system has made significant progress, it

faces challenges related to computational

requirements and the need for further optimization to

enhance its mobile device compatibility. Subsequent

research will address these issues by exploring more

efficient computational approaches, expanding the

dataset to encompass a wider array of sign languages,

and implementing compact models for improved

portability.

This research makes a substantial contribution to

the field of assistive technologies by advancing the

capabilities of real-time sign language translation. It

paves the way for a more inclusive future for the deaf

community and promotes enhanced communication

accessibility for all individuals.

REFERENCES

Menglin Zhang et al., "Deep Learning-Based Standard Sign

Language Discrimination," Tianjin University of

Technology, 2023.

B. Natarajan et al., "Development of an End-to-End Deep

Learning Framework for Sign Language Recognition,

Translation, and Video Generation," SASTRA Deemed

University, 2022.

Deep R. Kothadiya et al., "Hybrid InceptionNet Based

Enhanced Architecture for Isolated Sign Language

Recognition," Charotar University of Science and

Technology (CHARUSAT), 2024.

Tamer Shanableh, "Two-Stage Deep Learning Solution for

Continuous Arabic Sign Language Recognition Using

Word Count Prediction and Motion Images," IEEE

Access, 2023

Tangfei Tao, Yizhe Zhao, Tianyu Liu, and Jieli Zhu, "Sign

Language Recognition: A Comprehensive Review of

Traditional and Deep Learning Approaches, Datasets,

and Challenges," IEEE Access, 2024.

Abu Saleh Musa Miah et al., "Sign Language Recognition

Using Graph and General Deep Neural Network Based

on Large Scale Dataset," IEEE Access, 2024.

Abu Saleh Musa Miah et al., "Hand Gesture Recognition for

Multi-Culture Sign Language Using Graph and General

Deep Learning Network," The University of Aizu,

Japan, 2024.

Hamzah Luqman, "An Efficient Two-Stream Network for

Isolated Sign Language Recognition Using

Accumulative Video Motion," King Fahd University of

Petroleum & Minerals, 2022.

Giray Sercan Özcan et al., "Hand and Pose-Based Feature

Selection for Zero-Shot Sign Language Recognition,"

Başkent University, Türkiye, 2024.

INCOFT 2025 - International Conference on Futuristic Technology

774

Jungpil Shin et al., "Korean Sign Language Alphabet

Recognition Through the Integration of Handcrafted

and Deep Learning-Based Two-Stream Feature

Extraction Approach," IEEE Access, 2024.

Candy Obdulia Sosa-Jiménez et al., "A Prototype for

Mexican Sign Language Recognition and Synthesis in

Support of a Primary Care Physician," IEEE Access,

2022.

Zinah Raad Saeed et al., "A Systematic Review on Systems-

Based Sensory Gloves for Sign Language Pattern

Recognition: An Update From 2017 to 2022," IEEE

Access, 2022.

Jungpil Shin et al., "Dynamic Korean Sign Language

Recognition Using Pose Estimation-Based and

Attention-Based Neural Network," IEEE Access, 2023.

Sunusi Bala Abdullahi et al., "IDF-Sign: Addressing

Inconsistent Depth Features for Dynamic Sign Word

Recognition," IEEE Access, 2023.

Md. Amimul Ihsan et al., "MediSign: An Attention-Based

CNN-BiLSTM Approach of Classifying Word Level

Signs for Patient-Doctor Interaction," IEEE Access,

2024.

Ensemble Deep Learning for Multilingual Sign Language Translation and Recognition

775