A Comparative Analysis of Glove-Based and Image-Based Sign

Language Recognition Systems

Junkang Rong

School of Computing, Civil Aviation Flight University of China, Guanghan, Sichuan, 618300, China

Keywords: Sign Language Recognition, Image Recognition, Glove-Based Sensors.

Abstract: As an important means of communication for deaf and hearing-impaired individuals, sign language possesses

a unique linguistic system and mode of expression. However, because sign language is not widely spoken,

the rate of standardized adoption is low, and there are significant differences in the sign languages used by

various groups of deaf people. Consequently, sign language recognition technology plays a crucial role in

facilitating barrier-free communication between deaf individuals and those with normal hearing. Sign

language recognition systems that utilize glove-based sensors and image recognition have made significant

advancements, thereby enabling more convenient communication. Glove-based sensors offer high precision

by capturing detailed hand gestures. They are particularly effective in low-light conditions or when the signer

is off-camera, providing a consistent recognition rate. On the other hand, image recognition systems excel in

their non-intrusive nature, allowing for sign language interpretation without the need for the signer to wear

any devices. They can process sign language in real-time and are ideal for video-based applications, making

them suitable for inclusive social interactions and educational tools. This paper will review glove-based and

image-based sign language recognition systems, covering their background, current research status, key

technologies, and potential application prospects.

1 INTRODUCTION

According to the World Health Organization, there

are more than 1 billion persons with disabilities

globally, with the deaf accounting for 10 per cent of

the disabled population. In China, eighty-two million

people worldwide are classified as disabled, 20.54

million of whom have hearing impairments and 1.3

million of whom have speech impairments, according

to the results of the Sixth National Population Census

and the Second National Sample Survey of Persons

with Disabilities. As the main means of

communication for these people, sign language has a

unique syntax, semantics and vocabulary system

(Jones, 2021). However, since sign language is not a

mass language, the prevalence of standardized sign

language is very low, and there are dialectal

phenomena in sign language used by different deaf

groups, which poses a challenge to the development

of sign language recognition technology.

Originating in the 1980s, sign language

recognition technology has steadily received

attention and research due to the ongoing

advancements in computer technology. Early sign

language recognition systems were mostly based on

wearable devices such as data gloves, and realized the

classification and recognition of sign language

through multi-sensor fusion technology (Cheok,

2019). However, such systems have problems such as

bulky equipment, high cost, and affecting the

naturalness of human-computer interaction. In recent

years, with the rise of computer vision and deep

learning technologies, the sign language recognition

system based on image recognition has gradually

become the mainstream of research (Wadhawan,

2021). The system captures sign language images or

videos through cameras, and realizes the automatic

recognition of sign language using techniques such as

image processing, feature extraction and

classification algorithms.

This paper reviews advancements in sign

language recognition, focusing on glove-based and

image-based systems. It examines their technologies,

evaluates their accuracy and convenience in

facilitating communication for the deaf and hearing-

impaired, and discusses their research status and

applications.

408

Rong, J.

A Comparative Analysis of Glove-Based and Image-Based Sign Language Recognition Systems.

DOI: 10.5220/0013337000004558

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 1st International Conference on Modern Logistics and Supply Chain Management (MLSCM 2024), pages 408-412

ISBN: 978-989-758-738-2

2 GLOVE-BASED SIGN

LANGUAGE RECOGNITION

Through the use of multi-sensor fusion technology,

the glove-based sign language recognition system

precisely obtains the angle information, movement

trajectory, and temporal information of the hand in

order to accomplish sign language classification and

recognition (Amin, 2022). The system has a high

recognition rate, but the equipment is bulky and

costly, which affects the ease of use and naturalness

of human-computer exchange and makes it difficult

to be used in real life.

2.1 Representative Works

In recent years, researchers have developed a new

type of data glove, which adds contact sensors and

can effectively utilize the large amount of contact

information in Chinese sign language. For example,

the novel data glove proposed by Zhang Yaxin et al.

is a device designed for Chinese sign language

recognition, which fully takes into account the

characteristics of Chinese sign language and has the

advantages of cheap price and high recognition

accuracy. This glove solves the deficiencies of

existing data gloves when used for Chinese sign

language recognition by improving the application of

sensors and contact sensors. In addition, the design of

this glove takes into account the user's need for sign

complexity, and therefore the sensors have been

adapted accordingly. The main objective of this

research is to improve the accuracy and efficiency of

sign language recognition to meet the needs in

practical applications. Yaxin Zhang's team

experimentally verified the feasibility of this new data

glove in virtual environment interaction tasks and

demonstrated its effectiveness in the field of

teleoperation. This suggests that the glove is not only

suitable for sign language recognition, but can also be

useful in other applications that require precise

gesture control. In conclusion, the novel data glove

proposed by them significantly improves the

performance of Chinese sign language recognition by

optimizing the sensor configuration and design, and

at the same time is cost-effective and easy to operate,

which gives it a wide range of potential applications

in a variety of application scenarios. In terms of

sensor design, this new data glove adds contact

sensors on the basis of the original, cancelling the

thumb-crossing sensors and the abduction sensing

between the middle finger and the ring finger and the

ring finger and the little finger for measuring the

finger tensor angle, and adding contact sensors at the

end of the finger, which can obtain a total of 20

information points. Moreover, the system has low

cost and can recognize Chinese sign language better.

As for the depth information, it was found that the

“depth information” reflected by the bending angle of

the elbow joint is also indispensable, so the three

aspects were considered together to obtain a new type

of wearable human posture sensor. Compared with

the widely used q-type and p-type data gloves, this

data glove adds a new type of contact sensor,

effectively utilizes a large amount of contact

information in Chinese sign language, and has the

advantages of being cheaper, more suitable for the

characteristics of Chinese sign language, and higher

recognition accuracy. Finally, this data glove is

applied to a new Chinese sign language recognition

system, which can recognize Chinese sign language

words more accurately by combining with the visual

part (Zhang, 2001).

This design has several advantages, first of all this

new data glove has added contact sensors that can

effectively utilize the large amount of contact

information in Chinese sign language. This makes it

possible to more accurately capture the details and

subtle changes of gestures during the recognition

process. Moreover, compared with the existing

CyberGlove model data glove, the new data glove has

a cost advantage and is better suited to the

characteristics of Chinese sign language. This means

that it is not only less expensive, but also more

compatible with Chinese sign language habits in

practical applications. By incorporating a visual

component, the data glove is able to recognize Chinese

sign language words more accurately. This high-

precision recognition capability is crucial to improving

the overall performance of the sign language

interpreting system. Importantly, the glove is able to

acquire and analyze finger contact information,

thereby reducing repetitive or unnecessary contact

information and improving recognition efficiency and

accuracy. The new data glove also has the ability to

differentiate between left and right fingers, which is

important for certain specific sign language gestures,

as different combinations of fingers may be required

for different gestures. The new data gloves utilize

common and inexpensive components such as

Bluetooth modules, gyroscopes and flexible sensors,

making the total cost significantly lower than current

data gloves with similar capabilities. This not only

reduces the cost of the device, but also simplifies the

algorithm complexity. Finally, the glove implements a

real-time recognition and decoding system on smart

terminals, ensuring fast data processing, which is very

important for the user experience in real applications.

A Comparative Analysis of Glove-Based and Image-Based Sign Language Recognition Systems

409

In order to achieve real-time sign language

recognition and translation on lightweight edge

devices and to offer deaf people with real-time

communication and exchange services anytime,

anywhere, Yin Yafeng's group has proposed another

representative work that achieves this goal. The

technique is based on area-aware time-sequence maps.

This technology aims to realize real-time sign

language recognition and translation on lightweight

edge devices to provide anytime, anywhere

communication and exchange services. Specifically,

the technology utilizes computer vision and image

processing techniques to acquire images of sign

language movements through a cell phone camera or

other video capture device. The system then uses

algorithms to analyze and process these images to

recognize finger and hand positions and movements.

During the recognition process, the technique may

incorporate a finite state machine and dynamic time

regularization (DTW) approach to deal with

continuous gesture movements in sign language

videos. In addition, it may also involve deep learning

models, such as the Keras deep model, for classifying

and recognizing sign language actions. Ultimately,

the recognized sign language movements are

translated into text or speech and displayed via a

connected digital screen or other output device so that

they can be understood by the hearing impaired and

others. In terms of hardware design, edge devices

typically need to have high-performance computing

power to support complex tasks, while requiring low

power consumption and a small footprint. For

example, the NVIDIA Jetson Xavier NX is a

lightweight device pre-installed with Ubuntu, easy

installation, and support for 12-24V wide-voltage

operation and -10~55°C wide-temperature operation.

These devices are also equipped with abundant input

and output ports, which facilitate the connection of

various types of sensors for multi-stream video edge

inference and obstacle avoidance. For software

architecture, lightweight implementation frameworks

such as Cafe2, add support for mobile devices and

mainstream machine learning frameworks such as

PyTorch and MXNet are starting to be deployed on

edge devices. EdgeOS, the IoT edge operating system,

is built specifically to adapt to edge-side devices, with

core functions such as industrial protocol parsing,

data filtering and distribution, and is characterized by

cross-platform, ease of use and secondary

development. The last is to think about how to

optimize the algorithm, researchers in the field of

image recognition and other areas, through the design

and optimization of lightweight convolutional neural

network model, can be achieved on the edge device

real-time image recognition tasks. Modifications like

quantization, pruning, and knowledge distillation can

help cut down on computing overhead and model size

even further. A combination of a lightweight decoder

and a pyramid pooling transformer for edge

intelligence captures spatial and spectral details in the

shallow layers via wavelet transforms to effectively

recognize edges and reduce noise while maintaining

computational efficiency. Another infrared weak

target detection algorithm for embedded edge

computing devices first uses a lightweight backbone

network for feature extraction, and then obtains the

final segmented bipartite map through multiple up-

sampling layers and feature fusion across layers,

which ensures high detection accuracy and low false

alarm rate (Yin, 2017).

2.2 Discussions on Glove-Based Sign

Language Recognition

Reducing the cost of data gloves while maintaining or

enhancing their recognition accuracy can be achieved

through several strategic approaches. One such

approach is optimizing sensor configuration, which

involves eliminating unnecessary sensors like the

thumb-crossing sensor and the abduction sensor that

measures the angle of finger spread. These sensors

can be redundant in certain applications and

contribute to increased costs. Additionally, replacing

high-cost sensors with low-cost contact sensors can

lead to significant cost reductions due to the reduced

price of the sensors and the associated signal

conversion circuitry.

Improving algorithms and data processing

techniques is another vital strategy. The utilization of

machine learning algorithms, such as the generalized

regression neural network (GRNN), can significantly

boost gesture recognition accuracy, with research

indicating potential achievement of up to 99%

accuracy. Furthermore, incorporating neural network

models and template matching techniques can

enhance the recognition rate of similar gesture letters,

with the algorithm achieving a recognition rate of

98.5%.

The adoption of high-precision, low-latency

sensors also play a crucial role in enhancing

recognition precision. Selecting sensors that offer

high accuracy and optimizing their arrangement on

the glove can improve the stability and reliability of

data capture. Advanced technologies like magnetic

fingertip tracking sensors and electron magnetic field

magnetic localization tracking can provide highly

accurate finger motion capture data.

MLSCM 2024 - International Conference on Modern Logistics and Supply Chain Management

410

Lastly, simplifying the structural design of the

data glove can contribute to cost reduction and

maintenance ease. Designing the glove with a

removable outer fabric allows for easy replacement of

the outer layer in case of staining, reducing

maintenance costs and prolonging the glove's lifespan.

These combined efforts not only make data gloves

more affordable but also ensure they remain effective

tools for gesture recognition.

3 IMAGE-BASED SIGN

LANGUAGE RECOGNITION

The vision-based sign language recognition system

utilizes a camera to acquire 2D images or videos of

sign language, and recognizes them through

algorithms such as image processing and machine

learning (Wiley, 2018). The system is closer to the

social needs and suitable for human-computer

interaction, but it has the defects of low recognition

rate, poor real-time performance, and low

applicability (Subburaj, 2022). In recent years, with

the development of deep learning technology, the

performance of vision-based sign language

recognition system has been significantly improved.

3.1 Representative Works

Wuhan University proposes an attention-based

mechanism for continuous sign language recognition

algorithm attention-based 3D convolutional neural

network (ACN). 3D CNN is mainly used to process

data containing temporal dimension, such as video or

3D images. Unlike 2D CNNs, 3D CNNs use a 3D

convolutional kernel that is capable of performing

convolutional operations in the time dimension to

capture spatio-temporal features in the data. For

example, in video analysis, 3D CNNs can take into

account both frame-to-frame motion information to

better understand video content. By dynamically

assigning varying weights to different areas of the

input data as it is processed, the attention mechanism,

on the other hand, allows the model to focus on the

most relevant information, mimicking the functioning

of the human visual system. The attention mechanism

in deep learning can greatly enhance the model's

performance, particularly when handling complicated

and high-dimensional input. Finally, introducing the

attention mechanism into 3D CNNs can improve the

performance of the model by making it more focused

on the critical parts of the input data. For example, in

EEG signal emotion recognition, a 3D CNN that

combines the frequency-space attention mechanism

(FSA-3D-CNN) is able to simultaneously consider

the information of EEG signals in the three

dimensions of time, space, and frequency, thus

improving the accuracy of emotion recognition. It is

capable of recognizing continuous sign language in

complex backgrounds. The algorithm preprocesses

sign language videos containing complex

backgrounds through a background removal module

and extracts spatio-temporal fusion information using

3D-ResNet based on spatial attention mechanism

(Yang, 2023).

The team of Prof. Hongwen Cao and Prof. Hong

Li from the School of Foreign Languages, Chongqing

University has made a new progress in the neural

mechanism of Chinese sign language word

recognition. This study examined the effects of word

frequency, word length, phonological neighborhood

word size, and likelihood on vocabulary recognition

in sign language and found that these factors were

similar to findings in spoken language, suggesting

that the same neural mechanisms exist in the process

of vocabulary recognition in sign language and

spoken language. However, the significant effect of

likelihood also suggests that the lexical recognition

process is also influenced by factors related to

linguistic modality. This study enriches the

understanding of the neural mechanisms of sign

languages in China, contributes to the further

understanding of the nature of natural language, and

provides important information about the

characteristics and patterns of lexical processing in

Chinese sign languages for both educators and

learners of sign languages. Recent advances in sign

language recognition technology include methods

combining sequence annotation and deep learning,

sign language recognition and translation techniques

for region-aware time-series maps, continuous sign

language recognition algorithms based on attentional

mechanisms, and research on neural mechanisms

(Zhang, 2023).

3.2 Discussions on Image-Based Sign

Language Recognition

Improving the robustness of sign language

recognition systems in complex environments is

crucial for accurate interpretation. One approach to

enhance robustness is through multimodal data fusion,

which leverages a combination of multiple sensors

and data sources. For instance, the integration of

CNNs with inertial measurement units and

stretchable strain sensors can more precisely perceive

hand poses and motion trajectories. Utilizing a variety

A Comparative Analysis of Glove-Based and Image-Based Sign Language Recognition Systems

411

of multimodal data, including video feeds, keypoints,

and optical flow, allows for the training of a unified

visual backbone that significantly boosts recognition

performance.

Deep learning model optimization is another key

strategy. Advanced models such as the BLSTM

(Bidirectional Long Short-Term Memory) model

decompose consecutive sentences into word vectors,

thereby enhancing the recognition of continuous sign

language sentences. The fusion of attention

mechanisms with connective temporal classification

methods enables the extraction and combination of

short-term spatio-temporal features and hand

movement trajectory features. This addresses

challenges related to redundant information and

alignment issues within the spatio-temporal

dimension.

To tackle the challenge of recognizing sign

language from non-specific individuals, data

enhancement and diversification are essential.

Expanding the training dataset to include a broader

range of signers improves the system's ability to

generalize. Techniques like image generation for data

augmentation can further strengthen the model's

robustness, ensuring high accuracy in real-time

recognition scenarios.

Lastly, the introduction of prior knowledge,

including motor and linguistic a priori, into the causal

temporal recognition framework is beneficial. This

incorporation refines the robustness of feature

extraction by providing a deeper understanding of the

contextual semantics and the nuances of sign

language gestures. By integrating these strategies,

sign language recognition systems can be made more

resilient and effective in complex and varied

environments.

4 CONCLUSIONS

A significant application of computer vision and

machine learning technologies in the field of

accessible communication is the recognition of sign

language using an image-based system. The

performance of the sign language recognition system

will continue to improve with ongoing technological

advancements, providing the hearing impaired with a

more convenient and effective means of

communication. In the future, it is required to

continue in-depth research to solve the current

problems and promote the further development of

sign language recognition technology. The system

has a broad application prospect. It can not only

provide more communication opportunities for the

hearing impaired to help them communicate with

others without barriers, but also can be applied in the

field of education to help teachers and students

understand sign language better and improve the

teaching effect. In addition, the sign language image

recognition system also has potential application

value in the fields of intelligent transportation, remote

control, and virtual reality.

REFERENCES

Amin, M. S., Rizvi, S. T. H., & Hossain, M. M. 2022. A

comparative review on applications of different sensors

for sign language recognition. Journal of Imaging, 8(4),

98.

Cheok, M. J., Omar, Z., & Jaward, M. H. 2019. A review

of hand gesture and sign language recognition

techniques. International Journal of Machine Learning

and Cybernetics, 10, 131-153.

Jones, G. A., Ni, D., & Wang, W. 2021. Nothing about us

without us: Deaf education and sign language access in

China. Deafness & Education International, 23(3),

179-200.

Subburaj, S., & Murugavalli, S. 2022. Survey on sign

language recognition in context of vision-based and

deep learning. Measurement: Sensors, 23, 100385.

Wadhawan, A., & Kumar, P. 2021. Sign language

recognition systems: A decade systematic literature

review. Archives of computational methods in

engineering, 28, 785-813.

Wiley, V., & Lucas, T. 2018. Computer vision and image

processing: a paper review. International Journal of

Artificial Intelligence Research, 2(1), 29-36.

Yang, G., Ding, X., Gao, Y., et al. 2023. Continuous Sign

Language Recognition with Complex Backgrounds

Based on Attention Mechanism. Journal of Wuhan

University (Natural Science Edition), 69(1), 97-105

Yin, Y., 2017. Research on Behavior Perception

Recognition Technology and System Based on Mobile

Devices. Nanjing University. Doctoral Dissertation.

Zhang, Y., Yuan, K., & Yang, X., 2001, A Novel Data

Glove for Sign Language Recognition. Journal of

University of Science and Technology Beijing, 23(4),

379-381

Zhang, X., Cao, H., & Li, H. 2023. Neurophysiological

effects of frequency, length, phonological

neighborhood density, and iconicity on sign

recognition. NeuroReport, 34(17), 817-824.

MLSCM 2024 - International Conference on Modern Logistics and Supply Chain Management

412