CNN-based dynamic gesture recognition methods by
proposing a more efficient two-dimensional CNN
approach utilizing feature fusion for improved
precision and reduced computational demands. It
employs original frames and keyframes for optical
flow to capture both temporal and spatial
characteristics, which are then fused and recognized
by the 2D CNN. A fractional-order Horn and Schunck
method extracts high-quality optical flow, and an
improved clustering algorithm identifies keyframes,
reducing data redundancy. The suggested dynamic
gesture identification technique achieved high
accuracy rates of 98.6% on the Cambridge dataset and
97.6% on the Northwestern University dataset,
outperforming alternative techniques. The model,
with only 0.44 million parameters, significantly
reduced computational complexity and training time
compared to conventional 3D CNN models. Ablation
studies confirmed the efficiency of fractional-order
optical flow and keyframe retrieval, enhancing
recognition accuracy by over 10%. The approach
demonstrates efficient gesture recognition with
minimal parameters and fast computation time (Yu,
2022).
Another representative work proposes enhancing
human-computer interaction by developing a highly
accurate, hardware-free static hand gesture
recognition system using CNNs. The method
involves preprocessing with skin segmentation and
data augmentation to enhance model accuracy. The
CNN architecture consists of seven layers, including
max-pooling and convolutional layers, followed by a
fully connected layer. Dropout is applied to prevent
overfitting. The model is trained using the cross-
entropy loss function and Adam optimizer. The
study's results demonstrated outstanding performance
of the proposed CNN model in recognizing static
hand gestures, achieving testing accuracies of 96.5%
on the NUS II dataset and 96.57% on the Marcel
dataset. The accuracy of the model was greatly
increased by the incorporation of skin segmentation
and data augmentation, reducing misclassification
rates. The experiments confirmed the effectiveness of
the CNN approach in gesture recognition tasks, even
with complex backgrounds (Eid, 2023).
In summary, the accuracy of CNN-based gesture
recognition systems may reach over 90%, or even
approach or exceed 99%, in some relatively simple
application scenarios, such as recognizing a limited
number of gesture types with little variation.
However, in more complex and diverse application
scenarios, such as recognizing a large number of
gesture types with subtle differences and presented
under various lighting conditions and angles, the
accuracy may be relatively lower, but it can still
outperform traditional machine learning or image
processing methods. The field of gesture recognition
based on convolutional neural networks is currently
experiencing rapid development and continuous
innovation. As a result of ongoing technological
advancements and the expansion of application
scenarios, gesture recognition technology will
become increasingly essential in various industries.
4 GESTURE RECOGNITION
BASED ON TRANSFORMER
The Transformer model possesses significant
advantages due to its self-attention mechanism, which
allows for dynamic weighting of input data, enabling
it to focus on relevant features. This results in
efficient processing and understanding of sequences,
making it particularly adept at handling long-range
dependencies. In gesture recognition, these
capabilities translate to robust interpretation of
gesture sequences, facilitating accurate identification
even in complex environments. The model's capacity
to capture minute details and temporal dynamics
within gestures leads to enhanced recognition
accuracy and real-time performance, making it a
potent tool for gesture-based human-computer
interaction (Ahmed, 2023).
Motivated by the goal of developing an efficient
and accurate hand gesture recognition framework, a
representative work utilizes high-density surface
Electromyography (EMG) signals and deep learning,
aiming to enhance prosthetic hand control and
human-machine interactions. The research presents a
Vision Transformer network-based Compact
Transformer-based Hand Gesture Recognition
(CTHGR) framework for hand gesture classification
using high-density surface EMG signals. The
framework employs a method of attention for feature
extraction and leverages both spatial and temporal
features without requiring transfer learning. It
incorporates a hybrid model that fuses macroscopic
EMG data with microscopic neural drive information
extracted via Blind Source Separation, enhancing
gesture recognition accuracy. The method is
evaluated using various window sizes and electrode
channels, demonstrating improved performance over
conventional deep learning and machine learning
models. The study's results show that the proposed
CTHGR framework achieves high accuracy in
identifying hand movements using HD-sEMG signals,
with average accuracies ranging from 86.23% to
91.98% across different electrode channels and
window widths. The framework outperforms 3D
CNN models and traditional machine learning,