
hand, leveraging advancements in deep learning,
embedded systems, and convolutional neural
networks (CNNs).
2. The proposed system integrates a camera module
for real-time object detection, classification,
and grip formation, ensuring affordability and
reduced maintenance.
3. By enabling precise and responsive control
through a computer vision-based approach, this
solution enhances user interaction, functionality,
and mobility, addressing key limitations of tradi-
tional prosthetics.
2 RELATED WORK
Many Studies have been carried out to investigate the
development of prosthetic hands with artificial intelli-
gence, using different technologies and approaches to
extend their functionality. These research works ex-
plore various ways to merge the newest AI methods
to make the hands more controllable, adaptable, and
overall functional, contributing significantly to the in-
dustry’s progress.
In the research by (Ujjwal Sharma and Singh,
2023), object detection developments are explored,
with YOLO being a main focus. Faster R-CNN
ResNet was able to reach 77.4% mAP at 6 FPS, while
YOLO v2 paired with a (416x416) resolution had an
equal value of accuracy (77.2%) and speed (68 FPS).
Later, YOLO v2 was upgraded to 78.4% mAP at 60
FPS (480x480). YOLO v3, which also used DarkNet-
53 like ResNet-50, showed equivalent accuracy but
was faster. Dataset specifics are not provided.
(Xia Zhao and Parmar, 2024) discussed the enor-
mous contributions of Convolutional Neural Net-
works (CNNs) in improving computer vision tasks
such as image classification, object detection, and
video prediction. CNNs surpass traditional methods
by delivering accurate results. The challenges in this
field involve training with large datasets, model com-
plexity, and high computational cost. Future research
will focus on optimizing architectures and reducing
the dependency on labeled data to improve perfor-
mance.
(Ross Girshick and Malik, 2014) introduced a
straightforward and reliable object detection method
designed to work with CNNs, evaluated on the PAS-
CAL VOC 2011 and 2012 datasets. The model im-
proved mAP by more than 30% and achieved 47.9%
accuracy in the segmentation task. This approach, in-
tegrating CNNs and region proposals, offers a fast al-
ternative to more complicated ensemble systems.
(Chunyuan Shi and Liu, 2020) explored CNNs
for recognizing grasp patterns in prosthetic hands,
reporting mono-modal accuracies of 80% for RGB,
85.4% for grayscale, and 89.8% for depth images.
The fusion of grayscale and depth data increased the
recognition rate to 94.6%. Additionally, Vision-EMG
achieved a 50% reduction in grasp-and-pick-up time
compared to Coded-EMG, highlighting superior per-
formance.
(Meena Laad and Saiyed, 2024) compared two
object detection CNNs: SSD with MobileNetV1
and Faster-RCNN with InceptionV2, using a custom
dataset of 444 images (355 for training, 89 for test-
ing). While SSD was faster, it showed inferior per-
formance compared to Faster-RCNN, which, though
slower, was more accurate.
(Shripad Bhatlawande and Gadgil, 2023) con-
ducted research into robotic grasping using RGB-D
data. The study utilized the Cornell Grasp Dataset and
applied graph segmentation and morphological image
processing (MIP) with a Random Forest (RF) classi-
fier. The method achieved 94.26% accuracy in grasp-
ing detection, outperforming other algorithms in both
speed and accuracy.
(Douglas Morrison and Leitner, 2020) introduce
the EGAD dataset, a more diverse tool for assessing
robot arm interactions with objects, particularly for
grasp-centric tasks. The GG-CNN algorithm associ-
ated with EGAD succeeds 58% of the time, indicating
the challenges of natural grasp depth and orientation.
More complex datasets like EGAD show greater limi-
tations, offering opportunities for algorithm improve-
ment.
(Cloutier and Yang, 2013) revisit various pros-
thetic hand control techniques, focusing on antici-
patory pattern recognition, fuzzy clustering, neural
networks, and ENG control. The study uses EMG
signals for motion classification, achieving accuracy
rates between 86% and 98%. ENG interfaces provide
a more natural control method through the Peripheral
Nervous System.
3 METHODOLOGY
3.1 Data Collection
The step of collecting data is video shooting of objects
in the real world which will be interpreted by a com-
puter vision model as grip recognition through, they
are further processed into 300 frames of 15-second
length, to make sure that the data is of good quality.
These frames are classified among power grip, preci-
sion, grip or pinch grip.
Class A: Power Grip – Utilized mostly by objects
INCOFT 2025 - International Conference on Futuristic Technology
414