Enhancing Autonomous Vehicle Navigation: Traffic Police Hand

Gesture Recognition for Self‑Driving Cars in India Using MoveNet

Thunder

Ippili Rahul and Saravanan Santhanam

Department of Computing Technologies, SRM Institute of Science and Technology, Kattankulathur, Chennai, Tamil Nadu,

India

Keywords: Traffic Gesture Recognition, Autonomous Vehicles, MoveNet Thunder, TensorFlow, Real‑Time Detection,

Carla Simulator, Sensor Fusion, Indian Traffic.

Abstract: The Traffic Police Hand Gesture Recognition System for autonomous vehicles implements the TensorFlow’s

MoveNet Thunder model. The system detects three hand signals including 'Stop' and 'Turn Left' and 'Move

Forward' since such movements correspond to standard traffic police actions in India. Our custom database

included eight thousand gestures’ images recorded under diverse circumstances. An architecture of dense and

dropout layers within our network enables both accurate performance while keeping the network protected

from overfitting conditions. The Haar cascades face detection system enables live officer identification before

camera recording of gestures that occur within the field of view starts. Minor mistakes occurred between

similar gestures despite the system achieving a 89% success rate. System evaluation using Carla simulator

was conducted under two conditions: first with environmental conditions enabled and second without

environmental conditions enabled. The created prototype proves useful as an effective element that combines

safety features with operational efficiency for autonomous vehicle navigation systems in controlled traffic

environments.

1 INTRODUCTION

Autonomous vehicles (AVs) are forever changing

transportation, yet, traffic signal reliance and reliance

on GPS, along with sensors, usually fails in human

controlled traffic environments. However, in

countries like India where traffic police’s hand

gestures are important in navigation, AVs should be

able to interpret these signals correctly to operate

safely and efficiently.

Gesture recognition capabilities have recently

greatly improved thanks to advancements in deep

learning, pose estimation and sensor fusion. For

example, with MoveNet Thunder, it is enabled to

have real time detection, and combining LiDAR and

vision data provides a hybrid fusion increasing the

accuracy of the recognized objects. Nevertheless,

there are challenges in dealing with occlusions, light

and gesture misclassifications.

In this research, the development of a real time

and accurate traffic police hand gesture recognition

system for autonomous vehicles is done using:

MoveNet Thunder, deep learning-based

categorization, and sensor fusion technique. Thanks to

integrations of these technologies it can be achieved

that AVs can detect gestures like “Stop,” “Turn Left,”

and “Move Forward” in a safer fashion in mobile

traffic environments. Validation of the performance

of the system in the Carla simulator under various

traffic and weather conditions is also validated, which

makes it a robust solution for real world deployment.

2 LITERATURE SURVEY

Hand gesture traffic recognition for traffic police is

very important to autonomous vehicle navigation in

the human regulated traffic. Edge computing

Ekatpure, R. (2023) improves processing speed and

decreases latency, while perceptual (Lu et al., 2021)

enhancements provide better vision in poor

conditions Ding et al., (2021). The localization

techniques including SLAM, GNSS and LiDAR

(Chiang et al., 2020) offer additional accuracy (de

Rahul, I. and Santhanam, S.

Enhancing Autonomous Vehicle Navigation: Trafﬁc Police Hand Gesture Recognition for Self-Driving Cars in India Using MoveNet Thunder.

DOI: 10.5220/0013931600004919

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 1st International Conference on Research and Development in Information, Communication, and Computing Technologies (ICRDICCT‘25 2025) - Volume 5, pages

473-479

ISBN: 978-989-758-777-1

473

Miguel et al., 2020).

In the case of deep learning methods, CNNs,

reinforcement learning (Ibrahim et al., 2020 and Xu

et al., 2024), multi sensor fusion Tang et al., (2022)

and Queralta et al., (2020), are employed to enhance

gesture recognition. First, AI driven navigation is

developed for small UAS and then transferred to self-

driving vehicles Bijjahalli et al., (2020). Advances in

recognition include Open Pose Fu, M. (2022),

convolutional network Wiederer et al., (2020), RGB

D Faster R CNN Wang et al., (2018), CNN RNN

Baek et al., (2022), as well as SlowFast networks are

used to increase motion detection Zhu et al., (2024).

When MoveNet has done better than alternatives,

it is essential to pose estimation. Action classification

Anju et al., (2024) is improved by the neural

networks, while spatiotemporal integration improves

recognition Kaushik et al., (2023). In sensor fusion,

gesture tracking with LiDAR and vision data is

supported. Comparative deep learning evaluations

are done along with problems of misclassification

and ways of solving them.

The system uses MoveNet technology which

pairs deep learning techniques with sensor inputs for

recognizing dependable traffic police hand signals

needed in navigational security systems. The research

demands more emphasis on enhancing

simultaneously both spatial-temporal models and

real-time inference speed while developing traffic

environment-based training datasets.

3 PROPOSED METHODOLOGY

This section outlines the methodology for

implementing Traffic Police Hand Gesture

Recognition for Self-Driving Cars in India Using

MoveNet Thunder, including detailed steps from data

collection to deployment, extracted from the code

details and project information.

3.1 Data Collection and Preprocessing

3.1.1 Dataset Creation

The system is built on a custom dataset of 8,000

images representing various hand gestures used by

traffic police in different environments and weather

conditions. These gestures include commands like

"Stop," "Turn Left," "Turn Right," and "Move

Forward." This dataset ensures that the model is

equipped to handle a range of real-world scenarios.

Each class is stored in subfolders, such as Pose1,

Pose2, etc., ensuring an organized structure for

training and testing.

3.1.2 Preprocessing and Augmentation

The images are preprocessed using OpenCV to

convert them to grayscale for Haar cascade- based

face detection and resized to maintain consistency

across samples. Using MediaPipe Pose, key body

landmarks are extracted, such as elbow and shoulder

angles, which are critical for gesture recognition. The

model gains generalization capability through the

application of data augmentation strategies that

involve flipping along with scaling and rotation

methods.

Data augmentation methods enhance dataset

diversity which leads the model to achieve better

generalization capabilities.

3.1.3 Data Splitting

To evaluate the model thoroughly, the dataset is split

into training (80%), validation (15%), and testing

(5%) sets. This ensures that the model is trained

effectively and validated on unseen data before being

tested in real-world scenarios.

3.2 Model Development

3.2.1 Model Selection: MoveNet Thunder

The MoveNet Thunder model by TensorFlow is used

for real-time pose detection. This model can

accurately identify key body landmarks, making it

ideal for detecting traffic gestures. The detect()

function in the code runs multiple inference passes on

the same frame to improve detection accuracy. This

ensures that even in suboptimal lighting conditions or

partial occlusion, the gesture is correctly recognized.

3.2.2 Neural Network Architecture for

Gesture Classification

The extracted pose landmarks are passed to a custom

neural network built using Keras and TensorFlow.

The architecture consists of:

 Complex patterns within the pose

landmarks are learned using dense layers

during training.

 The model employs Dropout Layers to stop

random neurons from functioning when

learning occurs during training thereby

reducing overfitting.

 The Softmax Output Layer functions for

multi-class classification by determining

gestures between "Stop" and "Turn Left"

and others.

This architecture is designed to handle noisy data

ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,

COMMUNICATION, AND COMPUTING TECHNOLOGIES

474

while maintaining high accuracy.

3.3 Sensor Fusion and Real-Time

Inference

3.3.1 Camera and LiDAR Integration

The system integrates camera inputs with LiDAR

point clouds to enhance recognition accuracy.

Camera feeds provide visual data, while LiDAR

captures depth information. This fusion improves the

system’s robustness in low-light conditions or foggy

weather, where visual data alone might be

insufficient.

3.3.2 Real-Time Detection and Face

Validation

Using OpenCV, frames are captured continuously

from the video feed, and faces are detected using Haar

cascades to validate whether the traffic officer is

facing the camera. The classifyPose() function

processes the detected landmarks, checks if a face is

present, and classifies the gesture based on body

posture and orientation. This prevents false positives

by ensuring the vehicle only responds to relevant

gestures.

3.4 Model Training and Evaluation

3.4.1 Training Process

Training occurs through implementing categorical

cross-entropy loss together with the Adam

optimizer's mechanism. Among the key callbacks is

Model Checkpoint used to save ideal model weights

along with Early Stopping used to stop training when

validation accuracy stops increasing. To split the

dataset properly for robust validation performance the

train test split() function should be applied.

3.4.2 Confusion Matrix and Classification

Report

The performance assessment of the model uses the

evaluate () function which produces both a confusion

matrix and classification report data. The precision,

recall and F1-score metrics are found in the report

while the confusion matrix shows all classification

errors. The system displays 89% gesture

classification precision however special focus is

needed to enhance the detection capability for

"Pose6" and "Pose7" motions.

3.5 Simulation and Testing Using Carla

3.5.1 Setup and Scenario Testing

Testing takes place through the Carla simulator under

multiple conditions that include dark environments

together with fog conditions and dense traffic

scenarios. The model undergoes simulation testing

which confirms its stability when dealing with real-

world predicaments. Through the Carla environment

team members have access to test how autonomous

vehicles respond to different gestures while

simultaneously evaluating human command tracking

capabilities of the system.

3.5.2 Decision-Making

Logic Based on the detected gesture, the system sends

control commands to the vehicle. For example, upon

detecting the "Stop" gesture, the vehicle halts, and

when a "Turn Left" gesture is detected, the car takes

the left turn. The decision logic is implemented to

ensure safe and responsive navigation.

3.6 Performance Optimization

3.6.1 Asynchronous Processing and Multi-

Threading

The system’s performance is enhanced through

multi-threading to process video frames and perform

gesture detection in parallel. This ensures smooth

real-time operation without latency. The FPS counter

ensures the system maintains acceptable performance

for real-time driving.

3.6.2 Model Quantization and Pruning

The model receives optimization through edge device

deployment by implementing quantization together

with pruning techniques. The model size becomes

smaller and runs faster while preserving accuracy

after applying these techniques.

3.7 Deployment and Integration

3.7.1 Integration with Vehicle Systems

The gesture recognition system is integrated with the

vehicle's control unit via communication protocols

like MQTT or ROS. This allows the vehicle to

receive commands based on the recognized gestures

and take appropriate actions.

Enhancing Autonomous Vehicle Navigation: Trafﬁc Police Hand Gesture Recognition for Self-Driving Cars in India Using MoveNet

Thunder

475

3.7.2 Real-Time Monitoring with Streamlit

The Streamlit interface provides live feedback on the

recognized gestures and FPS, ensuring the system

performs reliably in real-world scenarios. This also

helps in debugging and monitoring the vehicle’s

behavior during field tests.

3.8 Future Enhancements

3.8.1 Dataset Expansion and Fine-Tuning

The dataset will be expanded to include more gesture

variations and challenging environmental conditions.

Fine-tuning the model on this enhanced dataset will

further improve accuracy and generalization.

3.8.2 Advanced Sensor Fusion

The system can be improved by exploring advanced

fusion techniques that integrate GPS, LiDAR, and

camera data, enhancing its ability to make real-time

decisions even in complex environments.

3.8.3 Field Testing and Deployment

The final step will involve field testing the system on

autonomous vehicles to validate its performance

outside simulated environments. Insights from these

tests will be used to further refine the model and

ensure seamless deployment.

4 ARCHITECTURE DIAGRAMS

Figure 1 shows the vehicle control system workflow.

Figure 1: Autonomous Vehicle Control System Workflow.

5 RESULTS AND DISSCUSSION

The Traffic Police Hand Gesture Recognition Model

was evaluated using both training and validation

datasets over 130 epochs to monitor the system’s

performance. Below are the key observations from

the accuracy graph and the confusion matrix

generated from the final evaluation.

5.1 Model Accuracy Analysis

Figure 2: Model Accuracy Graph.

Figure 2 displays both the training and validation

accuracy learning curves in the accuracy plot.

5.1.1 Training Accuracy

The training curve demonstrates a steady

improvement throughout the 130 epochs, reaching

approximately 87% accuracy by the end of training.

The curve follows a smooth upward trend, indicating

that the model successfully learns patterns from the

dataset without overfitting.

5.1.2 Validation Accuracy

The validation accuracy achieves superior metrics

than training accuracy at an early training stage until

it reaches a stable point of 90% accuracy in the end.

The model shows improved capability to predict new

data points because it works effectively with

unknown datasets. The trained model avoids

overfitting through dropout layers and regularization

techniques because training and validation accuracy

show only a small difference.

ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,

COMMUNICATION, AND COMPUTING TECHNOLOGIES

476

The final accuracy values are:

 Training Accuracy: ~87%

 Validation Accuracy: ~90%

5.2 Confusion Matrix Analysis

Figure 3: Confusion Matrix.

The confusion matrix (see Figure 3) provides insight

into the model's performance across the eight gesture

classes, with the following key findings:

5.2.1 High Performance on Core Gestures

 Pose1 (Stop vehicles from left and right):

222 correctly predicted out of 238 samples,

with minimal misclassification (only 10

samples predicted as Pose7). Precision and

recall for Pose1 remain high, showing the

model's robustness in recognizing this

critical gesture.

 Pose3 (Stop vehicles from behind): 214 out

of 219 samples were correctly classified.

This strong performance highlights the

effectiveness of the MoveNet Thunder

model in detecting critical stopping

gestures.

5.2.2 Challenges in Specific Classes

 Pose6 (Start vehicles on T-point): Some

confusion is observed between Pose6 and

Pose7, with 30 samples misclassified. This

indicates that the two poses might share

similar body postures or be difficult to

distinguish under certain conditions.

 NoPose (Unknown Pose): Out of 52

samples, 16 were misclassified as Pose2.

This suggests that in some cases,

insufficient pose information led to false

classifications.

5.2.3 Misclassification Patterns

Pose2 and Pose4 show occasional misclassification

into neighboring gesture classes, which could be

attributed to visual similarities in hand and body

movements. The confusion between Pose6 and Pose7

indicates the need for additional training data or fine-

tuning the classification threshold for these specific

poses.

6 QUANTITATIVE RESULTS

Table 1: Pose Classification Confusion Summary.

Class

Total

Samples

Correctly Classified Misclassified Top Misclassification

NoPose (Unknown

Pose)

52 36 16 Pose2

Pose1 (Stop from sides) 238 222 16 Pose7

Pose2

(

Sto

from front

)

221 193 28 Pose1

Pose3 (Stop from

ehind)

219 214 5 Pose2

Pose4 (Start from left) 216 209 7 Pose5

Pose5

(

Start from ri

)

196 172 24 Pose4

Pose6 (Start on T-

oint)

158 117 41 Pose7

Pose7 (Stop from

front/back

)

228 203 25 Pose6

Enhancing Autonomous Vehicle Navigation: Trafﬁc Police Hand Gesture Recognition for Self-Driving Cars in India Using MoveNet

Thunder

477

7 DISCUSSION AND INSIGHTS

The results indicate that the MoveNet Thunder-based

gesture recognition system performs exceptionally

well across most gesture classes. The high precision

and recall for critical gestures like "Stop" and "Start

from Left/Right" demonstrate the model's reliability

in real-time applications. However, misclassification

between Pose6 and Pose7 suggests a need for:

 Additional training data for these specific

classes to help the model differentiate between

subtle variations.

 Fine-tuning the decision boundaries between

similar poses to improve classification accuracy.

The face detection mechanism using Haar cascades

ensures that the system only responds when a valid

human gesture is detected, minimizing false

positives. However, additional sensor fusion (LiDAR

and GPS integration) could further enhance

performance in challenging environments, such as

low-light conditions or foggy weather.

Overall, the model achieves 89% accuracy,

showing strong potential for real-world deployment

in self-driving cars. With further refinement,

particularly in handling similar poses (Pose6 and

Pose7), the system can achieve even higher

reliability. The use of Carla Simulator for testing

ensures that the system is well-prepared for diverse

traffic scenarios, making it suitable for integration

into autonomous vehicle control systems in Indian

traffic conditions

8 CONCLUSIONS

The Traffic Police Hand Gesture Recognition System

developed using TensorFlow’s MoveNet Thunder

model demonstrates strong potential for real-world

deployment in autonomous vehicles by accurately

interpreting complex human gestures relevant to

Indian traffic scenarios. With an overall accuracy of

89%, the system performs reliably across core

gestures, ensuring safe and efficient vehicle

navigation. However, minor misclassifications

between similar poses, such as Pose6 and Pose7,

highlight the need for further dataset expansion and

fine-tuning. The use of Carla simulator for testing

ensures robust performance under diverse

environmental conditions, making this system well-

prepared for seamless integration into autonomous

vehicle control systems, enhancing safety in human-

managed traffic environments.

REFERENCES

Anju, O. A., Saranyah, V., Snehavarshini, S.,

Bhuvaneshwari, C., & Soundharya, M. (2024, June).

The detection and classification of human poses by

Movenet depends on spatiotemporal data configuration.

In 2024 15th International Conference on Computing

Communication and Networking Technologies

(ICCCNT) (pp. 1-6). IEEE.

Baek, T., & Lee, Y. G. (2022). Traffic control hand signals

can be recognized by convolution neural networks

together with recurrent neural networks. Journal of

Computational Design and Engineering, 9(2), 296-309.

Bijjahalli, S., Sabatini, R., & Gardi, A. 2020. "Advances in

Intelligent and Autonomous Navigation Systems for

Small UAS." Progress in Aerospace Sciences 115:

100617.

Chiang, K. W., Tsai, G. J., Chu, H. J., & El-Sheimy, N.

2020. The paper explains strategies to improve

INS/GNSS/Refreshed-SLAM combinations that result

in accurate lane-level navigation outcomes. IEEE

Transactions on Vehicular Technology 69(3): 2463-

2476.

de Miguel, M. Á., García, F., & Armingol, J. M. 2020.

"Improved LiDAR Probabilistic Localization for

Autonomous Vehicles Using GNSS." Sensors 20(11):

3145.

Ekatpure, R. 2023. The research article investigates edge

computing benefits for autonomous vehicles by

evaluating technical frameworks and data handling

methods and performance measurements. Applied

Research in Artificial Intelligence and Cloud

Computing 6(11): 17-34.

Fu, M. (2022, May). This paper presents an OpenPose

evaluation method to detect traffic officers' hand

signals. In 2022 International Conference on Urban

Planning and Regional Economy (UPRE 2022) (pp. 38-

42). Atlantis Press.

Ibrahim, H. A., Azar, A. T., Ibrahim, Z. F., & Ammar, H.

H. 2020. "A Hybrid Deep Learning-Based Autonomous

Vehicle Navigation and Obstacles Avoidance." In

Proceedings of the International Conference on

Artificial Intelligence and Computer Vision

(AICV2020), 296-307. Springer International

Publishing.

Kaushik, P., Lohani, B. P., Thakur, A., Gupta, A., Khan, A.

K., & Kumar, A. (2023, September). Body Posture

Detection and Comparison Between OpenPose,

MoveNet and PoseNet. In 2023 6th International

Conference on Contemporary Computing and

Informatics (IC3I) (Vol. 6, pp. 234-238). IEEE.

Li, Q., Queralta, J. P., Gia, T. N., Zou, Z., & Westerlund, T.

2020. "Multi-Sensor Fusion for Navigation and

Mapping in Autonomous Vehicles: Accurate

Localization in Urban Environments." Unmanned

Systems 8(3): 229-237.

Lu, Y., Ma, H., Smart, E., & Yu, H. 2021. "Real-Time

Performance-Focused Localization Techniques for

Autonomous Vehicle: A Review." IEEE Transactions

ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,

COMMUNICATION, AND COMPUTING TECHNOLOGIES

478

on Intelligent Transportation Systems 23(7): 6082-

6100.

Sabaichai, T., Tancharoen, D., & Limpasuthum, P. (2023,

November). Human Action Classification Based on

Pose Estimation and Artificial Neural Network. In 2023

7th International Conference on Information

Technology (InCIT) (pp. 181-185). IEEE.

Tang, Y., Zhao, C., Wang, J., Zhang, C., Sun, Q., Zheng,

W. X., ... & Kurths, J. 2022. "Perception and

Navigation in Autonomous Systems in the Era of

Learning: A Survey." IEEE Transactions on Neural

Networks and Learning Systems 34(12): 9604-9624.

The paper emerged as Ding, F., Yu, K., Gu, Z., Li, X., &

Shi, Y. 2021. "Perceptual Enhancement for

Autonomous Vehicles: Restoring Visually Degraded

Images for Context Prediction via Adversarial

Training." IEEE Transactions on Intelligent

Transportation Systems 23(7): 9430-9441.

Wang, G., & Ma, X. (2018, October). A traffic management

identification system uses an RGB-D sensor along with

faster R-CNN model. In 2018 International Conference

on Intelligent Informatics and Biomedical Sciences

(ICIIBMS) (Vol. 3, pp. 78-81). IEEE.

Wiederer, J., Bouazizi, A., Kressel, U., & Belagiannis, V.

(2020, October). Traffic control gesture recognition for

autonomous vehicles. In 2020 IEEE/RSJ International

Conference on Intelligent Robots and Systems (IROS)

(pp. 10676-10683). IEEE.

Xu, L., Liu, J., Zhao, H., Zheng, T., Jiang, T., & Liu, L.

2024. "Autonomous Navigation of Unmanned Vehicle

Through Deep Reinforcement Learning." arXiv

preprint arXiv:2407.18962.

Zhu, X., & Fang, M. (2024, October). Zhu and Fang (2024)

presented SlowFast network as a method for detecting

traffic police gestures in their research. In Fifth

International Conference on Computer Vision and Data

Mining (ICCVDM 2024) (Vol. 13272, pp. 744-750).

SPIE.

Enhancing Autonomous Vehicle Navigation: Trafﬁc Police Hand Gesture Recognition for Self-Driving Cars in India Using MoveNet

Thunder

479