Gesture Recognition for UAV-based Rescue Operation based on Deep

Learning

Chang Liu

1 a

and Tam

as Szir

anyi

1, 2 b

Department of Networked Systems and Services, Budapest University of Technology and Economics,

BME Informatika

ulet Magyar tud

osok k

utja 2, Budapest, Hungary

Machine Perception Research Laboratory of Institute for Computer Science and Control (SZTAKI), H-1111 Budapest,

Kende u. 13-17, Hungary

Keywords:

UAV Rescue, Human Gesture Recognition, UAV-human Communication, OpenPose, Neural Networks, Deep

Learning.

Abstract:

UAVs play an important role in different application ﬁelds, especially in rescue. To achieve good communi-

cation between the onboard UAV and humans, an approach to accurately recognize various body gestures in

the wild environment by using deep learning algorithms is presented in this work. The system can not only

recognize human rescue gestures but also detect people, track people, and count the number of humans. A

dataset of ten basic rescue gestures (i.e. Kick, Punch, Squat, Stand, Attention, Cancel, Walk, Sit, Direction,

and PhoneCall) has been created by a UAV’s camera. From the perspective of UAV rescue, the feedback from

the user is very important. The two most important dynamic rescue gestures are the novel dynamic Attention

and Cancel which represent the set and reset functions respectively. The system shows a warning help message

when the user is waving to the UAV. The user can also cancel the communication at any time by showing the

drone the body rescue gesture that indicates the cancellation according to their needs. This work has laid the

groundwork for the next rescue routes that the UAV will design based on user feedback. The system achieves

99.47% accuracy on training data and 99.09% accuracy on testing data by using the deep learning method.

1 INTRODUCTION

Gesture recognition is a popular research topic in the

ﬁeld of computer vision and machine learning, and

it has been widely associated with intelligent surveil-

lance and human-computer interaction. Unmanned

aerial vehicles (UAVs) are becoming increasingly

popular for many commercial applications, such as

photogrammetry (Gonc¸alves and Henriques, 2015),

agriculture (Barbedo, 2019), measuring park-based

physical activity (Park and Ewing, 2017), and search

and rescue (Erdelj et al., 2017) (Peschel and Mur-

phy, 2013). Nowadays, with the development of com-

puter vision technology and drone technology, in-

creasingly researchers have made numerous signif-

icant research comes about in these two intersect-

ing areas. Such as UAV hand gesture control (Ma

et al., 2017)(Li and Christensen, ), UAV for pedes-

trian detection (De Smedt et al., 2015), UAV gesture

recognition (Perera et al., 2018)(Hu and Wang, 2018).

https://orcid.org/0000-0001-6610-5348

https://orcid.org/0000-0003-2989-0214

UAV has the ability to overcome the problem of ﬁxed

coverage and it also can reach difﬁcult access areas.

Therefore, it will provide awesome offer assistance to

human beings in rescue.

Within these areas of research, a number of datasets

have been published over the past few years. These

datasets cover a wide range of research disciplines,

but mainly relate to the security, industrial and agri-

cultural domains. From the perspective of human de-

tection and action recognition datasets, there are some

open source datasets collected by drones, such as,

datasets for object detection (Xia et al., 2019), object

tracking (Carletti et al., 2018), human action detection

(Barekatain et al., 2017), and hand gesture recogni-

tion (Natarajan et al., 2018). Moreover, a dataset for

UAV control and gesture recognition (Perera et al.,

2018) and an outdoor recorded drone video dataset

for action recognition (Perera et al., 2019). But so

far there is no suitable dataset to describe some of the

gestures of human beings in difﬁcult situations, es-

pecially in wild disasters. In this work, we propose

a novel dataset to describe some of the body gesture

180

Liu, C. and Szirányi, T.

Gesture Recognition for UAV-based Rescue Operation based on Deep Learning.

DOI: 10.5220/0010522001800187

In Proceedings of the International Conference on Image Processing and Vision Engineering (IMPROVE 2021), pages 180-187

ISBN: 978-989-758-511-1

responses that humans will make in the wildness en-

vironment.

The drones can also use speech in wild rescue, but

this is more dependent on the environment, especially

in the wild, and if the drones use speech recognition

for rescue then there is no way to avoid some of the

noise caused by the external environment, which can

affect the rescue. If speech is possible between the

UAV board and humans on the ground against the

noisy (e.g., rotor noise) environment, the used lan-

guage and the possible rich dictionary of problem

featuring makes it impossible to understand humans

come up with the problem. While a limited and well-

oriented dictionary of gestures can force humans to

communicate brieﬂy. So gesture recognition is a good

way to avoid some communication drawbacks, but

of course in our rescue gestures, we need to select

the most representative gestures according to differ-

ent cultural backgrounds.

In drone rescues, communication between users and

drones is a very important factor. Therefore, it is nec-

essary to add corresponding feedback in the process

of recognizing rescue gestures. In this work, based

on the 10 basic body rescue gestures created in this

paper, we have chosen a pair of dynamic gestures:

a two-handed waving motion (Attention) and a one-

handed waving motion (Cancel) as the two most basic

communication vocabularies, well separated from the

static gesture patterns. When a human stretches out

two arms to call the drone, the drone will issue a warn-

ing and enter the help mode. When the human only

stretches out one arm, it means that the user wants to

cancel the communication with the drone. In other

words, the user does not need any help, the system

will shut down. The gestures dynamic Cancel and At-

tention are highlighted here as they are seen as setting

and resetting functions respectively, for people who

do not want to interact with the drone (e.g., standing

people), then communication between the drone and

the user will not be established and no warning mes-

sages will appear.

In the next few sections, Section 2 presents related

work, including machine speciﬁcations and UAV con-

nectivity. In Section 3, the gesture data collection

strategies and the related methodology are presented,

followed by human detection, pose extraction, human

tracking and counting, and body rescue gesture recog-

nition, along with a description of the relevant models

and training and system information. Finally, Section

4 discusses the training results of the models and the

experimental results. Conclusions and future work

are drawn in Section 5.

2 BACKGROUND

Based on experiments conducted by Sabir Hossain on

different GPU systems(Hossain and Lee, 2019), we

chose to embed the Jetson Xavier GPU into the drone,

which was used for real-world applications.The real

implementation of this work is done by using an on-

board UAV with a Jetson Xavier GPU in the ﬁeld

where we have no network to rely on. During the

experiment we were unable to go out into the ﬁeld

to ﬂy the drone for some external reason, so we sim-

ulated the ﬁeld environment in the lab and changed

the system for the test section, as shown in Figure

1. The lab tests were done on a 3DR SOLO UAV

based on a Raspberry Pi system that relied on a desk-

top ground station with a GTX Titan GPU. The drone

communicates with the computer via a local network.

In Chapter 4 we also tested the real running time of

the system.

Figure 1: Testing of a Raspberry Pi system UAV with GPU-

based ground station for rescue.

The ground station computer is equipped with an

NVIDIA GeForce GTX Titan GPU and an Intel(R)

Core (TM) I7-5930k CPU, which used for system

testing. The UAV is a raspberry pi drone, which is

a single-board computer with a camera module and

BCM2835 CPU. The type of camera is a 1080P 5MP

160

◦

ﬁsh eye surveillance camera module for Rasp-

berry Pi with IR night vision. The resolution of the

drone camera is set 1280*960 for the gesture recogni-

tion. In the test, the drone was ﬂown in the laboratory

at a height of about 3 metres. When we increase the

resolution, the altitude at which the drone can ﬂy in-

creases accordingly. The higher the resolution of the

drone camera, the higher the altitude the drone can

ﬂy over. The system can therefore also work properly

at altitudes of more than ten metres using the high-

resolution sensors of the drone camera. The system

works well when the drone is ﬂying diagonally above

the user, as the drone can detect the entire body in this

recognition.

Gesture Recognition for UAV-based Rescue Operation based on Deep Learning

181

3 METHODOLOGY

The framework of our proposed system is based on

gesture recognition for UAV and human communica-

tion. In this section, data collection, human detec-

tion, counting, and tracking are presented. The whole

gesture recognition system with calling and cancel-

ing feedback is explained. Figure 2 shows the frame-

work of the whole system. First, we perform pose es-

timation, followed by human tracking and counting.

Next comes the all-important rescue gesture recogni-

tion.Feedback from the human body is essential for

UAV gesture recognition systems. Obtaining infor-

mation about gestures without feedback will not help

to improve autonomy.In order to obtain this informa-

tion, the two most important dynamic gestures are

the novel dynamic Attention and Cancel, which indi-

cate the setting and resetting functions of the system,

respectively. These dynamic gestures have been de-

scribed in our paper (Lics

ar and Szir

anyi, 2005). The

system uses gesture recognition technology to force

the user to communicate brieﬂy, quickly, and effec-

tively with the drone in speciﬁc environments.

3.1 Data Collection

OpenPose (Cao et al., 2017) is a real-time multi-

person framework displayed by the Perceptual Com-

puting Lab of Carnegie Mellon College (CMU) to

identify a human body, hand, facial, and foot key

points together on single images. Based on the ro-

bustness of the OpenPose algorithm and its ﬂexibility

in extracting keypoints, we used it to detect human

skeleton and obtain skeletal data for different gestures

on the human body, thus laying the data foundation

for subsequent recognition.The key idea of OpenPose

is to use a convolutional neural network to generate

two heap maps, one for predicting joint positions, and

the other for partner the joints into human skeletons.

In short, the input to OpenPose is an image and the

output is the skeleton of all the people detected by

this algorithm. Each skeleton has 18 joints, counting

head, neck, arms, and legs, as appeared in Table 1.

Figure 3 shows the skeleton data and Table 1 gives

the key points information.

As there is no publicly available relevant dataset

in the ﬁeld of wilderness rescue by drones, to ad-

dress this problem we created a new dataset specif-

ically describing short and meaningful physical res-

cue gestures made by humans in different situa-

tions.Considering that people in different countries

have different cultural backgrounds, certain gestures

may represent different meanings. Therefore, we

have selected and deﬁned 10 representative rescue

Table 1: OpenPose joints information.

Number Joints Number Joints

0 Nose 9 Right Knee

1 Neck 10 Right Foot

2 Right Shoulder 11 Left Hip

3 Right Elbow 12 Left Knee

4 Right Wrist 13 Left Foot

5 Left Shoulder 14 Right Eye

6 Left Elbow 15 Left Eye

7 Left Wrist 16 Right Ear

8 Right Hip 17 Left Ear

gestures that are used to convey clear and speciﬁc

messages without ambiguity that humans make in dif-

ferent scenarios. These gestures include Kick, Punch,

Squat, Stand, Attention, Cancel, Walk, Sit, Direction

and PhoneCall. The dataset can of course be extended

to a larger dataset.

The datasets are collected using a 1080P 160

◦

ﬁsh eye

surveillance camera module for raspberry pi on the

3DR SOLO UAV system.The data set was collected

from six members of our laboratory who also took

part in the real-time test that followed. Four of them

were male and two were female, aged between twenty

and thirty years old. They made all possible varia-

tions for all gestures.The system proposed in this pa-

per recognises ten very common body rescue gestures

in real time, including Kick, Punch, Squat, Stand, At-

tention, Cancel, Walk, Sit, Direction and PhoneCall.

We have collected as many ’attention’ and ’cancel’

gestures as possible in order to make the system more

powerful for setting and resetting. Table 2 describes

the details of each gesture. Table 3 describes the de-

tails of the UAV rescue dataset.

In our dataset, the focus is on two dynamic gestures

(Attention and cancel), which are completely separate

from the static gesture mode, as they represent the

system’s setting and resetting functions.The system

will only issue an alert if it recognises these two ges-

tures above. Attention indicates that the user needs to

establish communication with the drone. Conversely,

”Cancel” sends an alert indicating that the user does

not need to establish contact and that the system will

automatically shut down.When other rescue gestures

are recognised, the system will not issue an alert.

With the exception of ’Attention’ and ’Cancel’, the

remaining eight gestures are considered to be signs of

normal human activity and therefore do not interact

further with the drone.

IMPROVE 2021 - International Conference on Image Processing and Vision Engineering

182

Figure 2: Framework of the whole system.

Figure 3: OpenPose skeleton data.

3.2 Gesture Recognition

Figure 4 shows the ﬂow chart for human gesture

recognition. The human skeleton is ﬁrst detected by

the input video stream using the OpenPose algorithm

to obtain skeletal information data, followed by fea-

ture extraction based on this skeletal data, and ﬁ-

nally fed into a classiﬁer to obtain recognition re-

sults. We performed real-time pose estimation with

OpenPose by using a pre-trained model as the estima-

tor (Lawrence, 2021).A deep neural network (DNN)

model is used for predicting the user’s rescue ges-

tures. We use Deep SORT algorithm(Wojke et al.,

2017) for human tracking of multi-person scenes. The

main difference from the original SORT algorithm

(Bewley et al., 2016) is the integration of appearance

information based on a deep appearance descriptor.

The Deep SORT algorithm allows us to calculate a

depth feature for each bounding box and add this fea-

ture using the similarity between depth features as a

factor in the tracking logic. Based on the above de-

scription we can obtain information about the human

body. Next by counting the number of people we ar-

rive at the following three scenarios: no one, indi-

viduals and multiple people. For case 1, if the drone

does not detect any person, then no communication

between the drone and the user can be established and

gesture recognition has no meaning. For cases 2 and

3, if the drone detects one or more people, then the

drone will enter the gesture recognition phase and dis-

play the corresponding recognition results based on

the user’s body gestures, in order to enable communi-

cation between the user and the drone and thus to help

humans. When the two dynamic gestures ”Attention”

and ”Cancel”, which represent the system settings and

reset functions respectively, appear, the system will

display a warning, open the help mode or cancel the

interaction.

Figure 4: Workﬂow of the human gesture recognition sys-

tem.

In contrast to other gesture recognition methods (e.g.

using 3D convolutional neural networks(Carreira and

Zisserman, 2017)), we ﬁnally chose the OpenPose

skeleton as the basic feature for human gesture recog-

nition. The reason is that the features of the hu-

Gesture Recognition for UAV-based Rescue Operation based on Deep Learning

183

Table 2: UAV rescue gestures and corresponding key-

points.

Number Name Reuse Gestures

1 Kick

2 Punch

3 Squat

4 Stand

5 Attention

6 Cancel

7 Walk

8 Sit

9 Direction

10 PhoneCall

man skeleton are simple, intuitive and easy to dis-

tinguish between different human gestures. In con-

trast, 3DCNNs are both time-consuming and difﬁcult

to train large neural networks. As for the classiﬁers,

we experimented with four different classiﬁers, in-

cluding kNN (Guo et al., 2003), SVM (Mavroforakis

and Theodoridis, 2006), deep neural network (Liu

et al., 2017), and random forest (Pal, 2005). Exper-

iments were conducted on the above four classiﬁers,

and from the accuracy values obtained, DNN has the

highest accuracy, so we choose DNN as the classiﬁer

for gesture recognition. The implementation of these

Table 3: UAV rescue gesture dataset details.

Number Name No.of data

1 Kick 784

2 Punch 583

3 Squat 711

4 Stand 907

5 Attention 1623

6 Cancel 1994

7 Walk 722

8 Sit 942

9 Direction 962

10 PhoneCall 641

classiﬁers was from the Python library “sklearn”.

The DNN model has been programmed using Keras

Sequential API in Python. There are 4 dense layers

with batch normalization behind each one and 128,

64, 16, 10 units in each dense layer sequentially. The

last layer of the model is with Softmax activation

and 10 outputs. Based on the establishment of the

above DNN model for gesture recognition, the next

step is training. The model is compiled using Keras

with TensorFlow backend. The categorical cross-

entropy loss function is utilized because of its suit-

ability to measure the performance of the fully con-

nected layer’s output with Softmax activation. Adam

optimizer with an initial learning rate of 0.0001 is uti-

lized to control the learning rate. The model has been

trained for 50 epochs on a system with an Intel i7 -

5930K CPU and NVIDIA GeForce GTX TITAN X

GPU. The total training dataset is split into two sets:

90% for training, and 10% for testing. Speciﬁc infor-

mation such as the accuracy and loss of the ﬁnal body

gesture recognition model is speciﬁed in Section 4.

4 EXPERIMENTS

Based on the introduction in Chapter 2, the testing

phase of the designed system was done in a simulated

ﬁeld environment in the laboratory, and the actual run-

ning time required for gesture recognition to run on

the GPU-based ground station was 25 ms. It should

be noted that the results shown below are cropped im-

ages, and the original image should be in a 4:3 ratio,

as we tried to recreate the ﬁeld environment without

clutter (e.g. tables and chairs that we did not want to

include ), so we have cropped a ﬁxed area of the out-

put video. As the communication between the UAV

and the GPU-based ground station in the lab relies on

the local network, requests sent from the client and

accepted by the server directly reduce the value of the

FPS, resulting in a very slow running system. The

IMPROVE 2021 - International Conference on Image Processing and Vision Engineering

184

system only achieves around 5 FPS in real-time op-

eration, but running directly on the UAV with the Jet-

son Xavier GPU solves this problem. It should also be

noted that in laboratory tests, the UAV should ﬂy in an

inclined position above the person, at a distance of ap-

proximately 2 to 3 metres from the user. The angled

position ensures that the entire human body can be

recognised with a higher probability than ﬂying ver-

tically downwards directly above the user’s head, and

as the work is based on the human skeleton, the ﬂy-

ing position of the drone has some limitations on the

recognition results.

Figure 5: Model accuracy over the epochs.

Figure 6: Model loss over the epochs.

Based on the human rescue gesture dataset created in

Table 2, we trained the model through a deep neu-

ral network and eventually obtained the accuracy and

loss of the human gesture recognition model. During

the training process, the accuracy and loss functions

change as shown in Figures 5 and 6. First, the accu-

racy of training and testing increases rapidly. There-

after, it grows slowly between 10 and 20 epochs and

merges after 30 epochs. After 40 epochs there is

less noise in between. The weights of the best ﬁtting

model with the highest test accuracy are preserved.

Both training and test losses are decreasing and con-

verging, thus showing a well-ﬁtting model.After 50

epochs of training, the model achieved an accuracy

of 99.47% on the training data and 99.09% on the test

data. Figure 7 shows the normalised confusion matrix

on the test set. The high density at the diagonal shows

that the majority of human rescue gestures are cor-

rectly predicted. In most gestures, the performance is

good and close to perfect. We also analyzed the per-

formance of the model from other standard metrics.

We used the following equations to calculate macro-

F1. Based on the true positives (TP), false positives

(FP), false negatives (FN) and true negatives (TN) of

the samples, we calculated P-values (Precision) and

R-values (Recall) respectively, resulting in macro F1

values mostly close to 1.00.

Precision =

T P

T P + FP

, Recall =

T P

T P + FN

(1)

macroP =

∑

i=1

, macroR =

∑

i=1

(2)

macroF1 =

2 × macroP × macro R

macroP + macroR

(3)

Figure 7: Normalized Confusion matrix with predicted la-

bels on X-axis and true labels on the Y-axis in testing set.

Figures 8 shows the recognition of Attention gesture

and Cancel gesture with warning messages in real

time. Information about the number of people, time,

frame rate and FPS is also shown. The results of

the recognition of two basic gestures, chosen at ran-

dom from the dataset, are described in detail.Figure 9

shows that when a user points in a particular direction,

the aim is to alert the drone to look in the direction that

Gesture Recognition for UAV-based Rescue Operation based on Deep Learning

185

the person is pointing. For example, when someone

is lying on the ground in the direction pointed, the

gesture is a good solution to the problem that when

someone is lying on the ground, the drone is not able

to recognise the skeletal information of the person ly-

ing on the ground very well due to the limitations of

the drone’s ﬂight position. Figure 9 also shows the

user making a phone call with a gesture that could be

linked to hand gesture number recognition at a future

stage. When the user poses to make a call, we can per-

form hand number recognition to get the phone num-

ber the user wants to dial in the extension work.

Figure 8: Attention and Cancel.

Figure 9: Direction and PhoneCall.

When there are more than one person, one of them

sends an ”Attention” gesture to the drone. At this

point, the drone will send a warning that someone

needs help. This is shown in Figure 10. We can also

see in Figure 10 that the gestures of people other than

the person performing the Attention gesture are also

well recognised. Our gesture recognition system can

identify approximately 10 people at once. It is worth

raising the point that if a person is not fully present,

then that person will not be recognised. If the user

makes a gesture that is not in our data set, the person’s

gesture will not be recognised and the recognition re-

sult information above it will be blank.

Figure 10: Multiple people with Attention.

5 DISCUSSION AND FUTURE

WORK

In this paper we present a real-time human gesture

recognition system for UAV rescue. The practical ap-

plication and the laboratory test are two different sys-

tems. The system not only detects people, tracks them

and counts them, but also recognises the user’s ges-

tures.

The main innovations and contributions of this paper

are as follows: Firstly, it is worth acknowledging that

gesture recognition for wilderness rescue avoids in-

terference from the external environment, which of-

fers the greatest advantage over speech recognition

for rescue. A limited and well-directed dictionary of

gestures may force a short communication. Gesture

recognition is therefore a good way to avoid certain

communication deﬁciencies. Secondly, a dataset of

ten basic human rescue gestures (i.e. kick, punch,

squat, stand, attention, cancel, walk, sit, indicate and

phone call) was created for describing some physical

human gestures in the ﬁeld. Finally, the two most im-

portant dynamic gestures are the novel dynamic ’at-

tention’ and ’cancel’, representing the set and reset

functions respectively. From a drone rescue perspec-

tive, we have done a good job of getting feedback

from users. This work has provided the basis for the

design of subsequent rescue routes.

In future work we need to include more generic res-

cue gestures into the gesture dataset. We also need

to make it possible for the system to automatically

retrain the model based on new data in a very short

period of time, thus obtaining new models with new

rescue gestures. Outdoor testing of drones equipped

with Jetson Xavier GPUs is also a future extension

work.

IMPROVE 2021 - International Conference on Image Processing and Vision Engineering

186

ACKNOWLEDGEMENTS

The work is carried out at Institute for Computer Sci-

ence and Control (SZTAKI), Hungary and the au-

thor would like to thank her colleague L

aszl

o Sp

for providing the infrastructure and technical sup-

port. This research was funded by Stipendium Hun-

garicum scholarship and China Scholarship Council.

The research was supported by the Hungarian Min-

istry of Innovation and Technology and the National

Research, Development and Innovation Ofﬁce within

the framework of the National Lab for Autonomous

Systems.

REFERENCES

Barbedo, J. G. A. (2019). A review on the use of un-

manned aerial vehicles and imaging sensors for mon-

itoring and assessing plant stresses. Drones, 3:40.

Barekatain, M., Mart

ı, M., Shih, H., Murray, S., Nakayama,

K., Matsuo, Y., and Prendinger, H. (2017). Okutama-

action: An aerial view video dataset for concurrent

human action detection. In 2017 IEEE Conference on

Computer Vision and Pattern Recognition Workshops

(CVPRW), pages 2153–2160.

Bewley, A., Ge, Z., Ott, L., Ramos, F., and Upcroft, B.

(2016). Simple online and realtime tracking. 2016

IEEE International Conference on Image Processing

(ICIP), page 3464–3468.

Cao, Z., Simon, T., Wei, S.-E., and Sheikh, Y. (2017). Real-

time multi-person 2d pose estimation using part afﬁn-

ity ﬁelds. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition (CVPR).

Carletti, V., Greco, A., Saggese, A., and Vento, M.

(2018). Multi-object tracking by ﬂying cameras based

on a forward-backward interaction. IEEE Access,

6:43905–43919.

Carreira, J. and Zisserman, A. (2017). Quo vadis, action

recognition? a new model and the kinetics dataset.

In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition (CVPR).

De Smedt, F., Hulens, D., and Goedeme, T. (2015). On-

board real-time tracking of pedestrians on a uav. In

Proceedings of the IEEE Conference on Computer Vi-

sion and Pattern Recognition (CVPR) Workshops.

Erdelj, M., Natalizio, E., Chowdhury, K. R., and Akyildiz,

I. F. (2017). Help from the sky: Leveraging uavs

for disaster management. IEEE Pervasive Computing,

16(1):24–32.

Gonc¸alves, J. and Henriques, R. (2015). Uav photogram-

metry for topographic monitoring of coastal areas. IS-

PRS Journal of Photogrammetry and Remote Sensing,

104:101 – 111.

Guo, G., Wang, H., Bell, D., Bi, Y., and Greer, K.

(2003). Knn model-based approach in classiﬁcation.

In OTM Confederated International Conferences” On

the Move to Meaningful Internet Systems”, pages

986–996. Springer.

Hossain, S. and Lee, D.-J. (2019). Deep learning-based

real-time multiple-object detection and tracking from

aerial imagery via a ﬂying robot with gpu-based em-

bedded devices. Sensors, 19:3371.

Hu, B. and Wang, J. (2018). Deep learning based hand ges-

ture recognition and uav ﬂight controls.

Lawrence, C. (2021). reaktor/vzw-care-tf-pose-estimation.

Li, S. and Christensen, H. Wavetoﬂy: Control a uav using

body gestures.

Lics

ar, A. and Szir

anyi, T. (2005). User-adaptive hand ges-

ture recognition system with interactive training. Im-

age and Vision Computing, 23(12):1102 – 1114.

Liu, W., Wang, Z., Liu, X., Zeng, N., Liu, Y., and Alsaadi,

F. E. (2017). A survey of deep neural network ar-

chitectures and their applications. Neurocomputing,

234:11 – 26.

Ma, Y., Liu, Y., Jin, R., Yuan, X., Sekha, R., Wilson, S.,

and Vaidyanathan, R. (2017). Hand gesture recogni-

tion with convolutional neural networks for the mul-

timodal uav control. In 2017 Workshop on Research,

Education and Development of Unmanned Aerial Sys-

tems (RED-UAS), pages 198–203.

Mavroforakis, M. E. and Theodoridis, S. (2006). A geo-

metric approach to support vector machine (svm) clas-

siﬁcation. IEEE Transactions on Neural Networks,

17(3):671–682.

Natarajan, K., Nguyen, T. D., and Mete, M. (2018). Hand

gesture controlled drones: An open source library.

In 2018 1st International Conference on Data Intel-

ligence and Security (ICDIS), pages 168–175.

Pal, M. (2005). Random forest classiﬁer for remote sensing

classiﬁcation. International journal of remote sensing,

26(1):217–222.

Park, K. and Ewing, R. (2017). The usability of unmanned

aerial vehicles (uavs) for measuring park-based physi-

cal activity. Landscape and Urban Planning, 167:157

– 164.

Perera, A. G., Law, Y. W., and Chahl, J. (2019). Drone-

action: An outdoor recorded drone video dataset for

action recognition. Drones, 3:82.

Perera, A. G., Wei Law, Y., and Chahl, J. (2018). Uav-

gesture: A dataset for uav control and gesture recog-

nition. In Proceedings of the European Conference on

Computer Vision (ECCV) Workshops.

Peschel, J. M. and Murphy, R. R. (2013). On the hu-

man–machine interaction of unmanned aerial system

mission specialists. IEEE Transactions on Human-

Machine Systems, 43(1):53–62.

Wojke, N., Bewley, A., and Paulus, D. (2017). Simple on-

line and realtime tracking with a deep association met-

ric. In 2017 IEEE International Conference on Image

Processing (ICIP), pages 3645–3649. IEEE.

Xia, G.-S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J.,

Datcu, M., Pelillo, M., and Zhang, L. (2019). Dota:

A large-scale dataset for object detection in aerial im-

ages. arXiv:1711.10398 [cs].

Gesture Recognition for UAV-based Rescue Operation based on Deep Learning

187