How Far Can a Drone be Detected?

A Drone-to-Drone Detection System Using Sensor Fusion

Juann Kim

1 a

, Youngseo Kim

2 b

, Heeyeon Shin

3 c

, Yaqin Wang

4 d

and Eric T. Matson

Dept. Software, Sangmyung University, Cheonan, Republic of Korea

Dept. Human Centered AI, Sangmyung University, Seoul, Republic of Korea

Computer Science and Engineering, Kyung Hee University, Yongin, Republic of Korea

Computer and Information Technology, Purdue University, West Lafayette, U.S.A.

Keywords:

Drone Detection, Audio Classiﬁcation, Object Detection, Machine Learning, Deep Learning, Decision

Fusion, Sensor Fusion.

Abstract:

The recent successes of drone detection models show that leveraging the decision fusion of audio-based and

vision-based features can achieve high accuracy, instead of only using unitary features. In this paper, we

propose to estimate how far can a drone be detected in different distances. Drone-to-drone dataset for were

collected separately using a camera and a microphone. The data are evaluated using deep learning and machine

learning techniques to show how far can a drone be detected. Two different types of sensors were used

for collecting acoustic-based features and vision-based features. Convolutional Neural Network (CNN) and

Support Vector Machine (SVM) are utilized with audio-based features, which are Mel-Frequency Cepstral

Coefﬁcients (MFCC) and Log-Mel Spectrogram. YOLOV5 is adopted for visual feature extraction and drone

detection. Ultimately, by using the sensor fusion of both domains of audio and computer vision, our proposed

model achieves high performances in different distances.

1 INTRODUCTION

The application of Unmanned Aerial Vehicles

(UAVs), or drones, is increasing rapidly in diverse

ﬁelds including agriculture, construction, technical

service, health care, and delivery systems. The bene-

ﬁts of drones are enormous: operating without a pilot,

applying diverse ﬁelds, low-cost infrastructure, and

etc. Especially, Countering Unmanned Aerial Sys-

tem (CUAS) is required to detect and track malicious

drones that approach protected or secure areas. Drone

ﬂights in the Air Exclusion Zone have repeatedly oc-

curred. For instance, a man was detained since he

ﬂew his UAV 100 feet above near the White House

in 2015. (H. Abdullah, 2015) Due to this, the impor-

tance of drone detecting and further drone localiza-

tion comes to the fore. Various domains, including

Radar and Lidar, many types of camera and micro-

phone were applied to drone detection and localiza-

tion.

In this paper, the low-cost sensors, camera and

https://orcid.org/0000-0002-0923-0115

https://orcid.org/0000-0002-9019-2135

https://orcid.org/0000-0002-3423-3780

https://orcid.org/0000-0003-2954-0855

microphone, are used for drone-to-drone detection.

By the sensor fusion, two different sensors can com-

pensate each other. The experiment is conducted in

the scenario that two drones are facing each other in

the air. The distance between the target drone and

the moving plane of the detecting drone are set from

20m to 60m to experiment with how far can the tar-

get drone be detected. The collected dataset is used

in developing Machine Learning and Deep Learning

models to detect a drone using various sensors. This

paper focuses on UAV detection by certain range us-

ing audio-based and vision-based approaches.

In previous research, various feature extraction

methods were proposed such as MFCC, Log Mel-

spectrogram, Short Time Fourier Transform (STFT),

(S. Al-Emadi, A. Al-Ali, A. Mohammad, and A. Al-

Ali, 2019), (Y. Wang, F. E. Fagian, K. E. Ho, and E. T.

Matson, 2021). From these extraction methods, var-

ious studies have succeeded in detecting drones us-

ing MFCC, (S. Jeon, J. -W. Shin, Y. -J. Lee, W. -

H. Kim, Y. Kwon, and H. -Y. Yang, 2017). In this

work, SVM and a Convoluntional Neural Network

are used for drone detection with audio data. Log

Mel-spectrogram and MFCC are evaluated for feature

extraction. For Computer Vision, a state-of-the-art

structure, you only look once - YOLOV5 is applied

Kim, J., Kim, Y., Shin, H., Wang, Y. and Matson, E.

How Far Can a Drone be Detected? A Drone-to-Drone Detection System Using Sensor Fusion.

DOI: 10.5220/0011791000003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 3, pages 877-884

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

877

to detect the drone. After comparing different models

and feature extraction methods, we choose the CNN

model and MFCC feature for the sensor fusion. In our

proposed fusion system, the YOLOV5 model ﬁrst de-

tects the detecting drone using visual data; then, the

falsely detected data are reclassiﬁed with a pre-trained

CNN model based on audio data. Overall, the main

contributions of this work can be summarized as fol-

lows:

• We gather the drone-to-drone audio and video

data that is collected at a distance of 20 to 60 me-

ters manually.

• We propose a novel drone detection scheme that

reduces the error rate using the proposed sensor

fusion.

The rest of the work contains ﬁve sections. Sec-

tion 2 reviews several related works to organize the

problems. In Section 3, our methodology is intro-

duced, which includes data collection and data pro-

cessing. In Section 4, the experiments of proposed

system are conducted to evaluate the optimal perfor-

mances for each domain and also for the sensor fusion

of both domains. Lastly, Section 5 suggests the con-

clusions and future works.

2 RELATED WORK

2.1 Radio Frequency and Radar-Based

Approach

Currently, various methods have been used for drone

detection and drone localization. In (Choi B, Oh D,

Kim S, Chong J-W, Li Y-C., 2018), distance esti-

mation or drone localization was done in two ways.

Firstly, the implemented FMCW radar system result

with only one drone showed the maximum distance

between the drone and the radar system was greater

than about 1005 to 1010 m. Meanwhile, when the

two drones were ﬂying at the same time, one frame

of the detection results in a range of around 339 m.

The distance of drone-to-drone detection using radar

is shorter than using only one drone ﬂying. However,

radar-based detection is not optimized for plastic ma-

terial drone detection and small drone at widely vary-

ing ranges (Liu, Hao, et al., 2017). Also, radar is a

high-cost sensor.

2.2 Audio-Based Deep Learning

Approach

A radar system has a small cross-section, and radio

frequency (RF) based systems do not operate well

when GPS communication signals are small; there-

fore their performances are limited. However, the

microphone array overcomes the shortcomings of the

sensors and shows excellent performance in drone lo-

calization and drone tracking. In (Christnacher, F.,

Hengy, S., Laurenzis, M., Matwyschuk, A., Naz, P.,

Schertzer, S., & Schmitt, G., 2016), four microphone

sensors were used to predict the direction of drone

arrival (DOA), and localization is performed by ob-

taining azimuth and elevation angles by a multi-signal

classiﬁcation algorithm (MUSIC). In fact, it showed a

very low performance. Meanwhile, in (Sedunov, A.,

Sutin, A., Sedunov, N., Salloum, H., Yakubovskiy,

A., & Masters, D., 2016), (H. Salloum, A. Sedunov,

N. Sedunov, A. Sutin and D. Masters, 2015), Acous-

tic Aircraft Detection (AAD) systems were developed

and built. This system can detect and track small air-

planes and helicopters, whereas it does not consider a

situation with multiple noises.

2.3 Vision-Based Deep Learning

Approach

Research on drone detection systems using Computer

Vision is one of the traditional methods that is widely

used in the past. Furthermore, research on drone de-

tection systems using computer vision-based technol-

ogy has been shown to be sufﬁciently accurate and

commercially available by experiments conducted by

(Deng, S., Li, S., Xie, K., Song, W., Liao, X., Hao, A.,

& Qin, H, 2020). Among CNN-based deep learning

models, YOLO, a one-stage model, is easy to detect

drone objects in real-time and can respond in a very

short time. Thus, it can be applied to systems such as

CUAS.

2.4 Depth Estimation and Distance

Prediction

Depth estimation and distance prediction have made

enormous progress in recent years and achieved sig-

niﬁcant results with the advance of deep learning (Al-

malioglu, Yasin, et al., 2019), (Wu, Zhenyao, et al.,

2019), (Feng, Tuo, and Dongbing Gu., 2019). In the

early stage, (Aswini, N., S. V. Uma, and V. Akhilesh.,

2022) applied object detection model-YOLOV3 and

mathematical principles for obstacle distance estima-

tion. They established the maximum distance to the

obstacle is a 30m. Beside, we set the distance to the

drone to a maximum of 60m. On the other hand, (Yip,

Daniel A., et al., 2020) designed a sound level mea-

surements from audio recordings that provides objec-

tive distance estimation. However, our paper utilizes

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

878

Figure 1: Visual of data samples. We use the manually collected audio and image input data. From the ﬁrst left column,

there are no drone data, drone data at a 20m distance, drone data at 40m, and lastly the drone data at 60m. Then, from the ﬁrst

top row, there are image samples, mel spectrogram feature map from audio data and MFCC feature maps from audio data.

both sound and visual source for drone distance pre-

diction.

3 METHODOLOGY

3.1 Data Collection

The data collection method of this paper is the simi-

lar as that of previous work (Kim. J, Lee. D, Kim. Y,

Shin. H, Heo. Y, Wang. Y, & Matson, E. T, 2022). We

use DJI Matrice200 as te target drone and DJI Mavic2

Pro as the detecting drone to collect data in this re-

search. Similar to (Alaparthy, V., Mandal, S., & Cum-

mings, M., 2021), the negative dataset, i.e., no drone

data, is also collected for the drone detection, which

includes environment noises while no drone is ﬂying

in the air such as wind or bird sound. Various previ-

ous studies (Liu, Hao, et al., 2017), (Hu, Yuanyuan, et

al., 2019), (Al-Emadi, Sara, et al., 2019) have imple-

mented with the camera and microphone placing on

the ground to collect data containing the target drone.

In contrast, our dataset was collected from the cam-

era and the microphone of detecting drone while two

drones were ﬂying in the air at the same time, facing

each other.

The detecting drone was hovering at the altitude

of 10m. While hovering, the audio and video data

were collected using the built-in camera and iPhone

6 attached to the detecting drone. Then, the target

drone was maintaining the distance with plane of de-

tecting drones moving horizontally and vertically by

20m, 40m, and 60m. With the ﬁxed distance, the tar-

get drone was moving randomly at the camera range.

The weather condition varies in the days of data

collection. The different weather conditions include

windy, sunny, and foggy days with different humidity

levels and wind speed. So, the background images

and noise are included in the data, while other en-

vironmental factors being the same that of the drone

data.

Audio data consists four classes in .wav format.

The data are collected in the environment with other

various noises such as wind, bird, cow, insect, trafﬁc,

airplane, etc.

When collecting vision data, the raw mp4 video

ﬁles are split into images per 30 frames. Each image

has 640 x 640 resolution, and the image format is jpg.

For each domain of audio and image data, 1029

data samples are collected for each class. Thus, total

4116 data samples are collected for each domain, as

shown in Table 1.

3.2 Audio Data Augmentation

The raw audio data is split into one second which can

sufﬁciently represent acoustic-based features in train-

ing and testing (S. Seo, S. Yeo, H. Han, Y. Ko, K.

E. Ho, and E. T. Matson, 2020), (Casabianca, Pietro,

and Yu Zhang, 2021), (S. Al-Emadi, A. Al-Ali, A.

Mohammad, and A. Al-Ali, 2019).

Table 1: The Number of Dataset.

Split

Class No drone 20m 40m 60m

Audio Image Audio Image Audio Image Audio Image

Train 720 720 719 719 719 719 719 719

Validation 204 204 205 205 205 205 205 205

Test 105 105 105 105 105 105 105 105

How Far Can a Drone be Detected? A Drone-to-Drone Detection System Using Sensor Fusion

879

Before conducting feature extraction, pitch shift-

ing is used for audio data augmentation in order to

improve performance in generalization. Pitch shift-

ing is a methodology to raise or lower the pitch of the

audio data without affecting the speed of the sound.

In (J. Salamon and J. P. Bello, 2017), pitch shifting

augmentation shows the best positive impact on per-

formance and is the only method that does not have

negative impacts on any types of environmental sound

classiﬁcation. Therefore, the total number of data

samples doubled as 8232 from the original dataset.

3.3 Audio Feature Extraction

The audio features are extracted using two feature ex-

traction methods: MFCC and Log Mel-Spectrogram.

Also, MFCC provides useful features to capture peri-

odicity from the fundamental frequencies brought on

drone’s rotor blades (Jeon, S., Shin, J. W., Lee, Y. J.,

Kim, W. H., Kwon, Y., & Yang, H. Y., 2017). Mean-

while, the Log Mel-Spectrogram has a low false alarm

rate but a weak drone detection ability. However, the

MFCC has a strong drone detection ability while hav-

ing a high False Alarm Rate compared to Log Mel-

Spectrogram (Dong, Qiushi, Yu Liu, and Xiaolin Liu,

2022). For the hyper-parameter, the number of mels is

used as 128 which is the default value, and the num-

ber of MFCC is also uniﬁed as 128. The examples

of extracted feature map of four classes are shown in

Figure 1.

3.4 Vision Data Processing

To the purpose, train the model for drone detection,

all of the ground truth objects in the picture require

to be labeled ﬁrst. This dataset is labeled using the

“LabelImg” (heartexlabs, 2014). The coordination of

the bounding box including the location information

of drones is generated as text ﬁles.

4 EXPERIMENT

4.1 Overview

In this paper, the low-cost sensor fusion system for

detecting the target drone by three intervals is pro-

posed. The camera and the microphone used for this

system are attached to the Drone (A. Patle and D. S.

Chouhan, 2013). Generally, drone detection results

using visual-based features show a high performance

(Madasamy, K., Shanmuganathan, V., Kandasamy,

V., Lee, M. Y., & Thangadurai, M, 2021). However,

the camera cannot perform its role properly in situa-

tions where vision is obstructed. In the dataset we col-

lected, weather conditions are the main factors for the

obstruction including cloudy and foggy conditions.

This can be compensated by using additional sensor

for drone detection, which is audio-based features.

Therefore, drone detection is done based on vision-

based features using the YOLOV5 model. Then, the

falsely detected vision data is reclassiﬁed using audio

data with a CNN model. The falsely detected data is

speciﬁcally the ones that are classiﬁed as False Nega-

tive (FN) and False Positive (FP). The proposed sys-

tem is described in Figure 2.

The sound of the detecting drone with the micro-

phone attached is considered the background noise

when detecting another drone. Although two drone

sounds are simultaneously recorded, drone detection

is successfully presented in this paper. Practically, in

(Kim. J, Lee. D, Kim. Y, Shin. H, Heo. Y, Wang.

Y, & Matson, E. T, 2022), while the microphone is

attached to the detecting drone, another drone is de-

tected through an audio signal up to 20m with an ac-

curacy of 88.96%.

4.2 Audio Classiﬁcation

4.2.1 Background

Machine Learning and Deep Learning approaches are

well-known for achieving high performances for the

drone detection system using audio data. In (Seo,

Y., Jang, B., & Im, S., 2018), comparing to SVM,

CNN showed a decrease in false positives and an in-

crease in the correct detection rate. Similarly, (Seo,

Y., Jang, B., & Im, S., 2018) also obtained the re-

sult of the Deep learning model which shows a higher

performance than that of SVM, with 8.31% improve-

ment. In this experiment (Seo, Y., Jang, B., & Im, S.,

2018), the performance of SVM and CNN are both

evaluated with two different features, which are Mel-

Spectrogram and Mel Frequency Cepstral Coefﬁcient.

Various kernels of SVM are applied to classify the

features.

SVM acquires an optimal hyperplane containing

positive and negative samples with the principle of

structural risk minimization (Winters-Hilt, S., Yelun-

dur, A., McChesney, C., & Landry, M., 2006). Mean-

while, the CNN model also demonstrates high perfor-

mance for the two-dimensional features in many ap-

plications such as audio-based features (Seo, Y., Jang,

B., & Im, S., 2018). CNN model is composed of mul-

tiple layered neural networks including a convolution

layer, a pooling layer, an active layer, and a full con-

nection layer.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

880

Figure 2: Overview of the drone detection system.

4.2.2 Machine Learning Training

The input dimension of the Support Vector Machine

should be 1-dimension (1 x N) for each data sam-

ple. The Principal Component Analysis (PCA) is ap-

plied for dimension reduction from (2 x N) to (1 x

N). Also, the N, the hyper-parameter of PCA named

n components, is set as 128.

Grid search refers to the process of training a cer-

tain model with all possible combinations of differ-

ent hyper-parameters within the range speciﬁed by

the user and eventually obtaining the optimal hyper-

parameter that shows the highest performance.

Three different kernels, Gaussian Radial Basis

Function (RBF), sigmoid, and polynomial kernels,

are used. One of the most commonly used kernel

functions is the radial basis function. Each data point

has a ”bump” added to it.

K(x, x

) = e

−γ



x − x



(1)

Here γ, r and d are kernel parameters.

K(x

, x

) = tanh(γx

+ r) (2)

The polynomial kernel function is directional. In

other words, the direction of the two vectors in low-

dimensional space determines the output. This is due

to the dot product in the kernel. The magnitude of the

vector x inﬂuences both the output and the vector’s

magnitude.

K(x, x

) =



1 + x · x



(3)

d is the degree of kernel function.

In the kernel functions, a gamma hyper-parameter

deﬁnes how far the inﬂuence of a single training point

reaches and the range set as 1e-3 to 1e-6 multiplied

by 0.1 intervals. Also, C is another hyper-parameter

that controls the trade-off between smooth decision

boundary and classifying training points correctly and

the range set from 1e-3 to 1e3 (A. Patle and D. S.

Chouhan, 2013). For the model’s stability, 5-fold

cross validation is applied.

4.2.3 Deep Learning Training

Early Stopping technique is used for training in or-

der to prevent overﬁtting. The monitor and the pa-

tient hyper-parameters are set as a validation loss and

15 respectively. Softmax is used for the last activa-

tion layer as an activation function. The two different

optimizers, Stochastic Gradient Descent (SGD) and

Adam, are used for evaluation. Also, 5-fold cross-

validation is applied for obtaining a more accurate es-

timate of model prediction performance.

4.2.4 Result and Analysis

As a result of using 5-fold cross validation procedure

with MFCC feature extraction for the SVM model,

the accuracy is obtained as 65.5%. Among different

combinations of hyper-parameters, the highest perfor-

mance is shown when the kernel function is set to

RBF, 1e-5 for gamma, and 10 for C. On the other

hand, the result of the 5-fold cross validation of SVM

using Log Mel features is 34.5%. Thus, it can be seen

that better results are obtained when MFCC is used

for the feature extraction for the SVM model.

Overall, from Table 2 to Table 5, CNN based on

MFCC features using Stochastic Gradient Descent as

an optimizer shows the highest and the most stable

performance in multi-class classiﬁcation. The model

can detect up to 40 meters with more than 70% accu-

How Far Can a Drone be Detected? A Drone-to-Drone Detection System Using Sensor Fusion

881

racy.

In comparison of two models, the CNN model

shows better performances than the SVM model for

drone detection based on the audio-based features.

Therefore, the CNN model based on MFCC features

is employed for the second step of the proposed drone

detection system as shown in Figure 2.

Table 2: CNN-MFCC Adam.

Class Precision Recall F1 Accuracy

no drone 74.0% 66.0% 69.7% 66.0%

20m 75.0% 75.0% 74.7% 74.9%

40m 54.3% 57.3% 56.0% 57.5%

60m 59.0% 61.3% 60.3% 62.2%

Table 3: CNN-Mel Adam.

Class Precision Recall F1 Accuracy

no drone 75.7% 70.7% 73.0% 70.8%

20m 74.5% 68.0% 70.7% 67.9%

40m 55.3% 54.7% 54.7% 54.6%

60m 61.0% 69.0% 64.0% 69.8%

Table 4: CNN-MFCC SGD.

Class Precision Recall F1 Accuracy

no drone 75.3% 75.0% 75.0% 75.2%

20m 78.0% 78.3% 78.0% 78.4%

40m 60.7% 70.7% 65.0% 70.5%

60m 73.3% 60.3% 66.0% 61.0%

Table 5: CNN-Mel SGD.

Class Precision Recall F1 Accuracy

no drone 48.3% 45.7% 44.4% 45.4%

20m 60.3% 50.3% 54.0% 50.1%

40m 42.3% 63.0% 50.7% 62.9%

60m 45.7% 33.0% 38.3% 33.3%

4.3 Vision Object Detection

To detect the drone in images, Convolution with

Batch normalization and Leaky ReLU (CBL), spa-

tial Pyramid Pooling (SPP), and Cross Stage Partial

(CSP) were used in the backbone layer of YOLOV5

(Ultralytics, ”YOLOV5”), which introduces a type

of powerful object detecting model. The backbone

network obtains feature maps of different sizes from

input images via the pooling layer and convolution

layer. The total structure is shown in Figure 3.

First, CBL is a block that is fundamentally used

to extract features containing of leaky ReLU, batch

normalization and the convolution layer. SPP en-

hances performance by pooling different sizes of fea-

ture maps with ﬁlters and then merging them again.

The CSP divides the feature map of the base layer into

two parts to depress the massive inference computa-

tions caused by duplicate gradient information. Then,

they are combined again in the cross-stage hierarchy

method proposed in the paper (Wang, Chien-Yao, et

al.,). This way, the spread out gradient information

can have a huge correlation difference by transition

the transformation and concatenation steps. Further-

more, CSP can considerably impair computational ef-

fort and improve inference cost and accuracy.

5 Backbone networks - YOLOv5-n,s,m,l,x are

used. Each model is distinguished by depth multiple

and width multiple, and can be organized. The larger

the depth multiple value, the more BottleneckCSP()

is repeated to become a deeper model. Moreover, the

larger the width multiple, the higher Convolution ﬁl-

ter number of the corresponding layer.

The training is performed through the SGD opti-

mizer with a momentum of 0.937 and weight decay of

1e-5. Also, for other hyper-parameters, the model ini-

tializes the learning rate as 0.01 and the batch size of

16. The iteration is set to 30. Our model architecture

has 270 layers and 17K parameters. The evaluation

performances for drone detection tasks are measured

by precision, recall, and accuracy.

4.4 Sensor Fusion

As previously mentioned in Figure 2, the fusion

method includes two steps. The ﬁrst step is drone

detection using the YOLOV5 model based on vision-

based features. Then, from the ﬁrst step, the falsely

detected data by the YOLOV5 model is re-classiﬁed

by the pre-trained CNN model as shown in the sec-

ond step. This proposed system including two steps

shows the highest performance among three method-

ologies: using only audio-based features, only vision-

based features, and the decision fusion of both fea-

tures. As shown in Figure 4, for the distance of 40m,

the accuracy of the fusion method reached 88% which

is about 10% to 20% higher than the accuracies of

the two individual methods of using only one sensor.

Furthermore, for the drone detection in different dis-

tances, it can be clearly noticed that the performance

decreases as the distance from the detecting drone in-

creases.

5 CONCLUSION AND FUTURE

WORKS

The proposed system combines a camera and a mi-

crophone to perform drone detection and distance in-

terval estimation. First, YOLOV5 is trained with im-

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

882

Figure 3: YOLOV5 Architecture.

age data with different ranges of distances. From the

tested result, certain data which are classiﬁed as False

Negative (FN) and False Positive (FP) are again re-

classiﬁed CNN models using MFCC features that are

pre-trained with audio data.

Although the microphone is attached directly to

the detecting drone, the model is able to detect the

target drone ﬂying in 20m with the accuracy of 78%.

When the distance becomes far off from 20m to

40m, the performance of our proposed system is 10%

higher than when using only vision and 17% higher

than when using only audio. Even if the distance of

the plane of the target drone is 60m away from the

detecting drone, it is possible to detect the drone with

80% high performance as shown in Figure 4.

Figure 4: The detection accuracy decline as the distance

between the detecting drone and the moving plane of target

drone increases.

Figure 4 shows that the detection performance de-

creases as the distance between the detecting drone

and the moving plane of target drone increases. This

research has a limitation of only one type of target

drone being used. In the future work, various types of

drones will be used. Also, we plan to apply other deep

learning methods, such as LSTM and RCNN, will be

used to compare the performances and ﬁnd the best

model for drone-to-drone detection using audio and

computer vision sensor.

REFERENCES

AA. Patle and D. S. Chouhan, ”SVM kernel functions

for classiﬁcation,” 2013 International Conference on

Advances in Technology and Engineering (ICATE),

2013, pp. 1-9, doi: 10.1109/ICAdTE.2013.6524743

Al-Emadi, S., Al-Ali, A., & Al-Ali, A. (2021). Audio-based

drone detection and identiﬁcation using deep learning

techniques with dataset enhancement through genera-

tive adversarial networks. Sensors, 21(15), 4953.

Alaparthy, V., Mandal, S., & Cummings, M. (2021, March).

Machine Learning vs. Human Performance in the Re-

altime Acoustic Detection of Drones. In 2021 IEEE

Aerospace Conference (50100) (pp. 1-7). IEEE.

Almalioglu, Y., Saputra, M. R. U., De Gusmao, P. P.,

Markham, A., & Trigoni, N. (2019, May). GANVO:

Unsupervised deep monocular visual odometry and

depth estimation with generative adversarial net-

works. In 2019 International conference on robotics

and automation (ICRA) (pp. 5474-5480). IEEE.

Aswini, N., S. V. Uma, and V. Akhilesh. ”Drone to Obstacle

Distance Estimation Using YOLO V3 Network and

Mathematical Principles.”Journal of Physics: Confer-

ence Series. Vol. 2161. No. 1. IOP Publishing, 2022.

Casabianca, Pietro, and Yu Zhang. ”Acoustic-based UAV

detection using late fusion of deep neural networks.”

Drones 5.3 (2021): 54.

Choi B, Oh D, Kim S, Chong J-W, Li Y-C. Long-Range

Drone Detection of 24 G FMCW Radar with E-plane

Sectoral Horn Array. Sensors. 2018; 18(12):4171.

https://doi.org/10.3390/s18124171

Choi, D., Shallue, C. J., Nado, Z., Lee, J., Maddison,

C. J., & Dahl, G. E. (2019). On empirical compar-

isons of optimizers for deep learning. arXiv preprint

arXiv:1910.05446.

How Far Can a Drone be Detected? A Drone-to-Drone Detection System Using Sensor Fusion

883

Christnacher, F., Hengy, S., Laurenzis, M., Matwyschuk,

A., Naz, P., Schertzer, S., & Schmitt, G. (2016,

October). Optical and acoustical UAV detection. In

Electro-Optical Remote Sensing X (Vol. 9988, pp. 83-

95). SPIE.

Deng, S., Li, S., Xie, K., Song, W., Liao, X., Hao, A., &

Qin, H. (2020). A global-local self-adaptive network

for drone-view object detection. IEEE Transactions on

Image Processing, 30, 1556-1569.

Dong, Qiushi, Yu Liu, and Xiaolin Liu. ”Drone sound de-

tection system based on feature result-level fusion us-

ing deep learning.” Multimedia Tools and Applica-

tions (2022): 1-23.

Feng, Tuo, and Dongbing Gu. ”SGANVO: Unsuper-

vised deep visual odometry and depth estimation

with stacked generative adversarial networks.”IEEE

Robotics and Automation Letters 4.4 (2019): 4431-

4437.

H. Abdullah, ”Man Detained for Flying Drone Near

White House”. NEWS, May. 15, 2015. [On-

line]. Available: https://www.nbcnews.com/news/us-

20news/20man-20detained-20trying- 20ﬂy-20drone-

20near-20white-20house-20n359011

H. Salloum, A. Sedunov, N. Sedunov, A. Sutin and D. Mas-

ters, J. Salamon and J. P. Bello, ”Deep Convolutional

Neural Networks and Data Augmentation for Environ-

mental Sound Classiﬁcation,” in IEEE Signal Process-

ing Letters, vol. 24, no. 3, pp. 279-283, March 2017,

doi: 10.1109/LSP.2017.2657381.

H. Liu, Z. Wei, Y. Chen, J. Pan, L. Lin and Y. Ren, ”Drone

Detection Based on an Audio-Assisted Camera Ar-

ray,” 2017 IEEE Third International Conference on

Multimedia Big Data (BigMM), 2017, pp. 402-406,

doi: 10.1109/BigMM.2017.57.

Hu, Y., Wu, X., Zheng, G., & Liu, X. (2019, July). Ob-

ject detection of UAV for anti-UAV based on im-

proved YOLO v3. In 2019 Chinese Control Confer-

ence (CCC) (pp. 8386-8390). IEEE.

Heartexlabs, ”labelImg”, github.com (2014)

Jeon, S., Shin, J. W., Lee, Y. J., Kim, W. H., Kwon, Y.,

& Yang, H. Y. (2017, August). Empirical study of

drone sound detection in real-life environment with

deep neural networks. In 2017 25th European Signal

Processing Conference (EUSIPCO) (pp. 1858-1862).

IEEE.

Kim, J., Lee, D., Kim, Y., Shin, H., Heo, Y., Wang, Y., &

Matson, E. T. (2022). Deep Learning Based Malicious

Drone Detection Using Acoustic and Image Data (No.

9335). EasyChair.

Madasamy, K., Shanmuganathan, V., Kandasamy, V., Lee,

M. Y., & Thangadurai, M. (2021). OSDDY: embed-

ded system-based object surveillance detection system

with small drone using deep YOLO. EURASIP Jour-

nal on Image and Video Processing, 2021(1), 1-14.

S. Al-Emadi, A. Al-Ali, A. Mohammad and A. Al-

Ali, ”Audio Based Drone Detection and Identiﬁ-

cation using Deep Learning,” 2019 15th Interna-

tional Wireless Communications & Mobile Comput-

ing Conference (IWCMC), 2019, pp. 459-464, doi:

10.1109/IWCMC.2019.8766732.

Sedunov, A., Sutin, A., Sedunov, N., Salloum, H.,

Yakubovskiy, A., & Masters, D. (2016). Passive

acoustic system for tracking low-ﬂying aircraft. IET

Radar, Sonar & Navigation, 10(9), 1561-1568.

S. Jeon, J. -W. Shin, Y. -J. Lee, W. -H. Kim, Y. Kwon and H.

-Y. Yang, ”Empirical study of drone sound detection

in real-life environment with deep neural networks,”

2017 25th European Signal Processing Conference

(EUSIPCO), 2017, pp. 1858-1862, doi: 10.23919/EU-

SIPCO.2017.8081531.

S. Seo, S. Yeo, H. Han, Y. Ko, K. E. Ho and E. T. Matson,

”Single Node Detection on Direction of Approach,”

2020 IEEE International Instrumentation and Mea-

surement Technology Conference (I2MTC), 2020, pp.

1-6, doi: 10.1109/I2MTC43012.2020.9129016.

Seo, Y., Jang, B., & Im, S. (2018, November). Drone detec-

tion using convolutional neural networks with acous-

tic STFT features. In 2018 15th IEEE International

Conference on Advanced Video and Signal Based

Surveillance (AVSS) (pp. 1-6). IEEE.

Ultralytics, ”YOLOV5”, github.com

https://github.com/ultralytics/YOLOV5

Wang, C. Y., Liao, H. Y. M., Wu, Y. H., Chen, P. Y., Hsieh,

J. W., & Yeh, I. H. (2020). CSPNet: A new backbone

that can enhance learning capability of CNN. In Pro-

ceedings of the IEEE/CVF conference on computer

vision and pattern recognition workshops (pp. 390-

391).

Winters-Hilt, S., Yelundur, A., McChesney, C., & Landry,

M. (2006, September). Support vector machine im-

plementations for classiﬁcation & clustering. In BMC

bioinformatics (Vol. 7, No. 2, pp. 1-18). BioMed Cen-

tral.

Wu, Z., Wu, X., Zhang, X., Wang, S., & Ju, L. (2019). Spa-

tial correspondence with generative adversarial net-

work: Learning depth from monocular videos. In Pro-

ceedings of the IEEE/CVF International Conference

on Computer Vision (pp. 7494-7504).

Y. Wang, F. E. Fagian, K. E. Ho and E. T. Matson, ”A Fea-

ture Engineering Focused System for Acoustic UAV

Detection,” 2021 Fifth IEEE International Conference

on Robotic Computing (IRC), 2021, pp. 125-130, doi:

10.1109/IRC52146.2021.00031.

Yip, D. A., Knight, E. C., Haave-Audet, E., Wilson, S. J.,

Charchuk, C., Scott, C. D., ... & Bayne, E. M. (2020).

Sound level measurements from audio recordings pro-

vide objective distance estimates for distance sam-

pling wildlife populations. Remote Sensing in Ecol-

ogy and Conservation, 6(3), 301-315.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

884