InsectDSOT: A Neural Network for Insect Detection in Olive Trees

Lotﬁ Souiﬁ

, Afef Mdhaffar

1,2

, Ismael Bouassida Rodriguez

, Mohamed Jmaiel

1,2

and Bernd Freisleben

ReDCAD Laboratory, ENIS, University of Sfax, B.P. 1173 Sfax, Tunisia

Digital Research Center of Sfax, 3021 Sfax, Tunisia

Dept. of Math. & Comp. Sci., Philipps-Universit

at Marburg, Germany

ﬁ ﬁ

Keywords:

Deep Learning, Object Detection, Insect Detection, Yolov5, RepVGG.

Abstract:

Controlling insect pests in agricultural ﬁelds is a major concern. Despite technological developments, most

farm management methods and technologies still rely on experts for management and do not yet match the

criteria required for precise insect pest control. In this paper, we present a neural network approach for detect-

ing and counting insects. Using the Yolov5n 6.1 version as a baseline model, this paper proposes replacing the

Conv layers in the original model’s backbone and neck with the RepVGG layer. We use transfer learning to

improve performance by training our proposal on the MS COCO dataset and then use the output model of this

training as the input of our new training. Our proposal is validated using the DIRT (Dacus Image Recognition

Toolkit) dataset. The obtained results demonstrate that our approach, based on an improved Yolov5, achieves

86.1% of precision. It outperforms four versions of the original yolov5 and yolov5-based versions with modi-

ﬁed backbones based on lightweight models.

1 INTRODUCTION

Insects, pests, diseases, and weed infestations are

estimated to harm 40% of agricultural productivity

(Carvajal-Yepes et al., 2019). At least $220 billion

is spent annually on the global economy by plant dis-

eases, and at least $70 billion is spent annually on in-

vasive insects (FAO, 2021). Therefore, it is crucial

to detect insects and precisely count their numbers to

propose an adequate spraying strategy. For small tar-

gets like insect pests, the detection ability is currently

inadequate for three reasons. First, there is a lack of

information. Indeed, minor items take up fewer pixels

in the image and have less information, making it dif-

ﬁcult to extract important characteristics without be-

ing inﬂuenced by the underlying surroundings. Sec-

ond, the positional precision requirements are strin-

gent. The offset of the bounding box is high for the

inaccuracy of tiny target identiﬁcation, whether in the

training or prediction phases. Third, there is the ob-

ject aggregation difﬁculty. When this happens, targets

are aggregated into one point after being presented to

the deep feature map via repeated downsampling, re-

sulting in the inability to differentiate between sep-

arate objects. Furthermore, it will make regressing

the bounding boxes and converging the model harder.

Moreover, the accuracy of the lightweight model is

typically low when compared to other models.

Some studies use one of the two following ap-

proaches to tackle the problem of low accuracy in the

lightweight model. The ﬁrst approach consists of us-

ing databases (Yang et al., 2021), which includes pho-

tos of insects at enormously large sizes. The disad-

vantage of this approach is that the output model per-

forms poorly, particularly under real-world settings.

The second approach consists of adding more lay-

ers to existing models and/or combining them with

a new feature extractor from another technique. The

downside of this approach is that the output model

size generally increases. The second approach is used

by (Li et al., 2021b; He et al., 2019; Cardoso et al.,

2022). (Li et al., 2021b) integrated EfﬁcientNet and

Yolo. The precision improvement is around 5%, but

the output model size is nearly 60 MB, which is con-

sidered to be large for edge deployment. (He et al.,

2019), employed three meta-architectures (Faster R-

CNN (Girshick, 2015), R-FCN (Dai et al., 2016), and

SSD (Liu et al., 2016)) in conjunction with feature ex-

tractors (ResNet (He et al., 2016), Inception (Szegedy

et al., 2017), and MobileNet (Howard et al., 2017)) to

select the best model while taking the mean average

precision (mAP) into account. Their technique has

334

Souiﬁ, L., Mdhaffar, A., Rodriguez, I., Jmaiel, M. and Freisleben, B.

InsectDSOT: A Neural Network for Insect Detection in Olive Trees.

DOI: 10.5220/0011789600003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 2, pages 334-342

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

the drawback of having a low mAP (77%). (Cardoso

et al., 2022) evaluated four basic models: Faster R-

CNN, Yolov4 (Bochkovskiy et al., 2020), and Yolov5

(in small and large versions). The average precision

(AP) in three of the four selected models is low (i.e.,

approximately 80%) and the size of the ﬁnal model is

too large for edge devices (i.e., 167 MB).

In this paper, we adopt the second approach to de-

velop a neural network, called InsectDSOT, for de-

tecting and counting insects. Using the Yolov5n 6.1

version as a baseline model, we replace the Conv lay-

ers in the original model’s backbone and neck with

a RepVGG layer. We use transfer learning to im-

prove performance by training our proposal on the

MS COCO dataset and then use the output model of

this training as the input of our new training. The

model’s accuracy is improved signiﬁcantly. The size

of the resulting model is increased by 0.5 MB, which

is reasonable for edge deployment. We test our hy-

pothesis using a variety of backbones, including mo-

bileNet series, ShufﬂeNet series, Yolov3, Yolo5Face,

EfﬁcientLite, PPLC, GhostNet, Global Context, on

the DIRT dataset of olive insects. Our experimen-

tal results demonstrate the usefulness of our proposal

in terms of precision, without increasing signiﬁcantly

the model size.

The remainder of the paper is organized as fol-

lows. Section 2 provides an overview of related work.

Section 3 describes our proposal. Section 4 describes

our experiments. Section 5 concludes this paper and

outlines areas for future research.

2 RELATED WORK

(Chen et al., 2022) used MobileNetv3 (Howard et al.,

2019) with Yolov5 to detect airplanes in real time.

The difﬁculty with their technique is that the sug-

gested model accuracy is very weak, falling below

60%. (Hou et al., 2022) proposed M-YOLO, an

infrared object detector created by combining Mo-

bileNetv2 (Sandler et al., 2018) with Yolov4. The

accuracy of the proposed model is low, indicating a

ﬂaw in their technique. (Qi et al., 2021) proposed

YOLO5Face, a face detector based on YOLOv5.

This technique achieves good results and outperforms

other models when dealing with face identiﬁcation, in

terms of size and performance. This method has not

been applied in the identiﬁcation of small objects, par-

ticularly insects. (Liu et al., 2021) proposed combin-

ing the Global Context Network with Yolov3 (Red-

mon and Farhadi, 2018). Six out of 24 types of insects

respond well to this approach (i.e., the same charac-

teristic of those six is that their size is big compared

with the others). Their model’s mean average preci-

sion is insufﬁcient. (Qian et al., 2022) proposed using

EfﬁentNet-Lite (Tan and Le, 2019) as the backbone

of their method. The authors validated their tech-

nique by testing it with ﬁve different backbones: Mo-

bileNetv3Small, GhostNet (Han et al., 2020), Shuf-

ﬂeNetV2 (Ma et al., 2018), and PP-LCNet (Cui et al.,

2021). The ﬁndings demonstrate that all the other

models produce inferior outcomes when compared to

their suggestion.

Yolov3 is used by (Li et al., 2021a) with data gath-

ered from the Internet. The ﬂaw in this method is

that the resolution of the image characteristics, such

as light and insect size, may differ from the real en-

vironment. In contrast to (Li et al., 2021a), (Bjerge

et al., 2022; Tresson et al., 2019) implemented their

method in the real world while using expensive ma-

terials. (Takimoto et al., 2021) combines two detec-

tion and classiﬁcation approaches which are Yolov4

alongside with EfﬁcientNet in their study. (Mamdouh

and Khattab, 2021) proposed a system for detecting

and counting olive ﬂies using an enhanced version of

Yolov4. The suggested model’s drawback is that it has

a poor accuracy (84%) that can be improved. (Cabr-

era and Villanueva, 2022) used Yolov5 in its default

conﬁguration without any additional changes. The

size of the insects in the images used by (Yuan et al.,

2021) was at least 30%, which is much larger than

what would be found in the real world (in some cases,

the percentage of insects is less than 10%). In our

article, we focus on the YOLO (Yolov5) detector, a

one-stage anchor-based detector. Each deep learning

model typically has beneﬁts and drawbacks as feature

extractors. Some models were created using feature

extractors from already-built models, like SSD (Sin-

gle Shot multi box Detector), which was suggested

using the VGG16 feature extractor. Numerous stud-

ies have been conducted about combining model ad-

vantages to overcome other models’ disadvantages.

ResNet (and its variations), MobileNet (and its vari-

ations), and InceptionNet were the three models that

were most often used as backbones. The target mod-

els, such as SSD (Patel and Bhatt, 2019) (SSD with

Inception and MobileNet) and Yolo (Zha et al., 2021)

(uses Yolo with MobileNet), were both single-stage

detectors. Additionally, there are two stage detectors

like Faster R-CNN (Teng et al., 2022) (uses Faster-

RCNN with ResNet50). To summarize, most extant

one-stage object identiﬁcation methods use feature

fusion to increase the accuracy of tiny objects. The

main explanation for this is that low-level feature map

semantic information is sparse but correct in present-

ing object position, while high-level feature semantic

information is abundant but inaccurate.

InsectDSOT: A Neural Network for Insect Detection in Olive Trees

335

3 InsectDSOT

This section presents the original architecture of

Yolov5 and the core principles of the RepVGG block.

Also, we present InsectDSOT, our improved Yolov5

for insect detection.

3.1 Architecture of Yolo

Yolov5 has undergone various modiﬁcations since its

initial release in 2020. For example, the adoption of

the C3 layer was the most signiﬁcant enhancement

in version 4 (Jocher et al., 2021a). Also in version

6 (Jocher et al., 2021b), the nano version sugges-

tion, which is the basis of our approach. Furthermore,

some other enhancements were recently incorporated

in versions 6.1 (Jocher et al., 2022a) and 6.2 (Jocher

et al., 2022b). The version used in our experiment is

6.1 since the last version (6.2) included classiﬁcation

and we just focus on detection. The Yolov5 architec-

ture is shown in Figure 1.

Figure 1: Yolov5 v6.1 architecture.

YOLOv5 is available in ﬁve versions: YOLOv5n,

YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x.

The only difference between the variants is the model

depth and layer width. We chose YOLOv5N because

it has the smallest depth and feature map width, as

well as the quickest training time. Also, the result of

our proposal is a lightweight model that can be de-

ployed on both edge devices and servers. The other

three are expanded and deeper variants of YOLOv5s.

Table 1 lists the features of each Yolov5 version.

3.2 RepVGG Model

According to (Ding et al., 2021), many advanced

ConvNets outperform basic ConvNets in terms of ac-

curacy, despite having signiﬁcant limitations, such as:

Table 1: Yolo’s versions: depth and width.

Yolov5 model depth layer width

N 0.33 0.25

S 0.33 0.5

M 0.67 0.75

L 1.0 1.0

X 1.33 1.25

(i) complex multi-branch designs (for example, resid-

ual addition in ResNet) make the model harder to de-

velop and adapt, slow inference, and decrease mem-

ory consumption; (ii) some modules increase memory

access costs and do not support a variety of devices

(e.g., depthwise conv in MobileNets). The authors

suggest RepVGG as a solution to these restrictions,

RepVGG is a strong convolutional neural network ar-

chitecture with a VGG-like inference-time body (Si-

monyan and Zisserman, 2014) made up of basically

a stack of 3 × 3 convolution and ReLU. While it can

reach the accuracy and speed of multi-branch compli-

cated networks, we chose RepVGG-A0 since it has

the fewest network parameters among the RepVGG

conﬁgurations. Nonetheless, the RepVGG-A0 net-

work has a considerable network size and high recog-

nition accuracy (it can identify 1000 images classes in

the ImageNet-2012 test set with a Top-1 accuracy of

72.41%). RepVGG provides the following beneﬁts:

• The model features a VGG-like plain (also known

as feed-forward) topology 1 (i.e., each layer takes

the output of the previous layer as input and feeds

it into the next layer).

• The model’s body mainly makes use of 3×3 conv

and ReLU.

• No automatic search (Zoph et al., 2018), manual

reﬁnement (Radosavovic et al., 2020), compound

scaling (Tan and Le, 2019), or other heavy designs

are used to instantiate the concrete architecture

(including the particular depth and layer widths).

The differences between the ResNet and RepVGG

designs, as well as the RepVGG inference, are de-

picted in Figure 2:

• (A) ResNet: It has a multi-path topology during

training and inference, which causes it to be ex-

tremely slow and memory-inefﬁcient.

• (B) RepVGG Training: It only acquired multi-

path topology during training.

• (C) RepVGG Inference: It only has a single-path

topology during inference, allowing a quick infer-

ence time.

RepVGG employs ResNet-like identity and 1 × 1

branches to ensure that a building block’s training-

time information ﬂow is y = x+g(x)+ f (x), as in (B).

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

336

Figure 2: Difference between RepVGG and ResNet archi-

tecture.

From the same point of view as (Veit et al., 2016), the

model is transformed into an aggregation of 3

mem-

bers with n such blocks.

Figure 3: A RepVGG block’s hierarchical re-

parameterization.

To move from (B) to (C), RepVGG employs the

concept of re-parameterization to turn a learned block

into a single 3 × 3 convolutional layer for inference.

It is important to note that batch normalization (BN).

(Ioffe and Szegedy, 2015) is used before the addition

for each branch. The concept of re-parameterization

employs W

(3)

∈ R

×C

×3×3

to represent the kernel

of a 3 × 3 conv layer with C

input and C

out-

put channels, and W

(1)

∈ R

×C

for the kernel of

1 × 1. We utilise α

(3)

, σ

(3)

, γ

(3)

, β

(3)

as the accumu-

lated mean, standard deviation and learned scaling

factor and bias of the BN layer following 3 × 3 conv,

(1)

, σ

(1)

, γ

(1)

, β

(1)

for the BN following 1 × 1 conv,

and α

(0)

, σ

(0)

, γ

(0)

, β

(0)

for the identity branch. Let

(1)

∈ R

×N×H

×W

, M

(2)

∈ R

×N×H

×W

, be the

source and result, and ∗ be the convolution operator,

if C

= C

, H

= H

, and W

= W

we have:

(2)

= bn(M

(1)

∗W

(3)

, α

(3)

, σ

(3)

, γ

(3)

, β

(3)

+bn(M

(1)

∗W

(1)

, α

(1)

, σ

(1)

, γ

(1)

, β

(1)

)

+bn(M

(1)

, α

(0)

, σ

(0)

, γ

(0)

, β

(0)

)).

(1)

where bn in is the inference-time BN function:

bn(M, λ, σ, γ, β)

:,i,:,:

= (M

:,i,:,:

− µ

)

+ β

. (2)

The ﬁnal bias is obtained by summing the three

bias vectors, and the ﬁnal 3 × 3 kernel is obtained by

putting the 1*1 kernels into the central point of the

3 × 3 kernel, which is readily achieved by ﬁrst zero-

padding the two 1 × 1 kernels to 3 × 3 and adding the

three kernels up, as illustrated in Figure 3. As shown

in Figure 3, we assume C

= C

= 2 for clarity, there-

fore the layer contains four 3×3 matrices and the ker-

nel of the 1 × 1 layer is a 2 × 2 matrix.

3.3 The Architecture of InsectDSOT

We used the Yolov5 model as the base model and used

the RepVGG as a feature extractor. Since the primary

purpose of the RepVGG block is to overcome some

of the drawbacks of ConvNets, we replace the Conv

layers in the ﬁrst part (backbone) of the Yolo, as well

as the ﬁnal two Conv layers in the second part (Neck)

of the Yolov5 with RepVGG block. We only main-

tained three original Conv layers: one in the initial

input of the model, and the others two are in the are

in the model’s Neck and have no direct inﬂuence on

the model output. We opted to retrain them. Figure

4 illustrates the original Yolo’s architecture, with the

changed blocks of our InsectDSOT approach high-

lighted in red.

Figure 4: Modiﬁed Yolo version based on RepVGG block.

InsectDSOT: A Neural Network for Insect Detection in Olive Trees

337

4 EXPERIMENTS

This section describes the used dataset, the conducted

experiments, and the obtained results.

4.1 Dataset

Our experiments are conducted on the DIRT (i.e.,

Dacus Image Recognition Toolkit) dataset (Kalama-

tianos et al., 2018). The liquid-based pheromone

McPhail traps are used to collect the DIRT images.

Soup images are produced when olive fruit ﬂies are

trapped in liquid pheromone. There are 848 images

in the dataset. At least one olive fruit ﬂy is shown

in every picture in DIRT. Originally, the dataset had

three classes: dacus, dacus male, and dacus female.

In our situation, we only consider the ﬁrst class. Fig-

ures 5 and 6 depict two samples of the original DIRT

dataset and two samples of the augmented dataset, re-

spectively.

Figure 5: Original images of the DIRT dataset.

We removed several images from the augmented

dataset since they are not compatible with our criteria

(i.e., the size of insects in photos is bigger than 10%

of the entire image size). The augmented dataset has

3877 images.

Figure 6: Augmented dataset: examples images.

Deep learning models often require a large amount

of training data to generate effective results. There-

fore, we applied data augmentation techniques. The

original 848 images were separated into three groups:

the training set (552 photos), the validation set (170

photos), and the ﬁnal test set (126 photos). Then, we

enhanced all the sets by combining them with zoom-

ing (from 10% to 40%), and rotation (from 10% to

60%) on each photo to get extra training and testing

examples. After the augmentation, we have around

3877 pictures divided as follows: 2520 images for

training, 775 images for validating, and 582 images

for testing.

Table 2: Number of images in DIRT dataset.

Training Validation Final Test Total

DIRT 552 170 126 848

After DA 2520 775 582 3877

In this paper, we used Google Colab to train each

deep learning model for 100 epochs. We adopted the

standard hyperparameters, considering our study is

not focused on them. The standard hyperparameters

are: learning rate (lr) = 0.01; IoU training threshold =

0.2; obj loss gain = 1.0; class loss gain = 0.5.

4.2 Experimental Results

According to (Fawcett, 2006), the evaluation matrix

of the object detection matrix is based on: (i) True

Positives (TP): is the number of detected insects that

are detected as target insects; (ii) False Negatives

(FN): is the number of target insects that are not de-

tected; (iii) True Negatives (TN): is the number of

non-target insect that are not detected as target in-

sects; (vi) False Positives (FP): is the number of non-

target insect that are detected as target insects.

The above criteria need a deﬁnition in terms of

what constitutes a right detection and a wrongly de-

tection. One frequent method is to use the intersection

over union (IOU). In the object detection scope, the

IOU divides the area of union between the predicted

bounding box B

and the ground-truth bounding box

by the overlapping area between them.

IOU =

area(B

∪ B

)

area(B

∩ B

(3)

We can characterize a detection as correct or

wrong by comparing the IOU to a deﬁned threshold t.

If IOU ≥ t, the detection is considered to be correct.

If IOU < t, the detection is considered mistaken.

The remaining elements of the confusion ma-

trix are as follows: True Positive Rate (TPR):(also

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

338

known as sensitivity) is the measure of correctly de-

tected insect.

T PR =

T P

(T P + FN)

(4)

True Negative Rate (TNR): (also known as speci-

ﬁcity) is the measure of non insects that were cor-

rectly not detected and is expressed as follows:

T NR =

T N

(T N + FP)

(5)

Positive Predictive Value (PPV): (or precision) de-

scribes how much of the correctly detected labels are

truly positive and is expressed as follows:

PPV =

T P

(T P + FP)

(6)

Accuracy (ACC): is deﬁned as the number of accu-

rately identiﬁed insects or non-insects divided by the

total number of insects and is expressed as follows:

ACC(%) =

(T N + T P)

(T N + T P + FN + FP)

× 100 (7)

4.2.1 Yolov5 Performance

First, we trained the original version of Yolov5 on the

DIRT dataset, in order to compare it to our proposal.

Table 3 shows obtained results in terms of precision,

recall, mAP, and model’s size.

Table 3: Original Yolov5: Augmented DIRT Dataset.

Models Name Pr% R % mAP % Size (MO)

Yolo N 82.9 81.4 84.4 3.9

Yolo S 83.9 83.7 85.2 14.4

Yolo M 85.1 84.2 85.7 42.3

Yolo L 86.0 84.6 86.2 92.9

Yolo X 86.4 85.5 87.7 173.3

As shown in Table 3, the main difference between

all versions is their size in comparison to their out-

put performance. The difference in accuracy between

versions X and N is 3.5%, but the size difference is

over 45 times greater.

4.2.2 Evaluation of InsectDSOT

We focused on selecting lightweight feature extrac-

tion models with a limited set of trainable parameters.

This is primarily done to support a real-world appli-

cation by increasing the effectiveness of the feature

extraction step and assisting in lowering the amount

of data needed for ﬁne-tuning. Deeper networks may

also have difﬁculties during training as a result of is-

sues like vanishing gradients brought on by the in-

creased network depth. When it comes to the employ-

ment of backbones, we choose one of two approaches:

• First: remove the previous backbone and replace

it with the new backbone. In contrast to our

method and the Global Context block, most of the

used backbones adopt this strategy.

• Second: replace only a section of the backbone

with the new block from the new model. This

method is used for both the Global Context and

our proposal.

In the case of RepVGG, we use both methodologies

in order to compare their outcomes. The confusion

matrix is shown in Figure 7.

Figure 7: Confusion Matrix of Yolov5 N with TR.

Table 4 displays the outcomes of both approaches.

We now compare our proposal to Yolov5’s origi-

nal versions and other Yolov5 models. We adhere to

the principle of transfer learning (TL). In our case, we

used the MS COCO (Microsoft Common Object in

Context) benchmark dataset (Lin et al., 2014) to train

our model and then use the output model as a pre-

trained weight for further training. As a disadvantage

of using TL, only the precision of the model is raised;

the recall of our model is decreased by 1.2%, which

seems reasonable when compared to the 2.6% gain in

precision. Table 4 demonstrates that not all models

perform well on tiny object detection. Some mod-

els outperform others in terms of performance. Shuf-

ﬂeNet and his two variants perform poorly in contrast

to Yolo5Face, even though his output size, which is

less than 1 MB, performs well enough in our dataset.

We included InsectDSOT S, a larger conﬁgura-

tion of our proposal, to outperform all Yolov5 ver-

InsectDSOT: A Neural Network for Insect Detection in Olive Trees

339

Table 4: Comparison between used backbones.

Model Backbone Version Precision % Recall % mAP0.5 % mAP0.5-0.95 %

MobileNet

V2 79.2 78.7 82.4 31.7

V3 Small 78.3 77.3 80.1 29.3

V3 Large 81.4 79.2 84.5 31.1

ShufﬂeNet

V1 74 72.4 77.3 26.7

V2 74.5 75.3 78.3 26.8

Yolo5Face

BlazeFace 79.9 82.5 83.8 30.1

BlazeFPN 77 77.5 78.7 28.8

EfﬁcientNEt EfﬁcientNEt Lite 79.4 81.3 84.2 32.4

PPLC PPLC 80.1 78.8 82 30.3

RepVGG Entire RepVGG 83.6 82.9 86.2 34

GhostNet GhostNet 82.8 82.5 85.2 31.7

Yolov3 Yolov3 83.7 82.5 86.2 32.6

Global Context GC Net 80.5 84.1 85.2 30.8

InsectDSOT RepVGG 83.5 85.1 86.4 33.7

InsectDSOT+TL RepVGG 86.1 83.9 86.4 33.7

(a) Image in labeling program. (b) Output of our model. (c) Difference between two images.

Figure 8: Detected Image.

Table 5: Our proposal compared with original Yolov5.

Model Name Size (Mo) Precision (%)

Original N 3.9 82.9

Original S 14.4 83.3

Original M 42.3 85.1

Original L 92.9 86.0

InsectDSOT N 4.4 86.1

Original X 173.3 86.4

InsectDSOT S 15.4 87.0

sions. In terms of accuracy, InsectDSOT S outper-

forms Yolov5x by 0.6%. Table 5 displays the size and

precision of all models (Yolov5 and our InsectDSOT),

sorted from worst to best in terms of precision.

The image in Figure 8 was randomly selected

from the test dataset. 13 target insects were identi-

ﬁed in the original image by experts in the labeling

process. We executed the detection procedure with

our proposal. The second image displays the output

image, which shows that our model successfully iden-

tiﬁed 12 out of 13 insects, which is a very good result.

The unidentiﬁed insect in the previous image has a

red rectangle around it to draw attention to it. We

built auto-counting into the model’s head. As a conse-

quence, the insect’s name and the number of detected

insects are added on top of the image, in addition to

the regular detection.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

340

5 CONCLUSION

We presented InsectDSOT, an improved version of

Yolov5 for insect detection. The proposed approach

replaces the Conv layers in the original model’s back-

bone and neck with the RepVGG layer. We use trans-

fer learning to improve performance by (i) training

our proposal on the MS COCO dataset and then (ii)

using the output model of this training as the input of

our new training. Our proposal has been validated,

using the DIRT (Dacus Image Recognition Toolkit)

dataset. The obtained results demonstrate that our

approach, based on an InsectDSOT, achieved 86.1%

precision. It outperforms four versions of the origi-

nal yolov5 and yolov5-based versions with modiﬁed

backbones, based on lightweight models. InsectD-

SOT achieved good results in terms of precision and

recall, when compared to the other approaches.

In the future, we plan to collect a dataset that in-

cludes many types of olive insects. Also, we will de-

ploy our model on edge devices. Furthermore, since

the classiﬁcations approach is included in the most re-

cent Yolov5 release, we plan to include this function

into our model, as nature does not only have one type

of insect-pest that might damage olive trees.

ACKNOWLEDGEMENTS

This work is supported by the German Academic Ex-

change Service (DAAD), Germany (Transformation

Partnership: OLIVIA Project).

REFERENCES

Bjerge, K., Mann, H. M., and Høye, T. T. (2022). Real-time

insect tracking and monitoring with computer vision

and deep learning. Remote Sensing in Ecology and

Conservation, 8(3):315–327.

Bochkovskiy, A., Wang, C.-Y., and Liao, H.-Y. M. (2020).

Yolov4: Optimal speed and accuracy of object detec-

tion. arXiv preprint arXiv:2004.10934.

Cabrera, J. and Villanueva, E. (2022). Investigating gener-

ative neural-network models for building pest insect

detectors in sticky trap images for the peruvian horti-

culture. In Annual International Conference on Infor-

mation Management and Big Data, pages 356–369.

Springer.

Cardoso, B., Silva, C., Costa, J., and Ribeiro, B. (2022).

Internet of things meets computer vision to make an

intelligent pest monitoring network. Applied Sciences,

12(18):9397.

Carvajal-Yepes, M., Cardwell, K., Nelson, A., Garrett,

K. A., Giovani, B., Saunders, D. G., Kamoun, S.,

Legg, J., Verdier, V., Lessel, J., et al. (2019). A

global surveillance system for crop diseases. Science,

364(6447):1237–1239.

Chen, Y., Yang, J., Wang, J., Zhou, X., Zou, J., and Li,

Y. (2022). An improved yolov5 real-time detection

method for aircraft target detection. In 2022 27th In-

ternational Conference on Automation and Comput-

ing (ICAC), pages 1–6. IEEE.

Cui, C., Gao, T., Wei, S., Du, Y., Guo, R., Dong, S., Lu, B.,

Zhou, Y., Lv, X., Liu, Q., et al. (2021). Pp-lcnet: A

lightweight cpu convolutional neural network. arXiv

preprint arXiv:2109.15099.

Dai, J., Li, Y., He, K., and Sun, J. (2016). R-fcn: Object de-

tection via region-based fully convolutional networks.

Advances in neural information processing systems,

29.

Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., and Sun,

J. (2021). Repvgg: Making vgg-style convnets great

again. 2021 IEEE/CVF Conference on Computer Vi-

sion and Pattern Recognition (CVPR), pages 13728–

13737.

FAO (2021). Food and agriculture organization of the

united nations. https://www.fao.org/news/story/en/

item/1402920/icode/. Accessed: 2022-10-25.

Fawcett, T. (2006). An introduction to roc analysis. Pattern

recognition letters, 27(8):861–874.

Girshick, R. B. (2015). Fast r-cnn. 2015 IEEE International

Conference on Computer Vision (ICCV), pages 1440–

1448.

Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., and Xu, C.

(2020). Ghostnet: More features from cheap opera-

tions. In Proceedings of the IEEE/CVF conference on

computer vision and pattern recognition, pages 1580–

1589.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778. IEEE.

He, Y., Zeng, H., Fan, Y., Ji, S., and Wu, J. (2019). Ap-

plication of deep learning in integrated pest manage-

ment: A real-time system for detection and diagnosis

of oilseed rape pests. Mobile Information Systems,

2019.

Hou, Z., Sun, Y., Guo, H., Li, J., Ma, S., and Fan, J. (2022).

M-yolo: an object detector based on global context

information for infrared images. Journal of Real-Time

Image Processing, pages 1–14.

Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B.,

Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V.,

et al. (2019). Searching for mobilenetv3. In Pro-

ceedings of the IEEE/CVF international conference

on computer vision, pages 1314–1324.

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D.,

Wang, W., Weyand, T., Andreetto, M., and Adam,

H. (2017). Mobilenets: Efﬁcient convolutional neu-

ral networks for mobile vision applications. arXiv

preprint arXiv:1704.04861.

Ioffe, S. and Szegedy, C. (2015). Batch normalization: Ac-

celerating deep network training by reducing internal

covariate shift. In International conference on ma-

chine learning, pages 448–456. PMLR.

InsectDSOT: A Neural Network for Insect Detection in Olive Trees

341

Jocher, G. R., Stoken, A., and Chaurasia, A. (2021a). ul-

tralytics/yolov5: v4.0 - nn.silu() activations. https:

//zenodo.org/record/4418161. Accessed: 2022-11-10.

Jocher, G. R., Stoken, A., and Chaurasia, A. (2021b). ultr-

alytics/yolov5: v6.0 - yolov5n ’nano’ models. https:

//zenodo.org/record/5563715. Accessed: 2022-11-10.

Jocher, G. R., Stoken, A., and Chaurasia, A. (2022a). ul-

tralytics/yolov5: v6.1 - tensorrt, tensorﬂow edge tpu

and openvino export and inference — zenodo. https:

//zenodo.org/record/6222936.Y3t3Q3bP3IU. Ac-

cessed: 2022-11-10.

Jocher, G. R., Stoken, A., and Chaurasia, A. (2022b).

Yolov5 classiﬁcation models, apple m1, reproducibil-

ity, clearml and deci.ai integrations. https://zenodo.

org/record/7002879.Y3t3P3b7TIV. Accessed: 2022-

11-10.

Kalamatianos, R., Karydis, I., Doukakis, D., and Avlonitis,

M. (2018). Dirt: The dacus image recognition toolkit.

Journal of Imaging, 4(11):129.

Li, K., Zhu, J., and Li, N. (2021a). Insect detection and

counting based on yolov3 model. In 2021 IEEE 4th

International Conference on Electronics Technology

(ICET), pages 1229–1233. IEEE.

Li, K., Zhu, J., and Li, N. (2021b). Lightweight auto-

matic identiﬁcation and location detection model of

farmland pests. Wireless Communications and Mobile

Computing, 2021.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,

Ramanan, D., Doll

ar, P., and Zitnick, C. L. (2014).

Microsoft coco: Common objects in context. In Euro-

pean conference on computer vision, pages 740–755.

Springer.

Liu, Q., Yan, Z., Wang, F., and Ding, C. (2021). Research

on object detection algorithm for small object of pests

based on yolov3. In 2021 International Conference

on Computer Information Science and Artiﬁcial Intel-

ligence (CISAI), pages 14–18.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,

Fu, C.-Y., and Berg, A. C. (2016). Ssd: Single shot

multibox detector. In European conference on com-

puter vision, pages 21–37. Springer.

Ma, N., Zhang, X., Zheng, H., and Sun, J. (2018). Shuf-

ﬂenet v2: Practical guidelines for efﬁcient cnn archi-

tecture design. In ECCV, volume 11218, pages 122–

138. Springer.

Mamdouh, N. and Khattab, A. (2021). Yolo-based deep

learning framework for olive fruit ﬂy detection and

counting. IEEE Access, 9:84252–84262.

Patel, D. J. and Bhatt, N. (2019). Insect identiﬁcation

among deep learning’s meta-architectures using ten-

sorﬂow. Int. J. Eng. Adv. Technol, 9(1):1910–1914.

Qi, D., Tan, W., Yao, Q., and Liu, J. (2021). Yolo5face:

Why reinventing a face detector. ArXiv preprint

ArXiv:2105.12931.

Qian, Y., Miao, Y., Huang, S., Qiao, X., Wang, M., Li, Y.,

Luo, L., Zhao, X., and Cao, L. (2022). Real-time

detection of eichhornia crassipes based on efﬁcient

yolov5. Machines, 10(9):754.

Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., and

Doll

ar, P. (2020). Designing network design spaces.

In Proceedings of the IEEE/CVF conference on com-

puter vision and pattern recognition, pages 10428–

10436.

Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental

improvement. ArXiv, abs/1804.02767.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and

Chen, L.-C. (2018). Mobilenetv2: Inverted residu-

als and linear bottlenecks. In Proceedings of the IEEE

conference on computer vision and pattern recogni-

tion, pages 4510–4520.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A.

(2017). Inception-v4, inception-resnet and the im-

pact of residual connections on learning. In Thirty-

ﬁrst AAAI conference on artiﬁcial intelligence, pages

4278—-4284. ACM.

Takimoto, H., Sato, Y., Nagano, A. J., Shimizu, K. K.,

and Kanagawa, A. (2021). Using a two-stage con-

volutional neural network to rapidly identify tiny her-

bivorous beetles in the ﬁeld. Ecological Informatics,

66:101466.

Tan, M. and Le, Q. (2019). Efﬁcientnet: Rethinking model

scaling for convolutional neural networks. In Interna-

tional conference on machine learning, pages 6105–

6114. PMLR.

Teng, Y., Zhang, J., Dong, S., Zheng, S., and Liu, L. (2022).

Msr-rcnn: A multi-class crop pest detection network

based on a multi-scale super-resolution feature en-

hancement module. Frontiers in Plant Science, 13.

Tresson, P., Tixier, P., Puech, W., and Carval, D. (2019).

Insect interaction analysis based on object detection

and cnn. In 2019 IEEE 21st International Workshop

on Multimedia Signal Processing (MMSP), pages 1–6.

IEEE.

Veit, A., Wilber, M. J., and Belongie, S. (2016). Residual

networks behave like ensembles of relatively shallow

networks. Advances in neural information processing

systems, 29.

Yang, Z., Yang, X., Li, M., and Li, W. (2021). Automated

garden-insect recognition using improved lightweight

convolution network. Information Processing in Agri-

culture.

Yuan, Z., Fang, W., Zhao, Y., and Sheng, V. S. (2021).

Research of insect recognition based on improved

yolov5. Journal of Artiﬁcial Intelligence, 3(4):145.

Zha, M., Qian, W., Yi, W., and Hua, J. (2021). A lightweight

yolov4-based forestry pest detection method using

coordinate attention and feature fusion. Entropy,

23(12):1587.

Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. (2018).

Learning transferable architectures for scalable image

recognition. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 8697–

8710.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

342