How to Train an Accurate and Efﬁcient Object Detection Model on any

Dataset

Galina Zalesskaya

, Bogna Bylicka

and Eugene Liu

Intel, Israel

Intel, Poland

Intel, U.K.

Keywords:

Deep Learning, Computer Vision, Object Detection, Light-Weight Models.

Abstract:

The rapidly evolving industry demands high accuracy of the models without the need for time-consuming and

computationally expensive experiments required for ﬁne-tuning. Moreover, a model and training pipeline,

which was once carefully optimized for a speciﬁc dataset, rarely generalizes well to training on a different

dataset. This makes it unrealistic to have carefully ﬁne-tuned models for each use case. To solve this, we

propose an alternative approach that also forms a backbone of Intel® Geti™ platform: a dataset-agnostic

template for object detection trainings, consisting of carefully chosen and pre-trained models together with a

robust training pipeline for further training. Our solution works out-of-the-box and provides a strong baseline

on a wide range of datasets. It can be used on its own or as a starting point for further ﬁne-tuning for speciﬁc

use cases when needed. We obtained dataset-agnostic templates by performing parallel training on a corpus

of datasets and optimizing the choice of architectures and training tricks with respect to the average results

on the whole corpora. We examined a number of architectures, taking into account the performance-accuracy

trade-off. Consequently, we propose 3 ﬁnalists, VFNet, ATSS, and SSD, that can be deployed on CPU using

the OpenVINO™ toolkit. The source code is available as a part of the OpenVINO™ Training Extensions

https://github.com/openvinotoolkit/training extensions.

1 INTRODUCTION

Over recent years, deep learning methods have been

increasingly used for Computer Vision tasks, includ-

ing Object Detection. While most of the architec-

tures are optimized on well-known benchmarks like

COCO (Lin et al., 2014) (Wang et al., ) (Zhang et al.,

2022), further remarkable results have been obtained

using CNNs for domain-speciﬁc tasks, such as medi-

cal (Yang and Yu, 2021), trafﬁc (Ye et al., 2020), envi-

ronmental (Li et al., 2020) etc. Such domain-speciﬁc

solutions, however, are usually heavily optimized for

a speciﬁc target dataset, starting from carefully tai-

lored architecture to choice of training tricks. This

not only makes them time-consuming and computa-

tionally heavy to obtain but also is the reason why

they lack good domain generalization. Training mod-

els in such a way suffers from the caveat of overly

adjusting the techniques to a speciﬁc dataset. That

leads to, on one hand, great predictions for a given

dataset, on the other, showing poor transferability to

a different one. Meaning, such a model is unlikely to

train well on a new, previously unseen dataset, with-

out additional time-consuming and computationally

expensive experiments to ﬁnd the optimal tweaks of

the model architecture and training pipeline. Hence,

such an approach, even though used in the past with

great successes, is not ﬂexible enough for the rapidly

evolving needs across a wide variety of computer vi-

sion use cases with a focus on time to value.

In this paper we propose an alternative approach

aimed at those situations, addressing the needs of fast-

paced industry and real-world applications. We pro-

vide a dataset-agnostic template for object detection

that is ready to be used straight away and guarantees a

strong baseline on a wide variety of datasets, without

needing additional time-consuming and computation-

ally demanding experiments. The template also pro-

vides the ﬂexibility for further ﬁne-tuning, however,

it is not required. This methodology includes a care-

fully chosen and pre-trained model, together with a

robust and efﬁcient training pipeline for further train-

770

Zalesskaya, G., Bylicka, B. and Liu, E.

How to Train an Accurate and Efﬁcient Object Detection Model on any Dataset.

DOI: 10.5220/0011781400003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP, pages

770-778

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

ing.

To achieve this, we base this study on a care-

fully chosen corpus of 11 datasets that cover a wide

range of domains and sizes. They differ in both num-

ber of images as well as their complexity, including

the number of classes, object sizes, distinguishabil-

ity from the background, and resolution. The train-

ing procedure involves parallel experiments on all the

datasets from the corpus, optimizing the model with

respect to average metrics over all corpora.

To summarize, we introduce:

• a training paradigm together with a choice of a

corpus of datasets,

• a universal, dataset-agnostic model template for

object detection training,

• additional patience parameters for ReduceOn-

Plateau and Early Stopping, to correct for the

different sizes of datasets and avoid over/under-

ﬁtting,

• choice of data-agnostic augmentations and spe-

ciﬁc tricks for each of our 3 models that work on

a wide range of datasets

This methodology in the Intel® Geti™ computer

vision platform enables platform users to acceler-

ate their object detection training workloads for their

real-world use cases across a variety of industries.

2 RELATED WORK

2.1 Architectures

In this section, we will brieﬂy describe the architec-

tures we explore in our experiments.

SSD. Since its introduction, SSD (Liu et al., 2016)

has been a go-to model in the industry, thanks to its

high computational efﬁciency. As an anchor-based

detector, it predicts detections on the predeﬁned an-

chors. All generated predictions are classiﬁed as pos-

itive or negative, and then an extra offset regression

on the positives is done for the ﬁnal reﬁnement of the

bounding box location. However, anchor-based mod-

els present a serious disadvantage, they need manual

tuning of the number of anchors, their ratio, and scale.

This fact negatively affects their ability to generalize

to different datasets.

FCOS. Anchor-free detectors address these con-

cerns. Instead of anchor boxes, they use single points

for detection, determining the center of the object.

FCOS (Tian et al., 2019) approaches object detection

in a per-pixel way, making use of the multi-level pre-

dictions thanks to FPN (Feature Pyramid Network).

Since the predictions are made for pixels, a new way

to deﬁne positives and negatives is introduced to re-

place the traditional IoU-based strategy. This brings

further improvements. Additionally, to suppress the

low-quality detections during NMS, centerness, eval-

uating localization estimation, is used in combination

with classiﬁcation score as a new ranking measure.

YOLOX. (Ge et al., 2021) is a new anchor-free

member of another classic family of models. As a

basis it takes YOLOv3 (Redmon and Farhadi, 2018),

still widely used in the industry, then boosting it for

modern times with the newest tricks, including de-

coupled head, to deal with classiﬁcation VS regres-

sion conﬂict, and advanced label assigning strategy,

SimOTA.

ATSS. (Zhang et al., 2020) improves performance

of detection, due to introduction of Adaptive Training

Sample Selection. It is a method that allows for an

automatic selection of positive and negative samples

based on the statistical characteristics of objects.

VFNet. (Zhang et al., 2021) builds on FCOS and

ATSS advances, topping it up with a new ranking

measure for the NMS stage, to suppress low-quality

detections even more efﬁciently. The proposed met-

ric, the IoU-aware classiﬁcation score, combines to-

gether information on the classiﬁcation score and lo-

calization estimation in a more natural way and in

only one variable. This not only gives better results

in terms of the quality of detections but also avoids

the overhead of the additional branch for localization

score regression.

Faster R-CNN and Cascade R-CNN. All the mod-

els considered so far were one-stage detectors. Faster

R-CNN (Ren et al., 2015) is a two-stage detector,

where in the ﬁrst stage the model coarse scans the

whole image to generate region proposals, on which

it concentrates in more detail in the second stage. A

Region Proposal Network is introduced for the effec-

tive generation of region proposals. It is an anchor-

based detector with a deﬁnition of positive and nega-

tive samples through IoU threshold. Cascade R-CNN

(Cai and Vasconcelos, 2018) is a multi-stage detec-

tor addressing this issue. It consists of a sequence of

detectors trained with increasing IoU thresholds.

How to Train an Accurate and Efﬁcient Object Detection Model on any Dataset

771

2.2 Towards More Universal Models

Recently a lot of work has been done in the area

of moving towards more ﬂexible, generalizable mod-

els in computer vision. This ranges from multi-task

training (Crawshaw, 2020), where a model is simul-

taneously trained to perform different tasks, to do-

main generalization (Gulrajani and Lopez-Paz, 2021),

where the goal is to build models that can predict well

on datasets out-of-distribution from those seen during

training. Another interesting direction is a few-shot

dataset generalisation (Triantaﬁllou et al., 2021). In

this approach parallel training on diverse datasets is

used to build a universal template that can adapt to

unseen data using only a few examples. Similar work

to ours has been done in the classiﬁcation domain

(Prokoﬁev and Sovrasov, 2021), where adaptive train-

ing strategies are proposed and proved to be working

by training lightweight models on a number of diverse

image classiﬁcation datasets.

3 METHOD

This section describes in detail our approach to train

dataset-agnostic models from the architecture to the

training tricks.

3.1 Architecture

To create a scope of the models that are applicable

for the various object detection datasets regardless

of difﬁculty and object size, we experimented with

architectures in three categories: the light-weighted

(about 10 GFLOPs), highly accurate, and medium

ones. A detailed comparison of the different archi-

tectures used can be found in Table 2.

For the fast and medium models we took light

and accurate enough MobileNetV2 backbone (San-

dler et al., 2018), which is commonly used in the

application ﬁeld. For the accurate model we used

ResNet50 (He et al., 2016) as a backbone.

For the light model we used really fast SSD with 2

heads, which can be even more optimized for CPU us-

ing OpenVINO™ toolkit; for medium - ATSS head;

and for accurate - VFNet with modiﬁed deformable

convolutions (DCNv2 (Zhu et al., 2019)). In our ex-

periments, as described later, we were not limited

with these architectures to achieve speciﬁc goals of

performance and accuracy. For example, we also tried

lighted ATSS as a candidate for the fast model, and

SSD with more heads as a candidate for the medium

model. But the ﬁnal setup with 3 different architec-

tures worked best.

3.2 Training Pipeline

Pretraining Weights. To achieve fast model con-

vergence and to get high accuracy right from the

start, we used pretrained weights. Firstly, we trained

MobileNetV2 classiﬁcation model with ImageNet21k

(Ridnik et al., 2021), so a model could learn to differ-

entiate trivial features and patterns. Afterward, we

only took the trained MobileNetV2 backbone weight,

cutting off the fully connected classiﬁer head. Then,

we used this pretrained backbone for our object de-

tection model and trained ATSS-MobileNetV2 and

SSD-MobileNetV2 on the COCO object detection

dataset (Lin et al., 2014) with 80 annotated classes.

For VFNet we used pretrained COCO weights from

OpenMMlab MMDetection.

Augmentation. For augmentation, we used some

classic tricks that have gained the trust of researchers.

We augmented images with a random crop, horizontal

ﬂip, and brightness and color distortions. The multi-

scale training was used for medium and accurate mod-

els to make them more robust. Also, after a series of

experiments we carefully and empirically chose cer-

tain resolutions for each model in order to meet the

trade-off between accuracy and complexity. For ex-

ample, SSD performed best on the square 864x864

resolution, ATSS on the rectangle (992x736), and for

VFNet we used the default high resolution without

any tuning (1344x800). During resizing we did not

keep the initial aspect ratio for 3 reasons:

1. to add extra regularization;

2. to avoid blank spaces after padding;

3. to avoid spending too much time on model resiz-

ing if the image aspect ratio differs a lot from the

default.

The impact of these tricks on accuracy and training

time can be found in the Ablation study (section 4.4).

Anchor Clustering for SSD. Despite the wide ap-

plication of the SSD model, especially in produc-

tion, its high inference speed, and good enough re-

sults, its dependence on careful anchor box tuning

is a huge drawback. It relies on the dataset and ob-

ject size characteristics. So, it becomes a challenge to

build dataset-agnostic training using SSD model. For

that reason, we followed the logic in (Anisimov and

Khanova, 2017) and used information from the train-

ing dataset to modify the anchor boxes before train-

ing started. We collect object size statistics and clus-

ter them using KMeans algorithm to ﬁnd the optimal

anchor boxes with the highest overlap among the ob-

jects. These modiﬁed anchor boxes are then passed

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

772

for training. This procedure is used for each dataset

and helps to make the light model more adaptive, im-

proving the result without additional model complex-

ity overhead.

Early Stopping and ReduceOnPlateau. The com-

plexity and size of the input dataset can vary drasti-

cally, making it hard to predict the number of epochs,

needed to get a satisfying result and could often lead

to large training times or even overﬁtting. To pre-

vent it, we used Early Stopping (Prechelt, 2000) to

stop training after a few epochs if those did not im-

prove the result further. Further, we used average

precision (AP@[0.5,0.95]) on the validation subset

as a target metric. This problem with the unﬁxed

number of epochs also creates uncertainty in deter-

mining learning rate schedule for the given scenario.

We utilized the adaptive ReduceOnPlateau scheduler

in tandem with Early Stopping, as commonly em-

ployed, to circumvent this problem. This scheduler

reduces the learning rate automatically if the consec-

utive epochs do not lead to improvement in the target

metric - average precision for the validation dataset in

our case. We implemented both of these algorithms

for the MMDetection (Chen et al., 2019) framework

that forms the backbone of our codebase. It also pow-

ers the Intel Geti computer vision platform. It is an

OpenVINO™ toolkit fork of the OpenMMLab repos-

itory.

Iteration Patience. Another challenge is that the

number of iterations in the epoch differs drastically

from dataset to dataset depending on its length. This

creates a problem to choose an appropriate ”patience”

parameter for Early Stopping and ReduceOnPlateau

in case of the dataset-agnostic training. For exam-

ple, if we use the same patience for large datasets and

small ones, the training with a small amount of data

will stop too early and will suffer from undertraining.

A possible solution here would be to copy the training

dataset several times for every training epoch, increas-

ing the number of iterations in the epoch. However,

this will also lead to large datasets suffering from

overﬁtting and long training times.

To overcome this, we adapted the classic ap-

proaches of ReduceOnPlateau and Early Stopping by

adding iteration patience parameter. It works simi-

larly to the epoch patience parameter but also ensures

that a speciﬁc number of iterations were run during

training on those epochs.

This approach increased the accuracy on the small

datasets however it has some drawbacks as well.

The iteration patience could vary from architecture

to architecture and must be chosen manually, which

causes adding an extra hyperparameter. It leads to

a less universal training pipeline and careful hyper-

parameter selection is required in case of trying new

architectures, which could be time-consuming.

Another caveat is that on small datasets a large

part of the training cycle is still spent on the validation

and checkpoint creation because of the short epochs.

It leads to the fact that up to 50% of the training is

spent on not-so-useful checkpoints which are clearly

undertrained and not really informative. Therefore,

ideally, you should maximize the time spent on train-

ing compared to the time spent on validation and

checkpoint creation.

4 EXPERIMENTS

4.1 Datasets

For training purposes, we used 11 public datasets

varying in terms of the domains, number of images,

classes, object sizes, overall difﬁculty, and horizon-

tal/vertical alignment. Most of them were obtained

from Roboﬂow public dataset storage. BCCD, Pot-

hole, WGISD1(Santos et al., 2020), WGISD5(Santos

et al., 2020) and Wildﬁre datasets represent the cat-

egory with the small number of train images. Here

WGISD1 and WGISD5 share the same images, but

WGISD1 annotates all objects as 1 class (grape),

while WGISD5 classiﬁes each grape variety. Aerial,

Dice, MinneApple(H

ani et al., 2020) and PCB(Huang

and Wei, 2019) represent datasets with small objects

that average size is less than 5% of the image. The

PKLOT and UNO are large datasets with more than

4000 images.

The samples of the images can be found in Fig-

ure 1. For illustration purposes, images were cropped

to the square format. The detailed statistics with split-

ting into sections can be found in Table 1.

4.2 Evaluation Protocol

For validation, we used a COCO Average Precision

with IoU threshold from 0.5 to 0.95 (AP@[0.5,0.95])

as the common metric for the object detection task.

AP =

∑

− R

n−1

where P

and R

are the precision and recall at the

n-th threshold. AP averages 10 values because IoU

thresholds belong to [0.5,0.95] with an interval 0.05.

We collected train, validation, and test AP for

each dataset, evaluating the ﬁnal weights on the train,

How to Train an Accurate and Efﬁcient Object Detection Model on any Dataset

773

BCCD Pothole Wildfire

WGISD1 and

WGISD5

PKLot

Aerial Dice MinneApple PCB Uno

Small datasets

Small objects

Large datasets

Figure 1: Samples of the Training Datasets.

Table 1: Object Detection Training Datasets.

Dataset Number of Average object Number of Images

classes size, % of image Train Validation

BCCD 3 17x21 255 73

Pothole 1 23x14 465 133

WGISD1 1 7x16 110 27

WGISD5 5 7x16 110 27

Wildﬁre 1 18x15 516 147

Aerial 5 4x6 52 15

Dice 6 5x3 251 71

MinneApple 1 4x2 469 134

PCB 6 3x3 333 222

PKLOT 2 5x8 8691 2483

UNO 14 7x7 6295 1798

validation, and test subset respectively. The differ-

ence between train and validation accuracy was used

to monitor if the model was overﬁtted. In addition,

the test AP showed how used techniques affected the

accuracy on the unseen parts of the datasets and thus

its generalization ability. The validation AP acted as

the main metric for comparison and choosing the best

approach.

To rank the different training strategies and archi-

tectures, the train, validation, and test AP scores for

each dataset were averaged across the data corpus.

Each of these AP scores contributes equally to the ﬁ-

nal metric without any weighting strategy. As an aux-

iliary metric we used the same average AP scores but

on different subsets:

• on datasets with small objects where average ob-

ject size is less than 5% of the image - to monitor

how the network handles a hard task of the small

object detection

• on large datasets that contain more than 4000 im-

ages - to monitor how the network uses the poten-

tial of the large dataset and how its size impacts

the training time.

4.3 Results

After more than 300 training rounds, we have cho-

sen the 3 best architectures in terms of accuracy-

performance trade-off: the lightweight and fast, the

medium, and the heavy but highly accurate. These

are SSD on the MobileNetV2, ATSS on the Mo-

bileNetV2, VFNet on the ResNet50 respectively. To

increase the overall accuracy and keep training time

within acceptable ranges we applied some common

and architecture-aware practices, which are listed be-

low and described in detail in Training pipeline sec-

tion 3.2. The contribution of each of these techniques

in both improving the accuracy and reducing the train-

ing time could be found in the Ablation study (section

4.4).

To choose the best models, we also experimented

with well-known architectures in the scope of the

lightweight, medium, and heavy models, such as

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

774

Table 2: Architecture Comparison.

AVG AP on the

Model AVG AP

∗

, % GFLOPs small small object large

datasets,% datasets, % datasets,%

Fast models

YOLOX 49.6 6.45 48.8 35.1 80.7

SSD 52.0 9.36 46.9 38.7 91.2

Medium models

FCOS 56.7 20.86 52.2 43.8 93.8

ATSS 60.3 20.86 55.2 49.7 94.1

Accurate models

Faster R-CNN 61.8 433.40 55.4 53.7 94.0

Cascade R-CNN 62.5 488.54 56.2 55.0 93.6

VFNet

64.7 347.78 59.5 56.1 95.2

∗

The average AP on the validation subsets was reported here

Table 3: Detailed Results of the Chosen Models on All Datasets.

Model Small datasets Datasets with small objects Large datasets

BCCD

Pothole

WGISD1

WGISD5

Wildﬁre

Aerial

Dice

MinneApple

PCB

PKLOT

UNO

Fast models

YOLOX 63.5 44.7 44.1 47.7 47.7 24.2 58.2 23.0 34.9 80.9 80.5

SSD 60.0 37.5 45.9 40.2 51.0 22.7 61.3 31.1 39.8 94.7 87.7

Medium models

FCOS 63.5 42.4 55.2 55.0 44.7 29.3 59.8 39.5 46.7 97.3 90.2

ATSS 63.5 47.9 57.5 55.4 51.5 36.8 73.8 41.0 47.3 97.7 90.5

Accurate models

Faster R-CNN 64.2 50.2 59.4 52.3 51.1 45.3 77.8 40.9 50.8 97.3 90.7

Cascade R-CNN 65.5 51.2 60.5 53.2 50.4 45.9 79.5 42.7 51.7 95.9 91.3

VFNet 65.2 55.1 63.8 59.5 53.9 46.3 81.2 44.4 52.4 98.6 91.8

The average AP on the validation subsets was reported here

YOLOX, FCOS, Faster R-CNN, and Cascade R-

CNN. We have tuned parameters, resolution, training

strategy, and augmentation individually to achieve the

best accuracy with comparable complexity. The accu-

racy and complexity of these architecture experiments

could be found in Table 2.

For all of the models we used the following train-

ing setup:

• start from the weights pretrained on the COCO;

• augment images with a crop, ﬂip, and photo dis-

tortions;

• use ReduceOnPlateau learning rate scheduler with

iteration patience;

• use the Early Stopping to prevent overﬁtting on

the large datasets and iteration patience to prevent

underﬁtting on the small ones.

Additional special techniques included:

• for SSD: anchor box reclustering and pretraining

COCO weights on ImageNet-21k(Ridnik et al.,

2021)

• for ATSS: multiscale training and using

ImageNet-21k weights

• VFNet - multiscale training

Detailed results of the validation accuracy of de-

scribed models on all datasets are in Table 3 below.

4.4 Ablation Study

To estimate the impact of each training trick on the

ﬁnal accuracy, we conducted ablation experiments,

by removing each of them from the training pipeline.

Based on our experiments, each of these tricks im-

How to Train an Accurate and Efﬁcient Object Detection Model on any Dataset

775

Table 4: Impact of Each Training Trick on the Accuracy of the Candidate Models.

Conﬁguration SSD ATSS VFNet

AVG AP

∗

,% AVG AP,% AVG AP,%

Final solution 52.0 60.3 64.7

w/o pretraining on the COCO 51.2 56.8 47.7

w/o Iteration patience 44.6 58.6 64.3

w/o ReduceOnPlateau LR scheduler 50.9 58.9 63.7

w/o Multicsale training - 59.2 64.0

Architecture aware features

w/o Anchor reclustering 48.2 - -

w/o modiﬁed DCN - - 64.1

∗

The average AP on the validation subsets was reported here

Table 5: Impact of Each Training Trick to the Training Time

∗

of the Candidate Models.

SSD ATSS VFNet

Conﬁguration AVG training AVG AVG training AVG AVG training AVG

time, min epochs time, min epochs time, min epochs

Final solution 45.18 145 36.27 92 230.91 71

w/o Early Stopping 245.20 300 306.25 300 986.96 300

w/o Tuning resolution 56.64 136 50.64 111 - -

∗

GeForce RTX 3090 was used to benchmark training time.

proved the accuracy of the target metric approxi-

mately by 1 AP See results in Table 4. The iteration

patience parameter showed great improvement (7.4

AP) for the fast SSD model, which stopped too early

without the new parameter. The best trick for medium

ATSS is using COCO weights, which improved accu-

racy by 3.5 AP. For the VFNet the most productive

trick was to use COCO weights which increased the

accuracy by 17 AP as well.

Since we focused not only on the overall accuracy

but also on getting results within an acceptable time

range and fast inference of the trained model, we also

used some tricks to reduce the training time and com-

plexity of the model, while preserving the accuracy of

predictions. We measured time spent on training in a

single GPU setup, using GeForce 3090. See Table 5

for certain numbers.

The ﬁrst thing it is seen here, which is counter-

intuitive, is that the small SSD model averagely re-

quires more epochs and more time to fulﬁll its poten-

tial and train enough than the medium ATSS model.

Talking about training time tricks, the most powerful

was to use Early Stopping, which reduced the training

time by 4.2-8.6 times. Also, using the standard high

resolution (1344x800) of the input images increased

the training time for SSD and ATSS architectures by

25-40%. However, for SSD architecture, the accuracy

decreased by 0.9 AP due to the image resolution being

too high for such a small model. And, for ATSS, the

accuracy increased by 1.7 AP, mostly due to datasets

with small objects. But it slowed inference time by

1.5 times, rendering it too slow for our criteria for a

medium model.

5 CONCLUSIONS

In this research, we have outlined an alternative ap-

proach to training deep neural network models for

fast-paced real-world use cases, across a variety of

industries. This novel approach addresses the need

for achieving high accuracy without performing time-

consuming and computationally expensive training or

ﬁne-tuning for individual use cases. This methodol-

ogy forms the backbone of the Intel Geti computer vi-

sion platform that enables rapidly building computer

vision models for real world use cases.

As a ﬁrst step, we have proposed a corpus of pub-

licly available 11 datasets, spanning across various

domains and dataset sizes. Subsequently, we have

performed more than 300 experiments, each consist-

ing of parallel trainings – one per each dataset in the

selected corpus. We have selected average AP results

over all datasets in the corpus as a target metric to op-

timize our ﬁnal models.

Using this methodology, we have analyzed a num-

ber of architectures, concretely, SSD and YOLOX

as fast model architectures for inference, ATSS and

FCOS as medium model architectures, and VFNet,

Cascade-RCNN, and Faster-RCNN as accurate mod-

els. In the process, we have identiﬁed tricks and par-

tial optimization that helped us optimize the average

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

776

AP scores across the dataset corpus.

We encountered a signiﬁcant challenge adjusting

the optimal training time while building a universal

template for a wide range of possible real-world use

cases. This also affected the learning rate sched-

uler. To solve this we introduced an additional iter-

ation patience parameters for early stopping and Re-

duceOnPlateau along with the epoch patience param-

eter. This enabled us to balance the requirements for

small and large datasets while maintaining optimal

training times irrespective of the dataset sizes.

Our efforts result in 3 (one per 3 different

performance-accuracy regimes) dataset-agnostic tem-

plates for object detection training, that provide a

strong baseline on a wide variety of datasets and can

be deployed on CPU using the OpenVINO™ toolkit.

REFERENCES

Anisimov, D. and Khanova, T. (2017). Towards lightweight

convolutional neural networks for object detection. In

2017 14th IEEE international conference on advanced

video and signal based surveillance (AVSS), pages 1–

8. IEEE.

Cai, Z. and Vasconcelos, N. (2018). Cascade r-cnn: Delving

into high quality object detection. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 6154–6162.

Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X.,

Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng,

D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu,

R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy,

C. C., and Lin, D. (2019). MMDetection: Open mm-

lab detection toolbox and benchmark. arXiv preprint

arXiv:1906.07155.

Crawshaw, M. (2020). Multi-task learning with deep

neural networks: A survey. arXiv preprint

arXiv:2009.09796.

Ge, Z., Liu, S., Wang, F., Li, Z., and Sun, J. (2021). Yolox:

Exceeding yolo series in 2021. arXiv preprint arXiv:

Arxiv-2107.08430.

Gulrajani, I. and Lopez-Paz, D. (2021). In search of lost

domain generalization. In International Conference

on Learning Representations.

ani, N., Roy, P., and Isler, V. (2020). Minneapple: a bench-

mark dataset for apple detection and segmentation.

IEEE Robotics and Automation Letters, 5(2):852–858.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Huang, W. and Wei, P. (2019). A pcb dataset for de-

fects detection and classiﬁcation. arXiv preprint

arXiv:1901.08204.

Li, S., Yan, Q., and Liu, P. (2020). An efﬁcient ﬁre de-

tection method based on multiscale feature extrac-

tion, implicit deep supervision and channel attention

mechanism. IEEE Transactions on Image Processing,

29:8467–8475.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,

Ramanan, D., Doll

ar, P., and Zitnick, C. L. (2014).

Microsoft coco: Common objects in context. In Euro-

pean conference on computer vision, pages 740–755.

Springer.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu,

C.-Y., and Berg, A. C. (2016). Ssd: Single shot multi-

box detector. In Leibe, B., Matas, J., Sebe, N., and

Welling, M., editors, Computer Vision – ECCV 2016,

pages 21–37, Cham. Springer International Publish-

ing.

Prechelt, L. (2000). Early stopping - but when?

Prokoﬁev, K. and Sovrasov, V. (2021). Towards efﬁ-

cient and data agnostic image classiﬁcation training

pipeline for embedded systems.

Redmon, J. and Farhadi, A. (2018). Yolov3: An incre-

mental improvement. arXiv preprint arXiv: Arxiv-

1804.02767.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. Advances in neural information

processing systems, 28:91–99.

Ridnik, T., Ben-Baruch, E., Noy, A., and Zelnik-Manor,

L. (2021). Imagenet-21k pretraining for the masses.

arXiv preprint arXiv:2104.10972.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and

Chen, L.-C. (2018). Mobilenetv2: Inverted residu-

als and linear bottlenecks. In Proceedings of the IEEE

conference on computer vision and pattern recogni-

tion, pages 4510–4520.

Santos, T. T., de Souza, L. L., dos Santos, A. A., and Avila,

S. (2020). Grape detection, segmentation, and track-

ing using deep neural networks and three-dimensional

association. Computers and Electronics in Agricul-

ture, 170:105247.

Tian, Z., Shen, C., Chen, H., and He, T. (2019). Fcos:

Fully convolutional one-stage object detection. In

2019 IEEE/CVF International Conference on Com-

puter Vision (ICCV), pages 9626–9635.

Triantaﬁllou, E., Larochelle, H., Zemel, R., and Du-

moulin, V. (2021). Learning a universal template

for few-shot dataset generalization. arXiv preprint

arXiv:2105.07029.

Wang, C., Bochkovskiy, A., and Liao, H. Scaled-yolov4:

Scaling cross stage partial network. arxiv 2020. arXiv

preprint arXiv:2011.08036.

Yang, R. and Yu, Y. (2021). Artiﬁcial convolutional neural

network in object detection and semantic segmenta-

tion for medical imaging analysis. Frontiers in Oncol-

ogy, 11:573.

Ye, T., Zhang, X., Zhang, Y., and Liu, J. (2020). Railway

trafﬁc object detection using differential feature fusion

convolution neural network. IEEE Transactions on

Intelligent Transportation Systems, 22(3):1375–1387.

Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni,

L. M., and Shum, H.-Y. (2022). Dino: Detr with im-

proved denoising anchor boxes for end-to-end object

detection. arXiv preprint arXiv:2203.03605.

How to Train an Accurate and Efﬁcient Object Detection Model on any Dataset

777

Zhang, H., Wang, Y., Dayoub, F., and Sunderhauf, N.

(2021). Varifocalnet: An iou-aware dense object de-

tector. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition (CVPR),

pages 8514–8523.

Zhang, S., Chi, C., Yao, Y., Lei, Z., and Li, S. Z. (2020).

Bridging the gap between anchor-based and anchor-

free detection via adaptive training sample selection.

In Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition (CVPR).

Zhu, X., Hu, H., Lin, S., and Dai, J. (2019). Deformable

convnets v2: More deformable, better results. In Pro-

ceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pages 9308–9316.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

778