Rethinking the Backbone Architecture for Tiny Object Detection

Jinlai Ning

, Haoyan Guan

and Michael Spratling

Department of Informatics, King’s College London, London, U.K.

Keywords:

Tiny Object Detection, Backbone, Pre-Training.

Abstract:

Tiny object detection has become an active area of research because images with tiny targets are common in

several important real-world scenarios. However, existing tiny object detection methods use standard deep

neural networks as their backbone architecture. We argue that such backbones are inappropriate for detecting

tiny objects as they are designed for the classiﬁcation of larger objects, and do not have the spatial resolution

to identify small targets. Speciﬁcally, such backbones use max-pooling or a large stride at early stages in the

architecture. This produces lower resolution feature-maps that can be efﬁciently processed by subsequent lay-

ers. However, such low-resolution feature-maps do not contain information that can reliably discriminate tiny

objects. To solve this problem we design “bottom-heavy” versions of backbones that allocate more resources

to processing higher-resolution features without introducing any additional computational burden overall. We

also investigate if pre-training these backbones on images of appropriate size, using CIFAR100 and Ima-

geNet32, can further improve performance on tiny object detection. Results on TinyPerson and WiderFace

show that detectors with our proposed backbones achieve better results than the current state-of-the-art meth-

ods.

1 INTRODUCTION

Tiny object detection is a sub-ﬁeld of object detection

in the ﬁeld of computer vision, and has many applica-

tions including maritime search-and-rescue, surveil-

lance, and driving assistance. It is an active area of

research because classical methods of standard ob-

ject detection (Ren et al., 2015; Liu et al., 2015; Lin

et al., 2017a; Cai and Vasconcelos, 2018; Redmon

and Farhadi, 2018; Tian et al., 2019) that work well

on datasets such as MS COCO (Lin et al., 2014) and

Pascal VOC (Everingham et al., 2015) perform poorly

on tiny object datasets such as TinyPerson (Yu et al.,

2020) and WiderFace (Yang et al., 2016). Detecting

tiny objects is still a challenging task for these stan-

dard methods. As a result, many methods have been

designed speciﬁcally for tiny object detection (Tong

and Wu, 2022). As described in (Tong and Wu, 2022),

these existing methods adapt standard object detec-

tion frameworks to be more suitable for tiny object

detection by: using super-resolution techniques (Yang

et al., 2019; Tang et al., 2018); exploiting contextual

information (Hong et al., 2022); using data augmen-

https://orcid.org/0000-0002-0460-7657

https://orcid.org/0000-0003-1936-2442

https://orcid.org/0000-0001-9531-2813

tation techniques (Yun et al., 2019; Yu et al., 2020;

Jiang et al., 2021); employing multi-scale represen-

tation learning approaches (Lin et al., 2017b; Gong

et al., 2021); using anchor mechanisms more appro-

priate for small objects (Zhang et al., 2017; Zhang

et al., 2018); designing training strategies speciﬁc to

small objects (Singh et al., 2018; Krishna and Jawa-

har, 2017); or using loss functions speciﬁc for small

and tiny objects (Liu et al., 2021a; Lin et al., 2017c).

In this article we propose a different approach that

does not ﬁt into any of these categories.

Although previous approaches vary from each

other in speciﬁc details, they all depend on standard

deep neural network backbones, such as ResNet (He

et al., 2015), to extract features. The features are then

used for feature fusion, position regression and cat-

egory classiﬁcation. Despite the impressive results

previous algorithms have obtained, we believe that by

re-deploying the same standard backbones as are used

for general object detection (Ren et al., 2015; Lin

et al., 2017a; Cai and Vasconcelos, 2018; Tian et al.,

2019), these methods suffer from poor feature extrac-

tion for tiny object detection. The standard backbone

architectures down-sample the feature maps rapidly in

the ﬁrst few layers, and this down-sampling removes

much of the information about tiny objects that were

present in the original image. As a result, subsequent

Ning, J., Guan, H. and Spratling, M.

Rethinking the Backbone Architecture for Tiny Object Detection.

DOI: 10.5220/0011643500003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP, pages

103-114

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

103

(a) ResNet50 (b) BH-ResNet50

Figure 1: An illustration of (a) an existing backbone, and (b) our modiﬁed backbone. Our modiﬁed backbone delays down-

sampling and changes the number of layers at different depths.

layers in the backbone, and all the methods for im-

proving tiny object detection that fall into the seven

categories mentioned in the preceding paragraph, can

only work with features that are relatively uninforma-

tive and poor at identifying tiny objects.

Down-sampling is a widely applied operation that

has proven highly effective in convolutional neural

networks (CNNs). It can be achieved using pooling

layers that summarize the features in each patch, or

by using a convolutional layer with a stride greater

than one (Springenberg et al., 2015). Down-sampling

can improve translation invariance, avoid over-ﬁtting,

and decrease computational costs. For tiny object de-

tection, down-sampling itself is not an issue. Rather

it is the improper use of down-sampling that results

in poor feature-extraction. Tiny objects occupy very

few pixels and down-sampling could potentially re-

move important features that identify such objects.

The only way to preserve information about small fea-

tures is for convolutional ﬁlters in the earliest layers

to encode these features and pass this information on

to subsequent layers. However, in existing backbones

the number of convolutional ﬁlters in the early lay-

ers is kept to a minimum to reduce the computational

burden, and this likely means that not all discrimina-

tive features of tiny objects are able to be preserved.

For instance, ResNets decrease feature map size by a

factor of 4 in less than or equal to 2 layers of convo-

lutions. Using such backbones to handle tiny object

detection is likely to result in information about tiny

objects disappearing in the feature maps before it is

fully extracted.

Outside the domain of object detection, previous

work has addressed information loss caused by down-

sampling in a number of different ways. Zhao et

al. (Zhao et al., 2017) introduced random shifting

in down-sampling during the training process to sup-

press contextual information loss without extra cal-

culation cost. Ren et al. (Jiahuan et al., 2021) pro-

posed a low-rank feature recovery module to try to re-

cover the lost information. In other sub-ﬁelds of com-

puter vision, such as hyper-spectral target detection

(Bhandari and Tiwari, 2021), and pluralistic image in-

painting (Liu et al., 2022), several methods have been

designed to handle information loss. Similarly, the

chess-playing AI, AlphaGo (Silver et al., 2016), did

not use layers with strides bigger than one to avoid

losing any spatial information about the chessboard.

However, in the domain of tiny object detection, this

problem has previously not been considered or has

been overlooked due to the convenience of using the

same backbone as has been used in general-purpose

object detectors (Ren et al., 2015; Liu et al., 2015;

Redmon et al., 2015).

In this paper we address this neglected problem

and propose alternative backbone architectures that

are more appropriate for tiny object detection. We

show that simple changes to the architecture (as illus-

trated in Fig. 1) that delay down-sampling operations,

and move more convolutional ﬁlters to earlier layers

where they can process high-resolution information,

can bring clear improvement to object detection per-

formance. We demonstrate the effectiveness of this

strategy by making the same modiﬁcations to two

different backbone architectures: ResNet (He et al.,

2015) and HRNet (Wang et al., 2021b). Experiments

on TinyPerson (Yu et al., 2020) and WiderFace (Yang

et al., 2016) demonstrate that replacing standard back-

bones with our modiﬁed versions always results in an

improvement in performance for a number of differ-

ent object detection frameworks. This is achieved de-

spite the modiﬁed backbones having fewer parame-

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

104

Table 1: The size of objects in several typical general object detection datasets (above the double lines) and tiny object

detection datasets (below the double lines). Absolute size is deﬁned as the square root of the object’s absolute bounding box

area, measured in pixels. It is reported as a mean±standard deviation.

Dataset Absolute size Proportion of small/tiny objects

MS COCO 99.5 ± 107.5 32% (with absolute size≤ 32 pixels)

PASCAL VOC — 10% (with absolute size≤ 32 pixels)

CityPersons 79.8 ± 67.5 8% (with absolute size≤ 20 pixels)

TinyPerson 18.0 ± 17.4 73% (with absolute size≤ 20 pixels)

WiderFace 32.8 ± 52.7 56% (with absolute size≤ 20 pixels)

ters than the corresponding original ones. These re-

sults support our claim that unsuitable use of down-

sampling occurs in the backbones commonly used for

tiny object detection and that this problem should be

taken seriously to help improve detection accuracy.

2 RELATED WORK

Object detection aims to locate, using a bounding box,

and predict the category of each object in an image.

The majority of state-of-the-art algorithms are based

on deep learning techniques (Ren et al., 2015; Tan

et al., 2020; Tian et al., 2019; Wang et al., 2021a;

Liu et al., 2021c), although classic image process-

ing methods have also made contributions in the early

years (Dalal and Triggs, 2005; Felzenszwalb et al.,

2008). CNN-based networks have dominated the ob-

ject detection ﬁeld for many years. These meth-

ods can be classiﬁed as one-stage (Redmon et al.,

2015; Bochkovskiy et al., 2020; Liu et al., 2015; Tian

et al., 2019) or two-stage (Girshick et al., 2014; He

et al., 2014; Ren et al., 2015) algorithms, depend-

ing on whether or not detection is performed by one

end-to-end network. Rather than designing bespoke

backbone architectures, methods in both groups use a

backbone that is a standard CNN architecture that has

been pre-trained on a classiﬁcation task, and then had

the fully-connected layers removed.

Large datasets are the basic resources required for

deep learning-based methods. For object detection,

MS COCO (Lin et al., 2014) and PASCAL VOC (Ev-

eringham et al., 2015) are widely used as benchmarks

to evaluate the performance of different algorithms.

Both of them contain a variety of natural images, con-

taining a large number of different categories, and can

be used to test how object-detection methods perform

on general tasks.

Tiny object detection is a sub-ﬁeld of object de-

tection. It focuses on detecting tiny objects in im-

ages because such objects are common in real-world

scenarios, but are hard to detect with standard ob-

ject detection methods. The main difference between

tiny object detection datasets such as TinyPerson (Yu

et al., 2020) and WiderFace (Yang et al., 2016) and

general object detection datasets is the scale of the tar-

get objects. As shown in Table 1, tiny object datasets

contain more small and tiny objects while general

datasets contain objects at a wider range of scales.

As mentioned in the Introduction, previous methods

for tiny object detection fall into seven categories

(Tong and Wu, 2022). Representative examples from

each category will be reviewed in the following para-

graphs, with results summarized in Tables 2 and 3.

Super-resolution techniques increase the resolu-

tion of the image, to enable standard techniques for

larger object detection to be applied successfully to

images containing tiny targets. Increasing the reso-

lution can be done for the whole image, for example

using a generative neural network (Bai et al., 2018),

or can be used on a small region of interests (Yang

et al., 2019).

Contextual information (ie. features from the sur-

rounding region of the image) can be used to help de-

tect small objects. Hence, many techniques include

context in the computation. One example is the scale

selection pyramid network (SSPNet) that consists of a

context attention module (CAM), scale enhancement

module (SEM), and scale selection module (SSM)

(Hong et al., 2022). CAM includes context informa-

tion by generating hierarchical attention heat-maps.

SEM ensures feature maps of different scales only fo-

cus on objects of suitable scale rather than other ob-

jects and the background. SSM exploits feature fu-

sion between shallow and deep features to keep the

information shared by different layers.

Data augmentation aims to improve performance

by extending the size of training dataset through

identity-preserving image transformations. Yu et al.

(Yu et al., 2020) proposed a simple yet efﬁcient aug-

mentation approach named scale match. This method

can transform the distribution of object sizes from a

general dataset to be similar to a task-speciﬁc dataset.

In that paper, MS COCO was transformed with scale

match and then used as additional data for training a

detector for the TinyPerson detection task. One step

in scale match is to sample a size-bin from the his-

Rethinking the Backbone Architecture for Tiny Object Detection

105

Table 2: Detection performance of previous methods, including the current state-of-the-art methods, on TinyPerson, sorted by

performance on the mAP

tiny

metric.

Method mAP

tiny

mAP

tiny1

mAP

tiny2

mAP

tiny3

mAP

small

Faster R-CNN-FPN (Ren et al., 2015) 47.35 30.25 51.58 58.95 63.18

Faster R-CNN-FPN with S-α (Gong et al., 2021) 48.39 31.68 52.20 60.01 65.15

Faster R-CNN-RFLA (Xu et al., 2022) 48.86 30.35 54.15 61.28 66.69

Faster R-CNN-FPN-SM (Yu et al., 2020) 51.33 33.91 55.16 62.58 66.96

Faster R-CNN-FPN-SM+ (Jiang et al., 2021) 51.46 33.74 55.32 62.95 67.37

RetinaNet-SSPNet (Hong et al., 2022) 54.66 42.72 60.16 61.52 65.24

Faster R-CNN with SFRF (Liu et al., 2021b) 57.24 51.59 64.51 67.78 65.33

Cascade R-CNN-SSPNet (Hong et al., 2022) 58.59 45.75 62.03 65.83 71.80

Faster R-CNN-SSPNet (Hong et al., 2022) 59.13 47.56 62.36 66.15 71.17

Table 3: Detection performance of previous methods, including the current state-of-the-art methods, on WiderFace, sorted by

performance on the mAP

medium

metric.

Method mAP

easy

mAP

medium

mAP

hard

HR (Hu and Ramanan, 2017) 92.5 91.0 80.6

S3FD (Zhang et al., 2017) 93.7 92.4 85.2

FaceGAN (Bai et al., 2018) 94.4 93.3 87.3

SFA (Luo et al., 2019) 94.9 93.6 86.6

LSFHI (Zhang et al., 2018) 95.7 94.9 89.7

Pyramid-Box (Tang et al., 2018) 96.1 95.0 88.9

RetinaFace (Deng et al., 2020) 96.5 95.6 90.4

TinaFace (Zhu et al., 2020) 96.3 95.7 93.1

togram of object sizes, this sampling is done at ran-

dom with the probability based on a given image in

the external dataset. To avoid the range of sampled

size-bins having a big difference with the size of ob-

jects in the external image, monotone scale match

(MSM) is utilized to sample the range of sizes from

small to big monotonically. Jiang et al. (Jiang et al.,

2021) proposed an enhanced version of scale match,

SM+, that improves the scale match from image level

to instance level and uses probabilistic structure in-

painting (PSI) to handle the background.

Multi-scale representation learning aims to en-

hance detection performance by making use of fea-

tures at different scales, represented in different layers

of the CNN. This technique is also commonly applied

in general object detection, for example through the

feature pyramid network (FPN) (Lin et al., 2017b).

Gong et al. (Gong et al., 2021) argue that the FPN

brings not only a positive impact but also a negative

one caused by the top-down connections. They de-

sign a statistic-based fusion factor to adaptively ad-

just the weight of different layers during the feature

fusion. Tang et al. (Tang et al., 2018) argue that not

all high-level semantic features are really helpful to

smaller targets. They therefore modify FPN to a low-

level feature pyramid network (LFPN) that starts the

top-down structure from a middle layer rather than the

high layer. Liu et al. (Liu et al., 2021b) devise a fea-

ture rescaling and fusion (SFRF) network that selects

and generates a new resized feature map with a high-

density distribution of tiny objects through the use of a

Nonparametric Adaptive Dense Perceiving Algorithm

(NADPA) module.

Anchors are predeﬁned locations where objects

might be detected and are widely applied in two-step

detectors such as Faster R-CNN (Ren et al., 2015) and

Cascade R-CNN (Cai and Vasconcelos, 2018). Zhang

et al. (Zhang et al., 2017) guarantees that different

scales of anchor have the same density on the image,

so that various scales can approximately match the

same number of anchors. They also propose a scale

compensation anchor matching strategy to ensure that

all scales have enough anchors matched. Deng et

al. (Deng et al., 2020) specify different sizes of an-

chors in different layers of features to ensure all sizes

of objects could be matched properly. Zhang et al.

(Zhang et al., 2018) develop a single-level face de-

tection framework to speciﬁcally detect small faces,

which uses different dilation rates for anchors with

different sizes and performed anchor-level hard ex-

ample mining. Xu et al. (Xu et al., 2022) design a

novel label assignment method to mitigate the prob-

lems of lack of positive samples and gap between the

uniformly distributed prior anchors and the Gaussian

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

106

澳

(a) ResNet50

澳

(b) BH-ResNet50

Figure 2: The proportion of FLOPs per stage for (a) stan-

dard ResNet50, and (b) our modiﬁed BH-ResNet50.

distributed receptive ﬁeld.

Adapting the training strategy to be more appro-

priate for tiny object detection can be beneﬁcial. For

example, Luo et al. (Luo et al., 2019) propose a multi-

branch small face attention (SFA) detector. It trains

the model with multi-scale training images to improve

robustness and performance. Hu et al. (Hu and Ra-

manan, 2017) train separate face detectors efﬁciently

by utilizing features extracted from multiple layers of

a single feature hierarchy for different scales.

Optimizing the loss function is an effective strat-

egy to improve the detection performance of tiny ob-

jects. Zhu et al. (Zhu et al., 2020) modify the loss

function of bounding box regression from the widely

applied smooth L1 loss (Girshick, 2015) to DIoU loss

(Zheng et al., 2020). It ensures the loss function is

consistent with the objective of bounding box regres-

sion by including the IoU metrics.

Our proposed approach, of modifying the back-

bone, is orthogonal to all this previous work. Hence,

potentially all these diverse techniques might be fur-

ther improved by combining them with our proposed

approach.

3 METHODOLOGY

A natural idea to preserve the high-resolution features

required for tiny object detection is to dispense with

several down-samplings. However, such a simplis-

tic approach would signiﬁcantly increase the compu-

tational complexity of the backbone as the number

of calculations increases quadratically with the size

of a feature-map. To avoid this issue, our proposed

bottom-heavy (BH) architectures perform as much

down-sampling as the corresponding original archi-

tecture, but the number of convolutional ﬁlters is in-

creased in the earlier stages of the network, and de-

creased in the later stages. This decrease in the num-

ber of convolutional ﬁlters in the later layers reduces

the amount of computation performed in the deeper

layers to compensate for the increased computation

performed in the earlier layers. Speciﬁcally, in our

BH architecture, the number of convolutional layers

is increased before the feature-map is down-sampled

for the third time, and decreased after this point. The

proportion of FLOPs performed in different stages of

a ResNet50 and a BH-ResNet50 is shown in Fig 2.

The change in the distribution of the convolutional

layers is designed to ensure that the original backbone

and the modiﬁed version perform the same number

of ﬂoating-point operations, to allow a fair compar-

ison of the proposed backbones with existing ones.

Another result of these modiﬁcations is that the BH

networks have fewer parameters than their original

counter-parts. Full details are provided in Table 4.

These modiﬁcations ensure that high-resolution infor-

mation that can distinguish tiny objects is processed

more thoroughly, using more convolutional layers.

4 EXPERIMENTS

4.1 Pre-Training

Pre-training is a process that initializes model pa-

rameters with previously learned ones. For object

detection tasks, it is common to start training the

whole model with a backbone that has been pre-

trained on an image classiﬁcation task, typically Im-

ageNet. Compared with training the whole detection

network from scratch, using a pre-trained backbone

is better to maintain a stable and fast training process

(Pan and Yang, 2010; Han et al., 2021). Addition-

ally, the learned knowledge from classiﬁcation also

provides a solid foundation for detection. Our back-

bones are pre-trained on CIFAR-100 (Krizhevsky and

Hinton, 2009), ImageNet (Deng et al., 2009) and

a down-sampled variant of ImageNet, ImageNet32

Rethinking the Backbone Architecture for Tiny Object Detection

107

Table 4: The architectures of three standard deep neural network backbones (ResNet50, HRNet32, and HRNet18) and the

proposed bottom-heavy (BH) versions. Convolutional layers are speciﬁed using the notation width × height, number of

channels, and stride. Pooling layers are speciﬁed using the notation width × height, type, and stride, where type is M for max

pooling. Repeated blocks are shown using squared brackets with the number of repeats indicated by the number following

the multiply sign after the brackets. Where stride is unspeciﬁed it is equal to one. ’*’ indicates the stride of that layer equals

two but only for the ﬁrst block in a set of repeated layers. Rather than illustrating all parallel branches of HRNet, only the

deepest branch is shown for clarity. The number of giga-ﬂoating-point operations (GFLOPs) is calculated for an input size of

640 pixels ×512 pixels.

Stem Stage 1 Stage 2 Stage 3 Stage 4

Params &

GFLOPs

ResNet50

7 × 7, 64,2

3 × 3, M, 2





1 × 1, 64

3 × 3, 64

1 × 1, 256





× 3





1 × 1, 128

3 × 3, 128, 2

∗

1 × 1, 512





× 4





1 × 1, 256

3 × 3, 256, 2

∗

1 × 1, 1024





× 6





1 × 1, 512

3 × 3, 512

∗

1 × 1, 2048





× 3

23.23M

26.91

BH-

ResNet50

3 × 3, 64, 1

3 × 3, M, 2





1 × 1, 64

3 × 3, 64, 2

∗

1 × 1, 256





× 7





1 × 1, 128

3 × 3, 128, 2

∗

1 × 1, 512





× 6





1 × 1, 256

3 × 3, 256, 2

∗

1 × 1, 1024





× 2





1 × 1, 512

3 × 3, 512, 2

∗

1 × 1, 2048





× 1

10.91M

27.12

HRNet32

3 × 3, 64, 2



3 × 3, 64



× 4



3 × 3, 64



× 4



3 × 3, 128



× 16



3 × 3, 256



× 12

29.31M

51.93

BH-

HRNet32

3 × 3, 64, 1

3 × 3, 64, 2

3 × 3, M, 2



3 × 3, 64



× 4



3 × 3, 64



× 4



3 × 3, 128



× 8



3 × 3, 256



× 3

12.36M

51.89

HRNet18

3 × 3, 64, 2



3 × 3, 36



× 4



3 × 3, 64



× 4



3 × 3, 72



× 16



3 × 3, 144



× 12

9.56M

21.71

BH-

HRNet18

3 × 3, 64, 1

3 × 3, 64, 2

3 × 3, M, 2



3 × 3, 36



× 4



3 × 3, 36



× 4



3 × 3, 72



× 12



3 × 3, 144



× 6

5.85M

21.66

Table 5: Classiﬁcation top-1 accuracy of backbone archi-

tectures after pre-training on CIFAR-100, ImageNet 32 and

ImageNet.

Method CIFAR-100 ImageNet32 ImageNet

ResNet50 65.82% 43.12% 75.92%

BH-ResNet50 68.57% 48.75% 74.68%

HRNet18 55.75% — 76.80%

BH-HRNet18 59.91% 50.68% 76.62%

HRNet32 55.08% — 78.50%

BH-HRNet32 61.71% 56.00% 78.23%

(Chrabaszcz et al., 2017).

The CIFAR-100 dataset (Krizhevsky and Hinton,

2009) consists of 60000 32×32 colour images in 100

classes, separated into 50000 training and 10000 test

images. It is chosen for pre-training because: (1) the

image size of CIFAR-100 is consistent with the size

of target objects in the tiny object detection datasets

(2) it has a reasonable training time that is appropriate

for prototyping a new backbone; (3) some of the cate-

gories in CIFAR-100 such as man, woman, baby, boy,

girl, sea etc. are similar to those in the TinyPerson

and WiderFace datasets, and hence, it is expected that

the parameters will transfer from image recognition

to object detection.

We also perform experiments using backbones

pre-trained on ImageNet. This allows us to com-

pare the performance of the new architecture with

published results for existing object detectors that are

pre-trained on ImageNet. ImageNet is a large visual

dataset containing more than 20000 categories and 14

million images. It is considered a standard choice to

pre-train a backbone for object detection tasks. In ad-

dition, we also repeat some experiments on a down-

sampled variant of ImageNet, named ImageNet32.

ImageNet32 contains exactly the same number of im-

ages as the original ImageNet, but all images are re-

sized to 32×32 pixels. This dataset has the same ad-

vantages as CIFAR100, but provides more training

images.

For data augmentation, we use random horizontal

ﬂipping, random cropping and resizing. The classiﬁ-

cation results at the end of pre-training are shown in

Table 5. It can be seen that the proposed BH networks

produce more accurate classiﬁcations for the datasets

with small image sizes, compared to the correspond-

ing original networks. This supports our claim that

these bottom-heavy networks are more appropriate

for the discrimination of small objects. This improve-

ment in performance on small objects comes at the

cost of a small decrease in classiﬁcation accuracy for

large images. There may be scope for improving these

pre-training results (and as a consequence improving

the image detection results) as no great effort has been

taken to optimise the hyper-settings used during pre-

training: we had insufﬁcient computational resources

to do so, and searching for the best settings is not the

main point of this paper.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

108

Table 6: Detection performance of previous methods and our modiﬁed (BH) architecture on TinyPerson. Pre-training is done

on CIFAR-100.

Method Backbone mAP

tiny

mAP

tiny1

mAP

tiny2

mAP

tiny3

mAP

small

FLOPs

Faster R-CNN-FPN (Ren et al., 2015) ResNet50 43.02 26.09 47.43 54.24 60.13 75.58

BH-ResNet50 46.53 31.65 50.51 56.60 59.45 75.79

HRNet32 37.67 24.62 41.84 47.29 51.39 100.66

BH-HRNet32 41.82 27.19 46.03 51.85 55.39 100.62

HRNet18 36.94 24.05 41.46 45.94 50.57 69.33

BH-HRNet18 40.27 25.56 44.50 50.18 53.65 69.28

Faster R-CNN-SM (Yu et al., 2020) ResNet50 43.21 27.12 47.44 54.52 59.78 75.58

BH-R50 46.69 31.19 50.88 57.13 61.09 75.59

Adap-FCOS (ucas-vg, 2020) ResNet50 36.29 21.70 40.60 46.10 49.87 174.74

BH-ResNet50 41.42 25.92 45.49 51.33 57.08 174.95

Faster R-CNN-RFLA (Xu et al., 2022) ResNet50 42.41 24.73 45.39 55.11 61.50 75.59

BH-ResNet50 45.34 27.09 49.27 58.41 62.63 75.58

RetinaNet-SPPNet (Hong et al., 2022) ResNet50 48.37 36.78 54.47 55.05 60.74 260.06

BH-ResNet50 49.23 36.01 54.47 56.17 61.60 260.27

Cascade R-CNN-SSPNet (Hong et al., 2022) ResNet50 48.79 35.59 50.75 57.59 65.63 186.62

BH-ResNet50 48.96 37.69 52.15 55.69 62.62 186.83

Faster R-CNN-SSPNet (Hong et al., 2022) ResNet50 48.05 36.40 50.38 56.51 62.94 158.82

BH-ResNet50 49.78 39.99 52.26 57.14 64.44 159.03

4.2 Fine-Tuning

We trained object detectors using the TinyPerson and

WiderFace datasets. TinyPerson has 1610 large-scale

seaside images taken from a long distance. It con-

tains 794 and 816 images for training and testing, re-

spectively. The average absolute size of target ob-

jects is about 18 pixels and non-small objects (ab-

solute size> 32 pixels) are rare. TinyPerson divides

objects into different intervals based on the absolute

size (measured in pixels): tiny [2, 20], small [20, 32]

and all [2, ∞]. The tiny set is partitioned into three

(overlapping) subsets: tiny1 [2, 8], tiny2 [8, 12] and

tiny3 [12, 20]. The WiderFace dataset is a face detec-

tion benchmark containing 32,203 images of different

human faces with a high degree of variability in scale,

pose and occlusion. The average size of objects in

WiderFace is about 32 pixels. It separates objects into

three evaluation sets (easy, medium and hard) based

on the difﬁculty of detecting them.

The resolution of the images in TinyPerson ranges

from 497 × 700 pixels to 4064 × 6354 pixels. Di-

rectly feeding large images into the backbone would

produce out-of-memory issues. Meanwhile, simply

resizing such images to make them more manage-

able would risk losing information about tiny objects.

Therefore we follow the procedure described in (Yu

et al., 2020) for training with this dataset. Speciﬁ-

cally, images are cut into overlapping patches of size

640 pixels×512 pixels with 30 pixels overlap. We

use the default training schedule for Faster-RCNN

and Adap-FCOS(ucas-vg, 2020) on TinyPerson: 12

epochs with stochastic gradient descent (SGD) as the

optimizer. The learning rate is initialized to 0.01 and

decreased by a factor of 0.1 after 8 and 11 epochs for

training on 2 GPUs. The only difference for training

different architectures of SSPNet is that the default

schedule is 10 epochs. For WiderFace, the input size

is 640 pixels×640 pixels. The default training sched-

ule of 630 epochs with SGD (Zhu et al., 2020) is used.

The learning rate is initialized to 0.00375 and follows

a cosine restart scheduler that restarts the learning rate

every 30 epochs. We do not apply any complicated

image augmentation methods but follow the same set-

tings used by the corresponding state-of-the-art meth-

ods.

5 RESULTS

5.1 Experimental Setup and Evaluation

Metrics

Mean average precision (mAP) is a standard crite-

rion for evaluation in object detection. AP calculates

the average precision among a set of different recalls.

mAP is the mean AP among all classes. Whether

a bounding box is marked as a positive or negative

prediction is determined by whether the intersection-

over-union (IoU) of the predicted and ground-truth

bounding boxes is greater than a set threshold. The

mAP on a set of objects i with the percentage IoU

threshold j is noted as mAP

. For example, mAP

tiny

Rethinking the Backbone Architecture for Tiny Object Detection

109

Table 7: Detection performance of previous methods and our modiﬁed (BH) architecture on TinyPerson. Pre-training is

done on ImageNet and downsampled ImageNet (indicated with †). Only results with our proposed backbone are trained by

ourselves while all others are previously reported results.

Method Backbone mAP

tiny

mAP

tiny1

mAP

tiny2

mAP

tiny3

mAP

small

FLOPs

Faster R-CNN-FPN (Ren et al., 2015) ResNet50† 47.81 31.78 53.54 58.34 64.56 75.58

ResNet50 47.35 30.25 51.58 58.95 63.18 75.58

BH-ResNet50† 52.60 38.18 58.18 61.66 67.42 75.59

BH-ResNet50 52.03 37.95 57.58 60.99 65.71 75.59

HRNet32 53.05 35.61 58.46 63.87 68.02 100.66

BH-HRNet32† 52.62 38.76 57.64 62.28 67.42 100.62

BH-HRNet32 53.29 38.08 58.92 63.08 67.72 100.62

HRNet18 52.28 34.92 58.39 63.36 67.86 69.33

BH-HRNet18† 50.04 35.43 55.10 60.04 65.03 69.28

BH-HRNet18 52.69 36.12 59.13 63.48 67.45 69.28

Faster R-CNN-SM (Yu et al., 2020) ResNet50† 50.65 33.68 55.69 61.40 67.39 75.58

ResNet50 51.33 33.91 55.16 62.58 66.96 75.58

BH-ResNet50† 52.04 37.87 57.01 61.32 65.91 75.59

BH-ResNet50 51.48 37.35 57.35 60.49 65.47 75.59

Adap FCOS (ucas-vg, 2020) ResNet50† 42.13 22.62 46.50 54.32 57.61 174.74

ResNet50 47.42 29.05 52.06 59.15 65.06 174.74

BH-ResNet† 48.20 28.56 53.00 59.87 66.51 174.95

BH-ResNet 49.16 31.13 53.89 60.09 67.60 174.95

Faster R-CNN-RFLA (Xu et al., 2022) ResNet50 † 45.78 27.72 49.09 58.33 64.38 75.58

ResNet50 48.86 30.35 54.15 61.28 66.69 75.58

BH-ResNet50 † 50.07 32.29 54.83 62.60 68.21 75.59

BH-ResNet50 52.27 34.36 56.62 64.48 67.77 75.59

RetinaNet-SPPNet (Hong et al., 2022) ResNet50† 51.28 39.38 56.60 58.55 65.01 260.06

ResNet50 50.98 38.97 57.26 57.68 63.38 260.06

BH-ResNet50† 55.34 42.45 61.04 62.36 67.63 260.27

BH-ResNet50 52.70 39.61 58.82 60.35 69.00 260.27

Cascade R-CNN-SSPNet (Hong et al., 2022) ResNet50† 54.31 39.25 56.81 63.54 69.40 186.62

ResNet50 56.60 42.41 59.56 65.36 71.20 186.62

BH-ResNet50† 56.54 43.65 59.54 64.45 69.59 186.83

BH-ResNet50 57.89 44.39 60.86 66.25 71.03 186.83

Faster R-CNN-SSPNet (Hong et al., 2022) ResNet50† 50.48 36.81 51.08 60.87 67.38 158.82

ResNet50 58.58 44.92 62.22 67.19 71.88 158.82

BH-ResNet50† 56.22 46.07 58.03 63.36 69.54 159.03

BH-ResNet50 58.97 47.22 61.61 67.45 72.37 159.03

represents the mAP among objects in the tiny inter-

val (absolute size ∈ [2, 20]) with IoU threshold 0.5.

Floating-point operations (FLOPs) are used to mea-

sure the computational complexity of a model. The

higher the number of FLOPs, the more calculations

are performed by the method, for a given size of the

image. In Tables 6 and 7, FLOPs are measured for an

image size of 640×512 pixels.

5.2 Comparison to State-of-the-Arts

To evaluate our proposed backbone, we tested sev-

eral architectures that have previously been shown to

produce state-of-the-art performance on TinyPerson

and WiderFace, replacing the standard backbone with

our proposed bottom-heavy (BH) equivalent. The re-

sults for TinyPerson are shown in Tables 6 and 7.

In each case better performance was achieved using

the proposed backbones than with the original ones,

as summarized in Table 8. This was achieved with-

out increasing the computational complexity and us-

ing backbones containing fewer parameters than the

originals. It can also be seen from Table 8 that

the proposed BH modiﬁcations produce a signiﬁcant

improvement in detection performance compared to

standard backbones, and that this improvement is

highest when using CIFAR-100 and ImageNet32 for

pre-training.

A pre-training dataset should be consistent with

the target task (Pan and Yang, 2010). Down-sampled

ImageNet is a simple alternative to ImageNet that has

higher consistency with tiny object detection from the

perspective of object size. However, from our experi-

ments, only for some methods did pre-training on Im-

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

110

Table 8: The difference of mAP

tiny

between the standard backbones and our ’Bottom Heavy’ versions backbones, calculated

by calculating mAP

tiny

(for the BH architecture) minus mAP

tiny

(for the corresponding standard architecture).

Method CIFAR-100 ImageNet32 ImageNet

Faster R-CNN (Ren et al., 2015) +3.51 +4.79 +4.68

Faster R-CNN-SM (Yu et al., 2020) +3.48 +1.39 +0.15

Adap FCOS (ucas-vg, 2020) +5.13 +6.07 +1.74

Faster R-CNN-RFLA (Xu et al., 2022) +2.93 +4.29 +3.41

RetinaNet-SPPNet (Hong et al., 2022) +0.86 +4.06 +1.72

Cascade R-CNN-SSPNet (Hong et al., 2022) +0.17 +2.23 +1.29

Faster R-CNN-SSPNet (Ren et al., 2015) +1.73 +5.74 +0.39

Average change in mAP

tiny

+2.54 +3.96 +2.04

Table 9: Detection performance of previous methods and our modiﬁed architecture on WiderFace. Pre-training is done on

ImageNet. BN, GN & DCN indicates the batch normalization, group normalization and deformable convolution networks

respectively.

Method Backbone mAP

easy

mAP

medium

mAP

hard

FLOPs

TinaFace w BN(Zhu et al., 2020) ResNet50 95.77 95.43 92.23 191.30

BH-ResNet50 95.56 95.71 92.28 191.54

TinaFace w GN & DCN (Zhu et al., 2020) ResNet50 96.27 95.67 93.06 184.29

BH-ResNet50 96.61 95.98 93.26 188.60

ageNet32 outperform pre-training on ImageNet. It is

likely that ImageNet32 is more relevant to tiny object

detection but the blurring of images caused by down-

sampling also makes it harder to learn useful repre-

sentations from this dataset. The low top-1 classiﬁca-

tion accuracy, as shown in Table 5, provides evidence

for this. CIFAR-100 also uses images that are consis-

tent in size to the target objects in tiny object detec-

tion, but detection is much worse overall when using

CIFAR-100 for pre-training, presumably because the

small size of this dataset means that the representa-

tions that are learnt generalise more poorly.

When pre-training on ImageNet, although our

method still outperforms the state-of-the-art methods,

the improvement is not as signiﬁcant as when pre-

training with CIFAR-100 and ImageNet32. This dif-

ference is likely due to two reasons. One is that our

backbones work better when they are pre-trained on

a dataset that is more consistent with the tiny objects

because our backbones are designed speciﬁcally for

such objects. Another reason is we have simply used

the same hyper-parameters to pre-train our BH net-

works on ImageNet as are used for the corresponding

standard architectures, as we did not have the compu-

tational resources to search for more optimum hyper-

parameters appropriate for pre-training BH networks

on such a large dataset.

The results for WiderFace are shown in Ta-

ble 9. We evaluate performance using the current best

method, TinaFace (Zhu et al., 2020), with two differ-

ent settings. Our backbones outperform the standard

backbones in both cases in medium and hard tasks and

have achieved the state-of-the-art result in the easy

task at 96.61%.

Overall, improved performance was generated us-

ing different backbone families (ResNet and HRNet),

different network depths, pre-training with different

datasets (CIFAR-100, ImageNet and down-sampled

ImageNet), and when integrating the proposed back-

bones into several state-of-the-art tiny object detec-

tion frameworks. These results demonstrate that

the proposed method of network modiﬁcation gener-

alises, and could potentially be used with other ar-

chitectures to make them more suitable for small and

tiny object classiﬁcation and detection. Furthermore,

our results demonstrate that the use of standard back-

bones is harmful for tiny object detection. The results

also suggest that current architectures are potentially

wasting computational resources by performing un-

necessary computations in deep layers.

6 CONCLUSION

This paper proposes that standard deep neural net-

work architectures used as feature-extraction front-

ends in object-detection algorithms produce poor fea-

tures for tiny object detection tasks. Our claim is that

early down-sampling in such architectures results in

information loss and features that are poor at rep-

resenting small objects. To test this claim we de-

signed bottom-heavy versions of popular backbone

architectures, ResNet and HRNet, that increase the

Rethinking the Backbone Architecture for Tiny Object Detection

111

number of convolutional layers in the shallow, high-

resolution, stages of the network, and have fewer con-

volutional layers at later stages in the network. These

changes in the distribution of the convolutional lay-

ers are made to ensure that the computational com-

plexity of the modiﬁed backbones matches that of

the original, standard, network. Experimental results

show that these changes, despite reducing the num-

ber of parameters in the networks, result in more ac-

curate object detection across a number of object-

detectors, using different backbones, and pre-training

schemes (CIFAR100, ImageNet and ImgaeNet32),

for two standard benchmark datasets (TinyPerson and

WiderFace). The architectures proposed in this paper

are not a ﬁnal answer to the information loss prob-

lem. Rather they serve as a motivation for developing

improved feature-extraction backbones appropriate to

the task. Hopefully, our current results and insights

will inspire the development of even better backbones

for tiny object detectors in future research.

ACKNOWLEDGEMENTS

The authors acknowledge the use of the research

computing facilities at King’s College London, Ros-

alind, the King’s Computational Research, Engineer-

ing and Technology Environment (CREATE) (Lon-

don, 2022), and the Joint Academic Data science En-

deavour (JADE) facility. This research was funded by

the King’s - China Scholarship Council (K-CSC).

REFERENCES

Bai, Y., Zhang, Y., Ding, M., and Ghanem, B. (2018). Find-

ing tiny faces in the wild with generative adversarial

network. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition (CVPR).

Bhandari, A. and Tiwari, K. (2021). Loss of target informa-

tion in full pixel and subpixel target detection in hy-

perspectral data with and without dimensionality re-

duction. Evolving Systems, 12.

Bochkovskiy, A., Wang, C.-Y., and Liao, H.-Y. M. (2020).

Yolov4: Optimal speed and accuracy of object detec-

tion.

Cai, Z. and Vasconcelos, N. (2018). Cascade r-cnn: Delv-

ing into high quality object detection. 2018 IEEE/CVF

Conference on Computer Vision and Pattern Recogni-

tion, pages 6154–6162.

Chrabaszcz, P., Loshchilov, I., and Hutter, F. (2017). A

downsampled variant of imagenet as an alternative to

the cifar datasets. ArXiv, abs/1707.08819.

Dalal, N. and Triggs, B. (2005). Histograms of oriented gra-

dients for human detection. In 2005 IEEE Computer

Society Conference on Computer Vision and Pattern

Recognition (CVPR’05), volume 1, pages 886–893

vol. 1.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). ImageNet: A Large-Scale Hierarchical

Image Database. In CVPR09.

Deng, J., Guo, J., Ververas, E., Kotsia, I., and Zafeiriou, S.

(2020). Retinaface: Single-shot multi-level face local-

isation in the wild. In CVPR.

Everingham, M., Eslami, S. M. A., Van Gool, L., Williams,

C. K. I., Winn, J., and Zisserman, A. (2015). The pas-

cal visual object classes challenge: A retrospective.

International Journal of Computer Vision, 111(1):98–

136.

Felzenszwalb, P., McAllester, D., and Ramanan, D. (2008).

A discriminatively trained, multiscale, deformable

part model. In 2008 IEEE Conference on Computer

Vision and Pattern Recognition, pages 1–8.

Girshick, R. (2015). Fast R-CNN. In International Confer-

ence on Computer Vision (ICCV).

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014).

Rich feature hierarchies for accurate object detection

and semantic segmentation. In 2014 IEEE Conference

on Computer Vision and Pattern Recognition, pages

580–587.

Gong, Y., Yu, X., Ding, Y., Peng, X., Zhao, J., and Han, Z.

(2021). Effective fusion factor in fpn for tiny object

detection. 2021 IEEE Winter Conference on Applica-

tions of Computer Vision (WACV), pages 1159–1167.

Han, X., Zhang, Z., Ding, N., Gu, Y., Liu, X., Huo, Y., Qiu,

J., Yao, Y., Zhang, A., Zhang, L., Han, W., Huang,

M., Jin, Q., Lan, Y., Liu, Y., Liu, Z., Lu, Z., Qiu, X.,

Song, R., Tang, J., Wen, J.-R., Yuan, J., Zhao, W. X.,

and Zhu, J. (2021). Pre-trained models: Past, present

and future. AI Open, 2:225–250.

He, K., Zhang, X., Ren, S., and Sun, J. (2014). Spa-

tial Pyramid Pooling in Deep Convolutional Net-

works for Visual Recognition. arXiv e-prints, page

arXiv:1406.4729.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep

residual learning for image recognition. CoRR,

abs/1512.03385.

Hong, M., Li, S., Yang, Y., Zhu, F., Zhao, Q., and Lu, L.

(2022). Sspnet: Scale selection pyramid network for

tiny person detection from uav images. IEEE Geo-

science and Remote Sensing Letters, 19:1–5.

Hu, P. and Ramanan, D. (2017). Finding tiny faces. In Pro-

ceedings of the IEEE Conference on Computer Vision

and Pattern Recognition (CVPR).

Jiahuan, R., Zhao, Z., Jicong, F., Haijun, Z., Mingliang, X.,

and Meng, W. (2021). Robust low-rank deep feature

recovery in CNNs: Toward low information loss and

fast convergence. In 2021 IEEE International Confer-

ence on Data Mining (ICDM), pages 529–538.

Jiang, N., Yu, X., Peng, X., Gong, Y., and Han, Z. (2021).

Sm+: Reﬁned scale match for tiny person detec-

tion. In ICASSP 2021 - 2021 IEEE International Con-

ference on Acoustics, Speech and Signal Processing

(ICASSP), pages 1815–1819.

Krishna, H. and Jawahar, C. (2017). Improving small ob-

ject detection. In 2017 4th IAPR Asian Conference on

Pattern Recognition (ACPR), pages 340–345.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

112

Krizhevsky, A. and Hinton, G. (2009). Learning multiple

layers of features from tiny images. Technical Re-

port 0, University of Toronto, Toronto, Ontario.

Lin, T., Goyal, P., Girshick, R. B., He, K., and Doll

ar, P.

(2017a). Focal loss for dense object detection. CoRR,

abs/1708.02002.

Lin, T., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick,

R. B., Hays, J., Perona, P., Ramanan, D., Doll

ar, P.,

and Zitnick, C. L. (2014). Microsoft COCO: common

objects in context. CoRR, abs/1405.0312.

Lin, T.-Y., Doll

ar, P., Girshick, R. B., He, K., Hariharan,

B., and Belongie, S. J. (2017b). Feature pyramid net-

works for object detection. 2017 IEEE Conference

on Computer Vision and Pattern Recognition (CVPR),

pages 936–944.

Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P.

(2017c). Focal loss for dense object detection. In

Proceedings of the IEEE International Conference on

Computer Vision (ICCV).

Liu, G., Han, J., and Rong, W. (2021a). Feedback-driven

loss function for small object detection. Image and

Vision Computing, 111:104197.

Liu, J., Gu, Y., Han, S., Zhang, Z., Guo, J., and Cheng, X.

(2021b). Feature rescaling and fusion for tiny object

detection. IEEE Access, 9:62946–62955.

Liu, Q., Tan, Z., Chen, D., Chu, Q., Dai, X., Chen, Y., Liu,

M., Yuan, L., and Yu, N. (2022). Reduce information

loss in transformers for pluralistic image inpainting.

ArXiv, abs/2205.05076.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S. E.,

Fu, C., and Berg, A. C. (2015). SSD: single shot multi-

box detector. CoRR, abs/1512.02325.

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin,

S., and Guo, B. (2021c). Swin transformer: Hierar-

chical vision transformer using shifted windows. In

Proceedings of the IEEE/CVF International Confer-

ence on Computer Vision (ICCV).

London, K. C. (2022). King’s computational research,

engineering and technology environment (CREATE).

https://doi.org/10.18742/rnvf-m076/.

Luo, S., Li, X., Zhu, R., and Zhang, X. (2019). Sfa: Small

faces attention face detector. IEEE Access, 7:171609–

171620.

Pan, S. J. and Yang, Q. (2010). A survey on transfer learn-

ing. IEEE Transactions on Knowledge and Data En-

gineering, 22(10):1345–1359.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.

(2015). You Only Look Once: Uniﬁed, Real-

Time Object Detection. arXiv e-prints, page

arXiv:1506.02640.

Redmon, J., Divvala, S. K., Girshick, R. B., and Farhadi, A.

(2015). You only look once: Uniﬁed, real-time object

detection. CoRR, abs/1506.02640.

Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental

improvement. CoRR, abs/1804.02767.

Ren, S., He, K., Girshick, R. B., and Sun, J. (2015). Faster

R-CNN: towards real-time object detection with re-

gion proposal networks. CoRR, abs/1506.01497.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,

van den Driessche, G., Schrittwieser, J., Antonoglou,

I., Panneershelvam, V., Lanctot, M., Dieleman, S.,

Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I.,

Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel,

T., and Hassabis, D. (2016). Mastering the game of

Go with deep neural networks and tree search. Na-

ture, 529(7587):484–489.

Singh, B., Najibi, M., and Davis, L. S. (2018). SNIPER:

Efﬁcient multi-scale training. NeurIPS.

Springenberg, J. T., Dosovitskiy, A., Brox, T., and Ried-

miller, M. A. (2015). Striving for simplicity: The all

convolutional net. CoRR, abs/1412.6806.

Tan, M., Pang, R., and Le, Q. V. (2020). Efﬁcientdet: Scal-

able and efﬁcient object detection. 2020 IEEE/CVF

Conference on Computer Vision and Pattern Recogni-

tion (CVPR), pages 10778–10787.

Tang, X., Du, D. K., He, Z., and Liu, J. (2018). Pyra-

midbox: A context-assisted single shot face detector.

In Proceedings of the European Conference on Com-

puter Vision (ECCV).

Tian, Z., Shen, C., Chen, H., and He, T. (2019). Fcos:

Fully convolutional one-stage object detection. 2019

IEEE/CVF International Conference on Computer Vi-

sion (ICCV), pages 9626–9635.

Tong, K. and Wu, Y. (2022). Deep learning-based detec-

tion from the perspective of small or tiny objects: A

survey. Image and Vision Computing, 123:104471.

ucas-vg (2020). TOV mmdetection.

Wang, C.-Y., Bochkovskiy, A., and Liao, H.-Y. M. (2021a).

Scaled-YOLOv4: Scaling cross stage partial network.

In Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition (CVPR), pages

13029–13038.

Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao,

Y., Liu, D., Mu, Y., Tan, M., Wang, X., Liu, W., and

Xiao, B. (2021b). Deep high-resolution representation

learning for visual recognition. IEEE Transactions on

Pattern Analysis and Machine Intelligence, 43:3349–

3364.

Xu, C., Wang, J., Yang, W., Yu, H., Yu, L., and Xia, G.-S.

(2022). Rﬂa: Gaussian receptive based label assign-

ment for tiny object detection. In European Confer-

ence on Computer Vision (ECCV).

Yang, S., Luo, P., Loy, C. C., and Tang, X. (2016). Wider

face: A face detection benchmark. In IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR).

Yang, Z., Chai, X., Wang, R., Guo, W., Wang, W., Pu, L.,

and Chen, X. (2019). Prior knowledge guided small

object detection on high-resolution images. In 2019

IEEE International Conference on Image Processing

(ICIP), pages 86–90.

Yu, X., Gong, Y., Jiang, N., Ye, Q., and Han, Z. (2020).

Scale match for tiny person detection. 2020 IEEE

Winter Conference on Applications of Computer Vi-

sion (WACV), pages 1246–1254.

Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., and Yoo,

Y. (2019). Cutmix: Regularization strategy to train

strong classiﬁers with localizable features. In Pro-

ceedings of the IEEE/CVF International Conference

on Computer Vision (ICCV).

Rethinking the Backbone Architecture for Tiny Object Detection

113

Zhang, S., Zhu, X., Lei, Z., Shi, H., Wang, X., and Li, S. Z.

(2017). S3fd: Single shot scale-invariant face detector.

In Proceedings of the IEEE International Conference

on Computer Vision (ICCV).

Zhang, Z., Shen, W., Qiao, S., Wang, Y., Wang, B.,

and Yuille, A. L. (2018). Robust face detection

via learning small faces on hard images. CoRR,

abs/1811.11662.

Zhao, G., Wang, J., and Zhang, Z. (2017). Random shift-

ing for CNN: a solution to reduce information loss

in down-sampling layers. In Sierra, C., editor, Pro-

ceedings of the Twenty-Sixth International Joint Con-

ference on Artiﬁcial Intelligence, IJCAI 2017, Mel-

bourne, Australia, August 19-25, 2017, pages 3476–

3482. ijcai.org.

Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., and Ren, D.

(2020). Distance-IoU loss: Faster and better learning

for bounding box regression. In The AAAI Conference

on Artiﬁcial Intelligence (AAAI).

Zhu, Y., Cai, H., Zhang, S., Wang, C., and Xiong, Y. (2020).

Tinaface: Strong but simple baseline for face detec-

tion. arXiv preprint arXiv:2011.13183.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

114