Pyramid Swin Transformer: Different-Size Windows Swin Transformer

for Image Classiﬁcation and Object Detection

Chenyu Wang

1,2 a

, Toshio Endo

1 b

, Takahiro Hirofuchi

2 c

and Tsutomu Ikegami

2 d

Tokyo Institute of Technology, Tokyo, Japan

National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan

Keywords:

Swin Transformer, Object Detection, Image Classiﬁcation, Feature Pyramid Network, Multiscale.

Abstract:

We present the Pyramid Swin Transformer for object detection and image classiﬁcation, by taking advantage

of more shift window operations, smaller and more different size windows. We also add a Feature Pyramid

Network for object detection, which produces excellent results. This architecture is implemented in four

stages, containing different size window layers. We test our architecture on ImageNet classiﬁcation and COCO

detection. Pyramid Swin Transformer achieves 85.4% accuracy on ImageNet classiﬁcation and 54.3 box AP

on COCO.

1 INTRODUCTION

Both image classiﬁcation and object detection are

critical tasks in computer vision, and they are used to

identify objects for categorization, such as humans,

animals, fruits, or some buildings. Object detection

will be more complicated than image classiﬁcations

since it must recognize the positions of the objects

and produce marks. Currently, object detection is

used for security, medical, self-driving cars, iden-

tity identiﬁcation, and other purposes. It has experi-

enced an exponential expansion in recent years, along

with the rapid development of new tools and proce-

dures. As convolutional neural networks have been

utilized successfully in computer vision, research on

convolutional neural networks has ﬂourished. They

have been used for various computer vision tasks, in-

cluding image classiﬁcation and object detection. In

2012, Alexnet(Krizhevsky et al., 2017) achieved a

new accuracy record in the ImageNet Large Scale Vi-

sual Recognition Challenge (ILSVRC)(Russakovsky

et al., 2015).

After Alexnet(Krizhevsky et al., 2017), research

on CNN has shown a blowout outbreak, and research

on various backbone networks such as GoogLeNet

(Szegedy et al., 2015), VGG (Simonyan and Zisser-

man, 2014), and ResNet (He et al., 2016) have fol-

lowed one after another. At the same time, research

https://orcid.org/0000-0001-8770-3275

https://orcid.org/0000-0001-7297-6211

https://orcid.org/0000-0002-1253-6625

https://orcid.org/0000-0003-2977-6390

Figure 1: Pyramid Swin Transformer. a) model is used for

image classiﬁcation, where each layer is feature sampled

with Swin Transformer, and b) model is used for object de-

tection. It adds a feature pyramid network to a) model.

on the object detection framework is also developing,

such as Faster R-CNN(Ren et al., 2015) and Mask

R-CNN(He et al., 2017), which are two-stage detec-

tors. A module presents some candidates, which the

network classiﬁes as objects or backgrounds. Two-

stage detectors can be slower but more accurate than

single-stage detectors, such as YOLO (Redmon et al.,

2016), Single Shot MultiBox Detector (SSD) (Liu

et al., 2016), which are single-stage detectors.

More recently, convolutional neural networks

(CNN) have been outperformed by Vision Trans-

formers (Vit), which have shown potential and have

been modiﬁed for usage in various vision applica-

tions.(Arnab et al., 2021; Carion et al., 2020; Beal

et al., 2020; Strudel et al., 2021; Wang et al., 2021).

The Transformer (Vaswani et al., 2017) model is

a relatively new deep learning model. However,

it has been widely researched and applied in natu-

ral language processing (NLP) and computer vision

(CV). The Transformer was initially introduced as a

Wang, C., Endo, T., Hirofuchi, T. and Ikegami, T.

Pyramid Swin Transformer: Different-Size Windows Swin Transformer for Image Classiﬁcation and Object Detection.

DOI: 10.5220/0011675800003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP, pages

583-590

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

583

machine translation sequence-to-sequence (Sutskever

et al., 2014) concept. Later studies have shown that

Transformer-based pre-trained models (PTMs)(Qiu

et al., 2020) can achieve cutting-edge performance

on various tasks. As a result, The Transformer has

become the most preferred design in NLP, particu-

larly for PTMs. In recent research(Dosovitskiy et al.,

2020), the Transformer model has also performed

well in CV tasks. Naturally, Transformer recently

has been utilized in CV(Dosovitskiy et al., 2020; Liu

et al., 2021) and audio processing(Chen et al., 2021;

Dong et al., 2018).

Our Pyramid Swin Transformer is an improved

version of the Swin Transformer(Liu et al., 2021),

and we propose two models, one for image classi-

ﬁcation and another for object detection. We im-

prove the original Swin Transformer by using smaller,

more different-size windows and more shift window

operations in order to achieve a better detection ef-

fect, which certainly enhances the detection effect.

Compared to Swin Transformer, our Pyramid Swin

Transformer uses windows of varying sizes on a uni-

ﬁed scale to execute multiple window multi-head self-

attention computation, as shown in Figure 2. This

slightly increases the amount of calculation, but it bet-

ter solves the problem of information interaction be-

tween windows and windows.

2 RELATED WORK

The use of the Transformer for computer vision is ar-

guably an essential attempt in the history of computer

vision. Transformer-like frameworks are what allow

us to get rid of CNN and have better globalization.

With more in-depth research, the model will also be-

come more applicable to computer vision, capable of

performing more tasks quickly and accurately. The

model will also become more useful in computer vi-

sion and can perform more quickly and accurately.

There is no doubt that Swin Transformer(Liu et al.,

2021) is a promising improvement.

2.1 Feature Pyramid Network

The Feature Pyramid Network (FPN)(Lin et al., 2017)

is a high-accuracy and fast feature extractor designed

for this pyramid concept. It can replace detector fea-

ture extractors like Faster R-CNN(Ren et al., 2015)

and generates many feature layers (multi-scale fea-

ture maps) with higher quality information for ob-

ject detection than traditional feature pyramids. FPN

is made up of a bottom-up and a top-down path-

way. The bottom-up route is the standard convo-

lutional network for feature extraction. The spatial

resolution degrades as they ascend, and the semantic

value of each layer grows as more high-level struc-

tures are recognized. FPN offers a top-down approach

to building higher-resolution layers from a semantic-

rich layer. While the reconstructed layers are se-

mantically strong, the positions of objects after all

the down-sampling and up-sampling are not exact.

They improve the detector’s prediction by adding lat-

eral links between reconstructed layers and the asso-

ciated feature map. FPN has been widely employed in

several frameworks for object detection(Ronneberger

et al., 2015; Zhang et al., 2018; Peng et al., 2018)

and semantic segmentation(Liu et al., 2018) because

of its excellent results and practicality, and all of them

have obtained successful outcomes. FPN has been

proven to signiﬁcantly improve object detection ac-

curacy with a modest increase in processing cost.

2.2 Swin Transformer

The ﬁrst part of the name Swin Transformer(Liu et al.,

2021) is derived from Shifted Windows, which is

also the main feature of Swin Transformer. The re-

search community is not new to the idea of shifted

windows. As a result of its high efﬁciency, it is

one of the CNN aspects that has helped the network

succeed in the ﬁeld of computer vision. However,

it had not been used in Transformers before. The

original intention of Swin Transformer’s authors was

to make Vision Transformer-like a convolution neu-

ral network, which can also be divided into several

blocks for cascading feature extraction, thus leading

to the concept of multi-scale for the proposed fea-

tures. The standard Transformer(Dosovitskiy et al.,

2020) has some challenges when applied directly to

the vision domain. The difﬁculty comes mainly from

two aspects, namely different scales and the large

resolution of the image. The ﬁrst problem is scale,

but this phenomenon does not exist in natural lan-

guage processing. To solve the problem of sequence

length, researchers have made a series of attempts, in-

cluding using the subsequent feature map as input to

the Transformer or breaking the image into multiple

patches to reduce the resolution of the image as well

as dividing the images into small windows and then

doing self-attention(Vaswani et al., 2017) computa-

tion in the windows.

Since the self-attention is computed within the

window, its computational complexity grows linearly

rather than squarely with the image size, giving Swin

Transformer the ability to pre-train models at partic-

ularly high resolutions, the hierarchical structure has

the advantage of being ﬂexible enough to provide in-

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

584

Figure 2: Pyramid Swin Transformer. The part on the right is the overall Pyramid Swin Transformer, where Swin Transformer

computes each layer, and the part on the far right is the part of the Feature Pyramids Network. The part shown on the left is

our Pyramid Swin Transformer, where we use a different size window to compute multi-head self-attention.

formation at different scales. Following the original

Vision Transformer, the Swin Transformer is likely

one of the most exciting pieces of research. The Swin

Transformer resolved the original Vit’s problems by

using hierarchical feature maps and shifting window

multi-head self-attention. The Swin Transformer is

now widely applied as a backbone architecture in var-

ious vision tasks, such as image classiﬁcation and ob-

ject detection.

Swin Transformer does a wonderful job of resolv-

ing the multi-scale and computational complexity is-

sues in Vit(Dosovitskiy et al., 2020), but it also brings

a few new issues due to the usage of window multi-

self-attention, which makes the windows indepen-

dent of one another. Although the authors incorpo-

rated shift-window multi-head self-attention, which

enables speciﬁc windows to be connected, there is

still a lack of information interaction between some

windows in large-scale dimensions. Therefore, we

present the Pyramid Swin Transformer in order to

look forward to solving the communication problem

between windows. As a result, it seems to deal with

the problem at some extend.

3 METHOD

Recent developments in vision Transformer backbone

designs are mostly concerned with attention opera-

tor advancements. A new topology design can add a

new dimension to Vits, allowing for even more pow-

erful vision expressiVity. One of the best is the Swin

Transformer(Liu et al., 2021), on which our research

is based. In this section, we will go into the structure

of our Pyramid Swin Transformer.

3.1 Architecture

Our architecture is mainly based on the Swin Trans-

former(Liu et al., 2021), as shown in Figure 2, our ar-

chitecture for object detection adds a feature pyramid

part and adjusts the original Swin Transformer. We

use a hierarchical network, where the ﬁrst stage size

is the largest one(64 ×64). In this stage, we divide the

feature map into four types(16 ×16,8 ×8,4×4,2 ×2

windows), which corresponding window sizes are

4 ×4, 8 ×8, 16 ×16 and 32 ×32. In the window size

of 32 ×32, the length of self-attention is 1024, which

will bring much computation, so this size of the win-

dow we only use in stage 1, other stages will never

utilize this size window.

Each layer consists of two steps, one for win-

dow multi-head self-attention and the next for shift

window multi-head self-attention, each layer is the

same and includes two computations of multi-head

self-attention, which is also the same in Swin Trans-

former(Liu et al., 2021), as shown in Figure 3, except

that we split it into smaller blocks, and the number of

different-size windows in each layer is the hierarchi-

cal progression from more to less, which is more con-

ducive to global connection. When computing shift

window multi-head self-attention, we only compute

shift window multi-head self-attention once in each

layer in order to decrease the computation. Each stage

of the last layer, except the fourth stage will have a

2 ×2 window, increasing window-to-window infor-

mation interaction. As a result, increasing global rel-

evance.

The overall architecture is shown in Table 1. The

size of our input image is 256 ×256, we call it Pyra-

mid Swin-R, and the ﬁrst stage has 4 layers, the sec-

ond stage has 3 layers, the third stage has 2 layers, and

the last stage has 2 layers. We also implement another

framework, the Pyramid Swin-R, which differs from

Pyramid Swin Transformer: Different-Size Windows Swin Transformer for Image Classiﬁcation and Object Detection

585

the Pyramid Swin-L only in the number of channels

and layers. The details are as follows:

• Pyramid Swin-R: C=96, layer={4,3,2,2}

• Pyramid Swin-L: C=192, layer={4,3,3,2}

where C is the layer channel number in the ﬁrst

stage. Each layer consists of two sub-multi-head self-

attention calculations.

We utilize different window sizes in each layer,

and the next layer complements the previous one. The

problem with Swin Transformer was insufﬁcient in-

formation interaction between windows and windows

at the low semantic level, and our improvements have

enhanced the information interaction between sepa-

rate windows. The main concept behind the Pyramid

Swin Transformer is to keep adding windows of var-

ious sizes to a uniform scale to improve their direct

information contact.

Supposing each window contains 2

×2

window

sizes, on an image of size h × w feature map, the

computational complexity of a global multi-head self-

attention module and a window-based one is as fol-

lows:

Ω(MSA) = 4hwC

+ 2(hw)

Ω(W −MSA) = 4hwC

+ (2)

2i+1

hwC

Where C is a channel and the former is quadratic to

feature map size h ×w, and the latter depends mainly

on the size of i, which i ∈ {0,5}. In our design, be-

cause the computation is too large when i = 5, we try

to minimize the case of i = 5 in the whole framework.

Actually, we only use i = 5 once at the ﬁrst layer of the

ﬁrst stage. While window multi-head self-attention is

scalable for h ×w, global self-attention computation

is typically costly. Therefore, window self-attraction

has excellent potential for lowering computation re-

quirements.

For self-attention computation, we follow (Raffel

et al., 2020; Hu et al., 2019; Liu et al., 2021) by in-

cluding a relative position bias β ∈ R

×M

to each

head:

Attention(Q,K,V ) = So f tMax(QK

√

d + β)V,

where Q,K,V ∈ R

×d

are the query,key and value

matrices, d is the query and key dimension and M

the window size.

3.2 Pyramid Swin for Object Detection

We will introduce how to apply the Pyramid Swin

Transformer backbone for object detection. Pyramid

Swin Transformer’s hierarchical structure generates

multiscale feature maps in four phases. We integrate

neatly into Feature Pyramid Networks (FPN) for ob-

ject detection applications, as shown in ﬁgure 2. Our

Pyramid Swin Transformer creates semantically ro-

bust feature maps at all scales using a top-down pyra-

mid with lateral connections in FPN. By using FPN

with the Pyramid Swin Transformer backbone, we

apply it to different detection architectures. In Fea-

ture Pyramid Network, we use pixel shufﬂe for im-

plementation to upsample feature maps of small size,

which in our experiments can achieve better results

than the normal pooling method. We have always set

the feature pyramid’s channel to 96, and it correlates

to the Swin Transformer’s portion on the left side of

the pyramid.

Object detection often uses training inputs with

a range of input sizes than ImageNet classiﬁcation,

where the input image is a crop with a set resolution

(such as 224×224). We initialize the parameters from

the ImageNet pre-training weights to respond to the

location embedding with a 256 ×256 input size and

then interpolate them to their respective sizes for ob-

ject detection training.

4 EXPERIMENT

We conduct experiments on ImageNet-1K image clas-

siﬁcation (Deng et al., 2009), COCO object detection

(Lin et al., 2014). In the following sections, we will

compare the suggested Pyramid Swin Transformer ar-

chitecture to the prior state-of-the-art on two tasks.

4.1 Image Classiﬁcation on ImageNet

Settings. To be fair, we benchmark the proposed

Pyramid Swin Transformer on ImageNet-1K(Deng

et al., 2009), which contains 1.28M training images

and 50K validation images from 1,000 classes. On

a single crop, the top-1 accuracy is reported. Swin

Transformer is used in our training methods.(Liu

et al., 2021). ImageNet-1K training. This setting

mostly follows (Touvron et al., 2021). We employ

an AdamW (Kingma and Ba, 2014) optimizer for 300

epochs using a cosine decay learning rate scheduler,

as same as the Swin Transformer (Liu et al., 2021).

We include most of the augmentation and regulariza-

tion strategies of (Touvron et al., 2021) in training,

except for repeated augmentation (Hoffer et al., 2020)

and EMA (Polyak and Juditsky, 1992). Note that this

contrasts the situation where consistent augmentation

is essential to maintain Vit training(Dosovitskiy et al.,

2020).

Image Classiﬁcation on ImageNet. Table 2 shows

the results of our Pyramid Swin Transformer and

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

586

Table 1: Pyramid Swin Transformer Detailed architecture speciﬁcations. Input image size is 256 ×256.

Pyramid Swin-R Output Size Layers Channel Windows Window size Heads

Stage 1 64

4 96 16

,16

,32

Stage 2 32

3 192 8

,16

Stage 3 16

2 384 4

Stage 4 8

2 768 2

Pyramid Swin-L Output Size Layers Channel Windows Window size Heads

Stage 1 64

4 192 16

,16

,32

Stage 2 32

3 384 8

,16

Stage 3 16

3 768 8

Stage 4 8

2 1536 2

Table 2: Test Environment.

CPU Intel(R) Xeon(R) Silver 4110

Memory 16G

GPU NVIDIA Tesla V100 PCIe

GPU Memory 16G

Pytorch 1.7.1

CUDA 11.6

OS Ubuntu 18.04

state-of-the-art CNNs and Transformers. Based

on computation, the models are divided into cate-

gories. Compared with the state-of-the-art Convolu-

tion Nets and Vision transformer models such as Reg-

Net (Radosavovic et al., 2020), EfﬁcientNet (Tan and

Le, 2019),CoAtNet(Dai et al., 2021),Vit(Dosovitskiy

et al., 2020), DeiT(Touvron et al., 2021), MVit(Fan

et al., 2021), Swin (Liu et al., 2021) and Swinv2(Liu

et al., 2022), our Pyramid Swin Transformer achieves

slightly better accuracy. All details are shown in Table

Comparison on Imagenet. Our design outperforms

several CNN systems even when we utilize a regu-

lar model(Pyramid Swin-R). In image classiﬁcation,

our design has no evident advantages over Trans-

former systems. Compared to Swin Transformer(Liu

et al., 2021), our improved Pyramid Swin Trans-

former has greater accuracy than Swin Transformer

while using fewer FLOPs and parameters. For exam-

ple, Pyramid Swin-R (84.6%) achieved the same re-

sult as SwinV2-B(Liu et al., 2022). This shows that

the overall architecture of Swin Transformer has a lot

of limitations, and the immediate increase in the num-

ber of parameters does not yield good results. On

the regular-size model, Pyramid Swin-R (84.6%) im-

proves +0.1% over Swin-B but with fewer FLOPs and

parameters. On the large-size model, Pyramid Swin-

Table 3: Comparison with previously reported ImageNet-

1K work. We make pretrain on ImageNet-1K. Pyramid

Swin is trained for 300 epochs without any external data

or models.

Method Resolution Params FLOPs Top-1 Acc.

RegNetY-4G 224

21M 4G 80.0

RegNetY-8G

224

39M 8G 81.7

RegNetY-16G 224

84M 16G 82.9

EfﬁcientNet-B1 240

8M 1G 78.8

EfﬁcientNet-B2 260

9M 1G 79.8

EfﬁcientNet-B3 300

12M 2G 81.6

EfﬁcientNet-B4 380

19M 4G 82.9

EfﬁcientNet-B5 456

30M 10G 83.6

EfﬁcientNet-B6 528

43M 19G 84.0

EfﬁcientNet-B7 600

66M 37G 84.4

CoAtNet-0 224

25M 4G 81.6

CoAtNet-1 224

42M 8G 83.3

CoAtNet-2 224

75M 16G 84.1

CoAtNet-3 224

168M 35G 84.6

Vit-B/16 384

86M 55G 77.9

Vit-L/16 384

307M 191G 76.5

DeiT-S 224

22M 5G 79.8

DeiT-B 224

86M 18G 81.8

DeiT-B 384

86M 55G 83.1

MVit-B-16 224

37M 8G 83.0

MVit-B-24 224

72M 15G 84.0

MVit-B-24 320

73M 33G 84.8

Swin-T 224

28M 5G 81.3

Swin-S 224

50M 9G 83.0

Swin-B 224

88M 15G 83.5

Swin-B 384

88M 47G 84.5

SwinV2-T 256

28M 7G 82.8

SwinV2-S 266

50M 13G 84.1

SwinV2-B 256

88M 22G 84.6

P. Swin-R 256

77M 18G 84.6

P. Swin-L 256

164M 39G 85.4

Pyramid Swin Transformer: Different-Size Windows Swin Transformer for Image Classiﬁcation and Object Detection

587

L (85.4%) improves +0.8% over SwinV2-B (84.6%).

Our Pyramid Swin Transformer is only more accu-

rate than Swin-B +0.1% with an equal size model.

However, Swin-B utilizes a higher resolution. We get

the same accuracy as SwinV2-B with the same image

size, but our computation is fewer. Compared with

MVit (320 ×320), our large model(Pyramid Swin-L)

has a higher accuracy rate, but the amount of compu-

tation is also greatly increased. Compared with MVit

(224 × 224), our regular model((Pyramid Swin-R))

has a +0.6% accuracy. The effect of our architecture

on image classiﬁcation is not so obvious, and the main

contribution is to reduce the same amount of compu-

tation while maintaining accuracy.

4.2 Object Detection on COCO

Settings. We conduct object detection experiments on

the Microsoft COCO(Lin et al., 2014) dataset. An ab-

lation study is conducted using the validation set, and

test-dev is used to report on a system-level compari-

son. We use standard Mask R-CNN (He et al., 2017)

and Cascade Mask R-CNN (Cai and Vasconcelos,

2018) detection frameworks implemented in Detec-

tron. The backbone networks of the objects we com-

pared are Resnet(He et al., 2016), ResNet(Xie et al.,

2017), PVT-S(Wang et al., 2021), ViL-S-RPB(Zhang

et al., 2021) and Swin(Liu et al., 2021). For a fair

comparison, we follow the same way as Swin Trans-

former (Liu et al., 2021). For these four frame-

works, we utilize the same settings: multi-scale train-

ing (Carion et al., 2020; Sun et al., 2021). For Pyra-

mid Swin, we take the backbone pre-trained from

Imagenet-1K. The input sizes are set as [64, 32,16,8]

for multi-scale four stages, consistent with the self-

attention size used in Imagenet-1K pre-training.

With Mask R-CNN. On the regular size model, our

Pyramid Swin achieves the highest accuracy when

we utilize the framework of Mask R-CNN. Pyra-

mid Swin-R 50.3 box AP improves +1.8 box AP

over Swin-B(Liu et al., 2021) with fewer FLOPs

and parameters. Compared to Vit-B-RPB, our Pyra-

mid Swin has an advantage, with +0.7 box AP im-

provement. On the large-size model Pyramid Swin-L

achieves 51.6 box AP, improving +3.1 box AP over

Swin-B with far more FLOPs and parameters.

With Cascade Mask R-CNN. On the regular size

model, our Swin also achieves the highest accuracy

when we utilize the framework of Cascade Mask R-

CNN. Pyramid Swin-R gets 53.6 box AP improving

+1.7 box AP over Swin-B with fewer FLOPs and pa-

rameters. Our Pyramid Swin-L achieves 54.3 box AP

for the large-size model, improving +2.4 box AP over

Swin-B. Our Pyramid Swin Transformer is less effec-

Table 4: Results on COCO object detection.C.Mask in-

dicates Cascade Mask R-CNN, R.Pointsv2 indicates Rep-

Pointsv2.

a) Mask R-CNN

Model AP

box

mask

FLOPs Params

Res50 41.0 37.1 260G 44M

Res101 42.8 38.5 336G 63M

X101-64 44.4 39.7 493G 101M

PVT-S 43.0 39.9 245G 44M

PVT-M 44.2 40.5 302G 64M

PVT-L 44.5 40.7 364G 81M

ViL-S-RPB 47.1 42.1 277 G 45M

ViL-M-RPB 48.9 44.2 352G 60M

ViL-B-RPB 49.6 44.5 384G 76M

Swin-T 46.0 41.6 264G 48M

Swin-S 48.5 43.3 354G 69M

Swin-B 48.5 43.4 496G 107M

P. Swin-R 50.3 44.8 463G 94M

P. Swin-L 51.6 45.3 1014G 193M

b) Cascade Mask R-CNN

Model AP

box

mask

FLOPs Params

Res50 46.3 40.1 739G 82M

Res101 47.7 40.8 819G 101M

Swin-T 50.5 43.7 745G 86M

Swin-S 51.8 44.7 838G 107M

Swin-B 51.9 45.0 982G 145M

P. Swin-R 53.6 46.4 902G 136M

P. Swin-L 54.3 47.1 1867G 273M

tive in image classiﬁcation than in object detection be-

cause we did not improve Swin’s original framework

signiﬁcantly, improving the effect on image classi-

ﬁcation a little, not obvious. In contrast, a feature

pyramid network is added for object detection, ob-

taining more signiﬁcant results. As seen from Swin,

from Swin-S to Swin-B, the number of parameters

and FLOPs are also increased, but there is no sig-

niﬁcant effect. We can conclude that if computing

does not signiﬁcantly rise, it may be the limit of the

Swin framework. We have addressed this weakness

by slightly increasing computation while still obtain-

ing an acceptable outcome.

5 CONCLUSION

This time we publish an improved version of the Swin

Transformer, the Pyramid Swin Transformer, where

we use windows of different sizes to perform win-

dow multiple multi-head self-attention operations on

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

588

the same scale, improving the Swin Transformer. For

the Pyramid Swin Transformer, we created two mod-

els for image classiﬁcation and object detection. For

image classiﬁcation, our Pyramid Swin-R achieves

the same results as SwinV2-B(Liu et al., 2022) on

the Imagenet-1k test. At the same time, the Pyramid

Swin-L model outperforms the original framework by

+0.6%, and we achieve better results for object de-

tection. When we use the Mask R-CNN framework,

Pyramid Swin-R achieves 50.3 box AP and 44.8 mask

AP, Pyramid Swin-L achieves 51.6 box AP and 45.3

mask AP, and when we use the Cascade Mask R-CNN

framework, our Pyramid Swin-R gains 53.6 box AP

and 46.4 mask AP. Pyramid Swin-L achieves 54.3 box

AP and mask AP, improving Swin Transformer box

AP and mask AP signiﬁcantly. In the future, we aim

to do some speed tests and develop some lighter archi-

tectures for more in-depth comparative testing with

existing architectures. We will add Semantic Seg-

mentation on ADE20K(Zhou et al., 2019) and video

recognition on Kinetics-400(Kay et al., 2017), which

are also an important metrics for judging our architec-

ture.

ACKNOWLEDGEMENT

This work was partly supported by JSPS KAKENHI

Grant Number 20H04165.

REFERENCES

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lu

c, M.,

and Schmid, C. (2021). Vivit: A video vision trans-

former. In Proceedings of the IEEE/CVF Interna-

tional Conference on Computer Vision, pages 6836–

6846.

Beal, J., Kim, E., Tzeng, E., Park, D. H., Zhai, A., and

Kislyuk, D. (2020). Toward transformer-based object

detection. arXiv preprint arXiv:2012.09958.

Cai, Z. and Vasconcelos, N. (2018). Cascade r-cnn: Delving

into high quality object detection. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 6154–6162.

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov,

A., and Zagoruyko, S. (2020). End-to-end object de-

tection with transformers. In European conference on

computer vision, pages 213–229. Springer.

Chen, X., Wu, Y., Wang, Z., Liu, S., and Li, J. (2021).

Developing real-time streaming transformer trans-

ducer for speech recognition on large-scale dataset.

In ICASSP 2021-2021 IEEE International Confer-

ence on Acoustics, Speech and Signal Processing

(ICASSP), pages 5904–5908. IEEE.

Dai, Z., Liu, H., Le, Q. V., and Tan, M. (2021). Coatnet:

Marrying convolution and attention for all data sizes.

Advances in Neural Information Processing Systems,

34:3965–3977.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). Imagenet: A large-scale hierarchical

image database. In 2009 IEEE conference on com-

puter vision and pattern recognition, pages 248–255.

Ieee.

Dong, L., Xu, S., and Xu, B. (2018). Speech-transformer: a

no-recurrence sequence-to-sequence model for speech

recognition. In 2018 IEEE International Confer-

ence on Acoustics, Speech and Signal Processing

(ICASSP), pages 5884–5888. IEEE.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,

D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,

M., Heigold, G., Gelly, S., et al. (2020). An image is

worth 16x16 words: Transformers for image recogni-

tion at scale. arXiv preprint arXiv:2010.11929.

Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Ma-

lik, J., and Feichtenhofer, C. (2021). Multiscale vi-

sion transformers. In Proceedings of the IEEE/CVF

International Conference on Computer Vision, pages

6824–6835.

He, K., Gkioxari, G., Doll

ar, P., and Girshick, R. (2017).

Mask r-cnn. In Proceedings of the IEEE international

conference on computer vision, pages 2961–2969.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Hoffer, E., Ben-Nun, T., Hubara, I., Giladi, N., Hoeﬂer, T.,

and Soudry, D. (2020). Augment your batch: Improv-

ing generalization through instance repetition. In Pro-

ceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pages 8129–8138.

Hu, H., Zhang, Z., Xie, Z., and Lin, S. (2019). Local

relation networks for image recognition. In 2019

IEEE/CVF International Conference on Computer Vi-

sion (ICCV), pages 3463–3472.

Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C.,

Vijayanarasimhan, S., Viola, F., Green, T., Back, T.,

Natsev, P., et al. (2017). The kinetics human action

video dataset. arXiv preprint arXiv:1705.06950.

Kingma, D. P. and Ba, J. (2014). Adam: A

method for stochastic optimization. arXiv preprint

arXiv:1412.6980.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2017). Im-

agenet classiﬁcation with deep convolutional neural

networks. Communications of the ACM, 60(6):84–90.

Lin, T.-Y., Doll

ar, P., Girshick, R., He, K., Hariharan, B.,

and Belongie, S. (2017). Feature pyramid networks

for object detection. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 2117–2125.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,

Ramanan, D., Doll

ar, P., and Zitnick, C. L. (2014).

Microsoft coco: Common objects in context. In Euro-

pean conference on computer vision, pages 740–755.

Springer.

Pyramid Swin Transformer: Different-Size Windows Swin Transformer for Image Classiﬁcation and Object Detection

589

Liu, S., Qi, L., Qin, H., Shi, J., and Jia, J. (2018). Path ag-

gregation network for instance segmentation. In Pro-

ceedings of the IEEE conference on computer vision

and pattern recognition, pages 8759–8768.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,

Fu, C.-Y., and Berg, A. C. (2016). Ssd: Single shot

multibox detector. In European conference on com-

puter vision, pages 21–37. Springer.

Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning,

J., Cao, Y., Zhang, Z., Dong, L., et al. (2022). Swin

transformer v2: Scaling up capacity and resolution.

In Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pages 12009–

12019.

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin,

S., and Guo, B. (2021). Swin transformer: Hierarchi-

cal vision transformer using shifted windows. In Pro-

ceedings of the IEEE/CVF International Conference

on Computer Vision, pages 10012–10022.

Peng, C., Xiao, T., Li, Z., Jiang, Y., Zhang, X., Jia, K.,

Yu, G., and Sun, J. (2018). Megdet: A large mini-

batch object detector. In Proceedings of the IEEE con-

ference on Computer Vision and Pattern Recognition,

pages 6181–6189.

Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of

stochastic approximation by averaging. SIAM journal

on control and optimization, 30(4):838–855.

Qiu, X., Sun, T., Xu, Y., Shao, Y., Dai, N., and Huang,

X. (2020). Pre-trained models for natural language

processing: A survey. Science China Technological

Sciences, 63(10):1872–1897.

Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., and

Doll

ar, P. (2020). Designing network design spaces.

In Proceedings of the IEEE/CVF conference on com-

puter vision and pattern recognition, pages 10428–

10436.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,

Matena, M., Zhou, Y., Li, W., Liu, P. J., et al. (2020).

Exploring the limits of transfer learning with a uni-

ﬁed text-to-text transformer. J. Mach. Learn. Res.,

21(140):1–67.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.

(2016). You only look once: Uniﬁed, real-time object

detection. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 779–

788.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. Advances in neural information

processing systems, 28.

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net:

Convolutional networks for biomedical image seg-

mentation. In International Conference on Medical

image computing and computer-assisted intervention,

pages 234–241. Springer.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,

Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-

stein, M., et al. (2015). Imagenet large scale visual

recognition challenge. International journal of com-

puter vision, 115(3):211–252.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Strudel, R., Garcia, R., Laptev, I., and Schmid, C. (2021).

Segmenter: Transformer for semantic segmentation.

In Proceedings of the IEEE/CVF International Con-

ference on Computer Vision, pages 7262–7272.

Sun, P., Zhang, R., Jiang, Y., Kong, T., Xu, C., Zhan,

W., Tomizuka, M., Li, L., Yuan, Z., Wang, C.,

et al. (2021). Sparse r-cnn: End-to-end object detec-

tion with learnable proposals. In Proceedings of the

IEEE/CVF conference on computer vision and pattern

recognition, pages 14454–14463.

Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence

to sequence learning with neural networks. Advances

in neural information processing systems, 27.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,

Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-

novich, A. (2015). Going deeper with convolutions.

In Proceedings of the IEEE conference on computer

vision and pattern recognition, pages 1–9.

Tan, M. and Le, Q. (2019). Efﬁcientnet: Rethinking model

scaling for convolutional neural networks. In Interna-

tional conference on machine learning, pages 6105–

6114. PMLR.

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles,

A., and J

egou, H. (2021). Training data-efﬁcient

image transformers & distillation through attention.

In International Conference on Machine Learning,

pages 10347–10357. PMLR.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. Advances in neural

information processing systems, 30.

Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D.,

Lu, T., Luo, P., and Shao, L. (2021). Pyramid vi-

sion transformer: A versatile backbone for dense pre-

diction without convolutions. In Proceedings of the

IEEE/CVF International Conference on Computer Vi-

sion, pages 568–578.

Xie, S., Girshick, R., Doll

ar, P., Tu, Z., and He, K. (2017).

Aggregated residual transformations for deep neural

networks. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 1492–

1500.

Zhang, P., Dai, X., Yang, J., Xiao, B., Yuan, L., Zhang, L.,

and Gao, J. (2021). Multi-scale vision longformer:

A new vision transformer for high-resolution image

encoding. In Proceedings of the IEEE/CVF Interna-

tional Conference on Computer Vision, pages 2998–

3008.

Zhang, Z., Zhang, X., Peng, C., Xue, X., and Sun, J. (2018).

Exfuse: Enhancing feature fusion for semantic seg-

mentation. In Proceedings of the European conference

on computer vision (ECCV), pages 269–284.

Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso,

A., and Torralba, A. (2019). Semantic understanding

of scenes through the ade20k dataset. International

Journal of Computer Vision, 127(3):302–321.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

590