Applying Positional Encoding to Enhance Vision-Language Transformers

Xuehao Liu

, Sarah Jane Delany

and Susan McKeever

School of Computer Science, Technological University Dublin, Ireland

Keywords:

Image Captioning, Positional Encoding, Vision-Language Transformer.

Abstract:

Positional encoding is used in both natural language and computer vision transformers. It provides information

on sequence order and relative position of input tokens (such as of words in a sentence) for higher performance.

Unlike the pure language and vision transformers, vision-language transformers do not currently exploit po-

sitional encoding schemes to enrich input information. We show that capturing location information of visual

features can help vision-language transformers improve their performance. We take Oscar, one of the state-of-

the-art (SOTA) vision-language transformers as an example transformer for implanting positional encoding.

We use image captioning as a downstream task to test performance. We added two types of positional encod-

ing into Oscar: DETR as an absolute positional encoding approach and iRPE, for relative positional encoding.

With the same training protocol and data, both positional encodings improved the image captioning perfor-

mance of Oscar by between 6.8% to 24.1% across ﬁve image captioning evaluation criteria used.

1 INTRODUCTION

Transformer-based models have been widely adopted

in the ﬁelds of language and vision over the past ﬁve

years. There are two essential parts of a Transformer-

based model: the self-attention block and the posi-

tional encoding. The self-attention mechanism of the

transformer method captures the long distance rela-

tionship between tokens more effectively than tradi-

tional Recurrent Neural Networks (RNN). However,

it is invariant to sequence ordering of input tokens

(Shaw et al., 2018). The same token (e.g. a word)

in different positions of the input sequence (e.g. a

sentence) is the same to the self-attention mechanism.

The consequence of this is that valuable relative po-

sitional information is not used. For example, there

are different meanings associated with “he genuinely

needs to do that” versus “he needs to do that gen-

uinely”. Positional encoding is added to the input to-

kens as additional information, as it is a critical part

for building the sequence order for the transformer.

The vanilla transformer (Vaswani et al., 2017) added

a sinusoidal signal in different frequencies on tokens

in different location. Similarly, for visual input trans-

formers, DETR (Carion et al., 2020) proposed 2d ab-

solute positional encoding, which is two sinusoidal

https://orcid.org/0000-0001-9815-489X

https://orcid.org/0000-0002-2062-7439

https://orcid.org/0000-0003-1766-2441

signals in two dimensions, in order to provide loca-

tion information for object region features. Relative

positional encoding has recently been introduced in

other works (Shaw et al., 2018; Dai et al., 2019; Wu

et al., 2021; Chu et al., 2021) as an improvement to

the original absolute positional encoding.

In addition to vision-only and language-only

tasks, transformers are now used in tasks that involve

both modalities. Cross-modal transformers (Li et al.,

2021; Chen et al., 2020b; Zhou et al., 2020; Yu et al.,

2021; Li et al., 2020a) have received huge success in

a variety of downstream tasks such as image caption-

ing, by combining vision features and language token

embeddings. The vision features are extracted from

either Convolutional Neural Networks (CNNs) or ob-

ject detectors. The transformer can have two self-

attention blocks taking two modalities (Zhou et al.,

2020) separately, or a single transformer encoder (Li

et al., 2020b) for two kinds of input. Most research

works in this domain have focused on the challenge

of aligning vision representation and word embed-

dings. As a multi-stream transformer, Meter (Dou

et al., 2022) shares the attention between two modal-

ity attention blocks. mPLUG (Li et al., 2022) pro-

poses the asymmetric co-attention block, which al-

lowed text encoder to take visual attention from any

attention layer. Another simple improvement is to

have a larger pretrain dataset (Li et al., 2020b; Chen

et al., 2020b; Zhang et al., 2021; Wang et al., 2022).

838

Liu, X., Delany, S. and McKeever, S.

Applying Positional Encoding to Enhance Vision-Language Transformers.

DOI: 10.5220/0011796100003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP, pages

838-845

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

The visual representation input consists of the out-

put vector from an object detector or a CNN clas-

siﬁer. The feature vector is concatenated with the

height and width of the object bounding box (Li et al.,

2020b; Yu et al., 2021). However, we did not ﬁnd

any previous study that reﬁned the visual features

with positional encoding for vision-language trans-

formers. Our hypothesis is that the location of ob-

jects will contribute information, so including object

location information using positional encoding could

result in better models. To verify this, we implanted

two typical positional encodings on a leading cross

model transformer, Oscar (Li et al., 2020b); DETR a

2d absolute positional encoding approach, and iRPE

the SOTA relative positional encoding approach. We

found that simply adding DETR positional encoding

with a Mask r-CNN (He et al., 2017) feature improved

the performance of Oscar. Applying positional encod-

ing on query, value and key in the self-attention head

gave further improvement.

We summarise our contribution as follows:

• To the best of our knowledge, we are the ﬁrst work

that introduced positional encoding to vision-

language pre-training transformers. We built the

visual feature vectors with two kinds of positional

encoding.

• With positional encoding, we demonstrate that

with the same amount of training data, Oscar

reaches a better image captioning performance

compared to the original model. The Bleu4 score

increased by 24.1%. The CIDEr score increased

by 14.6%.

• The improvement of Oscar indicates that adding

positional encoding into the vision-language

transformers can enhance the performance of

vision-language downstream tasks.

2 RELATED WORK

The relatively recent success of transformer models

is evident in their use as pre-trained models for vi-

sion and language tasks. While positional encoding

has shown good success in object detection and im-

age classiﬁcation tasks, it has not been widely used in

vision-language pre-trained models. This section de-

scribes the main vision-language models and outlines

how positional encoding has been successfully used

to date.

Since the transformer structure was ﬁrst intro-

duced (Vaswani et al., 2017), attention-based mod-

els have been the model of choice in both language

and vision area. Pretrained language models such as

image

caption

Faster

r-CNN

object labels

visual features

Oscar

(a) An overview of the original Oscar structure.

image

caption

Mask

r-CNN

object labels

object mask

Oscar

visual features

Positional Encoding

(b) An overview of Oscar with positional encoding.

Figure 1: A comparison between the original Oscar (a) and

Oscar with positional encoding (b).

BERT (Devlin et al., 2018) and GPT (Radford et al.,

2018) leveraged the advantage of the attention mecha-

nism, with a better capability to model long term rela-

tionships compared to RNN methods. Moreover, off-

the-shelf attention models are able to process visual

inputs. Image GPT(iGPT) (Chen et al., 2020a), Pyra-

mid ViT (Wang et al., 2021), Swin Transformer (Liu

et al., 2021), and DETR (Carion et al., 2020) take vi-

sual features as input and use attention models to do

object detection and image classiﬁcation tasks.

Both multi-stream and single-stream transformers

(Khan et al., 2022) have been applied to image cap-

tioning. Following the intuition of taking visual fea-

tures as transformer input, ViLBERT (Zhou et al.,

2020) was designed with two parallel transformer

blocks as a co-attention framework, which takes vi-

sual feature vectors and language token embeddings

separately (Vaswani et al., 2017). The output of

different modality attention heads is then multiplied

across the different modalities. ViLBERT is consid-

ered a multi-stream transformer.

Similarly, single-stream transformers have one

transformer block for both visual and language in-

puts simultaneously. They take image region fea-

Applying Positional Encoding to Enhance Vision-Language Transformers

839

tures and captions as input, and multiply them across

the modalities at the ﬁrst layer of the transformer

encoder. Unicoder-VL (Li et al., 2020a), UNITER

(Chen et al., 2020b), Oscar (Li et al., 2020b), and

OFA (Wang et al., 2022) are all classiﬁed as single-

stream transformers and have used a number of other

techniques to improve the performance of transformer

architectures. Unicoder-VL used three objectives in

the pre-train process, Masked Language Modeling

(MLM) which predicts a token based on the surround-

ing word and image features, Masked Object Classi-

ﬁcation (MOC) which included zero-padding in the

input region feature and Visual Linguistic Matching

which considers whether the vision-language inputs

are semantically similar.

UNITER, however, included the objective of

Masked Region Feature Regression (MRFR) in pre-

training which includes a fully-connected layer on top

of the transformer output. It learns L2 regression be-

tween the input region of interest features and the pre-

dicted vector from the transformer. Another general

strategy used is to pre-train on multiple datasets to

generalize the transformer further. UNITER is pre-

trained on COCO (Lin et al., 2014), Visual Genome

(VG) (Krishna et al., 2017), Conceptual Captions

(CC) (Sharma et al., 2018), and SBU Captions (Or-

donez et al., 2011). The large amount of image-

caption pairs across different datasets provides extra

generalization for the model to reach a better perfor-

mance on downstream tasks such as image caption-

ing.

Oscar innovatively changed the image-caption in-

put pair to a caption-tag-image pair. The tags are

English words obtained from the Faster r-CNN (Ren

et al., 2015) object detector. Oscar also pre-trained the

transformer with a larger group of datasets, including

Open Images (Kuznetsova et al., 2020) and Object365

(Shao et al., 2019). More recently, OFA used multi

and uni-modal data combined across more than 10 vi-

sion and language datasets, and trained across a wider

range of downstream tasks. OFA achieved a higher

performance in image captioning compared to other

pre-trained transformers.

Positional encoding was ﬁrst introduced for lan-

guage transformers as a sinusoidal signal added be-

tween token embeddings and multi-head attention

blocks (Vaswani et al., 2017). However, in vision

transformers, the location cannot be encoded into a 1d

sinusoidal signal. The original positional encoding is

improved to 2d encoding to cater for image features.

All of these vision-language transformers are us-

ing the original 1-d positional encoding. None of the

transformers examined exploiting positional encod-

ing as an extra visual input. In this paper we explore

adding positional encoding to the visual features.

Considering the absolute and relative position of a

visual object, there are two kinds of positional encod-

ing: Absolute PE and Relative PE:

Absolute PE adds the 2-d sinusoidal encoding di-

rectly to the image feature vector. ViT (Dosovitskiy

et al., 2020) ﬁrstly applied both 1-d and 2-d posi-

tional encoding to the image visual input in a CNN.

It demonstrated that even the image patches that are

encoded in the raster order can signiﬁcantly improve

performance. DETR (Carion et al., 2020) then in-

novatively introduced 2d absolute positional encod-

ing into a vision transformer. For positional encod-

ing with length d DETR uses d/2 sine and cosine

functions computed in different frequencies. Follow-

ing the structure of DETR, Deformable-DETR (Zhu

et al., 2020) added a sparse prior to the attention head.

For a query element, Deformable-DETR will only fo-

cus on several elements based on the sparse prior,

which reduces the training epochs for a better perfor-

mance.

Relative PE adds the weighted sum of the sinu-

soidal encoding between attention layers. Relative

positional encoding was ﬁrst proposed in (Shaw et al.,

2018). It is a weighted vector computed using the

query and key based on a clipped relative distance.

Transformer-XL (Dai et al., 2019) further improved

Shaw’s positional encoding by introducing a trainable

offset for the query and key weight. In a simpler de-

sign, Huang et al. (Huang et al., 2020) proposed to

subtract the relative position from the original abso-

lute positional encoding. Image Relative Positional

Encoding (iRPE) (Wu et al., 2021) showed that po-

sitional encoding can be added into the self-attention

module with a bias or contextual mode. It also in-

troduced the concept of adding positional encoding

to any of the query, key, and value. All the works

focus on adding positional encoding to word tokens,

for language inputs. For visual input, (Ramachandran

et al., 2019) proposed to replace convolutions with a

fully attentional layer. The positional encoding added

to the input is the 2-d relative distance to the central

query pixel. CPVT (Chu et al., 2021) innovatively

generates the positional encoding by doing a convo-

lution operation on the original image feature.

3 APPROACH

To determine whether the performance of vision-

language transformers can beneﬁt from positional en-

coding, we applied positional encoding on a SOTA

transformer, and applied multiple positional encoding

approaches for comparison. We chose Oscar (Li et al.,

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

840

2020b) as the example transformer. Although Oscar

is a relatively simple typical single stream transformer

architecture, it has competitive performance. In this

section, we will ﬁrstly review the input structure and

training objectives of Oscar. We will then explain our

approach to including positional encoding into Oscar.

3.1 Example Transformer: Oscar

Oscar is a pre-trained transformer for vision-language

downstream tasks such as text retrieval, image re-

trieval, image captioning, and visual question answer-

ing. We take image captioning as the target task for

our work. Similar to other vision-language transform-

ers, Oscar can take two types of modalities: token

embeddings and vision features - noting that training

Oscar for different tasks the input structure could be

different. We took the off-the-shelf Oscar model with

the same BERT self-attention backbone. The part

changed is that the positional encoding was added to

the input for training, as an additional visual feature.

3.1.1 Input Structure for Training

The input to Oscar is a triple representing three as-

pects of an image: the caption, the tags, and the ob-

ject features. The caption is the word embedding se-

quence of the image caption. The tags are the English

words for the object labels. In the original Oscar the

object features are extracted from Faster r-CNN (Ren

et al., 2015). The three parts of the input are sepa-

rated by the special token [SEP], and the entire input

sequence is started with the class token [CLS]. In our

approach we use the tags for object labels and object

features extracted from Mask r-CNN (He et al., 2017)

given that we have to use the object mask to generate

positional encoding rather than the location of object

bounding box.

3.1.2 Pre-Training Objective

We follow the same loss objective as Oscar (Li et al.,

2020b), and BERT (Devlin et al., 2018). The losses

are computed on (i) Contrastive Loss: verifying two

modalities of the input, the visual part( tags and object

features) and the language part (caption); (2) Masked

Token Loss (MTL): predicting the masked tokens.

• Contrastive Loss: The contrastive loss is from

the perspective of the modalities. The model

should be able to recognize whether the visual

modality is pairing with language modality. In the

transformer, the special token [CLS] is the rep-

resentation of the vision-language input. Simi-

lar to Oscar, we generated 50% false input triples

by replacing the visual part randomly across the

dataset. Then we fully-connected the [CLS] em-

bedding to predict if it is a triple from a real image

or a false input triple.

• Masked Token Loss (MTL): In the language-

only environment, given the surrounding tokens,

the model should be able to retrieve the missing

token where the context is a combination of lan-

guage and vision. For each input sequence, we

randomly masked 15% of the English word tokens

with the special token [MASK], and predict this

masked token.

3.1.3 Image Captioning Finetuning

After the pre-train process, Oscar has built the object-

semantic mapping between objects and English to-

kens. The next step is to ﬁne-tune Oscar to adapt it

to the downstream task which is image captioning.

There are two steps to image captioning ﬁnetuning:

captioning pre-training and caption generation train-

ing. The loss objectives used are the seq2seq objec-

tives of image captioning used in the original Oscar.

• Captioning Pre-training: The input of caption-

ing pre-training is the same as the input structure

of section 3.1.1. The loss objective is MTL loss in

section 3.1.2. 15% of the input tokens are masked.

The model predicts the corresponding missing to-

ken. The tokens in the caption part will be able

to access the attention of both object labels and

features, but it cannot reach the attention of the

tokens behind the current token.

• Caption Generation Training: The input of im-

age captioning are the object labels and feature

vectors, rather than the triples as in pre-training.

The goal is to infer the ﬁrst part of the triple which

is the caption. First the model takes the special

token [CLS], the object labels and the object fea-

ture vectors. Second the generation starts with

the model predicting a sampled token based on

the input. This sampled token and a [MASK] to-

ken are the input for next round for the next word

prediction. The whole inference process stops

when the [EOS] token is predicted. Following Os-

car, the objective of caption generation is SCST

(Self-Critical Sequence Training) (Rennie et al.,

2017) where the inference process is treated as a

Reinforcement Learning process. The reward is

based on the CIDEr (Vedantam et al., 2015) score

against a random baseline.

Having examined Oscar’s structure and down-

stream task training process, we move next to the po-

sitional encoding we have selected to apply to Os-

car. We implemented two positional encoding ap-

proaches: The ﬁrst approach we consider is DETR

Applying Positional Encoding to Enhance Vision-Language Transformers

841

Oscar

image

caption

labels

region features

feature vector

Positional Encoding

+ + +

attention block (for iRPE)

mask

CLS

mask

CLS

Contrastive Loss Masked Token Loss

Figure 2: Illustration of Oscar with DETR/iRPE positional encoding: For both DETR and iRPE, the positional encoding is

added to the feature vector before pushing into Oscar. The only difference is iRPE has one more multihead attention block

than DETR, where the iRPE positional encoding can be added in a bias or contextual mode.

(Carion et al., 2020) a 2d absolute positional encoding

that uses the original sinusoidal encoding (Vaswani

et al., 2017). The performance of DETR positional

encoding will determine if a 2d positional encoding

applied to visual features will help the model. We

then consider iRPE, a relative positional encoding that

is more complex and achieves better performance im-

provements on transformer tasks than DETR. iRPE

(Wu et al., 2021) was proposed as positional encoding

for vision tasks only, for image classiﬁcation and for

object detection. In this paper we propose using it for

image captioning, a vision-language task for which it

hasn’t been used before.

3.2 Oscar with DETR Positional

Encoding

The original DETR is a vision-only transformer for

object detection. The input is the feature vector ex-

tracted by ResNet50 (He et al., 2016). The positional

encoding is calculated directly from the image fea-

ture. In our approach we use the 2-d mask generated

from Mask r-CNN(He et al., 2017). For the 2d co-

ordinates, d/2 sine and cosine functions, in different

frequencies, are applied to the mask, and then they

are concatenated together to a d dimension positional

encoding. It will be added to the object feature vec-

tor directly before building the input triples. Figure

2 illustrates the structure of Oscar adding in the posi-

tional encoding.

3.3 Oscar with iRPE Positional

Encoding

iRPE positional encoding is calculated based on the

relative distance between objects. The positional en-

coding is a piecewise mapping function between the

actual distance to a clipping distance to save compu-

tational cost. iRPE can be added to either the query,

key or value in a contextual or bias mode. Before the

ﬁrst layer of the encoder, the positional encoding will

be added to the object feature vector as the input of

the iRPE self-attention block. This attention head is

shown in Figure 2. The input to Oscar is the English

token embeddings including both captions and object

tags combined with the output of iRPE self-attention

block.

4 EVALUATION & RESULTS

Our aim was to evaluate the impact of each of the two

positional encoding schemes DETR and iRPE on the

performance of Oscar, a vision-language transformer,

for the task of image captioning.

Due to the limitation of training GPU resources,

our aim is to simply establish an implementation of

Oscar to work as a suitable baseline. We then add in

the approaches for positional encoding to show that

this improves on baseline performance.

Similar to the original work that proposed Oscar

(Li et al., 2020b), both pre-training and image cap-

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

842

tioning training are on the COCO dataset (Chen et al.,

2015). All approaches are evaluated on the COCO

validation set of 5K images

All of the models are pre-trained for 5 epochs

with a batch size of 256. Both the image caption-

ing pre-training and ﬁnetuning are then conducted

for 10 epochs. The original Oscar weight is down-

loaded from the original model zoo (Li et al., 2020b).

For iRPE positional encoding, we used the product

method and contextual mode, which was shown to

be the best choice from the original iRPE paper (Wu

et al., 2021).

We measured performance using the same metrics

as originally used to evaluate Oscar. These include

the following:

• Bleu: Bleu (Papineni et al., 2002) is a common

metric for machine translation. It calculates the

coexistence of n-grams between the ground truth

and the predicted sentence. The Bleu4 that we

used is comparing the 4-gram precision between

the caption generated by the model and the ground

truth.

• METEOR: METEOR (Denkowski and Lavie,

2014) is also a score focusing on the co-

occurrence of word chunks where a word chunk

is a sequence of n-grams. The length of word

chunks is not limited by the length of n-grams

This measure punishes small fraction chunks.

• CIDEr: For each n-gram in both reference sen-

tence and predicted sentence, the term frequency

inverse document frequency (TF-IDF) is calcu-

lated. The cosine similarity between the sentences

is the ﬁnal CIDEr (Vedantam et al., 2015) score.

• Spice: Spice (Anderson et al., 2016) parses a sen-

tence to a direct graph, which is further decon-

structed as tuples of words. The score is the F1

score on tuple hits between predicted and ground

truth sentences.

• Rouge L: Rouge L (Lin, 2004) is also a widely

used metric for text summarisation. Given the

Longest Common Subsequence (LCS) between

two sentences, Rouge L is calculated as the F-

measure between the sentences.

5 RESULTS AND DISCUSSION

Table 1 reports the image captioning performance

for our scenarios: baseline Oscar and the addition

of the two positional encoding approaches, DETR

and iRPE. The original Oscar implanted with posi-

tional encoding from DETR is labelled Oscar+DETR.

With iRPE relative positional encoding, we explored

adding it in a number of ways as iRPE positional en-

coding can be added to any of the query, key or value.

Oscar+iRPE (Q) means the Oscar is using positional

encoding applied only to the query. Similarly, Os-

car+iRPE (QK) and Oscar+iRPE (QKV) means the

positional encoding is applied to the query and key

and the query, key and value respectively.

Adding positional encoding improves on the base-

line Oscar in all cases, across all ﬁve metrics. While

better performance than the baseline is achieved with

DETR, Oscar+iRPE (QKV) has the highest score in

all 4 criteria except Bleu4. Generally iRPE signif-

icantly outperforms DETR. Oscar+iRPE (Q), iRPE

applied to the query only, is the only iRPE imple-

mentation that does not outperform the less complex

DETR positional encoding.

The improvement of adding positional encoding

to the Oscar baseline is signiﬁcant across all 5 evalu-

ation metrics. The simpler absolute positioning ap-

proach of DETR (2d sinusoidal signals at different

frequencies) is outperformed by the relation position-

ing approach in iRPE. Image captioning using iRPE

improves by up to 24.1% when measured with Bleu4

and up to 14.6% when measured with CIDEr. Whilst

our results are demonstrated for image captioning,

our results suggest that improved positional encod-

ing enriches the knowledge available to the model.

This holds promise for improvements in other vision-

language application areas such as visual question an-

swering and image retrieval.

Our results are shown against a baseline im-

plementation of Oscar. Future work can examine

whether increasing to the level of epoch training (hun-

dreds) and enlarged training sets used in Oscar’s orig-

inal implementation (Li et al., 2020b) impacts on the

positional encoding results. Other visual language

tasks can be examined with positional encoding to in-

vestigate the impact. We implemented two common

positional encoding schemes, but there are abundant

choices for further examination of performance im-

pact.

6 CONCLUSION

In this paper we added two types of positional en-

coding into a SOTA vision-language transformer, Os-

car. Positional encoding provides additional loca-

tion information that has been shown to improve

performance in vision-only transformers on vision-

only tasks such as object detection and image clas-

siﬁcation. In this paper we have shown that po-

sitional encoding can signiﬁcantly improve perfor-

Applying Positional Encoding to Enhance Vision-Language Transformers

843

Table 1: Image captioning performance on the COCO validation set. The percentage in parentheses is the improvement

compared to the Oscar (baseline) performance.

Bleu4 Metor CIDEr Spice Rouge L

Oscar (baseline) 0.277 0.249 99.6 0.184 0.528

Oscar+DETR 0.318 (14.8%) 0.257 (3%) 109.6 (10%) 0.189 (2.7%) 0.546 (3.4%)

Oscar+iRPE (Q) 0.316 (14.0%) 0.252 (1.2%) 103.9 (4.3%) 0.188 (2.1%) 0.547 (3.5%)

Oscar+iRPE (QK) 0.344 (24.1%) 0.265 (6.4%) 112.4 (12.8%) 0.197 (7%) 0.560 (6%)

Oscar+iRPE (QKV) 0.342 (23.4%) 0.269 (8%) 114.2 (14.6%) 0.200 (8.6%) 0.564 (6.8%)

mance in vision-language transformers, on the task

of image captioning. While absolute positional en-

coding (implemented using the DETR approach) im-

proved on performance, relative positional encoding

(using iRPE) had a signiﬁcantly higher beneﬁt.

We compared image captioning performance of

Oscar with different kinds of positional encoding. Us-

ing the same set of data for training, the experiment

results show that positional encoding improved im-

age captioning performance. More work on training

Oscar with more data and epochs could further val-

idate the experiment. In addition, the implantation

of other positional encoding approaches in different

vision-language transformers is also promising future

work.

REFERENCES

Anderson, P., Fernando, B., Johnson, M., and Gould, S.

(2016). Spice: Semantic propositional image caption

evaluation. In European conference on computer vi-

sion, pages 382–398. Springer.

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov,

A., and Zagoruyko, S. (2020). End-to-end object de-

tection with transformers. In European conference on

computer vision, pages 213–229. Springer.

Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan,

D., and Sutskever, I. (2020a). Generative pretraining

from pixels. In International conference on machine

learning, pages 1691–1703. PMLR.

Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S.,

Doll

ar, P., and Zitnick, C. L. (2015). Microsoft coco

captions: Data collection and evaluation server. arXiv

preprint arXiv:1504.00325.

Chen, Y.-C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan,

Z., Cheng, Y., and Liu, J. (2020b). Uniter: Universal

image-text representation learning. In European con-

ference on computer vision, pages 104–120. Springer.

Chu, X., Tian, Z., Zhang, B., Wang, X., Wei, X., Xia,

H., and Shen, C. (2021). Conditional positional

encodings for vision transformers. arXiv preprint

arXiv:2102.10882.

Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., and

Salakhutdinov, R. (2019). Transformer-xl: Attentive

language models beyond a ﬁxed-length context. arXiv

preprint arXiv:1901.02860.

Denkowski, M. and Lavie, A. (2014). Meteor universal:

Language speciﬁc translation evaluation for any target

language. In Proceedings of the ninth workshop on

statistical machine translation, pages 376–380.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. arXiv preprint

arXiv:1810.04805.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,

D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,

M., Heigold, G., Gelly, S., et al. (2020). An image is

worth 16x16 words: Transformers for image recogni-

tion at scale. arXiv preprint arXiv:2010.11929.

Dou, Z.-Y., Xu, Y., Gan, Z., Wang, J., Wang, S., Wang,

L., Zhu, C., Zhang, P., Yuan, L., Peng, N., et al.

(2022). An empirical study of training end-to-end

vision-and-language transformers. In Proceedings of

the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, pages 18166–18176.

He, K., Gkioxari, G., Doll

ar, P., and Girshick, R. (2017).

Mask r-cnn. In Proceedings of the IEEE international

conference on computer vision, pages 2961–2969.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Huang, Z., Liang, D., Xu, P., and Xiang, B. (2020). Im-

prove transformer models with better relative position

embeddings. arXiv preprint arXiv:2009.13658.

Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S.,

and Shah, M. (2022). Transformers in vision: A sur-

vey. ACM computing surveys (CSUR), 54(10s):1–41.

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata,

K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-

J., Shamma, D. A., et al. (2017). Visual genome:

Connecting language and vision using crowdsourced

dense image annotations. International journal of

computer vision, 123(1):32–73.

Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin,

I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci,

M., Kolesnikov, A., et al. (2020). The open images

dataset v4. International Journal of Computer Vision,

128(7):1956–1981.

Li, C., Xu, H., Tian, J., Wang, W., Yan, M., Bi, B.,

Ye, J., Chen, H., Xu, G., Cao, Z., et al. (2022).

mplug: Effective and efﬁcient vision-language learn-

ing by cross-modal skip-connections. arXiv preprint

arXiv:2205.12005.

Li, G., Duan, N., Fang, Y., Gong, M., and Jiang, D. (2020a).

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

844

Unicoder-vl: A universal encoder for vision and lan-

guage by cross-modal pre-training. In Proceedings of

the AAAI Conference on Artiﬁcial Intelligence, vol-

ume 34, pages 11336–11344.

Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., and

Hoi, S. C. H. (2021). Align before fuse: Vision and

language representation learning with momentum dis-

tillation. Advances in neural information processing

systems, 34:9694–9705.

Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang,

L., Hu, H., Dong, L., Wei, F., et al. (2020b). Os-

car: Object-semantics aligned pre-training for vision-

language tasks. In European Conference on Computer

Vision, pages 121–137. Springer.

Lin, C.-Y. (2004). Rouge: A package for automatic evalu-

ation of summaries. In Text summarization branches

out, pages 74–81.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,

Ramanan, D., Doll

ar, P., and Zitnick, C. L. (2014).

Microsoft coco: Common objects in context. In Euro-

pean conference on computer vision, pages 740–755.

Springer.

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin,

S., and Guo, B. (2021). Swin transformer: Hierarchi-

cal vision transformer using shifted windows. In Pro-

ceedings of the IEEE/CVF International Conference

on Computer Vision, pages 10012–10022.

Ordonez, V., Kulkarni, G., and Berg, T. (2011). Im2text:

Describing images using 1 million captioned pho-

tographs. Advances in neural information processing

systems, 24.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).

Bleu: a method for automatic evaluation of machine

translation. In Proceedings of the 40th annual meet-

ing of the Association for Computational Linguistics,

pages 311–318.

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.,

et al. (2018). Improving language understanding by

generative pre-training.

Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Lev-

skaya, A., and Shlens, J. (2019). Stand-alone self-

attention in vision models. Advances in Neural Infor-

mation Processing Systems, 32.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. Advances in neural information

processing systems, 28.

Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., and Goel,

V. (2017). Self-critical sequence training for image

captioning. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 7008–

7024.

Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X.,

Li, J., and Sun, J. (2019). Objects365: A large-

scale, high-quality dataset for object detection. In Pro-

ceedings of the IEEE/CVF international conference

on computer vision, pages 8430–8439.

Sharma, P., Ding, N., Goodman, S., and Soricut, R. (2018).

Conceptual captions: A cleaned, hypernymed, image

alt-text dataset for automatic image captioning. In

Proceedings of the 56th Annual Meeting of the Associ-

ation for Computational Linguistics (Volume 1: Long

Papers), pages 2556–2565.

Shaw, P., Uszkoreit, J., and Vaswani, A. (2018). Self-

attention with relative position representations. arXiv

preprint arXiv:1803.02155.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. Advances in neural

information processing systems, 30.

Vedantam, R., Lawrence Zitnick, C., and Parikh, D. (2015).

Cider: Consensus-based image description evalua-

tion. In Proceedings of the IEEE conference on com-

puter vision and pattern recognition, pages 4566–

4575.

Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J.,

Zhou, C., Zhou, J., and Yang, H. (2022). Ofa: Unify-

ing architectures, tasks, and modalities through a sim-

ple sequence-to-sequence learning framework. In In-

ternational Conference on Machine Learning, pages

23318–23340. PMLR.

Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D.,

Lu, T., Luo, P., and Shao, L. (2021). Pyramid vi-

sion transformer: A versatile backbone for dense pre-

diction without convolutions. In Proceedings of the

IEEE/CVF International Conference on Computer Vi-

sion, pages 568–578.

Wu, K., Peng, H., Chen, M., Fu, J., and Chao, H. (2021).

Rethinking and improving relative position encod-

ing for vision transformer. In Proceedings of the

IEEE/CVF International Conference on Computer Vi-

sion, pages 10033–10041.

Yu, F., Tang, J., Yin, W., Sun, Y., Tian, H., Wu, H., and

Wang, H. (2021). Ernie-vil: Knowledge enhanced

vision-language representations through scene graphs.

In Proceedings of the AAAI Conference on Artiﬁcial

Intelligence, volume 35, pages 3208–3216.

Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L.,

Choi, Y., and Gao, J. (2021). Vinvl: Revisiting visual

representations in vision-language models. In Pro-

ceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pages 5579–5588.

Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., and

Gao, J. (2020). Uniﬁed vision-language pre-training

for image captioning and vqa. In Proceedings of

the AAAI Conference on Artiﬁcial Intelligence, vol-

ume 34, pages 13041–13049.

Zhu, X., Su, W., Lu, L., Li, B., Wang, X., and Dai, J.

(2020). Deformable detr: Deformable transform-

ers for end-to-end object detection. arXiv preprint

arXiv:2010.04159.

Applying Positional Encoding to Enhance Vision-Language Transformers

845