Self-Modularized Transformer:

Learn to Modularize Networks for Systematic Generalization

Yuichi Kamata

, Moyuru Yamada

and Takayuki Okatani

Fujitsu Ltd., Kawasaki, Kanagawa, Japan

Graduate School of Information Sciences, Tohoku University, Sendai, Japan

Keywords:

Neural Module Network, Systematic Generalization, Visual Question Answering.

Abstract:

Visual Question Answering (VQA) is a task of answering questions about images that fundamentally requires

systematic generalization capabilities, i.e., handling novel combinations of known visual attributes (e.g., color

and shape) or visual sub-tasks (e.g., FILTER and COUNT). Recent researches report that Neural Module Net-

works (NMNs), which compose modules that tackle sub-tasks with a given layout, are a promising approach

for the systematic generalization in VQA. However, their performance heavily relies on the human-designed

sub-tasks and their layout. Despite being crucial for training, most datasets do not contain these annotations.

Self-Modularized Transformer (SMT), a novel Transformer-based NMN that concurrently learns to decom-

pose the question into the sub-tasks and compose modules without such annotations, is proposed to overcome

this important limitation of NMNs. SMT outperforms the state-of-the-art NMNs and multi-modal Transform-

ers for the systematic generalization to the novel combinations of the sub-tasks in VQA.

1 INTRODUCTION

Recent studies suggest that systematic generaliza-

tion remains challenging for state-of-the-art neural

network models. Systematic generalization is the

ability to generalize novel compositions of known

concepts beyond the training distribution (Lake and

Baroni, 2018; Bahdanau et al., 2019; Ruis et al.,

2020). Even successful models for in-distribution,

e.g., Transformer (Vaswani et al., 2017), largely de-

grades the performance for systematic generalization

(Yamada et al., 2022; Bergen et al., 2021).

Visual Question Answering (VQA) (Antol et al.,

2015) is the task of answering questions about im-

ages. The core of VQA is complex visual reasoning,

a composition of sub-tasks, e.g., FIND, FILTER, and

COUNT. This compositional structure yields a distri-

bution of image-question pairs of combinatorial size,

training data cannot fully capture. Thus, VQA funda-

mentally requires systematic generalization capabili-

ties.

Neural Module Networks (NMNs) show promis-

ing performance for systematic generalization in

VQA (Johnson et al., 2017a; Bahdanau et al., 2020;

D’Amario et al., 2021). NMNs decompose a ques-

tion in VQA into sub-tasks, and each sub-task is tack-

led with a shallow neural network called a module.

NMNs take a sequence of sub-tasks, i.e., a layout

of the sub-tasks, as an input instead of the question.

NMNs alleviate the gap between in-distribution gen-

eralization and systematic generalization.

The performance of NMNs, however, signiﬁ-

cantly depends on the human-designed sub-tasks and

their layouts (equivalently, programs), i.e., how to de-

sign a library of modules that covers all questions in

a target dataset and how to compose them to pro-

vide correct answers to the questions. These anno-

tations are essential to training the NMNs but are not

included in most datasets such as VQA v2.0 (Goyal

et al., 2017). A program generator can also be used

to convert the questions into the programs in the

test phase (Johnson et al., 2017b), but the program–

question pairs are needed to train it. Due to these cru-

cial limitations, NMNs cannot apply their excellent

systematic generalization capabilities to the datasets

that do not contain the programs.

In this paper, to eliminate the above limitations of

NMNs, we propose a novel Transformer-based NMN

called the Self-Modularized Transformer (SMT) that

simultaneously learns to decompose the question into

the sub-tasks and compose modules without the pro-

gram. This is a signiﬁcant challenge because the net-

work must determine if the modules, their composi-

tion, or both are incorrect from an error between the

predicted and actual answer. Two losses are intro-

duced to facilitate them. SMT outperforms the state-

Kamata, Y., Yamada, M. and Okatani, T.

Self-Modularized Transformer: Learn to Modularize Networks for Systematic Generalization.

DOI: 10.5220/0011682100003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP, pages

599-606

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

599

of-the-art NMNs and multi-modal Transformers for

the systematic generalization to the novel combina-

tions of the sub-tasks in VQA.

2 RELATED WORK

Neural Module Networks (NMNs) (Andreas et al.,

2016b) are commonly used to solve complex vi-

sual reasoning tasks such as CLEVR (Johnson et al.,

2017a). NMNs decompose a question into a sequence

of sub-tasks (i.e., program). With ground truth (GT)

programs NMNs have solved CLEVR perfectly (Yi

et al., 2018; Shi et al., 2019). Program generator (An-

dreas et al., 2016a; Yi et al., 2018; Akula et al., 2021),

which infers a program from a question, is proposed

to apply the NMNs to the questions for which the GT

programs are not provided. However, the question

and GT program pairs are required to train the pro-

gram generator.

Stack-NMN (Hu et al., 2018) detects weights of

applicability for each module, and similar to our ap-

proach it learns the soft weights to select modules.

While Stack-NMN implements modules specialized

to the sub-tasks deﬁned in CLEVR, we use larger

modules consisting of Transformer blocks to realize

ﬂexible acquisition of sub-tasks through the training

process without pre-deﬁning them.

Multi-modal Transformers (Lu et al., 2019; Ka-

math et al., 2021; Tan and Bansal, 2019) are re-

cent successful models on vision and language tasks

(Goyal et al., 2017; Hudson and Manning, 2019; Yu

et al., 2016), coupled with the effect of pretraining

with large amounts of data (Krishna et al., 2017;

Sharma et al., 2018), and are reported with high ac-

curacy on CLEVR without using any GT programs

as well (Kamath et al., 2021). The architecture of

Transformer is also adopted for the program generator

for NMNs (Chen et al., 2021). We use Transformer

blocks to build modules for each sub-task.

The CLOSURE dataset (Bahdanau et al., 2020)

provides novel combinations of sub-tasks in the

CLEVR dataset. The authors of CLOSURE have also

proposed Vector-NMN. Vector-NMN outperforms all

the previous NMNs and achieves promising perfor-

mance for systematic generalization with GT pro-

grams. Recently, some NMNs under the condition of

using GT programs (Bahdanau et al., 2020; Yamada

et al., 2022) or program generator (Akula et al., 2021)

have improved the performance on CLOSURE. How-

ever, they still rely on the GT programs to train the

model or program generator, unlike our approach.

3 APPROACH

In this section, we introduce Self-Modularized Trans-

former (SMT) that consists of a set of Transformer

modules and a controller network to select the mod-

ules for the sub-task at each layer.

3.1 Architecture

Figure 1 depicts an overview of SMT, which is com-

posed of Transformer blocks as modules. We de-

sign this architecture based on the preliminary exper-

imental results (see Sec. 4.5). Conventional L-layer

Transformer models stack L Transformer blocks se-

quentially, while our M-module SMT arranges all M

Transformer blocks in parallel and uses that set of

modules repeatedly to stack L layers (i.e., all Trans-

former blocks operate their sub-tasks L times). Trans-

former modules at l-th layer receive as their input

the output of the previous layer S

l−1

and produce a

weighted summation of each module’s output S

= Transformer

l−1

∑

m=1

· S

(1)

where w

is a weight for the output of the m-th mod-

ule at l-th layer.

Following the approach of conventional vision

and language Transformers (Lu et al., 2019; Li et al.,

2020), the image features f = { f

(1)

,··· , f

(O)

} and

regional positions r = {r

(1)

,··· ,r

(O)

} of the O ob-

jects extracted by the object detector are embedded

by each of the linear layers FC

and FC

, and they

are summed up to create a sequence of object features

with a layer normalization:

V = LayerNorm(FC

( f ) + FC

(r)). (2)

Also, a sequence of question embeddings Q is cre-

ated by converting W tokens of question sentences

q = {q

(1)

,··· ,q

(W )

} with the embedding layer E

and

adding the positional embeddings. In addition, the

question sentence q is fed into a dependency parser

to obtain dependency tags and dependency heads for

each word in the sentences as shown in Figure 2.

The dependency tag is converted to embeddings T =

(1)

,··· ,T

(W )

}, and is added to question embeddings

Q to obtain extended question embeddings:

= LayerNorm(Q + T ). (3)

The dependency heads are transformed to a matrix

head

that aligns attention masks for the Transformer

modules such that only a single phrase of dependency

parsing is present sequentially for the modules of each

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

600

Transformer 1

＋

...

Layer 1

Transformer 2

Transformer M

There is another cube that is the same size

as the brown cube; what is its color?

Dependency

Parsing

  

  

  





 



Dependency tag

embeddings

Dependency

head matrix (A

head

)

Add & Norm. Concatenate

...

Transformer 1

＋

...

Layer L

Transformer 2

Transformer M

Answer

MLP

head

MLP

ctrl

MLP

ctrl

Figure 1: Overview of Self-Modularized Transformer (SMT). It takes an image and a question as the input and outputs the

answer by selecting one from predeﬁned candidates. All Transformer modules are aligned in parallel, and that set of modules

is stacked in L layers repeatedly. A dependency parser divides the question into sentences for the modules. Selections of the

modules are determined by a controller network (MLP

ctrl

layer. For example, in the case of the question sen-

tence q = {There is another cube that ···}, the atten-

tion mask at the position of layer 1 of [1 1 0 1 0 ···]

indicates the phrase “There is cube” as hard atten-

tion to each word in the question. To keep the model

size (i.e., the number of layers) the same in the whole

dataset, every row of the dependency head matrix

head

which exceeds the number of the phrase is

padded with the attention mask of the language to-

kens set to zero.

We add embeddings of special tokens to the

head and tail of the object embeddings V and the

extended question embeddings Q

(i.e, tokens of

⟨BOS⟩, ⟨EOS⟩, ⟨BOI⟩, and ⟨EOI⟩). Finally, the ini-

tial input for the ﬁrst layer is as follows:

= {E

(⟨BOS⟩)⌢Q

⌢E

(⟨EOS⟩)

⌢E

(⟨BOI⟩)⌢V ⌢ E

(⟨EOI⟩)}

(4)

where “⌢” denotes a concatenation operation and E

indicates the embedding layer for question tokens.

The transformed ﬁrst tokens (i.e., at the position of

⟨BOS⟩) obtained from each module are fed into a con-

troller network MLP

ctrl

to predict the weights for the

module selection. The controller network is a multi-

layer perceptron (MLP). The weights of modules at

the l-th layer are

,··· ,w

} = softmax({MLP

ctrl

[0]),··· ,

MLP

ctrl

[0])})

(5)

where S

[0] denotes the ﬁrst token in the output of the

m-th module.

3.2 Losses for Module Selectivity

We introduce two losses, i.e., a sparsity loss and a

diversity loss to improve the module selectivity.

According to conventional approaches, the loss

for the VQA task L

uses the cross-entropy loss as

classiﬁcation with predetermined answer candidates.

The ﬁrst token in the output of the last layer S

[0] is

fed into the head network MLP

head

for classiﬁcation,

and class probabilities of the answers p

are identiﬁed

via the softmax function:

= softmax(MLP

head

[0])). (6)

Then, we adopt a sparsity loss so that each layer has a

high module selectivity:

∑

b=1

∑

l=1

∑

m=1



(b)



1/2

− L (7)

where B is batch size. For each layer, since the

weights are non-negative and add up to 1, the sum

Self-Modularized Transformer: Learn to Modularize Networks for Systematic Generalization

601

There

another

cube

that is the

same

size as the

brown

cube

;

what

is its

color

expl

ccomp

det attr

nsubj

relcl

det

amod

attr

prep

det

amod

pobj

punct

attr

ROOT

poss

nsubj

punct

Dependency head

Dependency tag

Layer L

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

…

Layer 2

0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

Layer 1

1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Attention mask

Figure 2: An example of the approach to dependency parsing. Dependency tag embeddings are added to word embeddings,

and dependency heads transform into attention masks to present a single phrase sequentially for each layer.

of that weights in powers less than one is limited to 1

(i.e., sparse) or more.

In addition, a diversity loss is added so that all the

modules can be selected uniformly (≈ L/M) through-

out the dataset:

∑

m=1



∑

b=1

∑

l=1

(b)

−



(8)

Then, the ﬁnal loss function is

+ γL

+ εL

(9)

where γ and ε are coefﬁcients for adjusting the spar-

sity and diversity.

4 EXPERIMENTS

4.1 Preprocess and Parameters

We use Faster R-CNN with ResNet-101 backbone

pretrained by the bottom-up attention method (Ander-

son et al., 2018) to extract object features from the

input images, and we set a ﬁxed number of objects

(O = 36). The byte-level Byte-Pair-Encoding and

Transformer block (1032 hidden units and 12 atten-

tion heads) as RoBERTa (Liu et al., 2019) are applied

to our model. We use spaCy

to extract the dependen-

cies from the input question. Moreover, the number of

layers is set to 22, the maximum number of phrases in

the question sentence obtained by dependency pars-

ing. We set γ and ε of loss adjustments to 5e

−4

and

−5

, respectively, and train the model using Adam

optimizer with the maximum learning rate of 8e

−5

4,000 steps warm-up and linear decay. The training

is executed for 40 epochs using 16 A100 GPUs with

https://spacy.io/

a batch size of 512 in total. Since the loss decreases

rapidly, the gradients from the controller MLP

ctrl

the ﬁrst tokens are not allowed to back-propagate.

4.2 Datasets

CLEVR is a dataset of VQA tasks that consists of im-

ages depicting simple 3D objects with a ﬁnite number

of attributes and questions of combinations of funda-

mental sub-tasks, and is used to focus on the evalu-

ation of reasoning and systematic generalization. It

contains three splits of data, i.e, a training set of 70k

images and 700k questions, a validation set of 15k im-

ages and 150k questions, and a test set of 15k images

and 15k questions.

CLOSURE is a complementary dataset to CLEVR

that consists of images in the CLEVR validation set

and questions of novel combinations of the same fun-

damental sub-tasks as CLEVR. It provides questions

of a validation set, a test set, and a small training set

for few-shot learning. Validation and test sets consist

of seven cases of questions depending on the combi-

nation of phrases, and each case of questions contains

36k questions. By training on CLEVR and evaluat-

ing on CLOSURE, it is possible to evaluate the ability

of systematic generalization on novel combinations of

known sub-tasks.

4.3 Results

Table 1 shows the evaluation results on the CLEVR

validation set and the CLOSURE test set for the

models trained on the CLEVR training set. Our

SMT achieved new state-of-the-art accuracy on CLO-

SURE compared to the models without GT programs

cited in the ﬁrst-row group. Moreover, our perfor-

mance on CLOSURE cannot be achieved by simply

inputting tokens of questions and images into 22-layer

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

602

Table 1: Performance on the systematic generalization of novel linguistic combinations. For the proposed model, the mean ±

standard deviation over ﬁve runs is reported. The result of MDETR from (Yamada et al., 2022) and the others from (Akula

et al., 2021) is cited.

†

Variance is reported.

Model

use GT programs

CLEVR val CLOSURE

training evaluation

MAC (Hudson and Manning, 2018) 99.1 71.6

ViLBERT (Lu et al., 2019) 95.3 51.2

MDETR (Kamath et al., 2021) 99.7 53.3

RoBERTa, 22-layer 98.5 ± 0.03 65.7 ± 2.1

SMT(Ours) 98.3 ± 0.05 77.0 ± 2.2

NS-VQA (Yi et al., 2018) ✓ 99.2 76.4

PG-Vector-NMN (Bahdanau et al., 2020) ✓ 98.8 71.0

NMN w/ CoSAtt (Akula et al., 2021) ✓ 98.9 ± 0.1

†

88.0 ± 0.2

†

RoBERTa, as shown in the second-row group. On

the other hand, in comparison to the models using

the program generator cited in the third-row group,

the proposed model is largely inferior to NMN w/

CoSAtt but is comparable and superior to NS-VQA

and PG-Vector-NMN. Therefore our model shows a

promising performance for systematic generalization

on novel linguistic compositions without any program

annotations.

4.4 Ablation Studies

Table 2 shows the effect of approaches applied to

SMT. If the information of dependency parsing is

not used (the second-row), the performance of CLO-

SURE is signiﬁcantly decreased. This result shows

that the decomposition of phrases based on depen-

dency parsing instead of GT programs greatly con-

tributes to systematic generalization for novel combi-

nations of linguistic constructs. Alternatively, SMT

causes degradation in “and

mat spa”, a major issue

to address. In the case our sparsity loss and diver-

sity loss are not set (the third row), the performance

on CLOSURE is slightly decreased; therefore, we be-

lieve that improving the module selectivity with our

proposed losses contributes to the systematic general-

ization performance.

4.5 Preliminary Experiments

We ﬁrst investigated whether the proposed approach

can obtain a systematic generalization using GT pro-

grams. To this end, we evaluate the performance on

CLOSURE with the following two models. Figure

3 illustrates the two models that use GT programs.

In Fig. 3 (a), we input the embedding of the func-

tion name in the GT programs (e.g., FIND, FILTER,

or COUNT) to the controller MLP

ctrl

instead of the ﬁrst

token in each module’s output. This setting gives the

oracle for the selection of modules. In Fig. 3 (b), in

Transformer 1

 



 



 



 





MLP

ctrl

Object features

Question

embeddings





Transformer 1

 



 



 





MLP

ctrl

Object features

Argument

embedding





...

Function

embedding

Function

embedding

(a)

Transformer 1

 



 



 



 





MLP

ctrl

Object features

Question

embeddings





Transformer 1

 



 



 





MLP

ctrl

Object features

Argument

embedding





...

Function

embedding

Function

embedding

(b)

Figure 3: Illustration of models using the GT programs for

preliminary experiments. (a) Instead of ⟨BOS⟩ tokens, an

embedding of the GT function is input to MLP

ctrl

; (b) fur-

thermore, instead of question tokens, an embedding of the

GT argument is input to each module.

addition to the previous setting the modules take the

embedding of the argument name in the GT programs

(e.g., “red”, “left”, or “sphere”) as linguistic input in-

stead of the question. This setting gives the oracle for

linguistic compositions of the question as well. The

number of layers is set to 26 (i.e., the maximum pro-

gram length), and the functions and the arguments ﬁll

with a ⟨PAD⟩ token (i.e., padded to align the size of

the input tokens).

The results in Table 3 show that the performance

on CLOSURE improves enough, except for the cases

of “or mat” and “or mat spa”, by using both func-

tions and arguments (i.e., with GT programs) for se-

lecting modules with the functions and for restricting

the tokens of linguistic input with the arguments.

Self-Modularized Transformer: Learn to Modularize Networks for Systematic Generalization

603

Table 2: Ablation studies for SMT. The mean ± standard deviation over ﬁve runs is reported.

Model

CLEVR CLOSURE

val and mat or mat or mat compare compare

spa spa mat mat spa

SMT 98.3 ± 0.05 66.1 ± 16.1 77.4 ± 3.4 46.5 ± 1.8 93.3 ± 2.3 94.6 ± 1.4

w/o parsing 98.5 ± 0.04 91.6 ± 3.6 26.0 ± 5.6 36.4 ± 5.9 81.3 ± 6.2 77.8 ± 4.0

w/o S&D loss 98.3 ± 0.01 60.9 ± 9.2 77.3 ± 3.9 48.4 ± 4.8 89.9 ± 1.6 92.0 ± 2.1

Model

CLOSURE

embed embed overall

spa mat mat spa

SMT 96.1 ± 0.8 64.8 ± 2.0 77.0 ± 2.2

w/o parsing 98.5 ± 0.2 62.7 ± 0.7 67.8 ± 2.6

w/o S&D loss 96.1 ± 0.7 62.6 ± 0.8 75.3 ± 1.8

Table 3: Preliminary experiments leading to our proposed model.

Model

and mat or mat or mat compare compare embed embed

spa spa mat mat spa spa mat mat spa

SMT, 26-layer

86.0 25.6 30.1 83.3 79.5 98.3 62.1

w/o parsing

w/ GT func. 90.3 41.3 44.0 80.7 76.5 98.6 63.7

w/ GT prog. 97.4 59.0 44.7 93.8 95.9 99.1 89.3

4.6 Visualization of Module Selectivity

For each question case on the CLOSURE test set, we

visualize the weight maps of modules’ selection in

SMT with different loss adjustments for sparsity and

diversity (i.e., γ and ε) in Fig. 4. First, the weights

of modules’ selection show different maps according

to the cases of question in CLOSURE, and the selec-

tivity of modules becomes higher as increasing loss

adjustments for sparsity and diversity. We show the

original setting in Fig. 4 (a) and a higher selectivity

setting in Fig 4 (b). With the high module selectiv-

ity, the accuracy of CLOSURE appears to be slightly

reduced. It is hypothesized that their learning capac-

ity will be constrained if the arrangement of modules

is strictly controlled before they understand how to

complete the task to a certain extent through trial and

error.

5 CONCLUSION

We proposed Self-Modularized Transformer (SMT)

that learns to decompose questions into sub-tasks and

compose modules without those annotations. SMT

outperforms a previous state-of-the-art NMN with-

out the human-annotated programs on the CLOSURE

dataset that provide novel combinations of the sub-

tasks. Our analysis reveals that restricting input to-

kens to be a part of the question sentence is the

key to learning the sub-tasks effectively. This ap-

proach helps to achieve systemic generalization even

for datasets not provided with human-designed anno-

tations, to which NMNs cannot apply.

With the proposed SMT, the standard deviation of

CLOSURE accuracy is large (“and mat spa”, in par-

ticular); therefore, we intend to improve the control

approach for module selection (MLP

ctrl

) in our future

work. In addition, since dependency parsing does not

signiﬁcantly contribute to the performance improve-

ment on “and

mat spa” and “embed mat spa”, it is

necessary to analyze key factors to learning those sub-

tasks individually.

REFERENCES

Akula, A., Jampani, V., Changpinyo, S., and Zhu, S.-C.

(2021). Robust visual reasoning via language guided

neural module networks. In NeurIPS, pages 11041–

11053.

Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M.,

Gould, S., and Zhang, L. (2018). Bottom-up and top-

down attention for image captioning and visual ques-

tion answering. In CVPR, pages 6077–6086.

Andreas, J., Rohrbach, M., Darrell, T., and Klein, D.

(2016a). Learning to compose neural networks for

question answering. In NAACL, pages 1545–1554.

Andreas, J., Rohrbach, M., Darrell, T., and Klein, D.

(2016b). Neural module networks. In CVPR, pages

39–48.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

604

Module no.

Layer no.

1 2 3 4 5 6

Module no.

Layer no.

1 2 3 4 5 6

Module no.

Layer no.

1 2 3 4 5 6

Module no.

Layer no.

1 2 3 4 5 6

Module no.

Layer no.

1 2 3 4 5 6

Module no.

Layer no.

1 2 3 4 5 6

Module no.

Layer no.

1 2 3 4 5 6

and_mat_spa (61.2)

compare_mat_spa (93.8) embed_mat_spa (67.0) or_mat_spa (47.4)

compare_mat (93.0) embed_spa_mat (96.6) or_mat (79.7)

overall (77.0)

(a) γ = 5e

−4

, ε = 5e

−5

Module no.

Layer no.

1 2 3 4 5 6

Module no.

Layer no.

1 2 3 4 5 6

Module no.

Layer no.

1 2 3 4 5 6

Module no.

Layer no.

1 2 3 4 5 6

Module no.

Layer no.

1 2 3 4 5 6

Module no.

Layer no.

1 2 3 4 5 6

Module no.

Layer no.

1 2 3 4 5 6

and_mat_spa (79.1)

compare_mat_spa (79.6) embed_mat_spa (62.7) or_mat_spa (51.9)

compare_mat (78.7) embed_spa_mat (97.0) or_mat (77.3)

overall (75.2)

(b) γ = 2e

−3

, ε = 1e

−3

Figure 4: Visualization of the mean weight maps of modules’ selection for each question case on CLOSURE test set. Numbers

in brackets indicate accuracy in each question case on CLOSURE.

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D.,

Lawrence Zitnick, C., and Parikh, D. (2015). VQA:

Visual Question Answering. In ICCV, pages 2425–

2433.

Bahdanau, D., de Vries, H., O’Donnell, T. J., Murty, S.,

Beaudoin, P., Bengio, Y., and Courville, A. C. (2020).

CLOSURE: assessing systematic generalization of

CLEVR models. arXiv preprint, arXiv:1912.05783.

Bahdanau, D., Murty, S., Noukhovitch, M., Nguyen, T. H.,

de Vries, H., and Courville, A. (2019). Systematic

generalization: What is required and can it be learned?

In ICLR.

Bergen, L., O’Donnell, T. J., and Bahdanau, D. (2021).

Systematic generalization with edge transformers. In

NeurIPS, pages 1390–1402.

Chen, W., Gan, Z., Li, L., Cheng, Y., Wang, W., and Liu,

J. (2021). Meta module network for compositional

visual reasoning. In WACV, pages 655–664.

D’Amario, V., Sasaki, T., and Boix, X. (2021). How mod-

ular should neural module networks be for systematic

generalization? In NeurIPS, pages 23374–23385.

Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and

Parikh, D. (2017). Making the V in VQA matter:

Elevating the role of image understanding in Visual

Self-Modularized Transformer: Learn to Modularize Networks for Systematic Generalization

605

Question Answering. In CVPR, pages 6904–6913.

Hu, R., Andreas, J., Darrell, T., and Saenko, K. (2018). Ex-

plainable neural computation via stack neural module

networks. In ECCV, pages 53–69.

Hudson, D. A. and Manning, C. D. (2018). Compositional

attention networks for machine reasoning. In ICLR.

Hudson, D. A. and Manning, C. D. (2019). GQA: A new

dataset for real-world visual reasoning and compo-

sitional question answering. In CVPR, pages 6693–

6702.

Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei,

L., Lawrence Zitnick, C., and Girshick, R. (2017a).

CLEVR: A diagnostic dataset for compositional lan-

guage and elementary visual reasoning. In CVPR,

pages 2901–2910.

Johnson, J., Hariharan, B., van der Maaten, L., Hoffman, J.,

Fei-Fei, L., Zitnick, C. L., and Girshick, R. (2017b).

Inferring and executing programs for visual reasoning.

In ICCV.

Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I.,

and Carion, N. (2021). MDETR-modulated detection

for end-to-end multi-modal understanding. In ICCV,

pages 1780–1790.

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata,

K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-

J., Shamma, D. A., et al. (2017). Visual genome:

Connecting language and vision using crowdsourced

dense image annotations. Int. Journal of Computer

Vision, 123(1):32–73.

Lake, B. and Baroni, M. (2018). Generalization with-

out systematicity: On the compositional skills of

sequence-to-sequence recurrent networks. In ICML,

pages 2873–2882.

Li, L. H., Yatskar, M., Yin, D., Hsieh, C.-J., and Chang, K.-

W. (2020). What does BERT with vision look at? In

ACL, pages 5265–5275.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen,

D., Levy, O., Lewis, M., Zettlemoyer, L., and

Stoyanov, V. (2019). RoBERTa: A robustly opti-

mized BERT pretraining approach. arXiv preprint

arXiv:1907.11692.

Lu, J., Batra, D., Parikh, D., and Lee, S. (2019). ViL-

BERT: Pretraining task-agnostic visiolinguistic repre-

sentations for vision-and-language tasks. In NeurIPS,

pages 13–23.

Ruis, L., Andreas, J., Baroni, M., Bouchacourt, D., and

Lake, B. M. (2020). A benchmark for systematic gen-

eralization in grounded language understanding. In

NeurIPS, pages 19861–19872.

Sharma, P., Ding, N., Goodman, S., and Soricut, R. (2018).

Conceptual captions: A cleaned, hypernymed, image

alt-text dataset for automatic image captioning. In

ACL, volume 1, pages 2556–2565.

Shi, J., Zhang, H., and Li, J. (2019). Explainable and ex-

plicit visual reasoning over scene graphs. In CVPR,

pages 8376–8384.

Tan, H. and Bansal, M. (2019). LXMERT: Learning cross-

modality encoder representations from transformers.

In EMNLP and Proc. of the Conference on Empirical

Methods in Natural Language Processing and ICNLP,

pages 5100–5111.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. In NeurIPS, pages

5998–6008.

Yamada, M., D’Amario, V., Takemoto, K., Boix, X., and

Sasaki, T. (2022). Transformer module networks for

systematic generalization in visual question answer-

ing. arXiv preprint arXiv:2201.11316.

Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., and Tenen-

baum, J. (2018). Neural-symbolic VQA: Disentan-

gling reasoning from vision and language understand-

ing. In NeurIPS, pages 1039–1050.

Yu, L., Poirson, P., Yang, S., Berg, A. C., and Berg, T. L.

(2016). Modeling context in referring expressions. In

ECCV, pages 69–85.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

606