Self-Modularized Transformer: Learn to Modularize Networks for Systematic Generalization

Yuichi Kamata; Moyuru Yamada; Takayuki Okatani

doi:10.5220/0011682100003417

Self-Modularized Transformer: Learn to Modularize Networks for Systematic Generalization

Yuichi Kamata, Moyuru Yamada, Takayuki Okatani

2023

Abstract

Visual Question Answering (VQA) is a task of answering questions about images that fundamentally requires systematic generalization capabilities, i.e., handling novel combinations of known visual attributes (e.g., color and shape) or visual sub-tasks (e.g., FILTER and COUNT). Recent researches report that Neural Module Networks (NMNs), which compose modules that tackle sub-tasks with a given layout, are a promising approach for the systematic generalization in VQA. However, their performance heavily relies on the human-designed sub-tasks and their layout. Despite being crucial for training, most datasets do not contain these annotations. Self-Modularized Transformer (SMT), a novel Transformer-based NMN that concurrently learns to decompose the question into the sub-tasks and compose modules without such annotations, is proposed to overcome this important limitation of NMNs. SMT outperforms the state-of-the-art NMNs and multi-modal Transformers for the systematic generalization to the novel combinations of the sub-tasks in VQA.

Download

Paper Citation

in Harvard Style

Kamata Y., Yamada M. and Okatani T. (2023). Self-Modularized Transformer: Learn to Modularize Networks for Systematic Generalization. In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP; ISBN 978-989-758-634-7, SciTePress, pages 599-606. DOI: 10.5220/0011682100003417

in Bibtex Style

@conference{visapp23,
author={Yuichi Kamata and Moyuru Yamada and Takayuki Okatani},
title={Self-Modularized Transformer: Learn to Modularize Networks for Systematic Generalization},
booktitle={Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP},
year={2023},
pages={599-606},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0011682100003417},
isbn={978-989-758-634-7},
}

in EndNote Style

TY - CONF

JO - Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP
TI - Self-Modularized Transformer: Learn to Modularize Networks for Systematic Generalization
SN - 978-989-758-634-7
AU - Kamata Y.
AU - Yamada M.
AU - Okatani T.
PY - 2023
SP - 599
EP - 606
DO - 10.5220/0011682100003417
PB - SciTePress