RecViT: Enhancing Vision Transformer with Top-Down Information Flow

Štefan Pócoš, Iveta Bečková, Igor Farkaš

2024

Abstract

We propose and analyse a novel neural network architecture — recurrent vision transformer (RecViT). Building upon the popular vision transformer (ViT), we add a biologically inspired top-down connection, letting the network ‘reconsider’ its initial prediction. Moreover, using a recurrent connection creates space for feeding multiple similar, yet slightly modified or augmented inputs into the network, in a single forward pass. As it has been shown that a top-down connection can increase accuracy in case of convolutional networks, we analyse our architecture, combined with multiple training strategies, in the adversarial examples (AEs) scenario. Our results show that some versions of RecViT indeed exhibit more robust behaviour than the baseline ViT, on the tested datasets yielding ≈18 % and ≈22 % absolute improvement in robustness while the accuracy drop was only ≈1 %. We also leverage the fact that transformer networks have certain level of inherent explainability. By visualising attention maps of various input images, we gain some insight into the inner workings of our network. Finally, using annotated segmentation masks, we numerically compare the quality of attention maps on original and adversarial images.

Download


Paper Citation


in Harvard Style

Pócoš Š., Bečková I. and Farkaš I. (2024). RecViT: Enhancing Vision Transformer with Top-Down Information Flow. In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: VISAPP; ISBN 978-989-758-679-8, SciTePress, pages 749-756. DOI: 10.5220/0012464700003660


in Bibtex Style

@conference{visapp24,
author={Štefan Pócoš and Iveta Bečková and Igor Farkaš},
title={RecViT: Enhancing Vision Transformer with Top-Down Information Flow},
booktitle={Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: VISAPP},
year={2024},
pages={749-756},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012464700003660},
isbn={978-989-758-679-8},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: VISAPP
TI - RecViT: Enhancing Vision Transformer with Top-Down Information Flow
SN - 978-989-758-679-8
AU - Pócoš Š.
AU - Bečková I.
AU - Farkaš I.
PY - 2024
SP - 749
EP - 756
DO - 10.5220/0012464700003660
PB - SciTePress