AR-VPT: Simple Auto-Regressive Prompts for Adapting Frozen ViTs to Videos

Muhammad Zain Yousuf, Syed Wasim, Syed Hasany, Muhammad Farhan

2024

Abstract

The rapid progress of deep learning in image recognition has driven increasing interest in video recognition. While image recognition has benefited from the abundance of pre-trained models, video recognition remains challenging due to the absence of strong pre-trained models and the computational cost of training from scratch. Transfer learning techniques have been used to leverage pre-trained networks for video recognition by extracting features from individual frames and combining them for decision-making. In this paper, we explore the use of Visual-Prompt Tuning (VPT) for video recognition, a computationally efficient technique previously proposed for image recognition. Our contributions are two-fold: we introduce Auto-Regressive Visual Prompt Tuning (AR-VPT) method to perform temporal modeling, addressing the weakness of VPT in this aspect. Finally, we achieve significantly improved performance compared to vanilla VPT on three benchmark datasets: UCF-101, Diving-48, and Something-Something-v2. Our proposed method achieves an optimal trade-off between performance and computation cost, making it a promising approach for video recognition tasks.

Download


Paper Citation


in Harvard Style

Zain Yousuf M., Wasim S., Hasany S. and Farhan M. (2024). AR-VPT: Simple Auto-Regressive Prompts for Adapting Frozen ViTs to Videos. In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP; ISBN 978-989-758-679-8, SciTePress, pages 632-638. DOI: 10.5220/0012392000003660


in Bibtex Style

@conference{visapp24,
author={Muhammad Zain Yousuf and Syed Wasim and Syed Hasany and Muhammad Farhan},
title={AR-VPT: Simple Auto-Regressive Prompts for Adapting Frozen ViTs to Videos},
booktitle={Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP},
year={2024},
pages={632-638},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012392000003660},
isbn={978-989-758-679-8},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP
TI - AR-VPT: Simple Auto-Regressive Prompts for Adapting Frozen ViTs to Videos
SN - 978-989-758-679-8
AU - Zain Yousuf M.
AU - Wasim S.
AU - Hasany S.
AU - Farhan M.
PY - 2024
SP - 632
EP - 638
DO - 10.5220/0012392000003660
PB - SciTePress