Applying Prompts and Parameter-Efficient Methods to Enhance Single-Stream Vision-Language Transformers

Xuehao Liu, Sarah Jane Delany, Susan McKeever

2024

Abstract

Large-Scale transformer models pose challenges due to resource-intensive training, time, and data requirements for fine-tuning on new tasks, mainly due to their extensive parameter count. To address this, zero-shot and few-shot learning, aided by techniques like prompts and parameter-efficient modules, have emerged. However, these techniques are often tailored for vision-only or language-only tasks, leaving a gap for their effectiveness in multi-modal tasks like image captioning. This paper explores the effectiveness of prompts and parameter-efficient modules in reducing the training effort for image captioning. Rather than extensive fine-tuning, we trained only the prompt and parameter-efficient modules on the pretrained Oscar transformer model using the COCO dataset. We tested five prompt tuning approaches and two parameter-efficient methods. Notably, combining visual prompt tuning(VPT) with Adapter and LoRA led to a 2% Cider score improvement after just one epoch training, with a minimal increase in trainable parameters (5.7%). Our work paves the way towards using single-stream transformer models for a variety of fine-tuned tasks, but with a huge potential reduction in retraining time and processing resources.

Download


Paper Citation


in Harvard Style

Liu X., Delany S. and McKeever S. (2024). Applying Prompts and Parameter-Efficient Methods to Enhance Single-Stream Vision-Language Transformers. In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP; ISBN 978-989-758-679-8, SciTePress, pages 501-508. DOI: 10.5220/0012364800003660


in Bibtex Style

@conference{visapp24,
author={Xuehao Liu and Sarah Jane Delany and Susan McKeever},
title={Applying Prompts and Parameter-Efficient Methods to Enhance Single-Stream Vision-Language Transformers},
booktitle={Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP},
year={2024},
pages={501-508},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012364800003660},
isbn={978-989-758-679-8},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP
TI - Applying Prompts and Parameter-Efficient Methods to Enhance Single-Stream Vision-Language Transformers
SN - 978-989-758-679-8
AU - Liu X.
AU - Delany S.
AU - McKeever S.
PY - 2024
SP - 501
EP - 508
DO - 10.5220/0012364800003660
PB - SciTePress