Bridging Vision and Language: A CNN-Transformer Model for Image Captioning
Radha Seelaboyina, Naveen Rampa, Alle Sai Shivanandha, Chamakura Yashwanth Reddy
2025
Abstract
Computer vision and natural language processing have recently received a lot of attention in building models to automatically generate descriptive sentences for images, a task referred to as image captioning. This entails grasping image semantics and building well-structured sentences to describe visual content in textual form. Recent developments in artificial intelligence (AI) prompted scientists to venture into deep learning methods using large data sets and computing capabilities to create effective models. The Encoder-Decoder mechanism, combining the Convolutional Neural Networks (CNNs) and Transformers, is most commonly employed for this purpose. A pre-trained CNN, e.g., EfficientNetB0, first extracts image features, which are subsequently processed by a Transformer-based decoder to produce relevant captions. The model is also trained on the Flickr_8k dataset of 8,000 images with five different captions each, thereby improving its contextual richness in the descriptions.
DownloadPaper Citation
in Harvard Style
Seelaboyina R., Rampa N., Shivanandha A. and Reddy C. (2025). Bridging Vision and Language: A CNN-Transformer Model for Image Captioning. In Proceedings of the 1st International Conference on Research and Development in Information, Communication, and Computing Technologies - ICRDICCT`25; ISBN 978-989-758-777-1, SciTePress, pages 125-132. DOI: 10.5220/0013923800004919
in Bibtex Style
@conference{icrdicct`2525,
author={Radha Seelaboyina and Naveen Rampa and Alle Shivanandha and Chamakura Reddy},
title={Bridging Vision and Language: A CNN-Transformer Model for Image Captioning},
booktitle={Proceedings of the 1st International Conference on Research and Development in Information, Communication, and Computing Technologies - ICRDICCT`25},
year={2025},
pages={125-132},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013923800004919},
isbn={978-989-758-777-1},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 1st International Conference on Research and Development in Information, Communication, and Computing Technologies - ICRDICCT`25
TI - Bridging Vision and Language: A CNN-Transformer Model for Image Captioning
SN - 978-989-758-777-1
AU - Seelaboyina R.
AU - Rampa N.
AU - Shivanandha A.
AU - Reddy C.
PY - 2025
SP - 125
EP - 132
DO - 10.5220/0013923800004919
PB - SciTePress