
REFERENCES
Aburass, S. and Dorgham, O. (2023). Performance evalua-
tion of swin vision transformer model using gradient
accumulation optimization technique.
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zit-
nick, C. L., and Parikh, D. (2015). Vqa: Visual ques-
tion answering. In 2015 IEEE International Confer-
ence on Computer Vision (ICCV), pages 2425–2433.
Coquenet, D., Rambour, C., Dalsasso, E., and Thome, N.
(2023). Leveraging vision-language foundation mod-
els for fine-grained downstream tasks.
Huang, T., Qasemi, E., Li, B., Wang, H., Brahman, F.,
Chen, M., and Chaturvedi, S. (2023). Affective and
dynamic beam search for story generation.
Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham,
H., Le, Q. V., Sung, Y., Li, Z., and Duerig, T. (2021).
Scaling up visual and vision-language representation
learning with noisy text supervision.
Li, J., Li, D., Xiong, C., and Hoi, S. (2022). Blip:
Bootstrapping language-image pre-training for unified
vision-language understanding and generation.
Li, P., Yang, Q., Geng, X., Zhou, W., Ding, Z., and Nian, Y.
(2024). Exploring diverse methods in visual question
answering. In 2024 5th International Conference on
Electronic Communication and Artificial Intelligence
(ICECAI), page 681–685. IEEE.
Loshchilov, I. and Hutter, F. (2019). Decoupled weight
decay regularization. In International Conference on
Learning Representations (ICLR).
Lv, K., Yang, Y., Liu, T., Gao, Q., Guo, Q., and Qiu, X.
(2024). Full parameter fine-tuning for large language
models with limited resources.
Ma, J., Wang, P., Kong, D., Wang, Z., Liu, J., Pei, H., and
Zhao, J. (2024). Robust visual question answering:
Datasets, methods, and future challenges.
Manning, C. D., Raghavan, P., and Sch
¨
utze, H. (2008). In-
troduction to Information Retrieval. Cambridge Uni-
versity Press.
Mao, A., Mohri, M., and Zhong, Y. (2023). Cross-entropy
loss functions: Theoretical analysis and applications.
Nguyen, D.-K. and Okatani, T. (2018). Improved fusion
of visual and language representations by dense sym-
metric co-attention for visual question answering. In
2018 IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 6087–6096.
Ondeng, O., Ouma, H., and Akuon, P. (2023). A review
of transformer-based approaches for image caption-
ing. Applied Sciences, 13(19).
O’Shea, K. and Nash, R. (2015). An introduction to convo-
lutional neural networks. CoRR, abs/1511.08458.
Piergiovanni, A., Kuo, W., and Angelova, A. (2022).
Pre-training image-language transformers for open-
vocabulary tasks.
Ramaswamy, V. V., Lin, S. Y., Zhao, D., Adcock, A. B.,
van der Maaten, L., Ghadiyaram, D., and Rus-
sakovsky, O. (2023). Geode: a geographically diverse
evaluation dataset for object recognition.
Ravi, S., Chinchure, A., Sigal, L., Liao, R., and Shwartz,
V. (2022). Vlc-bert: Visual question answering with
contextualized commonsense knowledge.
Risch, J., M
¨
oller, T., Gutsch, J., and Pietsch, M. (2021). Se-
mantic answer similarity for evaluating question an-
swering models.
Ruan, B.-K., Shuai, H.-H., and Cheng, W.-H. (2022). Vi-
sion transformers: State of the art and research chal-
lenges.
Salaberria, A., Azkune, G., Lopez de Lacalle, O., Soroa,
A., and Agirre, E. (2023). Image captioning for ef-
fective use of language models in knowledge-based
visual question answering. Expert Systems with Ap-
plications, 212:118669.
Salehin, I. and Kang, D.-K. (2023). A review on dropout
regularization approaches for deep neural networks
within the scholarly domain. Electronics, 12(14).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I.
(2023). Attention is all you need.
Wang, J., Yang, Z., Hu, X., Li, L., Lin, K., Gan, Z., Liu,
Z., Liu, C., and Wang, L. (2022). Git: A generative
image-to-text transformer for vision and language.
Visual Question Answering on the Indian Heritage in Digital Space Dataset Using the BLIP Model
11