Visual Question Answering Analysis: Datasets, Methods, and Image

Featurization Techniques

Vijay Kumari

, Abhimanyu Sethi

, Yashvardhan Sharma

and Lavika Goel

Birla Institute of Technology and Science, Pilani, Rajasthan, India

Malaviya National Institute of Technology, Jaipur, Rajasthan, India

Keywords:

Computer Vision, Natural Language Processing (NLP), Visual Question Answering (VQA), Attention

Mechanism, Convolutional Neural Networks.

Abstract:

Holistic scene understanding is a long-standing objective of core tenets of Artiﬁcial Intelligence (AI). Multi-

modal tasks that aim to synergize capabilities spanning multiple domains, such as visual-linguistic capabilities,

into intelligent systems are thus a desideratum for the next step in AI. Visual Question Answering (VQA) sys-

tems that integrate Computer Vision and Natural Language Processing tasks into the task of answering natural

language questions about an image represent one such domain. There is a need to explore Deep Learning tech-

niques that can help to improve such systems beyond the language biases of real-world priors that presently

hinder them from serving as a veritable touchstone for holistic scene understanding. Furthermore, the effec-

tiveness of Transformer architecture for the image featurization pipeline of VQA systems remains untested.

Hence, an exhaustive study on the performance of various model architectures with varied training conditions

on VQA datasets like VizWiz and VQA v2 is imperative to further this area of research. This study explores

architectures that utilize image and question co-attention for the task of VQA and several CNN architectures,

including ResNet, VGG, EfﬁcientNet, and DenseNet. Vision Transformer architecture is also explored for

image featurization, and a myriad of loss functions such as cross-entropy, focal loss, and UniLoss are em-

ployed for training the models. Finally, the trained model is deployed using Flask, and a GUI for the same has

been implemented that lets users input an image and accompanying questions about the image to generate an

answer in response.

1 INTRODUCTION

There has been an astronomical surge in Computer

Vision and Deep Learning research for tasks pertain-

ing the Visual capabilities. Provided with adequate

data, Deep Convolutional Neural Networks (CNNs)

have come to be at par with human levels in classi-

ﬁcation tasks (He et al., 2016a). However, a holis-

tic understanding of images, a desideratum for the

next step in AI, requires multimodal learning capa-

bilities. Recent research has paved the way to explor-

ing tasks that broaden the spectrum of Artiﬁcial Intel-

ligence towards more holistic capabilities, including

multimodal learning, which is an essential sub-ﬁeld

for truly AI-complete tasks.

1.1 Background

Visual Question Answering (VQA) is the multimodal

task of answering natural language questions about

an image. The task of open-ended VQA entails sev-

eral capabilities across the domain of AI like Activity

recognition, Object detection, Spatial awareness, At-

tribute classiﬁcation, etc.

A robust VQA system is expected to have the fa-

cilities for reasoning about images and answer a wide

range of Computer Vision tasks. Recent work has

welcomed self-attention-based architectures, speciﬁ-

cally Transformers, as the de-facto standard for Nat-

ural Language Processing tasks. Due to their com-

putationally efﬁcient architecture, models of extraor-

dinary size can now be trained on an enormous cor-

pus of data. Computer Vision tasks, however, are

predominantly based on Convolutional architectures

until some recent study in utilizing the Transformer

architectures in hopes of transferring their scalability

and efﬁciency over (Dosovitskiy et al., 2020). To that

end, Vision Transformers (Dosovitskiy et al., 2020)

have been incorporated into the pipeline of a VQA

system as image feature extractors in this study.

Kumari, V., Sethi, A., Sharma, Y. and Goel, L.

Visual Question Answering Analysis: Datasets, Methods, and Image Featurization Techniques.

DOI: 10.5220/0011655900003411

In Proceedings of the 12th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2023), pages 281-288

ISBN: 978-989-758-626-2; ISSN: 2184-4313

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

281

1.2 Motivation

Apart from immediate applications in assisting the vi-

sually impaired, VQA systems serve as an essential

component of the Visual Turing Test for image under-

standing. Multimodal research in the form of image

and video captioning is an extensively studied and ar-

guably the most closely related sibling task to VQA

that often entails an understanding of complex object

relationships and attributes to describe the contents

of visual media in natural language. However, vi-

sual captioning is often shown to lack the ﬁne-grained

scene understanding required of a Visual Turing test;

and the lack of a fast, cheap, and reliable automatic

evaluation for generated captions further casts doubt

upon its capacity as an ”AI-complete” task.

1.3 Objectives/Contributions

1. This work explores attention and co-attention

mechanisms pertaining to visual and linguistic

modalities.

2. Explored Vision Transformers for leveraging the

advantages of Transformer architecture for image

featurization over Convolutional Neural Networks

3. We formulated a comparative study of different

techniques (Transformers, Loss Functions, Multi-

modal fusion) in the ﬁeld of VQA over datasets

VQA, VizWiz.

2 VQA DATASETS, EVALUATION

METRICS AND METHOLOGY

Some of the earliest works in the domain of Visual

Question Answering were motivated by the devel-

opment of a Visual Turing Test for a Computer Vi-

sion system and combined visual-linguistic parsing

for natural language query answering. These works

were, however, limited in their scope to constrained

settings due to the unavailability of adequate datasets.

The years 2014 and 2015 saw some of the earliest

VQA datasets being publicly released, marking VQA

research’s rise. Some of them are presented below.

2.1 Early VQA Datasets

DAQUAR

One of the smallest yet earliest signiﬁcant VQA

datasets to arrive, the Dataset for Question Answer-

ing on Real-world images (DAQUAR) (Malinowski

and Fritz, 2014) consisted of 6794 training and 5674

test question-answer pairs. Adding to its small

size, DAQUAR was exclusively comprised of indoor

scenes, with signiﬁcant clutter and adverse lighting

conditions, which made questions difﬁcult to answer.

VQAv1

The ﬁrst version of the VQA dataset (Antol et al.,

2015) consisted of both ’real’ and ’abstract’ images.

The Real Images portion of the dataset comprises

over 80k,40k, and 80k images from the Microsoft

Common Objects in Context (MS COCO) (Lin et al.,

2014) for training, validation, and test sets, respec-

tively. These images are complex and diverse, hence

suitable for the VQA task. The abstract scenes sub-

set containing 50k cartoon scenes were introduced to

attract researchers looking to explore the high-level

reasoning required for the task of VQA. The dataset

comes with innate language biases in that a large pro-

portion of the dataset’s questions can be answered

without looking at the corresponding images. Ad-

ditionally, some questions are highly subjective and

would not strictly reﬂect an algorithm’s true capabil-

ity of solving the VQA problem.

COCO-QA

In response to the need to construct a more compre-

hensive, diverse, and complex dataset, aimed to create

a much larger body of question-answer pairs, which

were automatically synthesized by converting image

descriptions into QA forms. These QA pairs were

generated on the MS-COCO dataset (Lin et al., 2014).

COCO-QA contains 78,736 training and 38,948 tests

QA pairs. The largest potential for limitation in this

dataset is the question generation itself, in that it is

limited to the objects described in the COCO dataset

descriptions. Moreover, some of the questions gen-

erated suffer from being grammatically unclear and

ambiguous.

Visual Genome

Visual Genome (Krishna et al., 2017) was the largest

dataset when it was ﬁrst released, consisting of over

100k images from the MS-COCO (Lin et al., 2014)

and YFCC100M (Thomee et al., 2016) datasets.

What is unique about the Visual Genome Dataset is

that the questions were constrained to start with one

of the Ws with an average of 17 questions per image,

resulting in approximately 1.7 million QA pairs.

2.2 Evaluation Metrics

Several automatic evaluation metrics in the form of

BLEU (Papineni et al., 2002), METEOR (Banerjee

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

282

and Lavie, 2005), and ROUGE (Lin, 2004) originally

developed for machine translation evaluation lent to

the captioning tasks as well and have their own set

of limitations in dealing with natural language which

can be highly subjective.

Simple Accuracy

The task of Visual Question Answering faces similar

limitations but to a larger degree. VQA is framed ei-

ther as an open-ended problem, which entails the for-

mulation of a string to generate a natural language an-

swer or as a multiple-choice problem which reduces

to a multi-class classiﬁcation problem. For the latter,

simple accuracy is often employed as the evaluation

metric of choice, though it may also be utilized for the

former. However, this can prove largely inadequate

due to the binary nature of the metric. Additionally,

this also overlooks the possibility of multiple correct

answers.

WUPS

Wu-Palmer Similarity (WUPS) (Wu, 1994) take se-

mantics into consideration by striving to measure

the difference between the predicted answer and the

ground truth. It assigns a value between 0 and 1 based

on the predicted and ground truth similarity. Semanti-

cally similar words, such as ’whale’ and ’blue whale,’

have a higher WUPS score than, say, ’whale’ and ’ta-

ble.’

VQA Challenge

The VQA Dataset contains ten answers per question

by different annotators. The evaluation metric utilized

by the dataset and associated challenge is:

Accuracy

V QA

= min(

, 1) (1)

where n denotes the count of annotators who had the

same answer as the one predicted. A model is given

a full score based on if the predicted answer corre-

sponds to three or more annotators for a question.

Human Evaluation

Finally, there is the method of human evaluation, but

it presents a long list of problems that make such a

method infeasible, in terms of time, resources, and

expenses. Additionally, measuring a system’s perfor-

mance iteratively to strive to improve it greatly adds

to the problem. Lastly, judging the quality of an an-

swer requires criteria to be given to the judges. The

ideal evaluation metric for a VQA system remains to

be an open question.

2.3 Existing VQA Models and

Methodology

Some of the earliest VQA pipelines follow the most

intuitive forms of VQA pipelines, and models today

continue to morph from this fundamental architec-

ture where extracted features of both image and in-

put question are often fed into a multi-layer percep-

tron (MLP) classiﬁer after they have been fused to-

gether (Antol et al., 2015; Kaﬂe and Kanan, 2016;

Zhou et al., 2015). Some of the recurring methods

of combining features include element-wise addition,

product, and concatenation.

Featurization approaches have seen extensive va-

riety over past classiﬁcation frameworks. The authors

of (Zhou et al., 2015) employed the GoogLeNet archi-

tecture for extraction of image features, and a bag-of-

words model to represent the question features, which

were concatenated and then fed to a multi-class lo-

gistic regression classiﬁer. Skip-thought vectors were

utilized by (Kaﬂe and Kanan, 2016) for question fea-

tures and ResNet-152 for visual extraction.

The authors of (Antol et al., 2015) utilized an

LSTM encoder to represent question features with

GoogLeNet for visual features. After equalizing the

dimensionality of the two features, they were fused

using the Hadamard product, which was propagated

to a 2-layer MLP. In (Malinowski et al., 2015), each

word and the corresponding CNN features of the

image concatenated were sequentially fed into the

LSTM. A softmax classiﬁer then predicted the an-

swer.

Several studies have pointed towards the idea that

only using global features may overlook the signif-

icance of task-relevant regions of the input space.

By employing attention mechanisms, models learn

to ‘attend’ to the most relevant regions based on the

given task. Attention-based architectures have proved

to perform greatly in several NLP and vision tasks,

including image captioning, semantic segmentation,

object detection, and machine translation. For the

task of VQA as well, instead of global features, sev-

eral models have incorporated spatial attention to for-

mulating region-speciﬁc CNN features. The motiva-

tion behind this approach is that some speciﬁc visual

regions in an image, and likewise certain words in

a question, contain more information about the task

than others. A VQA model consists of the following

components or phases:

Image Featurisation

CNN’s pre-trained on ImageNet have consistently

served as well-performing networks for extracting

features for the task of VQA. Two of the most widely

Visual Question Answering Analysis: Datasets, Methods, and Image Featurization Techniques

283

employed are the VGG (Simonyan and Zisserman,

2014) and ResNet (He et al., 2016b) networks. Ow-

ing to increasing availability and accessibility to com-

putational resources, recent works have leaned to-

wards utilizing the more complex ResNets, which of-

ten achieve better results as well.

Question Featurisation

Skip-thought vectors (Kiros et al., 2015), Bag-of-

words (BOW), gated recurrent units (GRU) (Cho

et al., 2014), and long short-term memory (LSTM)

encoders (Hochreiter and Schmidhuber, 1997) are

some of the methods adopted for question featuriza-

tion.

Feature Integration

Simple mechanisms such as element-wise addition,

element-wise multiplication, and concatenation (An-

tol et al., 2015; Kaﬂe and Kanan, 2016; Zhou et al.,

2015) are widely used in VQA research. Additionally,

Bilinear pooling (Kim et al., 2016; Saito et al., 2017)

has often been employed. Computing spatial atten-

tion maps for the visual features based on question

features (Yang et al., 2016), and using Bayesian mod-

els that utilize the question-image-answer feature dis-

tributions (Kaﬂe and Kanan, 2016; Malinowski and

Fritz, 2014) are also prevalent.

Answer Generation

The most common is the classiﬁcation framework, but

some have also utilized frameworks to generate multi-

word answers sequentially, such as (Malinowski et al.,

2015), that used LSTMs for the same.

3 DATA SET(S) USED

The following datasets were employed for experi-

ments with VQA systems (chronologically)

3.1 VizWiz

The dataset contains 20,523 image/question pairs

and 205,230 answer/answer conﬁdence pairs for

the Training. The Validation set has 4,319 im-

age/question pairs and 43,190 answer/answer conﬁ-

dence pairs. The Test set has 8,000 image/question

pairs distributed among 4 question types: Yes/No,

number, other and unanswerable types.

3.2 VQA v2

The dataset contains 82,783 images, 443,757 ques-

tions, and 4,437,570 answers for the Training. The

Validation set has 40,504 images, 214,354 questions,

and 2,143,540 answers. The Test set has 81,434 im-

ages and 447,793 questions

4 PROPOSED TECHNIQUE(S)

AND ALGORITHM(S)

4.1 Proposed Model

The proposed model extends upon the co-attention

model ﬁrst devised in (Lu et al., 2016) illustrated

below. The model uses joint attention to simulta-

neously attend to both image regions and question

parts. The co-attention mechanism is implemented

hierarchically at three stages (word, phrase, and sen-

tence/question level).

4.1.1 Symbolic Notations

Let the question with M words be denoted as Q =

, q

, .....q

}, q

being the m-th word’s feature

vector. Word-level, phrase-level, and question-level

embeddings at a position m is represented as q

and q

. Image feature vectors at k spatial lo-

cations, on the other hand, are represented as I =

, i

, .....i

and ˆq

denote co-attention features

at each stage of the hierarchy for the image and ques-

tion, respectively, where l denotes the level in the hi-

erarchy, i.e., l ε{wo, ph, qu}. W represents weights

throughout.

4.2 Word, Phrase, and Sentence Levels

Tokenized and one-hot encoded questions are ﬁrst fed

through an embedding matrix, the output of which is

our word-level embeddings, Q

= {q

, q

, .....q

1-D convolutions are then applied with three ﬁlter

sizes denoting trigram, bigram, and unigram scopes.

s,m

= tanh(W

m:m+l−1

), l ∈ {1, 2, 3} (2)

which is then max-pooled to obtain the ﬁnal phrase-

level embeddings at each location:

= max(

1,m

2,m

3,m

), m ∈ {1, 2, .....M} (3)

The phrase-level embeddings are encoded through

an LSTM to get the sentence or sentence-level embed-

dings of the entire question.

A detailed illustration of the hierarchical embed-

dings is represented in ﬁg. 1

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

284

Figure 1: Proposed VQA architecture for hierarchical em-

beddings.

4.2.1 Core Mechanism

This mechanism is carried out at each stage of the

hierarchy. Similarity is computed between the re-

spective features at each image-question location pair.

Concretely, an afﬁnity matrix A ∈ R

M×K

is computed

between the image feature map I ∈ R

d×K

and question

Q ∈ R

d×M

A = tanh(Q

I) (4)

The image and question attention is learnt as de-

scribed:

(Image attention:)

= tanh(W

I + (W

Q)A),

= so f tmax(w

(5)

(Question attention:)

= tanh(W

Q + (W

I)A

= so f tmax(w

(6)

where a

∈ R

and a

∈ R

represent the image

region i

and word q

attention probabilities respec-

tively. Consequently, the image and question atten-

tion are calculated as a weighted sum:

i =

∑

k=1

q =

∑

m=1

(7)

at each stage of the hierarchy.

4.2.2 Predicting Answers

A multi-layer perceptron consumes the co-attention

features of question and image encompassing the hi-

erarchy to predict the answer.

= tanh(W

(

))

= tanh(W

[(

), z

])

= tanh(W

[(

), z

])

p = so f tmax(W

)

(8)

where p represents the answer probability.

Figure 2: Simplistic representation of proposed VQA

pipeline.

4.3 Proposed Techniques

A simple representation of the pipeline is represented

in ﬁg. 2. An ablation study was carried out over

all existing combinations of the below techniques and

can be found summarised in table 3.

• Image Featurization: An array of pre-trained im-

age feature extractors were employed:

1. CNNs: VGG19, Resnet101, EfﬁcientNetB5,

DenseNet169 pre-trained on ImageNet

2. Transformers: Vision Transformers pre-trained

on the ImageNet and ImageNet 21k datasets

• Loss Functions: Three different loss functions

were employed. The loss function formulas are

listed below.

1. Categorical Cross Entropy

CCE = −

∑

i=1

∑

j=1

i j

log ˆy

i j

) (9)

2. Focal Loss

FL = −

∑

i=1

∑

j=1

α(1 − ˆy

i j

)

i j

log ˆy

i j

) (10)

where α is the weight adjust hyperparameter,

and γ is for adjusting the curve. The higher the

value of α, the lower the loss for well-classiﬁed

datasets and vice versa. Upon iterative exper-

imentation, values of γ = 2 and α = 1 yielded

best results.

3. UniLoss

UL = −

∑

i=1

((1 − ε)log ˆy

∑

j=1

(0 +

)log ˆy

i j

)

(11)

where the hyper-parameter, ε, indicates the degree

to which the majority class examples’ inﬂuence

should be divided towards other classes; ˆy

is the

predicted probability of sample i on its true k

category. Upon iterative experimentation, value

of ε = 0.5 yielded the best results.

Visual Question Answering Analysis: Datasets, Methods, and Image Featurization Techniques

285

Vision Transformers

The base variant of Vision Transformers (Vit-Base)

with 12 layers and pre-trained weights were employed

with a patch size of 32x32. The output of the trans-

former with the extra learnable token at the beginning

discarded was used as the image features for the VQA

pipeline.

5 EXPERIMENTS AND ANALYSIS

This section analyzes the effectiveness of different

image featurization and error analysis techniques

with the Co-attention architecture on the VizWiz and

VQAv2 datasets.

5.1 VizWiz Dataset

Architecture

Pre-trained VGG16 and Resnet152 CNN networks

were employed for Image Featurization and GloVe

Word Embeddings, followed by Bidirectional LSTM

layers for question featurization. Visual and Linguis-

tic features were fused using element-wise multipli-

cation. When utilizing only the part of the dataset

with the yes/no question type, the top layers consisted

of a single dense layer with 512 neurons followed

by an output layer with 1 neuron and sigmoid acti-

vation for binary classiﬁcation. When utilizing the

entire dataset, the top layers were exactly as speciﬁed

in the accompanying ﬁg. 3. The loss functions em-

ployed were Categorical Cross Entropy, Focal loss,

and UniLoss. All models were trained for 25 epochs.

Figure 3: Architecture for the complete VizWiz dataset.

Experimental Setup

A summary of the hyperparameters settings for the

VizWiz dataset is provided as follows: The maximum

answer value taken is 1000, learning is rate 1e-3, and

the number of epochs is 25.

Results

Accuracy of 98.98% and 60.51% were obtained on

the training and test set, respectively, for the yes/no

part of the VizWiz Dataset. The results on the com-

plete dataset are summarized below in table 1. The

table shows that RESNET152 with cross-entropy per-

forms better in comparison to other networks.

Table 1: Results on the entirety of VizWiz Dataset.

Pretrained

CNN

Loss

function

Training

accuracy

Validation

accuracy

VGG16 CROSS

EN-

TROPY

57.89 45.00

VGG16 FOCAL

LOSS

57.42 45.72

VGG16 UNILOSS 57.21 45.16

RESNET152 CROSS

EN-

TROPY

60.41 46.74

RESNET152 FOCAL

LOSS

58.92 47.03

RESNET152 UNILOSS 58.85 47.01

5.2 VQAv2 Dataset

Architecture

The hierarchical Question-Image Co-Attention model

(Lu et al., 2016) was implemented that utilizes visual-

linguistic co-attention for the task of VQA. For the

purposes of experimentation, only the training por-

tion of the dataset was utilized, out of which a 75-25

split was maintained for training and validation sets.

This meant that the 82,783 training images and asso-

ciated 443,757 training questions were split between

training and validation sets. All models were trained

for 60 epochs. Categorical Cross Entropy, Focal Loss,

and UniLoss were used for experimentation purposes.

Experimental Setup

A summary of the hyperparameters settings for the

VQAv2 dataset is provided in Table 2.

Results

The results of the experiments conducted on the VQA

v2 dataset using the proposed model and varying set-

tings can be found summarised in table 3. Exper-

iments were conducted with different combinations

of image featurization and error analysis techniques.

The Table shows that the performance of the Vision

Transformers with cross-entropy loss on the training

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

286

Table 2: Summary of main hyperparameters and their val-

ues.

HYPERPARAMETER VALUE

Maximum answers 1000

Maximum words in a sequence 22

Epochs 150

Dimension d 512

Dimension k 256

Learning rate 1e-4

Dropout rate 0.5

Optimizer Adam

Focal Loss Gamma 2

Focal Loss Alpha 0.25

Focal Loss Epsilon 1e-9

Table 3: Results on the VQAv2 dataset (Training portion

[75-25 split]).

Feature Ex-

tractor

Loss function Training

Score

Validation

Score

VGG19 Cross-

Entropy

43.19 40.42

VGG19 Focal Loss 42.34 40.45

VGG19 Uniloss 40.56 40.12

Vision Trans-

former

Cross-

Entropy

55.16 40.13

Vision Trans-

former

Focal Loss 52.38 39.66

Vision Trans-

former

Uniloss 50.66 39.94

ResNet101 Cross-

Entropy

44.87 41.41

ResNet101 Focal Loss 43.42 41.67

ResNet101 Uniloss 43.26 40.10

EfﬁcientNetB5 Cross-

Entropy

52.79 41.68

EfﬁcientNetB5 Focal Loss 50.65 41.58

EfﬁcientNetB5 Uniloss 43.26 40.10

DenseNet169 Cross-

Entropy

53.45 41.94

DenseNet169 Focal Loss 51.65 42.08

DenseNet169 Uniloss 43.26 41.10

dataset is good. While on the Validation set, Densenet

with Focal loss function performs well.

5.3 VQA Web Application

In order to provide the interface for posing the query,

a web application is created. Figure 4 depicts the

graphical interface. The Flask framework is used

to build the web application’s backend, and HTML,

CSS, and JavaScript are used to build its front end.

The application can accept an image and a text query

as input and returns the relevant answer.

Figure 4: The GUI for the deployed VQA app.

6 CONCLUSIONS

There appears to be an indicative trend regarding at-

tention mechanisms and their beneﬁts to a multimodal

problem, such as Visual Question Answering. Al-

though Computer Vision problems have been unable

to leverage the advantages that Transformer archi-

tectures hold over Convolutional Networks. The re-

search on Vision Transformers to that end has been

a considerable step in this regard. Pre-trained Vi-

sion Transformers were utilized in our experiments

for image featurization. While they offer similar re-

sults in terms of accuracy performance, the models

train much faster than one CNNs of a similar scale

are employed. Co-attention mechanisms incorporated

into Transformer architectures seem to be the way for-

ward concerning VQA problems. They could hold

great potential in terms of holistically understanding

the visual medium as well as learning the correct re-

gions to attend to based on the input question in natu-

ral language.

Detailed studies have shown that most models per-

form slightly worse when they infer answers based

on only the question rather than when considering

both the input image and question. There is, there-

fore, a need to focus on visual attention so that the

visual information is adequately taken into account

for generating an answer to the input space. All of

the data and the ﬁnal deployed model were posted

to the ”https://github.com/AbhimanyuSethi-98/VQA-

Flask-App.git” repository.

ACKNOWLEDGMENT

The authors like to express their sincere gratitude

to the Department of Science and Technology

(DST/ICPS/CLUSTER/DataScience/2018/Proposal-

16:(T-856)) for giving ﬁnancial support at the

department of CSIS, Birla Institute of Technology

and Science, Pilani, India.

Visual Question Answering Analysis: Datasets, Methods, and Image Featurization Techniques

287

REFERENCES

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D.,

Zitnick, C. L., and Parikh, D. (2015). Vqa: Visual

question answering. In Proceedings of the IEEE inter-

national conference on computer vision, pages 2425–

2433.

Banerjee, S. and Lavie, A. (2005). Meteor: An automatic

metric for mt evaluation with improved correlation

with human judgments. In Proceedings of the acl

workshop on intrinsic and extrinsic evaluation mea-

sures for machine translation and/or summarization,

pages 65–72.

Cho, K., Van Merri

enboer, B., Gulcehre, C., Bahdanau, D.,

Bougares, F., Schwenk, H., and Bengio, Y. (2014).

Learning phrase representations using rnn encoder-

decoder for statistical machine translation. arXiv

preprint arXiv:1406.1078.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,

D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,

M., Heigold, G., Gelly, S., et al. (2020). An image is

worth 16x16 words: Transformers for image recogni-

tion at scale. arXiv preprint arXiv:2010.11929.

He, K., Zhang, X., Ren, S., and Sun, J. (2016a). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

He, K., Zhang, X., Ren, S., and Sun, J. (2016b). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural computation, 9(8):1735–1780.

Kaﬂe, K. and Kanan, C. (2016). Answer-type prediction for

visual question answering. In Proceedings of the IEEE

conference on computer vision and pattern recogni-

tion, pages 4976–4984.

Kim, J.-H., Lee, S.-W., Kwak, D., Heo, M.-O., Kim, J., Ha,

J.-W., and Zhang, B.-T. (2016). Multimodal residual

learning for visual qa. Advances in neural information

processing systems, 29.

Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Ur-

tasun, R., Torralba, A., and Fidler, S. (2015). Skip-

thought vectors. Advances in neural information pro-

cessing systems, 28.

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata,

K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-

J., Shamma, D. A., et al. (2017). Visual genome:

Connecting language and vision using crowdsourced

dense image annotations. International journal of

computer vision, 123(1):32–73.

Lin, C.-Y. (2004). Rouge: A package for automatic evalu-

ation of summaries. In Text summarization branches

out, pages 74–81.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,

Ramanan, D., Doll

ar, P., and Zitnick, C. L. (2014).

Microsoft coco: Common objects in context. In Euro-

pean conference on computer vision, pages 740–755.

Springer.

Lu, J., Yang, J., Batra, D., and Parikh, D. (2016). Hierar-

chical question-image co-attention for visual question

answering. Advances in neural information process-

ing systems, 29.

Malinowski, M. and Fritz, M. (2014). A multi-world ap-

proach to question answering about real-world scenes

based on uncertain input. Advances in neural infor-

mation processing systems, 27.

Malinowski, M., Rohrbach, M., and Fritz, M. (2015). Ask

your neurons: A neural-based approach to answering

questions about images. In Proceedings of the IEEE

international conference on computer vision, pages 1–

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).

Bleu: a method for automatic evaluation of machine

translation. In Proceedings of the 40th annual meet-

ing of the Association for Computational Linguistics,

pages 311–318.

Saito, K., Shin, A., Ushiku, Y., and Harada, T. (2017). Du-

alnet: Domain-invariant network for visual question

answering. In 2017 IEEE International Conference on

Multimedia and Expo (ICME), pages 829–834. IEEE.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B.,

Ni, K., Poland, D., Borth, D., and Li, L.-J. (2016).

Yfcc100m: The new data in multimedia research.

Communications of the ACM, 59(2):64–73.

Wu, Z. (1994). Palmer, m.: Verbs semantics and lexical

selection. In Proceedings of the 32nd Annual Meet-

ing of Association for Computational Linguistics, Las

Cruces, New Mexico.

Yang, Z., He, X., Gao, J., Deng, L., and Smola, A. (2016).

Stacked attention networks for image question an-

swering. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 21–

29.

Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., and Fergus,

R. (2015). Simple baseline for visual question answer-

ing. arXiv preprint arXiv:1512.02167.

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

288