Combining Text and Image Knowledge with GANs for Zero-Shot Action

Recognition in Videos

Kaiqiang Huang, Luis Miralles-Pechu

an and Susan Mckeever

School of Computing, Technological University Dublin, Central Quad, Grangegorman, Dublin, Ireland

Keywords:

Human Action Recognition, Zero-Shot Learning, Generative Adversarial Networks, Semantic Knowledge

Source.

Abstract:

The recognition of actions in videos is an active research area in machine learning, relevant to multiple domains

such as health monitoring, security and social media analysis. Zero-Shot Action Recognition (ZSAR) is a

challenging problem in which models are trained to identify action classes that have not been seen during the

training process. According to the literature, the most promising ZSAR approaches make use of Generative

Adversarial Networks (GANs). GANs can synthesise visual embeddings for unseen classes conditioned on

either textual information or images related to the class labels. In this paper, we propose a Dual-GAN approach

based on the VAEGAN model to prove that the fusion of visual and textual-based knowledge sources is an

effective way to improve ZSAR performance. We conduct empirical ZSAR experiments of our approach on

the UCF101 dataset. We apply the following embedding fusion methods for combining text-driven and image-

driven information: averaging, summation, maximum, and minimum. Our best result from Dual-GAN model

is achieved with the maximum embedding fusion approach that results in an average accuracy of 46.37%,

which is improved by 5.37% at least compared to the leading approaches.

1 INTRODUCTION

Over the last decade, the problem of Human Action

Recognition (HAR) has been addressed by a vari-

ety of supervised learning approaches. For example,

identifying whether a video belongs to a given trained

class (e.g. Jumping) (Wang and Schmid, 2013). Re-

cently, challenging research problem termed Zero-

Shot Action Recognition (ZSAR) has been studied to

recognise video instances of unseen classes (i.e. not

used during the training process) by transferring se-

mantic knowledge from the seen classes to the unseen

ones in the HAR ﬁeld.

Most approaches in the early research stage to

achieving ZSAR have used projection-based meth-

ods. The methods learn a projection function to map

the visual embedding of seen classes to their corre-

sponding semantic embeddings. For example, a pro-

jection function can be used to map the visual fea-

ture of the Running class to the Word2Vec embed-

ding of the Running class label. The learned projec-

tion function is then applied to recognise novel un-

seen classes by measuring a similarity-based metric

between the ground-truth embeddings and the pre-

dicted embeddings on the testing videos (Liu et al.,

2011; Xian et al., 2016; Huang et al., 2021a). How-

ever, the video samples of seen and unseen classes

are totally different. Therefore, the projection-based

approaches without developing any adaptation tech-

niques between seen and unseen classes can lead to

the problem of largely variational mismatching dur-

ing the test phase. To mitigate this problem, recent

ZSAR approaches have introduced a key approach

for synthetic data generation called Generative Ad-

versarial Networks (GANs) which is a natural candi-

date for the zero-shot learning task involving new un-

seen classes. ZSAR approaches using GANs aim to

synthesise visual embeddings of unseen classes based

on their corresponding semantic embeddings to mit-

igate the discrepancy between seen and synthesised

data. After the synthesised data is generated for un-

seen classes, a classiﬁer is trained with the real seen

and the synthesised unseen data in a fully-supervised

fashion to make predictions for a given test sample

(Mandal et al., 2019; Narayan et al., 2020; Huang

et al., 2021b).

In this work, we propose a Dual-GAN approach

based on the VAEGAN model (Narayan et al., 2020)

that fuses two semantic embeddings obtained from

different knowledge sources (i.e. text and image)

Huang, K., Miralles-Pechuán, L. and Mckeever, S.

Combining Text and Image Knowledge with GANs for Zero-Shot Action Recognition in Videos.

DOI: 10.5220/0010903100003124

In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 5: VISAPP, pages

623-631

ISBN: 978-989-758-555-5; ISSN: 2184-4321

623

Figure 1: High-level perspective of the pipeline for the proposed Dual-GAN approach based on VAEGAN model.

for the ZSAR task. For our experiments, we used a

commonly-used benchmark dataset in the HAR ﬁeld

named UCF101. In our paper, we answer the follow-

ing two research questions: (1) Can image-based se-

mantic embeddings, which have not been applied to

the GAN-based model yet, have a higher ZSAR per-

formance than the existing approaches based on text-

based semantic embeddings? and (2) Can our pro-

posed Dual-GAN approach incorporating two knowl-

edge sources get higher accuracy than a Single-GAN

approach (using either text or image)? Our main hy-

pothesis is that combined semantic embeddings pro-

duced from two knowledge sources (e.g. text and im-

age) that contain complementary information could

improve the ZSAR performance in the GAN-based

framework.

We summarise our contributions as follows:

1. We investigate two different knowledge sources

(i.e. texts and images) that can be used to repre-

sent semantic meaning for action classes.

2. We propose a Dual-GAN approach based on the

VAEGAN model to generate high-quality visual

embeddings for unseen classes by fusing seman-

tic embeddings obtained from two knowledge

sources (i.e. texts and images). The fusion meth-

ods include averaging, summation, maximum and

minimum.

3. Our Dual-GAN model outperforms the existing

ZSAR approaches that use a GAN-based ap-

proach to synthesising unseen class representa-

tions. To the best of our knowledge, there are no

previous works that employ a method of combin-

ing semantic embeddings derived from two differ-

ent knowledge sources in the context of the GAN-

based framework.

The rest of this paper is structured as follows. In

Section 2, we provide a literature review of various

approaches for the ZSAR. In Section 3, we introduce

our proposed Dual-GAN approach based on the VAE-

GAN model using two knowledge sources for ZSAR.

In Section 4, we describe the methodology, which in-

cludes the process of collecting images and feature

fusion methods. In Section 5, we explain the exper-

imental conﬁgurations and implementations in more

detail. In Section 6, we show the results and key ﬁnd-

ings. Finally, in Section 7, we conclude the paper and

propose a few ideas for future work.

2 RELATED WORK

In this section, we review the related literature on the

approaches in the early stage of the ZSAR research, as

well as on the generative approaches based on GANs.

In addition, we summarise the existing works that

propose different types of semantic embedding, es-

pecially in the GAN-based framework.

In the early stage of research on ZSAR, several

works (Xu et al., 2015; Li et al., 2016) proposed pro-

jection functions to map from a visual representation

of video instances to a semantic representation of the

class prototype that the video belongs to (i.e. typically

an embedding space of a class label). These learned

projection functions encode the relationship between

visual embeddings and semantic embeddings using

seen data. The learned projection function is then

used to recognise new unseen classes by measuring

the likelihood between the ground-truth and the pre-

dicted semantic representations of the video instances

in the embedding space. However, classes with simi-

lar semantic knowledge may have large variations in

the visual space. For example, both action classes

of Diving and Swimming have the same description

such that outdoor activity and has water, but their

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

624

video samples would look very different since Diving

and Swimming have quite different body movements.

Therefore, building a high-accuracy projection func-

tion is a big challenge, which may cause ambiguity in

the visual-semantic mapping due to the large variation

in the visual embedding.

Recently, advanced generative-based methods

have been used to synthesise visual embeddings of

unseen classes according to their semantic embed-

dings. Some authors (Xian et al., 2018) proposed a

conditional Wasserstein GAN (WGAN) model using

classiﬁcation loss to synthesise visual embeddings of

unseen classes. The visual embeddings of the un-

seen classes are then synthesised using a trained con-

ditional WGAN and used together with the real vi-

sual embeddings of seen classes to train a discrimi-

native classiﬁer in a fully-supervised manner. There

are other authors (Mandal et al., 2019; Narayan et al.,

2020; Mishra et al., 2020) who also apply extra com-

ponents to enforce a cycle-consistency constraint on

the reconstruction of the semantic embeddings dur-

ing training. The extra components assist to produce

a higher quality generator to synthesise semantically

consistent visual embeddings of unseen classes. Al-

though these generative-based methods show promis-

ing results for the ZSAR task, they still struggle to

generate higher quality and more satisfying visual

embeddings of unseen classes since the generated un-

seen data is directly used to train a supervised-based

classiﬁer along with seen data.

Also, as mentioned in Section 1, if we can ob-

tain richer and more representative knowledge incor-

porated into the semantic embedding of the actions,

intuitively we should improve downstream ZSAR ac-

curacy when identifying unseen classes. The authors

of the paper (Wang and Chen, 2017) enhanced the

word vectors of the label by collecting and modelling

textual descriptions of action classes. The contextual

information (e.g. textual descriptions related to ac-

tion classes) would remove the ambiguity of the se-

mantics to some extent in the original word vectors

of action labels. For example, the class Haircut has a

description that ‘A hairstyle, hairdo, or haircut refers

to the styling of hair, usually on the human scalp’.

Sometimes, this could also mean an editing of facial

or body hair. In that same work (Wang and Chen,

2017), the authors also proposed a method to collect

images related to the action labels for representing

visually discriminative semantic embedding. How-

ever, the work only evaluated the proposed semantic

embeddings in a project-based approach, not on the

GAN-based one. Similarly, the authors (Hong et al.,

2020) proposed a description text dataset whose def-

inition was taken from the ofﬁcial Wikipedia website

Table 1: Dataset used for evaluations.

Dataset #Class #Instances

Seen/Unseen

Proportion

UCF101 101 13320 51/50

for the UCF101 action dataset and evaluated it in the

GAN-based model.

3 APPROACH

In this section, we explain our Dual-GAN approach

for Zero-Shot Action Recognition and how it fuses

semantic embeddings from two knowledge sources:

text and images, shown in Fig. 1.

As shown in Fig. 1, the high-level perspective

of the pipeline for the proposed Dual-GAN approach

contains two steps. Step 1 aims to synthesise the vi-

sual embeddings of unseen classes conditioned on the

corresponding semantic embeddings obtained from

two different knowledge sources (i.e. texts and im-

ages) through the two VAEGAN components: Text-

VAEGAN and Image-VAEGAN. After that, the out-

puts of both image-driven and the text-driven unseen

visual embeddings are combined by a fusion opera-

tion (e.g. averaging) to form a new dataset that con-

tains the original seen data and the synthesised un-

seen data along with their respective labels. Step 2 fo-

cuses on training a classiﬁer in a supervised learning

fashion with the new dataset generated in the previous

step. It is noted that the generator of each VAEGAN

component is only trained with seen data (i.e. video

instances and labels). Each VAEGAN component

is able to synthesise semantically visual embeddings

conditioned on a semantic embedding (e.g. either the

Word2Vec of the action label or the image-based rep-

resentation of the action label) without having access

to any video instances of the unseen classes.

To expand the high-level pipeline described

above, we implemented the VAEGAN component

with a similar structure to the work proposed in

(Narayan et al., 2020) and shown in Fig. 2. To keep

this paper self-contained, we describe the VAEGAN

component, which recently yielded promising results

for the ZSAR task, in more detail. As mentioned in

Section 1, GANs can synthesise visual embeddings

that are close to the distribution of real instances, but

they can suffer from an issue termed mode collapse

(Arjovsky and Bottou, 2017), which leads to the prob-

lem of having low diversity of synthesised visual em-

bedding.

Similar to GANs, variational autoencoders

(VAEs) (Kingma and Welling, 2013) are another

generative model that employs an encoder to repre-

Combining Text and Image Knowledge with GANs for Zero-Shot Action Recognition in Videos

625

Figure 2: The details of one VAEGAN component (Huang et al., 2021b).

sent the input as a latent variable with a Gaussian

distribution assumption and a decoder to transform

the input from the latent variable. The generation

of unseen visual embedding with VAE gives more

stable outputs than with GANs (Verma et al., 2018).

Hence, the architecture of the VAEGAN component

combines the advantages of VAE and GAN by

assembling the decoder of the VAE and the generator

of the GAN to ultimately synthesise semantically

consistent visual representations.

As shown in Fig. 2, the real visual embedding of

seen classes x extracted from a deep neural network

along with the semantic embeddings a are the input

to the encoder E. The output of E is the latent code

z that compresses the information from visual repre-

sentations x, optimised by the Kullback-Leibler diver-

gence. The random noise and semantic embeddings

a are the input of the generator G that synthesises

the visual representation x', and the synthesised visual

representations x'and real visual representations x are

compared using a binary cross-entropy loss.

The discriminator D takes either x or x 'along with

the corresponding semantic embeddings a as the in-

put, and determines whether the input is real or syn-

thesised. The WGAN loss is applied to the output

of D to distinguish between the real and the synthe-

sised visual representations. Additionally, both the

Semantic Embedding Decoder SED and the feedback

module F improve the process of visual represen-

tation synthesis and reduce ambiguities among ac-

tion classes during the zero-shot classiﬁcation pro-

cess. The SED inputs either x or x'and reconstructs

the semantic embedding a', which is trained using a

cycle-consistency loss.

The feedback module F transforms the latent em-

bedding of SED and puts it back to the latent represen-

tation of G which can reﬁne x'to achieve an enhanced

Table 2: The details of knowledge sources and semantic

embeddings.

Semantics Source Embedding Dimensions

Labels Text Word2Vec 300

Descriptions Text Word2Vec 300

Collected Images Image GoogleNet 1024

Collected Images Image ResNet101 2048

visual representation synthesis. It is worth noting that

the generator G transforms the semantic embeddings

to visual representations, while SED transforms the

visual representations to semantic embeddings. Con-

sequently, the G and the SED include supplementary

information regarding visual representation and the

supplementary information can assist to improve the

quality of the visual representation synthesis and re-

duce ambiguity and misclassiﬁcation among action

classes.

The key approach to achieving ZSAR is to trans-

fer semantic knowledge containing enriched and dis-

criminative information from seen action classes to

unseen action classes. Semantic embedding derived

from multiple knowledge sources can potentially de-

liver better discriminative representation than only us-

ing a single source (Xiang et al., 2021). In this pa-

per, we propose two improvements for ZSAR. First,

we believe it is possible to improve the ZSAR per-

formance by introducing a combination of text-based

descriptions and images to represent semantic embed-

ding for the corresponding action class. Therefore,

we use two GANs rather than one, and we combine

the generated features of each GANs by generating

a new array that is calculated applying the following

methods: average, maximum, minimum, or summa-

tion. Second, for extracting textual features, we em-

ploy an approach that uses textual descriptions for the

action rather than the action class label itself. Intu-

itively, a textual description should contain more in-

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

626

Table 3: Experimental conﬁgurations for comparing text-

driven semantic embedding to image-driven semantic em-

bedding in the Single-GAN model.

Dataset

Knowledge

Source

Semantic

Embedding

UCF101

Text (baseline)

Action Class

Word2Vec

Text

Description

Word2Vec

Image GoogLeNet

Image ResNet101

formative and contextual semantic meaning than just

the class label. For the visual information, we use im-

ages related to the action class that provide enriched

visual cues for representing the semantic meaning.

4 METHODOLOGY

In this section, we describe our methodology to per-

form the ZSAR task based on the proposed Dual-

GAN model on the UCF101. We also introduce the

method for collecting images for each action class and

the method for extracting visual-based and text-based

semantic embeddings in more detail.

Dataset. We select the UCF101 (Soomro et al.,

2012) dataset that is widely used as benchmark to

evaluate the ZSAR performance. The details of the

dataset is described in Table 1. Followed by the works

(Mandal et al., 2019; Narayan et al., 2020), we use the

same split for model training and evaluation. Each

dataset has 30 independent splits and each split is

randomly generated by keeping the same seen/unseen

proportion so that all splits contain different seen and

unseen classes for training and test. In other words,

some classes are seen classes in one split, but these

classes can be unseen ones in other splits.

Image Collection. We apply a similar strategy to

collect images to the one proposed by (Wang and

Chen, 2017) in which the following steps are fol-

lowed. First, we consider the action labels as the key-

words to search related images by the image search

engines (i.e. Google Image Source).

For example,

we use the keyword Playing YoYo for searching im-

ages for the class YoYo. Then, after collecting the im-

ages, we remove the irrelevant and small-size images

for each class. As a result, we obtain 15,845 images

(157 images per class on average).

Image scraping tool is available at https://github.com/

Joeclinton1/google-images-download.git

Visual & Semantic Embeddings. To extract real

visual embedding x in Fig. 2, we adopted the off-the-

shelf I3D model for visual feature extraction provided

by (Mandal et al., 2019). I3D was originally proposed

by (Carreira and Zisserman, 2017) and it contains

RGB and Inﬂated 3D networks to generate appear-

ance and ﬂow features from the Mixed 5c layer. For

each video instance, the outputs from the Mixed 5c

layer for both networks are averaged through a tem-

poral dimension, pooled in the spatial dimension, and

then ﬂattened to obtain a 4096-dimensional vector for

appearance and ﬂow features. In the end, both appear-

ance and ﬂow features are concatenated to represent a

video with an 8192-dimensional vector.

We produce four types of semantic embedding a

that can be used to condition the VAEGAN as shown

in Fig. 2. The summary of semantic embedding is

given in Table 2. The semantic embedding of ac-

tion labels is extracted by Word2Vec. Word2Vec

(Mikolov et al., 2013), which is built upon a skip-

gram model that was pre-trained on a large-scale text

corpus (i.e. Google News Dataset), is used to deliver

a 300-dimensional vector for each action class label.

The text-based description per class are provided by

the work (Wang and Chen, 2017), motivated by the

fact that a class label is not adequate to represent the

complex concepts in human actions. The idea is that

each label is transformed into a description of that la-

bel and then we use Word2vec to represent each word

of that description. Then, we simply average all the

generated arrays by Word2vec, which also delivers a

300-dimensional vector for each class.

To extract features for collected images, we ap-

ply two off-the-shelf models: GoogLeNet (Szegedy

et al., 2015) and ResNet101 (He et al., 2016) which

were both pre-trained on the ImageNet dataset. The

average pooling layer that is before the last fully con-

nected layer is used as the deep image features for

both pre-trained models. Finally, all the extracted im-

age features are averaged for each action class.

Embeddings Fusion. As shown the Step 1 in Fig.

1, we aim to synthesise and combine different vi-

sual embeddings for unseen classes using various

knowledge sources in the proposed Dual-GAN ap-

proach. We have considered four methods to fuse the

pseudo unseen visual embeddings conditioned by the

text-based and the image-based knowledge sources

that are averaging, summation, maximum and min-

imum. For averaging, we calculate the mean of

the unseen visual embedding from the text-based se-

mantic knowledge source and the unseen visual em-

bedding from the image-based semantic knowledge

source. For summation, the same position of each el-

Combining Text and Image Knowledge with GANs for Zero-Shot Action Recognition in Videos

627

Table 4: Comparing our results to the TF-VAEGAN.

Dataset

Model TF-VAEGAN

(Narayan et al., 2020)

Single-GAN

(ours)

UCF101 41.00% 38.42%

Table 5: Results from the Single-GAN approach for

UCF101 dataset. Acc denotes mean average accuracy and

Std denotes standard deviation. W2V denotes Word2Vec.

Dataset

Semantic

Embedding

Acc

Std

(%)

Action Class W2V 28.02% 3.04%

Description W2V 29.09% 2.61%

GoogLeNet 44.35% 2.87%

UCF101

ResNet-101 45.87% 3.42%

ement for both synthesised unseen visual embeddings

is summed up. For maximum, the larger value in

each position between two synthesised visual embed-

dings is selected. Similarly, for minimum, the smaller

value in each position is selected. All four embedding

fusion methods will be empirically evaluated on the

dataset using the proposed Dual-GAN approach.

Evaluation Metrics. Class accuracy is a standard

metric in the ZSAR ﬁeld. To represent the perfor-

mance of the methodologies, we use the average per-

class accuracies introduced by the work (Xian et al.,

2017). The mean per-class accuracy averaged over

30 independent splits will be reported along with the

standard deviation.

5 EXPERIMENTS

In this section, we present the experimental conﬁg-

urations for comparing our proposed Dual-GAN ap-

proach that incorporates two knowledge sources (i.e.

texts and images) with other state-of-the-art method-

ologies. The implementations are then described in

detail.

Experiments and Baseline. For answering the ﬁrst

research question described in Section 1, we aim

to investigate whether the synthesised visual embed-

dings conditioned on the image-driven knowledge

source can lead to better ZSAR accuracies than those

from the text-driven knowledge source using a Single-

GAN model. The Single-GAN model follows only

one line of the Dual-GAN pipeline (using either Text-

VAEGAN or Image-VAEGAN depending on which

knowledge source is used) without the process of em-

bedding fusion illustrated in Fig. 1. Table 3 shows

that two text-driven knowledge sources (i.e. class

label and description) and two image-driven knowl-

edge sources (i.e. GoogLeNet and ResNet101) will

be evaluated for each dataset. As the baseline, we use

the Word2Vec of action class label to represent the

semantic embedding for the UCF101.

For answering the second research question intro-

duced in Section 1 about if two sources can work

better than just one, we aim to investigate and eval-

uate which embedding fusion method is the best.

The embedding fusion methods are averaging (Avg.),

summation (Sum.), maximum (Max.) and minimum

(Min.). The results from Dual-GAN experiments are

compared to the results from the Single-GAN to in-

vestigate whether Dual-GAN can deliver better ZSAR

performance than Single-GAN.

Implementation. Similar to our last work (Huang

et al., 2021b), the structures of discriminator D, en-

coder E, and generator G are designed as fully con-

nected networks in two layers along with 4096 hidden

units. The semantic embedding decoder SED and the

feedback module F have the same structure as D, E

and G. Leaky ReLU is used for each activation func-

tion, except in the output of G, where a sigmoid acti-

vation is applied to calculate the binary cross-entropy

loss. The whole framework is trained using an Adam

optimiser with 10

−4

learning rate. The supervised-

learning classiﬁer is a single-layer fully connected

network with equal output units to the number of un-

seen classes. We apply the same hyper-parameters as

our last work and the work (Narayan et al., 2020),

such as α, β and σ are set to 10, 0.01 and 1, respec-

tively. As explained in the work (Xian et al., 2019),

α is the coefﬁcient for weighting the WGAN loss,

β is a hyper-parameter for weighting the decoder re-

construction error in the semantic decoder embedding

SED, and σ is used in the feedback module F to con-

trol the feedback modulation. The gradient penalty

coefﬁcient λ is initially set to 10 for training a GAN.

All experiments were conducted on Google Colab

that provides Tesla P100 GPU with 25 GB memory

usage.

Additionally, the number of synthesised visual

embeddings is a hyper-parameter in the experiments.

Therefore, for efﬁciently conducting the experiments,

we synthesised 400 visual embeddings for each un-

seen class for the UCF101, which can yield decent

results within a reasonable time duration. Our code

is available online, which is compatible with Pytorch

1.9.0 and CUDA 11.1 version

https://github.com/kaiqiangh/kg gnn gan

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

628

Table 6: A comparison of Dual-GAN model with different fusion methods for UCF101. Acc and Std denote mean average

accuracy and standard deviation (in %), respectively. * denotes the best result among all cases.

Avg Sum Max Min

Dual Semantic Embedding

Acc Std Acc Std Acc Std Acc Std

Action Class Word2Vec

GoogLeNet

41.20% 3.21% 41.14% 3.17% 41.84% 3.22% 41.06% 3.19%

Action Class Word2Vec

ResNet101

41.29% 3.34% 41.05% 3.38% 41.95% 3.37% 41.24% 3.33%

Description Word2Vec

GoogLeNet

45.01% 2.78% 44.73% 2.71% 45.59% 2.77% 44.85% 2.66%

Description Word2Vec

ResNet101

45.58% 3.00% 45.57% 3.12% 46.37% * 3.10% 45.37% 3.00%

6 RESULTS & ANALYSIS

In this section, we present and analyse the results of

empirical experiments for all conﬁgurations described

in Section 5. For each conﬁguration, the mean aver-

age accuracy is reported along with the standard de-

viation.

Veriﬁcation of Experimental Baseline. Our ﬁrst

experimental run is to conﬁrm that we have set up

the TF-VAEGAN experimental pipeline correctly. We

compare our results to the work (Narayan et al., 2020)

that our model is built upon, using identical semantic

embeddings. The result is shown in Table 4. For the

UCF101, the annotated class-level attributes provided

by the work (Liu et al., 2011) is used and our result

is decreased by 2.58%. Note that, due to the scaling

limit of using annotated attributes in other datasets,

attribute-based semantic information will not be used

for further experiments and comparisons.

Is Image Source Better than Text Source? Ta-

ble 5 shows the results of evaluating the text-based

(i.e. action class and textual description) and image-

based (GoogLeNet and ResNet101) semantic embed-

dings on our Single-GAN implementations. As can

be seen, the Single-GAN results for the UCF101

are expected to our hypothesis as the image-based

ResNet101 semantic embedding outperforms action

class Word2Vec, description Word2Vec and image-

based GoogLeNet by large margins of 17.85%,

16.78% and a small margin of 1.52%, respectively.

The video instances from UCF101 have a clean back-

ground with single and centred actors, which can be

accurately represented by either textual descriptions

or relevant images. Moreover, ResNet101 can deliver

Figure 3: A comparison of Dual-GAN using different fu-

sion methods in UCF101.

a slight boost than GoogLeNet due to better model

capability of generalisation. In addition, we suggest

that using textual descriptions for action classes has

the potential risk of reducing the model performance,

which depends on how well representative video sam-

ples are.

Is the Dual-GAN Approach Better than the Single-

GAN? As can be seen in Table 6, the Max. fusion

method obviously surpasses others for all Dual Se-

mantic Embeddings cases in the UCF101 where the

Max. fusion of descriptions and ResNet101 delivers

the best performance at 46.37%, which surpasses the

baseline (i.e. action class Word2Vec in the Single-

GAN model) by a large margin of 18.35%. We sug-

gest that the textual descriptions used to represent the

semantic embedding of the class has a positive impact

on performing the ZSAR. Additionally, as shown in

Fig. 3, the Max. also performs the best on average

level.

Combining Text and Image Knowledge with GANs for Zero-Shot Action Recognition in Videos

629

Table 7: A comparison of ZSAR performance among our best results and the existing approaches (generative-based) for the

UCF101 dataset.

GMM

(Mishra et al., 2018)

CLSWGAN

(Xian et al., 2018)

CEWGAN

(Mandal et al., 2019)

f-VAEGAN

(Xian et al., 2019)

TF-VAEGAN

(Narayan et al., 2020)

Dual-GAN

(ours)

UCF101 20.3% 37.5% 38.3% 38.2% 41.0% 46.37%

For further investigations, we compare our best re-

sults to the existing approaches that follow the GAN-

based framework on the UCF101 dataset, presented

in Table 7. Our Dual-GAN model outperforms other

approaches up to 5.37% for the UCF101. There is no

doubt that fusing embeddings derived from different

knowledge sources (i.e. texts and images) delivers a

performance boost in the ZSAR. Note that, we do not

re-implement and evaluate other approaches, but di-

rectly report the results from the work (Narayan et al.,

2020).

As a result, we summarise our main ﬁndings

as follows: (1) The image-driven semantic embed-

ding is not absolutely better than the text-driven one,

which depends on how the quality of video sam-

ples is. (2) All cases of using the Dual-GAN model

outperform their counterpart cases of using Single-

GAN since the fused semantic embedding obtained

from two knowledge sources is capable of producing

more representative semantics to the classes. (3) The

Max. fusion method generally performs better than

other methods in most cases. Additionally, the lim-

itation of this work is that we do not ﬁne-tune the

proposed Dual-GAN model by optimising the hyper-

parameters, such as the number of synthesised visual

embeddings of unseen classes.

7 CONCLUSIONS

In this work, we have empirically evaluated the ZSAR

performances using either text-driven or image-driven

semantic embeddings related to the action classes in

the GAN-based framework on UCF101. We also have

investigated the impact of combining both text and

image knowledge by applying different fusion meth-

ods (i.e. averaging, summation, maximum, mini-

mum).

We have proven that applying the image-driven

semantic embedding can deliver signiﬁcant boosts

against the text-driven one within a range be-

tween 15.26% (GoogLeNet against Description) and

17.85% (ResNet101 against Action Class) in the

Single-GAN framework for UCF101. Furthermore,

our proposed Dual-GAN model outperforms the base-

line (i.e. action class in the Single-GAN model) by

large margin of 18.35%, as well as against the exist-

ing GAN-based approaches improved by 5.37%.

As future work, we aim to investigate generalised

ZSAR which is a more challenging task that tests both

seen and unseen classes together in the classiﬁcation

stage. Also, we will explore other approaches to pro-

duce more enriched and meaningful semantic embed-

ding that can also mitigate the problem of the seman-

tic gap between classes and video samples. We are

also planning to use other fusion methods such as con-

catenation or using two different classiﬁers and cal-

culating the predicted class as a combination of both

classiﬁers. Lastly, we plan to use other supervised

methods such as Random Forest, Support Vector Ma-

chines, or Deep Learning to see if they are able to

deliver better results.

ACKNOWLEDGEMENTS

This project is funded under the Fiosraigh Scholarship

of Technological University Dublin.

REFERENCES

Arjovsky, M. and Bottou, L. (2017). Towards principled

methods for training generative adversarial networks.

arXiv preprint arXiv:1701.04862.

Carreira, J. and Zisserman, A. (2017). Quo vadis, action

recognition? a new model and the kinetics dataset.

In proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 6299–6308.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Hong, M., Li, G., Zhang, X., and Huang, Q. (2020). Gener-

alized zero-shot video classiﬁcation via generative ad-

versarial networks. In Proceedings of the 28th ACM

International Conference on Multimedia, pages 2419–

2426.

Huang, K., Delany, S. J., and McKeever, S. (2021a). Fairer

evaluation of zero shot action recognition in videos.

In VISIGRAPP (5: VISAPP), pages 206–215.

Huang, K., Luis, Miralles-Pechu

an, B., and Mckeever, S.

(2021b). Zero-shot action recognition with knowledge

enhanced generative adversarial networks. In In Pro-

ceedings of the 13th International Joint Conference on

Computational Intelligence, pages 254–264.

Kingma, D. P. and Welling, M. (2013). Auto-encoding vari-

ational bayes. arXiv preprint arXiv:1312.6114.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

630

Li, Y., Hu, S.-h., and Li, B. (2016). Recognizing unseen ac-

tions in a domain-adapted embedding space. In 2016

IEEE International Conference on Image Processing

(ICIP), pages 4195–4199. IEEE.

Liu, J., Kuipers, B., and Savarese, S. (2011). Recognizing

human actions by attributes. In CVPR 2011, pages

3337–3344. IEEE.

Mandal, D., Narayan, S., Dwivedi, S. K., Gupta, V.,

Ahmed, S., Khan, F. S., and Shao, L. (2019). Out-of-

distribution detection for generalized zero-shot action

recognition. In Proceedings of CVPR, pages 9985–

9993.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and

Dean, J. (2013). Distributed representations of words

and phrases and their compositionality. In Advances in

neural information processing systems, pages 3111–

3119.

Mishra, A., Pandey, A., and Murthy, H. A. (2020). Zero-

shot learning for action recognition using synthesized

features. Neurocomputing, 390:117–130.

Mishra, A., Verma, V. K., Reddy, M. S. K., Arulkumar, S.,

Rai, P., and Mittal, A. (2018). A generative approach

to zero-shot and few-shot action recognition. In 2018

IEEE Winter Conference on WACV, pages 372–380.

IEEE.

Narayan, S., Gupta, A., Khan, F. S., Snoek, C. G., and

Shao, L. (2020). Latent embedding feedback and dis-

criminative features for zero-shot classiﬁcation. In

Computer Vision–ECCV 2020: 16th European Con-

ference, Glasgow, UK, August 23–28, 2020, Proceed-

ings, Part XXII 16, pages 479–495. Springer.

Soomro, K., Zamir, A. R., and Shah, M. (2012). Ucf101:

A dataset of 101 human actions classes from videos in

the wild. arXiv preprint arXiv:1212.0402.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,

Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-

novich, A. (2015). Going deeper with convolutions.

In Proceedings of the IEEE conference on computer

vision and pattern recognition, pages 1–9.

Verma, V. K., Arora, G., Mishra, A., and Rai, P. (2018).

Generalized zero-shot learning via synthesized exam-

ples. In Proceedings of the IEEE conference on com-

puter vision and pattern recognition, pages 4281–

4289.

Wang, H. and Schmid, C. (2013). Action recognition with

improved trajectories. In Proceedings of IEEE ICCV,

pages 3551–3558.

Wang, Q. and Chen, K. (2017). Alternative semantic rep-

resentations for zero-shot human action recognition.

In Joint European Conference on Machine Learning

and Knowledge Discovery in Databases, pages 87–

102. Springer.

Xian, Y., Akata, Z., Sharma, G., Nguyen, Q., Hein, M., and

Schiele, B. (2016). Latent embeddings for zero-shot

classiﬁcation. In Proceedings of CVPR, pages 69–77.

Xian, Y., Lorenz, T., Schiele, B., and Akata, Z. (2018). Fea-

ture generating networks for zero-shot learning. In

Proceedings of the IEEE conference on computer vi-

sion and pattern recognition, pages 5542–5551.

Xian, Y., Schiele, B., and Akata, Z. (2017). Zero-shot

learning-the good, the bad and the ugly. In Proceed-

ings of the IEEE Conference on CVPR, pages 4582–

4591.

Xian, Y., Sharma, S., Schiele, B., and Akata, Z. (2019).

f-vaegan-d2: A feature generating framework for any-

shot learning. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition,

pages 10275–10284.

Xiang, H., Xie, C., Zeng, T., and Yang, Y. (2021).

Multi-knowledge fusion for new feature generation

in generalized zero-shot learning. arXiv preprint

arXiv:2102.11566.

Xu, X., Hospedales, T., and Gong, S. (2015). Semantic

embedding space for zero-shot action recognition. In

2015 IEEE International Conference on Image Pro-

cessing (ICIP), pages 63–67. IEEE.

Combining Text and Image Knowledge with GANs for Zero-Shot Action Recognition in Videos

631