Generating Images from Caption and Vice Versa via CLIP-Guided

Generative Latent Space Search

Federico A. Galatolo

, Mario G. C. A. Cimino

and Gigliola Vaglini

Department of Information Engineering, University of Pisa, 56122 Pisa, Italy

Keywords:

CLIP, Generative Adversarial Networks, GPT2, Genetic Algorithms.

Abstract:

In this research work we present CLIP-GLaSS, a novel zero-shot framework to generate an image (or a caption)

corresponding to a given caption (or image). CLIP-GLaSS is based on the CLIP neural network, which, given

an image and a descriptive caption, provides similar embeddings. Differently, CLIP-GLaSS takes a caption

(or an image) as an input, and generates the image (or the caption) whose CLIP embedding is the most similar

to the input one. This optimal image (or caption) is produced via a generative network, after an exploration

by a genetic algorithm. Promising results are shown, based on the experimentation of the image Generators

BigGAN and StyleGAN2, and of the text Generator GPT2.

1 INTRODUCTION AND

BACKGROUND

In the last years, Generative Neural Networks showed

promising results in a wide range of ﬁelds. The state

of the art in the ﬁeld of image generation is repre-

sented by Generative Adversarial Networks (GANs)

(Goodfellow et al., 2014). GANs are capable of gen-

erating realistic synthetic images leveraging two com-

peting neural networks: a Generator and a Discrimi-

nator. The objective of the Generator is to generate

images capable of fooling the Discriminator. The ob-

jective of the Discriminator is to distinguish between

images generated by the Generator and the original

ones.

In the ﬁeld of Natural Language Processing (NLP),

unprecedented results have been achieved by trans-

formers architectures (Hu, 2019). One of the most

known and studied transformer is the Generative Pre-

trained Transformer 2 (GPT2) (Radford et al., 2019).

GPT-2 has 1.5 billion parameters and was trained on

a language modelling task on the texts of 8 millions

of web pages.

Generating images from a descriptive caption has

always been a challenging problem. In the last

years, some architectures like StackGAN++ (Zhang

et al., 2018) and AlignDRAW (Mansimov et al.,

https://orcid.org/0000-0001-7193-3754

https://orcid.org/0000-0002-1031-1959

https://orcid.org/0000-0003-1949-6504

2015) showed promising results in this ﬁeld, although

being limited to the visual and textual domains of

the training dataset. Very recently (January 2021),

a novel deep network which learns visual concepts

from natural language supervision has been released

by OpenAI. CLIP (Contrastive Language–Image Pre-

training) (Radford et al., 2021) consists of two en-

coders: one for images and another for texts. CLIP

encoders are capable of producing similar embed-

dings for images and texts representing similar con-

cepts. CLIP can be applied to visual classiﬁcation

tasks without training: to distinguish the object X

from Y in an images dataset, it is sufﬁcient for each

image to check whether the text description “a photo

of X” or “a photo of Y” is more likely to be paired

with it.

In this paper, we propose a framework based on

CLIP to generate (i.e., to build without a support-

ing database) the best image corresponding to a target

caption. The proposed framework can also be used

to generate a caption corresponding to a given image.

More speciﬁcally, the framework takes a caption (or

an image) as an input, and generates the image (or

the caption) whose CLIP embedding is most similar

to the input one. This optimal image (or text) is pro-

duced via a generative network after an exploration by

a genetic algorithm. Early experimentation of the pro-

posed CLIP-guided Generative Latent Space Search

(CLIP-GLaSS) has been carried out, on the basis of

the image Generators BigGAN and StyleGAN2, and

of the text Generator GPT2, for the text-to-image and

166

Galatolo, F., Cimino, M. and Vaglini, G.

Generating Images from Caption and Vice Versa via CLIP-Guided Generative Latent Space Search.

DOI: 10.5220/0010503701660174

In Proceedings of the International Conference on Image Processing and Vision Engineering (IMPROVE 2021), pages 166-174

ISBN: 978-989-758-511-1

image-to-text tasks, respectively.

The paper is structured as follows. Section 2 fo-

cuses on the Design of the CLIP-GLaSS framework.

Experimental studies are covered by Section 3. Con-

clusions are drawn in Section 4. The source code of

CLIP-GLaSS has been publicly released.

2 DESIGN OF THE CLIP-GLaSS

FRAMEWORK

The main components of the CLIP-GLaSS framework

are: (i) the CLIP network for producing image and

text embedding, (ii) an Image or Text Generator for

generating a text/image output with a similar embed-

ding, (iii) a Genetic Algorithm to explore the latent

space of the Generator for ﬁnding the most similar

embedding between image and text. In the case of the

text-to-image task, the Image Generator is based on

domain-speciﬁc or mixed-domains Generative Adver-

sarial Networks, such as DeepMind’s BigGAN and

Nvidia’s StyleGAN2, respectively. In the case of the

image-to-text task, the Generative Pre-trained Trans-

former 2 (GPT2) has been used. Finally, as a Genetic

Algorithm, the NSGA-II (Deb et al., 2002) has been

employed, in order to solve a multi objective opti-

mization problem. A classical Genetic Algorithm can

be employed when solving one single objective op-

timization. The advantage of the Genetic Algorithm

is that it is independent from the type of generative

architecture, and it is characterized by a high degree

of exploration. Since the genetic algorithm is consid-

ered well-known, the next subsections detail the ﬁrst

two components.

2.1 The CLIP Network

CLIP is composed by two Neural Networks: an Im-

age Encoder (IE) and a Text Encoder (TE). The two

encoders produce similar embeddings if the image

and the text contains similar visual and textual con-

cepts. CLIP was trained using images and related

snipped of texts scraped from the internet, to per-

form a contrastive learning task. Contrastive learn-

ing consists in training a model to predict the cor-

rect similarity between data samples. CLIP was in-

spired by some prior work: in 2013 Richer Socher

et al. trained a model to map images in feature vec-

tors close to semantic word vectors corresponding to

their class. The model shown some zero-shot capa-

bilities (Socher et al., 2013). Later, in 2017, Ang Li

et al. trained a model using images scraped form the

internet and related user comments to predict the cor-

responding comments word n-gram given an image.

The resulting model was able to achieve a zero-shot

11.5% accuracy on ImageNet (Li et al., 2017). Be-

cause of the wide range of visual concepts learned

from natural language via contrastive learning, CLIP

shows great zero-shot capabilities. CLIP’s zero-shot

performances were tested on over 30 different tasks

spanning from OCR to geo-localization. The model

showed competitive results against fully supervised

baselines. CLIP was able to match the accuracy of

the original ResNet-50 classiﬁer on ImageNet in zero-

shot without seeing any of the 1.28 million training

images.

2.2 Design of the Text-to-Image Task

Figure 1 shows the architectural design of the CLIP-

GLaSS framework for the text-to-image task. Here,

the CLIP text/image embeddings are represented in

light/dark blue, respectively. The similarity s of their

output is sent to the Genetic Algorithm (the green box

on the top) for being maximized. The Genetic Algo-

rithms controls the input z of the image Generator,

whose components are represented in orange. The

image Generator is based on a Generative Adversarial

Network (GAN). To clearly understand the overall be-

havior of the framework, the GAN is ﬁrst introduced

in the following.

Figure 1: Architectural design of the CLIP-GLaSS frame-

work for the text-to-image task.

The Generative Adversarial Neural Network is

a framework proposed in 2014 (Goodfellow et al.,

2014), and consists in two competing neural net-

works: a Generator and a Discriminator. The Gen-

erator produces candidates while the Discriminator

Generating Images from Caption and Vice Versa via CLIP-Guided Generative Latent Space Search

167

evaluates them. The Generator learns to map from an

seeding space called latent space (z), to the data space

of interest, while the Discriminator distinguishes can-

didates produced by the Generator (d

img

) from the real

data (R). In Figure 1, the output of the Discrimina-

tor is evaluated in terms of loss (l) to be minimized.

During the training, the Generator objective is to in-

crease the error rate of the Discriminator, by produc-

ing novel candidates that the Discriminator classiﬁes

as real. For this purpose, it is seeded with random-

ized input that is sampled from a predeﬁned latent

space (noise distribution p

). Independent backpropa-

gation procedures are applied to both networks, to al-

low the Generator to produce better images, while the

Discriminator becomes more skilled at distinguishing

synthetic images. For this purpose, Generator and

Discriminator are simultaneously trained to perform

their respective task in a zero-sum game. The Gen-

erator and the Discriminator are typically transposed

convolutional and convolutional neural networks, re-

spectively. In the proposed architecture, pre-trained

GANs have been used. Speciﬁcally, the Discrimina-

tor takes the image and provides a single scalar rep-

resenting the probability that its input comes from the

original data rather than from the Generator. More

formally, let us assume that the Discriminator D is

trained to output 1 if its input image is original, and 0

if it is synthetically generated. Then, using the Cross

Entropy loss the conjunct training objective function

can be expressed as:

min

max

x∼data

[log(D(x)] +

z∼ p

[log(1 − D(G(z)))] (1)

where E

x∼ p

means the expectation over the probabil-

ity distribution p.

At the end of the training process, the Generator

is able to generate data indistinguishable (by the Dis-

criminator) from the data originally provided. It has

been shown that the resulting Generator is also able to

give semantically signiﬁcant interpolations in the do-

main of the training data (Wang et al., 2019). Figure

1 focuses on the search carried out by the Genetic Al-

gorithm, over pre-trained networks. Overall, the Gen-

erator is seeded by z (according to a noise distribu-

tion p

) and provides a related image to both the Dis-

criminator and the CLIP image encoding (CLIP

The latter is compared with the CLIP text encoding

(C LIP

T E

) of the target text, via a similarity function

Sim. As a similarity function, the cosine similarity

has been used according to (Radford et al., 2021).

One objective for the Genetic Algorithm is to provide

the best z to maximize this similarity. More formally,

given the target text T, the optimization problem is:

max

sim(CLIP

(G(z)), CLIP

T E

(T )). (2)

The second objective of the Genetic Algorithm is the

classiﬁcation loss l of the Discriminator, which is cal-

culated from the encoding d

img

provided by the image

of the Generator, and the value associated to the out-

put of a real image (R). More formally, the optimiza-

tion problem is:

min

Loss(D(G(z)), R). (3)

After solving the optimization problem, the re-

sulting optimal image provided by the Generator is

img

opt

It is worth to notice that if using a generative archi-

tecture without Discriminator, the second objective is

missing. As a consequence, the optimization problem

is single-objective, since it considers only the CLIP

embeddings similarity.

2.3 Design of the Image-to-Text Task

Figure 2 shows the architectural design of the CLIP-

GLaSS framework for the image-to-text task. Sim-

ilarly to the text-to-image task, the CLIP image/text

embeddings are represented in dark/light blue, respec-

tively. The similarity s of their output is sent to the

Genetic Algorithm for being maximized. The Ge-

netic Algorithms controls the seed z of the text Gen-

erator, represented in orange. The state of the art for

text generators is based on transformers, such as XL-

Net (Yang et al., 2019), Conditional Transformer Lan-

guage Model (CTRL)(Keskar et al., 2019) and Gen-

erative Pre-trained Transformer-2 (GPT2)(Radford

et al., 2019). As an example, in this paper the GPT2

has been used.

More speciﬁcally, GPT2 is a transformer model

developed by OpenAI as improved version of the

Generative Pre-trained Transformer (GPT). The trans-

former architecture was introduced in 2017, and it is

used mainly for solving NLP tasks. Although trans-

formers are built to work with temporal data (like Re-

current Neural Networks or Long Short Term Mem-

ory Neural Networks), they do not require that tem-

poral data are sequentially processed. Hence, trans-

formers allow a fast large scale parallel training.

Transformers are composed by a stack of encoders

and decoders, and heavily rely on a mechanism called

attention, which is able to highlight important infor-

mation while discarding the non-important one. The

training data for GPT2 was scraped from the inter-

net using a custom web scraper that emphasizes doc-

ument quality using some meta-heuristics. The re-

sulting dataset contains text from over 45 million

IMPROVE 2021 - International Conference on Image Processing and Vision Engineering

168

Figure 2: Architectural design of the CLIP-GLaSS frame-

work for the image-to-text task.

links and consists in almost 40 GB of text. GPT2

was trained using a Model-Agnostic Meta-Learning

(MAML) method (Finn et al., 2017), and tested with

zero-shot approach on a variety of tasks: text genera-

tion, translation, summarization, question answering,

etc. It was able to match and even surpass the state-

of-the-art of fully supervised models in zero-shot.

It is important to notice that GPT2 does not use a

human-like alphabet, but an alphabet of tokens gen-

erated using the Byte Pair Encoding (BPE) (Sennrich

et al., 2015). GPT2 can be used as generative archi-

tecture by setting context tokens, and using them to

predict the next token, until the stop-sentence token

predicted or a target text length is reached. We will

refer to input context tokens as its latent space.

In Figure 2 the Generator is seeded by z, to

provide an output text accordingly. The output text

is used to fed the CLIP text encoder (CL IP

T E

Finally, the similarity between the embedding of the

(C LIP

T E

) and the embedding of the CLIP image

embedding (CLIP

) is computed. The optimization

problem of the Genetic Algorithm is to maximize this

similarity.

After solving the optimization problem, the resulting

optimal image generated from GPT2 is text

opt

3 EXPERIMENTAL STUDIES

The CLIP-GLaSS framework has been implemented

and publicly released as an open source GitHub

repository (Galatolo, 2021), along with an in-browser

demonstration. Experiments have been carried out on

an Intel Core i9-9900K CPU, a GeForce RTX 2080

GPU. After the input image/caption, 500 generations

are executed by the Genetic Algorithm to ﬁnd the op-

timal caption/image. The order of magnitude of the

processing time of an input is 5-10 minutes depend-

ing on the generative architecture used.

In this section, some pilot experiments of CLIP-

GLaSS are considered, to show its generation capabil-

ities. Each output is assessed in terms of quality, i.e.

absence of artifacts, and relevance, i.e., the absence

of unexpected elements, evaluated with the naked eye

as low, medium, or high.

3.1 Examples of the Text-to-Image Task

Different state-of-the-art pre-trained networks have

been used as Generator/Discriminator. In particu-

lar, two GANs have been used: DeepMind’s Big-

GAN (Brock et al., 2018) and Nvidia’s StyleGAN2

(Karras et al., 2020). The original BigGAN, trained

on the ImageNet dataset, has been considered (Rus-

sakovsky et al., 2015). Three versions of Style-

GAN2 have been used: (i) StyleGAN2-face, which

is trained on Flickr-Faces-HQ (Karras et al., 2019);

(ii) StyleGAN2-church, trained on a subset of LSUN

(Yu et al., 2015) made of church buildings; (iii)

StyleGAN2-car, trained on a subset of LSUN made

of cars.

OpenAI publicly released only BigGAN Generator.

For this case, a single-objective genetic algorithm is

used when optimizing its latent space. Differently,

Nvidia released both StyleGAN Generator and Dis-

criminator.

The BigGAN latent space z is composed by 1000

booleans representing the 1000 ImageNet classes, and

by 128 real numbers meant to be sampled from a trun-

cated normal distribution in the range [−2, 2]. When

optimizing its latent space, mixed genetic operators

are employed to correctly perform the initialization,

mutation and crossover operations.

The StyleGAN2 latent space z is made by 512 real

numbers meant to be sampled from a normal distribu-

tion.

Figure 3 shows three representative examples of in-

put caption and related output image generated via

StyleGAN2-face. Since this GAN is specialized on

faces, the overall result is very good: the quality and

the relevance of all images are high, except for Im-

age (b), whose relevance is medium due to the blonde

hairs on the bottom.

Figure 4 shows three representative examples of in-

put caption and related output image generated via

StyleGAN2-car. Although this GAN is specialized on

Generating Images from Caption and Vice Versa via CLIP-Guided Generative Latent Space Search

169

(a) the face of a man with brown eyes

and stubble beard

(b) the face of a blonde woman with

blue eyes and glasses

and brown eyes

Figure 3: Input caption and related output image generated via StyleGAN2-face.

(a) a black SUV in the snow (b) a red truck in a street (c) a yellow beetle in an intersection

Figure 4: Input caption and related output image generated via StyleGAN2-car.

(a) a gothic church a grass ﬁeld

(b) an orthodox church in moscow

Figure 5: Input caption and related output image generated via StyleGAN2-church.

cars, the quality of all images is medium, due to the

presence of artifacts. The relevance is high for (a) and

(b), but medium for (c) because the ”intersection” is

not visible.

Figure 5 shows three representative examples of in-

put caption and related output image generated via

StyleGAN2-church. This GAN is specialized on im-

ages of church buildings. Indeed, the image relevance

is high, and the quality is about high due to the pres-

ence of minor artifacts.

The examples clearly show that CLIP-GLaSS is able

to combine the right image elements for matching the

target caption, when using input texts that belong to

the same domain of the generative network.

Differently than in the previous cases, Figure 6

shows the output images generated via StyleGAN2

on three domains (face, car, and church). To asses

the role of the Discriminator, the optimization is also

performed without it. In Figure 6, the images pro-

duced without Discriminator have a ﬁnal asterisk in

the caption. Speciﬁcally, by using the input text ”the

face of a blonde girl with glasses”, the StyleGAN2-

face achieves a high quality and relevance for (a) and

(d), as expected. On the other side, the low perfor-

mance of StyleGAn2-car and StyleGAn2-church are

apparent. However, it is worth noting that in Figure 6

(f) the picture resembles a face with glasses, gener-

ated via two windows on a building.

Figure 7 and Figure 8 show the output images gen-

erated via StyleGAN2, with and without Discrimina-

tor, for the input caption ”a blue car in the snow” and

”a gothic church in the city”, respectively. Not sur-

prisingly, the GAN that is specialized in the category

of interest (face, car, church) provide the best signiﬁ-

cance, but medium-low quality due to the presence of

artifacts.

IMPROVE 2021 - International Conference on Image Processing and Vision Engineering

170

(a) StyleGAN2-face (b) StyleGAN2-car (c) StyleGAN2-church

(d) StyleGAN2-face* (e) StyleGAN2-car* (f) StyleGAN2-church*

Figure 6: Output images generated via StyleGAN2 - with and without (*) Discriminator - for the input caption “the face of a

blonde girl with glasses”.

(a) StyleGAN2-face (b) StyleGAN2-car (c) StyleGAN2-church

(d) StyleGAN2-face* (e) StyleGAN2-car* (f) StyleGAN2-church*

Figure 7: Output images generated via StyleGAN2 - with and without (*) Discriminator - for the input caption “a blue car in

the snow”.

Overall, the resulting images resemble the target

caption, but the medium-low quality of the images

suggests that, to correctly perform this task, a bigger

Generator (i.e. with more parameters) trained on a

wider variety of images is needed. In other words,

the Generator cannot create images that do not be-

long to its training domain. In general, it is appar-

ent that the Discriminator guarantees a better quality

of the output images. Interestingly, when the Dis-

criminator is not used, even if the target text is out-

Generating Images from Caption and Vice Versa via CLIP-Guided Generative Latent Space Search

171

(a) StyleGAN2-face (b) StyleGAN2-car (c) StyleGAN2-church

(d) StyleGAN2-face* (e) StyleGAN2-car* (f) StyleGAN2-church*

Figure 8: Output images generated via StyleGAN2 - with and without (*) Discriminator - for the input caption “a gothic

church in the city”.

(a) a clock in a wall (b) a dog in the woods (c) a mountain lake

Figure 9: Input caption and related output image generated via BigGAN.

side of the Generator domain, it is able to generate

images that somehow resemble the target text. For

example, Figure 7(d) is generated via StyleGAN2-

face with the input caption “a blue car in the snow”:

it represents a face of man in the snow with a blue

sweater. Another example is Figure 7(f), generated

by StyleGAN2-church without Discriminator: it rep-

resents a church with a blue blob whose shape remem-

bers a car. Figure 8 shows examples generated by the

three domain StyleGAN2 networks via the caption “a

gothic church in the city”. Speciﬁcally, StyleGAN2-

car without Discriminator generates an image with a

background building that remember the gothic style.

Finally, the CLIP-GLaSS framework has been exper-

imented using BigGAN, a large scale GAN trained

with ImageNet, i.e., with multiple domains images.

Figure 9 shows three captions and the related images

generated from BigGAN. Although the overall qual-

ity is medium for the presence of artifacts, the rele-

vance is high.

3.2 Examples of the Image-to-Text Task

This section focuses on some representative exam-

ples of the caption generation capabilities of CLIP-

GLaSS. Speciﬁcally, 20 context tokens have been

used as GPT2 latent space z, to which three ﬁxed to-

kens have been concatenated, representing the static

context ”the picture of”. The latent space of the con-

text tokens is a integer space of numbers ranging from

0 to 50257, which is the BPE vocabulary size. In-

deed, GPT2 adopts a subword-level vocabulary: it

IMPROVE 2021 - International Conference on Image Processing and Vision Engineering

172

(a) the picture of a dog that has a

high fatality rate

(b) the picture of the ﬁsh. (c) the picture of a zebra in a zebra

coat.

(d) the picture of the orchestra (e) the picture of the art of knots (f) the picture of the world’s largest

radio telescope

(g) the picture of the pottery of the

city of Suzhou

(h) the picture of the world’s ﬁrst

“safe” phone

(i) the picture of the butterﬂy

package

Figure 10: Input images and related output captions generated via GPT2.

does not predict the next word but the next subword

token. When optimizing GPT2 latent space, integer

genetic operators have been used to perform initial-

ization, mutation and crossover operations.

Figure 10 shows nine input images randomly ex-

tracted from ImageNet, with the related captions gen-

erated via GPT2. The results clearly show the CLIP-

GLaSS potential of caption generation. The most cap-

tions have both high quality and high relevance. Some

captions, e.g., (a), (c), and (h) have high relevance and

medium quality because of the textual artifacts. For

example, caption (a) “the picture of a dog that has a

high fatality rate”, can be related to the fact that the

dog is lying on the ground; caption (h) “the picture of

the world’s ﬁrst ‘safe’ phone”, can be related to the

fact that the phone in the picture resembles a safe.

4 CONCLUSIONS

In this research work we have introduced CLIP-

GLaSS, a zero-shot framework which takes an in-

put caption and generates a corresponding image, and

vice versa. CLIP-GLaSS is based on the CLIP neural

network, for generating close embeddings of seman-

tically close texts or images, a Generator network,

for controlling the respective generation of images or

texts, and a Genetic Algorithm, to explore the Gener-

ator space to ﬁnd the best image or text. The design

choices are ﬁrst detailed, and then a set of pilot exper-

iments are discussed, using the generative networks

BigGAN, StyleGAN2 and GPT2. Results show the

high potential of the proposed framework, in terms of

quality and relevance of the output image or text, en-

couraging further comparative research. The source

code has been publicly released, along with an in-

browser demonstration.

Generating Images from Caption and Vice Versa via CLIP-Guided Generative Latent Space Search

173

ACKNOWLEDGEMENT

Work partially supported by the Italian Ministry of

Education and Research (MIUR) in the framework of

the CrossLab project (Departments of Excellence).

REFERENCES

Brock, A., Donahue, J., and Simonyan, K. (2018). Large

scale gan training for high ﬁdelity natural image syn-

thesis. arXiv preprint arXiv:1809.11096.

Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T. (2002).

A fast and elitist multiobjective genetic algorithm:

Nsga-ii. IEEE transactions on evolutionary compu-

tation, 6(2):182–197.

Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic

meta-learning for fast adaptation of deep networks.

In International Conference on Machine Learning,

pages 1126–1135. PMLR.

Galatolo, F. A. (2021). Clip-glass repository on github,

https://github.com/galatolofederico/clip-glass.

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Ben-

gio, Y. (2014). Generative adversarial networks. arXiv

preprint arXiv:1406.2661.

Hu, D. (2019). An introductory survey on attention mecha-

nisms in nlp problems. In Proceedings of SAI Intelli-

gent Systems Conference, pages 432–448. Springer.

Karras, T., Laine, S., and Aila, T. (2019). A style-based

generator architecture for generative adversarial net-

works. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

4401–4410.

Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen,

J., and Aila, T. (2020). Analyzing and improving

the image quality of stylegan. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, pages 8110–8119.

Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C., and

Socher, R. (2019). Ctrl: A conditional transformer

language model for controllable generation. arXiv

preprint arXiv:1909.05858.

Li, A., Jabri, A., Joulin, A., and van der Maaten, L. (2017).

Learning visual n-grams from web data. In Proceed-

ings of the IEEE International Conference on Com-

puter Vision, pages 4183–4192.

Mansimov, E., Parisotto, E., Ba, J. L., and Salakhutdinov,

R. (2015). Generating images from captions with at-

tention. arXiv preprint arXiv:1511.02793.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,

Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,

J., et al. (2021). Learning transferable visual models

from natural language supervision.

Radford, A., Wu, J., Amodei, D., Amodei, D., Clark, J.,

Brundage, M., and Sutskever, I. (2019). Better lan-

guage models and their implications. OpenAI Blog

https://openai. com/blog/better-language-models.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,

Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-

stein, M., et al. (2015). Imagenet large scale visual

recognition challenge. International journal of com-

puter vision, 115(3):211–252.

Sennrich, R., Haddow, B., and Birch, A. (2015). Neural

machine translation of rare words with subword units.

arXiv preprint arXiv:1508.07909.

Socher, R., Ganjoo, M., Sridhar, H., Bastani, O., Man-

ning, C. D., and Ng, A. Y. (2013). Zero-shot learn-

ing through cross-modal transfer. arXiv preprint

arXiv:1301.3666.

Wang, Z., She, Q., and Ward, T. E. (2019). Generative ad-

versarial networks in computer vision: A survey and

taxonomy. arXiv preprint arXiv:1906.01529.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.,

and Le, Q. V. (2019). Xlnet: Generalized autoregres-

sive pretraining for language understanding. arXiv

preprint arXiv:1906.08237.

Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., and

Xiao, J. (2015). Lsun: Construction of a large-scale

image dataset using deep learning with humans in the

loop. arXiv preprint arXiv:1506.03365.

Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang,

X., and Metaxas, D. N. (2018). Stackgan++: Realis-

tic image synthesis with stacked generative adversar-

ial networks. IEEE transactions on pattern analysis

and machine intelligence, 41(8):1947–1962.

IMPROVE 2021 - International Conference on Image Processing and Vision Engineering

174