image quality and R-precision as a
complementary indicator to assess how well the
generated images are based on the original text. The
model is compared with GAN-INT-CLS, GAWWN,
StackGAN, StackGAN-v2 and PPGN. In CUB's test,
AttnGAN achieved an IS of 4.36, which is
significantly better than the best score of 3.82 for all
previous methods, while COCO's best IS improved
from 9.58 to 25.89. The results show that the
AttnGAN generates higher resolution images
compared to other models. Meanwhile, for the
AttnGAN model itself, when the hyperparameter λ in
the model objective function is increased, the IS as
well as the R-precision of the model itself is
improved. It shows that the proposed attention
mechanism has a significant impact on model
optimization (Xu et al., 2018).
2.4 DM-GAN Model
This model consists of two stages: a crude creation
and an refinement stage based on dynamic memory.
Memory Writing, Key Addressing, Value Reading,
and Response are the four components that make up
the refinement stage. This model's primary innovation
is Memory Writing, which embeds word features into
the memory feature space using a convolution
operation (Zhu et al., 2019). This can calculate the
importance of words, and highlight the important
words' information. It enables the model to use related
words to do the refinement, instead of using partial
text information, and some sentence-level
information. It does the refinement in word-level,
more nuanced than the previous models.
Minfeng Zhu et al. test the model using CUB and
MS COCO. They use IS as the indicator to assess the
image quality and R-precision as a complementary
indicator to assess the consistency. A lower FID
indicates that there is less separation between the
generated and actual image distributions. The model
is compared with GAN-INT-CLS, GAWWN,
StackGAN, StackGAN-v2, PPGN, and AttnGAN.
The IS of the DM-GAN model improves from 25.89
to 30.49 (17.77%) on the COCO dataset and from
4.36 to 4.75 (8.94%) on the CUB dataset, both of
which are noticeably better than the other methods.
The outcomes demonstrate that the DM-GAN model
produces images of superior quality in comparison to
alternative techniques. As DM-GAN improves its
comprehension of the data distribution, FID decreases
from 23.98 to 16.09 on CUB and from 35.49 to 32.64
on MS COCO. The CUB and COCO have seen
improvements in R-precision of 4.49% and 3.09%,
respectively. A higher R-precision means that the
images generated by DM-GAN are more accurate in
relation to the textual description. This further
demonstrates the efficacy of the dynamic
memorization technique (Zhu et al., 2019).
2.5 SD-GAN Model
This model uses Siamese to extract common
semantics from the text. This enables the model to
deal with generation bias due to expression
differences, and solve the semantic consistency
problem brought by different expressions.
Meanwhile, semantic diversity and details are kept to
get a more detailed generation. The core module of
the model is divided into a text encoder and a
hierarchical GAN (Yin et al., 2019). The text encoder
uses a bi-directional LSTM to extract semantic
features. The hierarchical GAN uses several
generators to progressively generate images from low
resolution to high resolution. Semantic-Conditioned
Batch Normalization (SCBN) is also introduced in
this model to enhance the embedding relationship
between visual features and textual semantics. It
enables the linguistic embedding to manipulate the
visual feature maps by scaling them up or down,
negating them, or shutting them off (Yin et al., 2019).
Guojun Yin et al. test the model using the CUB
and MS COCO datasets. They use IS as the indicator
to assess the image quality. They evaluate the image
quality using IS as the indicator. To determine
whether the produced images match the written
description, they also employ a human evaluation
procedure. GAN-INT-CLS, GAWWN, StackGAN,
StackGAN++, PPGN, AttnGAN, HDGAN, Cascaded
C4Synth, Recurrent C4Synth, LayoutSynthesis, and
SceneGraph are the models that are compared with IS.
The previous best IS on CUB was 4.36 for AttnGAN
and 4.67 for SD-GAN. The previous best IS on MS
COCO was 25.89 for AttnGAN and 35.69 for SD-
GAN. The outcome demonstrates that SD-GAN
produces the best-quality images.
For human evaluation, SD-GAN is contrasted
with StackGAN and AttnGAN. When evaluating the
images produced by these three models on CUB, the
testers select the SD-GAN image as the best 68.76%
of the time. Additionally, this figure is 75.78% for MS
COCO. This demonstrates how well the images
produced by SD-GAN match the original textual
description. In general, SD-GAN produces images
that are more consistent and of higher quality than
those produced by earlier models (Yin et al., 2019).