Stable Diffusion (Rombach, Blattmann, Lorenz,
et al., 2022) The diffusion model can be trained with
limited resources, reducing both the complexity and
retaining the details of the image when training the
diffusion model, so that the generated image can
achieve high fidelity.
ERNIE-ViLG 2.0 (Feng, Zhang, Yu, et al., 2023)
is the first image text generation model in the Chinese
field. It focuses on guiding the model to pay more
attention to the alignment of image elements and text
semantic in the learning process, so that the model
can deal with the relationship between various objects
and attributes. All of the above work is guided
without the classifier, without training the classifier
to guide the model learning.
2.3 Text Image Editing Based on the CLIP
Model
The CLIP (Radford, Kim, Hallacy, et al., 2021) model
uses contrast learning to collect 400 million image
text pairs from the network for training, acquiring a
multimodal embedding space that can be utilized to
determine the semantic similarity between a text and
image.
Patashnik (Patashnik, Wu, Shechtman, et al., 2021)
combined the generating ability of StyleGAN
generator with the visual language ability of CLIP,
proposing three methods of combining CLIP with
StyleGAN. Text-guided latent optimization is the first
approach, and the CLIP model serves as the loss
network. The second option is the latent residual
mapper that has been trained for specific text prompts,
based on the starting point of the latent space, the
mapper produces local steps in the latent space.
Finally, the method of mapping text prompts to input
unknown direction in the StyleGAN style provides
control over the strength of operation and the degree
of entanglement.
CLIP as a loss network: CLIPDraw (Frans, Soros,
et al., 2022) The cosine distance between the text cue
features and the generated image features obtained
from CLIP is used to calculate loss. VQGAN-CLIP
(Crowson, Biderman, Kornis, et al., 2022) modulates
the intermediate potential vector by comparing the
similarity of the generated image to the specified text
in the CLIP model, thus generating images consistent
with the text description.
CLIP acts as a joint embedding space: StyleGAN-
NADA (Gal, Patashnik, Maron, et al., 2022) encodes
the differences between domains as the text
description direction of the CLIP embedded space,
using two generators, one generator remains frozen to
provide samples from the source domain, and the
other generator is optimized to generate images
different from the source domain in the CLIP space,
but separate adhesion and distortion during face
editing. HairCLIP (Wei, Chen, Zhou, et al., 2022) On
the basis of StyleCLIP, image and text features are
extracted by CLIP as conditions, and entanglement
information injection makes the properties separate to
different mappers.
No dot should be included after the section title
number.
2.4 Text Image Editing Based on the
Autoregressive Model
Early visual regression models executed visual
synthesis through pixel-by-pixel execution., however,
these methods are computationally costly on high-
dimensional data. With the rise of VQ-VAE (Oora,
Vinyals, Kavukcuoglu, et al., 2017), image visual
synthesis tasks can benefit from using large-scale pre-
training models, enabling autoregressive combination
with VQ-VAE or other methods to achieve better
image quality.
DALL-E (Ramesh, Pavlov, Gon et al., 2021)
borrows from Transformer (Vaswani, Shazeer,
Parmar, et al 2017) ideas to model image and text
autoregressions as a single data stream. The images
with the highest similarity were then screened by the
CLIP (Cheng,2023) classifier. However, DALL-E
(Ramesh, Pavlov, Gon et al., 2021) generates a
relatively small image size, and a large amount of
CPU memory is required to generate high-precision
images.
Parti (Yu, Xu, Koh, et al., 2022) is a two-stage
model composed of image marker and autoregressive
model: the first stage designs the training marker; the
second stage trains the autoregressive sequence of
image marker from text tag to sequence model. Text-
based generative image editing has become an
important part of the image editing field.
The advantages of autoregressive-based text
image editing are the advantages of clear probabilistic
modeling and stable training. However, Converting
the image into a token for autoregressive prediction is
necessary for this method, which requires a slow
inference speed and a large number of parameters.
Text image editing based on diffusion model belongs
to the likelihood model, so it is easier to train during
training, and the image generation is innovative,
which is more suitable for large-scale data set training
compared with GAN. Although iterative methods can
achieve stable training with simple goals, they can
generate high computational costs during inference.