Background Image Editing with HyperStyle and Semantic Segmentation

Syuusuke Ishihata, Ryohei Orihara

, Yuichi Sei

, Yasuyuki Tahara

and Akihiko Ohsuga

The University of Electro-Communications, Tokyo, Japan

Keywords:

StyleGAN,GAN Inversion, Image Editing.

Abstract:

Recently, research has been conducted on applying StyleGAN to image editing tasks. Although the technique

can be applied to editing background images, because they are more diverse than foreground images such as

face images, specifying an object in background images to be edited is difﬁcult. For example, because natural

language instructions can be ambiguous, edited images become undesirable for the user. It is challenging to

resolve style and content dependencies in image editing. In our study, we propose an editing method that

adapts Style Transformer, the latest GAN inversion encoder approach, to HyperStyle by introducing semantic

segmentation to maintain the reconstruction quality and separate the style and the content of the background

image. The content is edited while keeping the original style by manipulating the coarse part of latent variables

and the residual parameters obtained by HyperStyle, and the style is edited without changing the content by

manipulating the medium and ﬁne part of latent vectors as in the conventional StyleGAN. As a result, the

qualitative evaluation conﬁrms that our model enabled the editing of image content and style separately, and

the quantitative evaluation validates that the reconstruction quality is comparable to the conventional method.

1 INTRODUCTION

StyleGAN (Karras et al., 2019), one of the Generative

Adversarial Networks (GANs) models with unsuper-

vised learning, is capable of generating high quality

images and has excellent interpolation performance

between images. Several research projects use the

ability of StyleGAN for image editing tasks. For ex-

ample, StyleCLIP (Patashnik et al., 2021) edits the

input image to match the content of the text. La-

tent codes are edited to produce images that match

the content of the natural language using Contrastive

LanguageImage Pre-training (CLIP) (Radford et al.,

2021), a model used to classify natural language and

images, as the loss function. In addition, a task called

GAN Inversion estimates latent variables such that

the Generator reconstructs the input image from them,

and the estimated latent variables can be manipulated

to edit the target image.

For the background images, it is useful to edit

style and content separately, as shown in Figure 1.

For example, it can reduce the time required to create

a photo book or video work, from the point of view of

https://orcid.org/0000-0002-9039-7704

https://orcid.org/0000-0002-2552-6717

https://orcid.org/0000-0002-1939-4455

https://orcid.org/0000-0001-6717-7028

generating various images from a single photo. How-

ever, background images include outdoors, such as

mountains or forests, and indoors, such as rooms, or

stadiums. Since background images are more diverse

than foreground images such as face data, the qual-

ity of GANs’ generated images is compromised. Ed-

itability is reduced accordingly.

Figure 1: Overview of our research on image editing. We

aim to edit either the content, style or both in the image.

Another problem is the difﬁculty of editing con-

tent. Although it is possible to edit the style of an

image using StyleCLIP, it is difﬁcult to edit the im-

age intuitively when the image content is speciﬁed

using text only. The results may differ from the ed-

itor’s intention, appeared in the images as disappoint-

ing effects such as a slight misalignment or a different

front-back relationship between objects. We checked

Ishihata, S., Orihara, R., Sei, Y., Tahara, Y. and Ohsuga, A.

Background Image Editing with HyperStyle and Semantic Segmentation.

DOI: 10.5220/0011661500003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 3, pages 293-300

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

293

whether StyleCLIP could edit images in considera-

tion of the content. In Figure 2, although the input

image should be edited so that the tree is placed on

the left side, it has been edited into an image where

the whole image has been covered by a tree. The

‘left’ part of the text prompt was ignored. Editing

with consideration for the contents was insufﬁcient.

On the other hand, a semantic segmentation mask

provides a visual representation of the editor’s in-

tended content. Therefore, a semantic segmentation

mask might be convenient for editing content. The

GAN Inversion task has an approach using Hyper-

Networks that modiﬁes the parameters of the Gen-

erator to achieve both reconstruction quality and ed-

itability of the generated image. In the case of back-

ground images, an encoder-based approach such as

pixel2style2pixel (pSp) (Richardson et al., 2021) re-

sults in lower reconstruction quality. The quality of

the Generator’s performance is degraded due to the

diversity of background images. Because HyperNet-

works improves the performance of the Generator, it

can solve the problem. In the GAN Inversion task,

the performance is measured in a space deﬁned by

two axes, namely, reconstruction quality and editabil-

ity. While both axes are important, our study focuses

particularly on editability.

Therefore, we propose a framework for manip-

ulating content with a semantic segmentation mask

while maintaining the style editability of StyleGAN.

The GAN Inversion approach called HyperStyle

(Alaluf et al., 2021) uses HyperNetworks’s outputs to

adjust the Generator weights to improve the recon-

struction quality. In the background image, the ef-

fect of the Generator adjustment causes a change in

overall shape. We hypothesize that using a semantic

segmentation mask as input to HyperNetworks could

control the content of the image. We introduce two

HyperStyle networks, one with the same inputs as the

conventional method and the other set up for seman-

tic segmentation mask and the input image, to achieve

better control of the content.

This paper is organized as follows. We describe

related work in Section 2, the proposed methodology

in Section 3, the experiments in Section 4, and a dis-

cussion of the results in Section 5. We summarize this

paper and discuss future works in Section 6.

Figure 2: Example of editing a background image in Style-

CLIP. The text prompt is ‘Tree on the left’.

2 RELATED WORK

2.1 Generative Adversarial Networks

Generative Adversarial Networks (GANs) are gener-

ative models that use two neural networks, Genera-

tor and Discriminator. The Generator fools the Dis-

criminator to recognize the generated data as train-

ing data, and the Discriminator counters the Gen-

erator by correctly recognizing the generated data

as fake data. The alternate learning approach im-

proves the quality of the generated data. Based on

the approach, various image generation and trans-

formation approaches have been proposed, including

DCGAN, which uses a Convolutional Neural Net-

work (CNN) to improve GAN’s performance, Pix2pix

(Isola et al., 2017) and Pix2pixHD (Wang et al.,

2018) for image transformation, PGGAN (Karras

et al., 2018) for higher resolution, and AttnGAN (Xu

et al., 2018) for generating images from text. Ex-

amples of image transformations include converting

black-and-white images or line drawings to color im-

ages, however, there are also approaches to synthesiz-

ing from a semantic segmentation mask. Unlike U-

Net(Ronneberger et al., 2015)-based approaches such

as pix2pix and pix2pixHD, semantic segmentation

is incorporated into normalization approaches such

as SPatially-Adaptive (DE)normalization (SPADE)

(Park et al., 2019) and semantic region-adaptive nor-

malization (SEAN) (Zhu et al., 2020) to synthe-

size images according to their segmentation-labeled

shapes. While these image-to-image approaches al-

low rough editing of images, it is difﬁcult to control

the style in such a way that StyleGAN does.

StyleGAN is the generative model that enables the

generation of higher-resolution images. Latent code

w which is transformed from stochastically generated

variable z in an MLP-based Mapping Network af-

fects image style. It is possible to control the rep-

resentation of the coarse to the ﬁne style of an im-

age by w. For example, in the case of face images,

face orientation and age can be changed by adding

vectors in the latent variable. StyleGAN2 is an im-

proved version of StyleGAN, which eliminates Adap-

tive Instance Normalization (AdaIN) and uses weight

demodulation to normalize and convolve the weight

demodulation. Since GAN Inversion estimates latent

variables from images, it is easier to edit images for

it than StyleGAN, which transforms noise into latent

variables with a Mapping Network. We use GAN In-

version in our method, which employs the Generator

of StyleGAN2.

In the Generator of a GAN (e.g., StyleGAN), be-

cause the latent variable used in the input determines

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

294

the image to be generated, manipulation of latent vari-

ables can be used to edit the image in the desired way.

For example, StyleCLIP (Patashnik et al., 2021) is an

approach for editing images with text that takes ad-

vantage of the expressive ability of StyleGAN. In

addition to StyleGAN, it uses CLIP (Radford et al.,

2021), a multimodal image classiﬁcation model that

learns the relationship between natural language text

and images, as a loss function. Using CLIP as the loss

function, image editing can be performed by the text

content.

2.2 GAN Inversion

GAN Inversion is the estimation of latent variables

from a real image such that the GAN generator can re-

produce the image. Such methods include those that

(1) directly optimize latent variables, where recon-

struction quality is high, and optimization takes time

such as Pivotal Tuning Inversion (PTI) (Roich et al.,

2021), and (2) encoder-based approach, encoding

images into latent vectors such as pixel2style2pixel

(pSp) (Richardson et al., 2021) and Style Transformer

(Hu et al., 2022). The encoder estimates a latent vari-

able such that the Generator produces the same image

as the input, as shown in Figure 3. Although encoder-

based approaches have a faster estimation time, re-

construction quality tends to be lower. Since Genera-

tor is often a pre-trained model, an encoder is trained

only in many approaches.

Figure 3: Overview of General type encoder GAN Inver-

sion pipeline. First, the encoder estimates latent codes to

be input to the pre-trained Generator from images. Then,

Generator creates the same images as the encoder’s input.

For GAN Inversion, especially encoder-based ap-

proaches, there are also approaches to improve the

reconstruction quality of the generated images by

updating Generator parameters with HyperNetworks

(Ha et al., 2017) which is a model to learn Neural Net-

work (NN) parameters, such as HyperStyle (Alaluf

et al., 2021). In HyperStyle the parameter

θ is given

by modifying Generator’s parameter θ as the follow-

ing equation.

i, j

= θ

i, j

(1 + ∆

i, j

) (1)

i, j

is the weights for the j-th channel of the i-th ﬁlter

in the convolution layer for the l-th Generator. An-

other approach similar to HyperStyle is HyperInverter

(Dinh et al., 2022). Our study used HyperStyle, one

of the HyperNetworks-based approaches.

3 METHOD

The goal of our study is to enable ﬂexible editing by

separating style and content without compromising

reconstruction quality. Therefore, we focused on the

GAN Inversion method. In GAN Inversion, it is said

that the relationship between reconstruction quality

and editability is a trade-off (Tov et al., 2021). Many

approaches have been devised to solve the problem.

One of the approaches is HyperStyle which aims to

eliminate the trade-off. First, we began by analyz-

ing HyperStyle, then the architectural details and loss

functions are described.

Figure 4: Example of reconstruction quality of GAN In-

version. W Encoder and Style Transformer are the encoder

network in GAN Inversion. In both cases, the output im-

age is blurred, but HyperStyle improves the reconstruction

quality and clariﬁes the shape of objects.

Figure 5: Conﬁrmation of content information by residue

parameters obtained from HyperStyle. Input images are (a)

and (b). GAN Inversion (c) of (a) using StyleGAN2 Gen-

erator adjusted by the residual parameters according to (b).

GAN Inversion (d) of (b) in the Generator of StyleGAN2

adjusted by the residual parameters according to (a).

3.1 HyperStyle

The encoder-based approaches lose the information

of the input image when encoding it into latent vari-

ables. Since background images are a more diverse

Background Image Editing with HyperStyle and Semantic Segmentation

295

Figure 6: Our Network Architecture. Seg Encoder and GAN Inversion encoder, Style Transformer estimate latent variables.

Then, the semantic segmentation mask, initial reconstructed images, and real images are input to two HyperStyle networks.

The input to HyperStyleA is a pair of the initial reconstructed image and a real image, and the input to HyperStyleB is a pair

of semantic segmentation mask and a real image. The outputs of each HyperStyle, ∆θ

and ∆θ

, are added together and are

used to modify Generator’s weights.

data set than foreground images such as face images,

it is difﬁcult to reconstruct an image with encoder-

only GAN Inversion like W Encoder and Style Trans-

former. W Encoder is a model for W space deﬁned in

the ablation study in pSp, and HyperStyle also uses

the encoder to improve editability. Although Style

Transformer also originally gets latent codes in W +

space, HyperStyle uses W space. We adjust the out-

put of Style Transformer to be a point in W space.

As shown in columns 2 and 3 in Figure 4, the recon-

structed image results are quite different from the in-

put image, and the entire image is blurred. The style

of the whole image remains, however, the information

on the content in the image is lost. The content infor-

mation of the input image is lost when encoding it to

latent variables. HyperStyle has recovered it. When

the residual parameters taken from the HyperStyle’s

HyperNetworks were replaced with the residual pa-

rameters obtained from another image, the content of

the image becomes that of another image. For exam-

ple, (c) in Figure 5 shows the GAN Inversion in (a),

where the residual parameters for the Generator are

replaced with those obtained from (b). The content is

that of (b) and the style remains the same as in (a).

We wonder if the residual parameter might play an

important role in content editing.

3.2 Architecture

Figure 6 shows our architecture. Two HyperStyles

are prepared, and their outputs, the residual param-

eters, are added to the trained Generator parameters.

The input data to the two HyperStyle networks are

semantic segmentation masks, initially reconstructed

images, and real images.

The input to HyperStyleA is a pair of the initial

reconstructed image and a real image, and the in-

put to HyperStyleB is a pair of semantic segmenta-

tion mask and a real image. The Generator, modiﬁed

by the residual parameters, generates a reconstructed

image that reproduces the real image from the latent

variables. In latent variables, it would be possible

to control the shape of the image using information

related to the content of the image, such as the out-

put of the semantic segmentation mask encoder, Seg

Encoder, in the lower convolution layer of the Gen-

erator. Therefore, the low to medium resolution of

latent variables obtained from Style Transformer are

replaced with the outputs of Seg Encoder. The out-

put of the Style Transformer is essentially a point in

W + space with different latent variables input to each

layer. However, to achieve both editability and recon-

struction quality, HyperStyle assumes the latent vari-

ables are distributed in W space, which is considered

to have higher editability although its reconstruction

quality is lower than that of W+ space. In this paper,

we adjust the latent variable for the output of Style

Transformer to be w ∈ R

1×512

to avoid impairing ed-

itability.

Due to the nature of StyleGAN, the input latent

variables to the high-resolution layer can control the

representation of the ﬁne parts of the image. There-

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

296

Figure 7: Result of reconstructed images. The proposed method is visually equivalent to other HyperStyle methods, and the

reconstruction results are signiﬁcantly better than the Style Transformer-only encoder method.

fore, when mixing the style without changing the

content, the medium to high resolution portion of

the latent variables are replaced. From Section 3.1,

the residual parameters by HyperNetworks contribute

signiﬁcantly to recovering the missing content infor-

mation. In addition, the input latent variables to the

low-resolution layer can control the representation of

the rough part of the image. To mix the content while

preserving the style of the original image, the residual

parameters and the output of the Seg Encoder from

the original image are replaced with the residual pa-

rameters and the output of the Seg Encoder from an-

other image.

3.3 Loss Function

The model of Style Transformer has been trained in

advance, and the HyperStyle network is trained in the

model. The loss function of the model is the same as

that of HyperStyle (Alaluf et al., 2021), as shown in

the following formula.

(x, ˆy) + λ

sim

(x, y, ˆy) + λ

perc

LPIPS

(x, ˆy) (2)

x and y are identical in the task and both are im-

ages from the original data set. ˆy is the output of

StyleGAN with adjusted parameters. For similarity

loss, an identity-based face recognition model is of-

ten used for tasks that generate and edit face images

(Richardson et al., 2021). In our study, because the

focus is on editing background images, MoCo-based

(Tov et al., 2021) similarity loss is used.

4 EXPERIMENTS

4.1 Implementation Details

It is necessary to train the model on a dataset with

as many scenes as possible. We use the ADE20K

dataset (Zhou et al., 2017). The dataset has 20,210

training data and 2,000 validation data for background

scenes. With the dataset, we use the pre-trained Style-

GAN2 (Karras et al., 2020) Generator with 200,000

iterations and Style Transformer(Hu et al., 2022) with

100,000 iterations. Its output is an image whose reso-

lution of 256 × 256, and the same is true for both the

input and semantic segmentation images. Our method

is implemented in Pytorch. We train the models of the

method, namely HyperStyleA, B, and Seg Encoder,

for 200,000 iterations. We set λ

= 1.0, λ

perc

= 0.8,

and λ

sim

= 0.5 in the loss function. Ranger (Wright,

2019) is employed as the optimization approach with

a learning rate of lr = 0.0001.

4.2 Reconstruction Quality Evaluation

The reconstruction quality is evaluated using qualita-

tive and quantitative evaluation.

Background Image Editing with HyperStyle and Semantic Segmentation

297

Table 1: Quantitative results of background image reconstruction quality, FID, and KID scores on ADE20k (Zhou et al., 2017)

validation data.

Quality of Image Reconstruction Fidelity

method L2(↓) LPIPS(↓) PSNR(↑) MS-SSIM(↑) FID(↓) KID(×10

)(↓)

Proposed method 0.06412 0.27514 18.25542 0.57358 44.97 18.370

Style Transformer+HyperStyle 0.05276 0.22515 19.11787 0.64569 47.50 19.990

pSp + HyperStyle 0.05982 0.28317 18.54680 0.59826 58.06 26.121

HyperStyle 0.05547 0.23650 18.88020 0.62725 48.10 19.907

Style Transformer 0.10120 0.45700 16.20895 0.36557 159.99 113.650

pSp 0.08000 0.36000 17.22778 0.47954 78.22 39.871

W Encoder 0.10000 0.44000 16.14369 0.36623 169.00 110.610

The evaluation of reconstruction quality shows

that the results of our method are visually compara-

ble to the approaches using HyperStyle in columns 3

to 5 in Figure 7. In particular, the proposed method

reproduces the shape of the image better than the ap-

proach using pSp as an encoder (column 5). In addi-

tion, the entire image is blurred for Style Transformer

(column 6), which is a simple encoder. The proposed

method produced the reconstruction image closer to

the original input image.

In the experiment, a quantitative evaluation is con-

ducted using the same indicators as for HyperInverter

(Dinh et al., 2022), a method similar to HyperStyle.

We evaluated the quality of image reconstruction us-

ing L2 distance, Learned Perceptual Image Patch

Similarity (LPIPS) (Zhang et al., 2018), Peak Signal

to Noise Ratio (PSNR) and multi-scale structural sim-

ilarity (MS-SSIM) (Wang et al., 2003), and the real-

ism of images using Frchet inception distance (FID)

(Heusel et al., 2017) and Kernel Inception Distance

(KID) (Bikowski et al., 2018) metrics, which are of-

ten used in GAN methods. The FID and KID metrics

measure the ﬁdelity between real and generated im-

ages. The lower scores are for these, the closer the

generated and real images are. The results are shown

in Table 1. It compares (1) the three encoder meth-

ods, W Encoder, pSp, and Style Transformer used in

our method, and (2) the approaches that adapt Hyper-

Style to them. HyperStyle in row 6 of Table 1 is the

approach when the encoder is W Encoder. The results

show that our method is the best and more realistic

in terms of FID and KID metrics, although the recon-

struction quality is slightly inferior to other methods

using HyperStyle.

4.3 Editability Evaluation

The evaluation is based on two criteria: The result of

mixing the style only while preserving the content of

the original image (Figure 8), and the result of mixing

Figure 8: Result of style editing. It is a style-only mixing

and the content component is unchanged. The ﬁrst row and

the ﬁrst column are the input images. The style of the image

in each column are edited to the style of the image in the ﬁrst

row.

the content only while preserving the original image

style (Figure 9).

In Figure 8, the style is transferred from the im-

age in the ﬁrst row onto the image in each column.

It can be seen that the global style of the image in

the ﬁrst column has been changed and the content of

the original image has been retained. For example, as

shown in the second row, the sky has changed from

clear to cloudy. Similarly, in the third row, the sky

has changed, but the shape of the cloud remains the

same. On the other hand, Figure 9 shows that the im-

age content is that of the reference image while the

source image’s style is retained. It can be seen that

the shape of the buildings and the ground has changed

without any color change. Because these results show

that images can be edited separately for style and con-

tent, our method has high editability.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

298

Figure 9: Result of content editing. The ﬁrst row and the

ﬁrst column are the input images. It is a content-only mix-

ing and the image’s style is ﬁxed. The content of the image

in each column is edited to the content of the image in the

ﬁrst row.

5 DISCUSSION

In previous research, it was difﬁcult to edit image

style and content separately for highly diverse do-

mains such as background images. The traditional

image editing approaches by latent variable manip-

ulation, as in pSp, also slightly change the style of the

image when trying to edit the content of the image.

Figures 8 and 9 show that our method can edit the im-

age style and content independently. The advantage

of our method is that either the style or the content

can be edited without style and content dependencies

in image editing.

Table 1 shows that other HyperStyle-based ap-

proaches without semantic segmentation tend to have

better metrics on reconstruction quality than the pro-

posed method. We assume that the reason is how

those two HyperStyles were trained. The amount

of parameter changes to modify the Generator has

increased and the number of HyperNetworks to be

trained has also increased, which may have prevented

ﬁne-tuning of the parameters. However, the evalua-

tion by FID and KID scores shows better results than

the other methods, indicating its superiority in terms

of ﬁdelity. These results suggest that improving the

quality of reconstructed images is an issue.

In the method, latent variables in W space were

employed for editability, however, W+ space such as

the one used by pSp is said to provide higher recon-

struction quality than it. Looking at the results in Ta-

ble 1, from W Encoder and pSp results, it certainly ap-

pears that W+ space has an advantage in reconstruc-

tion quality. However, when HyperStyle is applied,

the reconstruction quality is better with the encoder

in W space. We assume that the reason is that the

improvement in reconstruction quality in W space is

more dependent on the residual parameters related to

Generator performance than in W + space. Therefore,

W space is more appropriate for the method using Hy-

perStyle. It is assumed that the reconstruction quality

can be improved by devising factors other than latent

space, such as a GAN Inversion encoder that predicts

the correct latent variable in W space and a residual

parameter by HyperNetworks.

6 CONCLUSIONS

In this paper, the problem of poor reconstruction qual-

ity of GAN Inversion due to the diversity of back-

ground images is solved by HyperStyle, a method

to update the parameters of the Generator using Hy-

perNetworks. In addition, we can conﬁrm that edit-

ing content, which was impossible with text, is feasi-

ble using HyperNetworks’ residual parameters. Our

method allows ﬂexible editing of background images

with style and content separately while the quality of

reconstruction images is comparable to existing ap-

proaches, such as HyperStyle.

There are three future works.

The ﬁrst is to improve content editability. Mixing

of image content could be achieved using the output

of Seg Encoder and HyperNetworks’ residual param-

eters. However, there is a limitation in editing because

an image other than the image to be edited is required.

In the future, we would like to explore methods like

SPADE (Park et al., 2019) and SEAN (Zhu et al.,

2020) that control content editing using semantic seg-

mentation masks only. We aim to realize an intuitive

method of editing background images that considers

usability.

The next step is to improve reconstruction qual-

ity. The residual parameters by HyperNetworks are

important for the purpose. We need to explore a GAN

Inversion encoder that can improve the quality of re-

construction while considering it.

Lastly, we will use text-based style control. In the

case of a text-based image editing approach such as

StyleCLIP, style control is achieved by manipulating

latent vectors. Our method would be able to do the

same by manipulating latent variables according to

the textual content. Initially, we will apply StyleCLIP

to our method.

Background Image Editing with HyperStyle and Semantic Segmentation

299

ACKNOWLEDGEMENTS

This work was supported by JSPS KAKENHI Grant

Numbers JP21H03496, JP22K12157.

REFERENCES

Alaluf, Y., Tov, O., Mokady, R., Gal, R., and Bermano,

A. H. (2021). Hyperstyle: Stylegan inversion with

hypernetworks for real image editing.

Bikowski, M., Sutherland, D. J., Arbel, M., and Gretton, A.

(2018). Demystifying mmd gans.

Dinh, T. M., Tran, A. T., Nguyen, R., and Hua, B.-S. (2022).

Hyperinverter: Improving stylegan inversion via hy-

pernetwork. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

(CVPR).

Ha, D., Dai, A. M., and Le, Q. V. (2017). Hypernetworks.

In International Conference on Learning Representa-

tions.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and

Hochreiter, S. (2017). Gans trained by a two time-

scale update rule converge to a local nash equilibrium.

In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H.,

Fergus, R., Vishwanathan, S., and Garnett, R., editors,

Advances in Neural Information Processing Systems,

volume 30. Curran Associates, Inc.

Hu, X., Huang, Q., Shi, Z., Li, S., Gao, C., Sun, L., and Li,

Q. (2022). Style transformer for image inversion and

editing. arXiv preprint arXiv:2203.07932.

Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. (2017).

Image-to-image translation with conditional adversar-

ial networks. CVPR.

Karras, T., Aila, T., Laine, S., and Lehtinen, J. (2018). Pro-

gressive growing of GANs for improved quality, sta-

bility, and variation. In International Conference on

Learning Representations.

Karras, T., Laine, S., and Aila, T. (2019). A style-based

generator architecture for generative adversarial net-

works. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition (CVPR).

Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J.,

and Aila, T. (2020). Analyzing and improving the im-

age quality of StyleGAN. In Proc. CVPR.

Park, T., Liu, M.-Y., Wang, T.-C., and Zhu, J.-Y. (2019).

Semantic image synthesis with spatially-adaptive nor-

malization. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition.

Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., and

Lischinski, D. (2021). Styleclip: Text-driven manip-

ulation of stylegan imagery. In Proceedings of the

IEEE/CVF International Conference on Computer Vi-

sion (ICCV), pages 2085–2094.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,

Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,

J., Krueger, G., and Sutskever, I. (2021). Learning

transferable visual models from natural language su-

pervision. arXiv preprint arXiv:2103.00020.

Richardson, E., Alaluf, Y., Patashnik, O., Nitzan, Y., Azar,

Y., Shapiro, S., and Cohen-Or, D. (2021). Encoding

in style: a stylegan encoder for image-to-image trans-

lation. In IEEE/CVF Conference on Computer Vision

and Pattern Recognition (CVPR).

Roich, D., Mokady, R., Bermano, A. H., and Cohen-Or, D.

(2021). Pivotal tuning for latent-based editing of real

images. ACM Trans. Graph.

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net:

Convolutional networks for biomedical image seg-

mentation. CoRR, abs/1505.04597.

Tov, O., Alaluf, Y., Nitzan, Y., Patashnik, O., and Cohen-Or,

D. (2021). Designing an encoder for stylegan image

manipulation. arXiv preprint arXiv:2102.02766.

Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., and

Catanzaro, B. (2018). High-resolution image synthe-

sis and semantic manipulation with conditional gans.

In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition.

Wang, Z., Simoncelli, E. P., and Bovik, A. C. (2003). Mul-

tiscale structural similarity for image quality assess-

ment. In In The Thrity-Seventh Asilomar Conference

on Signals, Systems & Computers, pages 1398–1402.

Wright, L. (2019). Ranger - a synergistic opti-

mizer. https://github.com/lessw2020/Ranger-Deep-

Learning-Optimizer.

Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang,

X., and He, X. (2018). Attngan: Fine-grained text to

image generation with attentional generative adversar-

ial networks.

Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang,

O. (2018). The unreasonable effectiveness of deep

features as a perceptual metric. In CVPR.

Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., and

Torralba, A. (2017). Scene parsing through ade20k

dataset. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition.

Zhu, P., Abdal, R., Qin, Y., and Wonka, P. (2020). Sean:

Image synthesis with semantic region-adaptive nor-

malization. In IEEE/CVF Conference on Computer

Vision and Pattern Recognition (CVPR).

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

300