Alias-Free GAN for 3D-Aware Image Generation

Attila Szab

, Yevgeniy Puzikov

, Sahan Ayvaz

, Sonia Aurelio

Peter Gehler

, Reza Shirvany

and Malte Alf

Zalando SE, Switzerland

Zalando SE, Germany

peter.gehler@zalando.de, reza.shirvany@zalando.de, malte.alf@zalando.ch

Keywords:

GAN, NeRF, 3D-aware, Generative AI.

Abstract:

In this work we build a 3D-aware generative model that produces high quality results with fast inference times.

A 3D-aware model generates images and offers control over camera parameters to the user, so that an object

can be shown from different viewpoints. The model we build combines the best of two worlds in a very direct

way: alias-free Generative Adversarial Networks (GAN) and a Neural Radiance Field (NeRF) rendering,

followed by image super-resolution. We show that fast and high-quality image synthesis is possible with

careful modiﬁcations of the well designed architecture of StyleGAN3. Our design overcomes the problem

of viewpoint inconsistency and aliasing artefacts that a direct application of lower-resolution NeRF would

exhibit. We show experimental evaluation on two standard benchmark datasets, FFHQ and AFHQv2 and

achieve the best or competitive performance on both. Our method does not sacriﬁce speed, we can render

images at megapixel resolution at interactive frame rates.

1 INTRODUCTION

3D-aware image generative models aim to generate

2D images in a way that the viewpoint is control-

lable by the user, the camera parameters can be spec-

iﬁed at inference time. Generative Adversarial Net-

works (GANs) (Goodfellow et al., 2014) can be used

to solve this task, where a generator is combined with

a renderer to produce an image. GAN training can

be unsupervised, only an image collection of unre-

lated samples and no ground truth labels are required.

However, there is no out-of-the-box solution, the task

is very challenging. The details of the GAN architec-

ture, the renderer and the 3D representation and the

interplay between these modules matter a lot. In this

work we propose a novel design for a 3D-aware GAN

that combines the best practices of modern 2D and

3D models. It is alias-free, produces high-resolution

results, is 3D-aware and has fast inference time.

The ﬁrst 3D-aware GANs build on explicit 3D

representations, for example voxels (Gadelha et al.,

2017) and meshes (Szab

o et al., 2019). More re-

cent work (Chan et al., 2021), (Chan et al., 2022)

then used volumetric rendering and a Neural Radi-

ance Field (NeRF) (Mildenhall et al., 2022) renderer.

Compared to voxel- and mesh-based methods,

Figure 1: An example result of our model, three images

rendered under three different viewpoints that are manually

chosen. The images are of high-quality with no visible arte-

facts and high-resolution. Our 3D model is trained using

unlabelled 2D images without any knowledge of viewpoints

at training time. In contrast to previous work, it does not

employ any task-speciﬁc priors or regularization.

NeRF parameterisations offer more ﬂexibility and

produce higher-quality images. Ideally, one would

just run a vanilla NeRF at a high resolution with

dense depth sampling. In principle, this could work

very well, but the computational cost and memory

requirements make this naive approach infeasible.

Thus, recent approaches were proposed to reduce

the requirements on computation and memory, e.g.

GIRAFFE (Niemeyer and Geiger, 2021b), where a

NeRF is used to render a feature map at a low res-

olution, followed by 2D super-resolution.

Rendering features at low resolution, however,

creates images artefacts. A slight change in the cam-

Szabó, A., Puzikov, Y., Ayvaz, S., Aurelio, S., Gehler, P., Shirvany, R. and Alf, M.

Alias-Free GAN for 3D-Aware Image Generation.

DOI: 10.5220/0012432700003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 2: VISAPP, pages

221-232

ISBN: 978-989-758-679-8; ISSN: 2184-4321

221

era viewpoint creates an wobbling effect in the im-

age. This wobbling effect is caused by aliasing ef-

fects that stem from the design of 2D convolution

networks. This aliasing effect motivated the work of

StyleGAN3 (Karras et al., 2021a), that was speciﬁ-

cally designed be alias-free. In our work we take the

idea of (Karras et al., 2021a) and lift it to 3D. We show

that alias-free network can be lifted to 3D, thus we

can avoid the image artefacts. The empirical results

show that our model performs better or on par with

previous 3D-aware methods, that tried to address the

shortcomings with more complex components, such

as extra regularization terms.

Putting the ideas together, our model generates

images in three stages. First, we sample points on a

3D grid and apply alias-free convolutions on it, which

produces 3D feature grid. Second, the features are

processed with volumetric rendering, which techni-

cally is a weighted sum along the depth axis of the

3D grid. The result of the rendering is a low reso-

lution 2D grid (image) of a feature map. Finally, the

2D feature grid is supplied to the super-resolution net-

work, which is an alias-free 2D convolution network.

An example result can be seen in Figure 1, a sample

from our model using three different viewpoints.

The key contributions of this work are:

• We design and implement a novel alias-free 3D-

aware generative model that combines state of the

art NeRF and GAN components.

• Quantitative results show that our approach

achieves state-of-the-art (SOTA) and competitive

results on FFHQ and AFHQv2 on high resolu-

tions, while having interactive frame rates.

• Qualitatively we show viewpoint consistency

when we control variables such as appearance,

horizontal and vertical translation and rotation.

2 RELATED WORK AND MODEL

PRELIMINARIES

Different types of 3D-aware generative models ex-

ist, prominent examples are autoencoders (Shi et al.,

2021), diffusion models (Poole et al., 2022) (Kim and

Chun, 2022) (Wang et al., 2022) and GANs. The

GAN architecture (Szab

o et al., 2019) (Kwak et al.,

2022) (Chan et al., 2022) (Xue et al., 2022) (Sun et al.,

2022) remains a strong contender in this space and is

the model of choice for our construction. In this sec-

tion we quickly review the main ingredients for this

model and in Section 3 explain our 3D extension.

2.1 GAN

A GAN consists of a pair of neural networks, a gen-

erator G and a discriminator D that compete during

training. The generator network produces novel im-

ages and the discriminator network is trained to distin-

guish between real and generated images. The origi-

nal loss function from (Goodfellow et al., 2014) is a

min-max objective

min

max

real

[log(D(x

real

)] +

[log(1 −D(G(z))], (1)

where the training images are drawn from the real im-

age distribution x

real

∼ p

real

, the latent vectors are usu-

ally drawn from a Normal distribution z ∼N (0,I). In

theory, with perfect training, the generator learns the

create samples x

fake

= G(z) from the data distribu-

tion that are indistinguishable from real data points.

In practice, however, quite some engineering and net-

work design is required to train the models to achieve

good performance, e.g. trading off learning rates of

the generator and discriminator.

Several GAN variants have been propsed since

the inception and the Alias-Free GAN (Karras et al.,

2021a) is a modern variant that produces high-

quality images and includes equivariance properties

(especially, the StyleGAN3-R variant). Intuitively,

StyleGAN3-R emulates an implicit representation,

similar to a neural network applied to each pixel lo-

cation separately, which gives the RGB pixel color.

StyleGAN3-R operates on a band limited continu-

ous signals, but represents them as a discrete 2D

grid based on the Nyquist–Shannon sampling theo-

rem (Shannon, 1949). When a nonlinear function is

applied to a band limited signal, the result is not nec-

essarily band-limited. Thus, one can deﬁne an alias-

free non linear function by ﬁltering the result of the

vanilla non-linear function f with a low-pass ﬁlter. In

the discrete domain the alias-free F function is

F(Z) = s

·X ⊙(φ

∗ f (φ

∗Z)), (2)

where Z is the sampling grid, s is the sampling rate,

is the ideal low-pass ﬁlter with band limit s/2, X

is the Dirac comb, ⊙ denote element-wise multipli-

cation and ∗ is continuous convolution. StyleGAN-R

uses radially symmetrical jinc ﬁlters to achieve rota-

tional equivariance. Notice, that this operation can

only be performed by entering temporarily the con-

tinuous domain. However, in practice it is enough to

approximate it by ﬁrst upsampling, then applying the

vanilla function f , and ﬁnally downsampling.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

222

K, M

Fourier!

Features

AFConv

Renderer

Mask

r+1

point!

casting

ndc

Superr-resolution!

Network

Distance

Mapping

fake

trainable networksgenerator inputs

Figure 2: Our generator takes the latent z and the intrinsic and extrinsic camera parameters K and M. 3D points are sampled

on a grid, then Fourier features are computed on them, which go through the alias-free 3D convolutional layers. Then the

features are rendered and passed to the super-resolution network to get the image as an output.

2.2 3D-Aware GAN

A 3D-aware GAN (Szab

o et al., 2019) can control

camera viewpoints by means of generator condition-

ing. More precisely, it takes the intrinsic and extrinsic

camera parameters K and M, respectively, as inputs.

The training objective now includes terms of camera

parameters and reads

min

max

real

[log(D(x

real

)] +

z,K,M

[log(1 −D(G(z,K, M))]. (3)

The camera parameters K,M are either sampled

from a ﬁxed distribution (as in our work) or a

viewpoint distribution can be learned alongside the

netwroks (Niemeyer and Geiger, 2021a).

An image generator is now composed of a neural

network NN that produces a 3D representation (e.g. a

mesh), followed by a renderer R that takes the 3D rep-

resentation and camera parameters and produces the

image. As for all GAN models, optionally, a super-

resolution network can be used to upscale the image:

fake

= G(z,K,M)

= SuperRes(R(NN(z),K,M)). (4)

In order to train a generative model of 3D shapes from

natural 2D images, 3D GANs exploit the idea that

a realistic 3D object should yield a realistic render-

ing from any plausible viewpoint. By randomizing

the choice of the viewpoint, model training forces the

generator network to learn a 3D representation dis-

entangled from the viewpoint. The work of (Szab

et al., 2019) provides a theory for such systems, which

is a special case of a general theory of Ambient-

GAN (Bora et al., 2018).

This design then offers several possibilities re-

garding the choice of the 3D representation. One

can use meshes (Szab

o et al., 2019), voxels (Gadelha

et al., 2017) (Schwarz et al., 2022), multi-plane im-

ages (Kumar et al., 2023), radiance manifolds (Deng

et al., 2022b) (Xiang et al., 2022) (Deng et al., 2022a),

signed distance functions (Or-El et al., 2022) (Burkov

et al., 2022) (Liu and Liu, 2022); each of these rep-

resentations are paired with their corresponding dif-

ferentiable renderer. Arguably, the most popular rep-

resentation for modern 3D GANs is the Neural Radi-

ance Fields (Chan et al., 2021) (Gu et al., 2022) (Zhou

et al., 2021) (Kaneko, 2022) (Tang et al., 2022).

2.3 Volumetric Rendering

Volumetric rendering techniques (Max, 1995) (Meetz

et al., 1991) (Rushmeier and Torrance,

1987) (Williams and Max, 1992) (Kajiya, 1986)

are modelling the physical process of image for-

mation and are capable of representing the scene

unambiguously and accurately. A popular formula-

tion is the radiance ﬁeld equation

C(r) =

T (t)σ(r(t))c(r(t),d)dt, (5)

where T (t) = exp(−

σ(r(s)ds). (6)

Here, T is the transmittance, σ is the density and c

is the color at the locations r(t) = o + td, o is the

camera center and d is the direction of a ray. The

integral bounds are the distances to the near and far

plane, t

and t

respectivelly. In practice, Eq. 5 is ap-

proximated by numerical integration, where, for each

pixel, points are sampled along the corresponding ray.

Recently Neural Radiance Fields (NeRF) (Mildenhall

et al., 2022) proposed to use a volumetric rendering,

where the volume is parameterised by a Multi-layer

Perceptron (MLP): it takes 3D coordinates as inputs,

Alias-Free GAN for 3D-Aware Image Generation

223

obj

bcg

frustum

Figure 3: The grey region shows the part of the scene that

are part of the object mask, i.e. potentially visible. We

render the part that is both masked and inside the frustum.

and outputs the corresponding density and color. In

our work instead of an MLP we use a 3D convolu-

tional network, which calculates the features for the

rendering on a 3D point grid of points.

For 3D-aware GANs, the NeRF is conditioned on

the latent variables z, so the neural network takes

both the 3D locations and z as inputs. The neural

network is not always an MLP. Their computational

cost is high, and with the current hardware it in not

feasible to use them in high-resolution image synthe-

sis as is. Thus, more efﬁcient architectures were in-

troduced. VoxGRAF (Schwarz et al., 2022) uses a

sparse voxel grid to speed up computation by skip-

ping empty space. Tri-plane representations (Chan

et al., 2022) (Skorokhodov et al., 2022) (Xu et al.,

2023) run standard 2D convolutional networks, then

rearrange their outputs as three planes that are per-

pendicular to each other, then features are sampled

by projecting the 3D points onto them. This is much

faster than having to compute a full MLP forward pass

for each point.

GIRAFFE (Niemeyer and Geiger, 2021b) pro-

posed to render scenes using a low-resolution NeRF

model, followed by a super-resolution module. The

MLP in their case does not directly compute RGB

pixel values, but instead creates a high-dimensional

feature map.

In our work, we build upon StyleGAN3-R and

NeRF rendering. As they both produce and use im-

plicit representations, StyleGAN3-R can be naturally

modiﬁed to allow 3D viewpoint control. Similar to

GIRAFFE, we render a low-resolution feature map,

and then upsample it to produce high-resolution im-

ages.

3 APPROACH

In this section we present the construction of our

model, and show how we combine the alias free prop-

erties of (Karras et al., 2021a) with a 3D NeRF ren-

dering. For this we need to equip the generator with

an explicit rendering function that ensures geometry

and viewpoint consistency across different samples.

We will explain every steps in detail in this section,

the main ﬂow are generation of the 3D representation,

alias free convolutions, rendering followed by a ﬁnal

super-resolution step to map to the target resolution.

An overview of this method is shown in Figure 2.

The training procedure remains the same as be-

fore, we optimize the minimax objective w.r.t. the

generator G and discriminator D parameters as in

Eq. 3. The discriminator is taken as is from the vanilla

StyleGAN3 implementation. During training we need

to sample viewpoints and camera parameters and we

will explain the choices in the respective sections.

3.1 Viewpoint Sampling

Viewpoints are parameterised by polar coordinates.

The horizontal and vertical angles are sampled inde-

pendently within the ranges of ±80

◦

and ±20

◦

re-

spectively, while the radius is sampled within [5,25].

The camera center is placed according to the polar

coordinates and it points at the origin. The focal

length is set to be equal to the radius, thus the unit

ball around the origin always ﬁts tightly to the ren-

dered square image. Once these parameters are given,

they determine the camera parameters K and M as de-

scribed below. Note we use dimensionless units for

the focal length and the 3D coordinates, as we do not

have any ground truth sizes in meters.

3.2 Camera Parameters

The cameras are parameterised with intrinsic K and

extrinsic M matrices:

K =





f 0 0

0 f 0

0 0 1





, M =



R t

0 1



, (7)

where f ∈ R

is the focal length, R ∈ R

3×3

is a rota-

tion matrix and t ∈ R

is a translation, thus M ∈R

4×4

is the (world-to-camera) matrix that describes a rigid

movement.

3.3 Normalised Device Coordinates

We use Normalised Device Coordinates (NDC) for

sampling points during the volumetric rendering.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

224

The points in this coordinate system are p

ndc

) ∈ R

and they relate to the points in

the camera frame by a perspective transformation

ndc

∼ Pp

cam

, (8)

where the superscript denotes homogeneous coordi-

nates, p

cam

= (x

cam

,1). The perspective

transformation matrix is given by:

P =







f 0 0 0

0 f 0 0

0 0 a b

0 0 1 0







, (9)

where f ∈ R is again the focal length from P ∈ R

4×4

and a, b ∈ R. To remain consistent with K, only the

entries a and b are free parameters. We set them in

such a way that applying the transformation would

bring the near plane n > 0 of the camera to zero in

NDC, and at depth d > 0 the Jacobian of the transfor-

mation becomes proportional to the identity matrix,

a = f ·d/n, b = −f ·d. (10)

This way, P is designed to produce the least amount

of distortion in the rendered volume.

3.4 Scene

In our setting, we assume that the object is located

at the origin in the world coordinate frame inside a

sphere with radius r

obj

= 1.5 (which is slightly bigger

than

√

2, so the sides of the frustums we use for ren-

dering can ﬁt inside it). We also set a background ra-

dius r

bcg

= 2.0, to allow the model to put some back-

ground behind the object. Figure 3 shows which part

of the scene is visible. The distance between the cam-

era and the object center is d. If a point is closer to

the camera than d and is inside the sphere with r

obj

it is considered for rendering. Points further than the

distance d are rendered, if they are inside the sphere

with r

bcg

. For calculating P in Eq. 9, we set the near

plane n = d −r

obj

3.5 Point Casting

We ﬁrst generate points in NDC space in a regular

3D grid. The size of the grid is D ×H ×W , denot-

ing the grid sizes for depth, height and width, respec-

tively. We chose H = W = 52, which corresponds to

a 32 ×32 pixel grid plus a 10-point margin on each

side. We do not use margins for the depth axis, we

set D = 32. Excluding the margin, the ranges are

ndc

∈ [−1,+1], y

ndc

∈ [−1,+1] and z

ndc

∈ [0,+4].

Note that the neighboring points are 2× less dense

along the depth axis. We denote the point grid with

an overloaded notation p

ndc

, where a single point

ndc

[k, j,i] is indexed by k, j and i, which correspond

to the depth and v,u pixel coordinates, respectively.

To make the notation succinct, we omit indices if they

are not necessary, e.g. p

ndc

[k] is a 2D slice of the 3D

grid.

Next, we back-project the points to the world co-

ordinate frame by applying the inverse perspective

transformation by the matrix (PM)

−1

. We get a 3D

grid of points p

in the world coordinate frame:

∼ (PM)

−1

ndc

, (11)

where, as before, the superscript denotes homoge-

neous coordinates.

3.6 Fourier Features

Similarly to StyleGAN3-R, we ﬁrst compute Fourier

features (Tancik et al., 2020), to supply the network

with input, except in our case we compute them for

3D points instead of 2D pixel locations. The Fourier

features are the input to the ﬁrst layer, so we denote

them f

= [cos(ω

),sin(ω

),...,

cos(ω

),sin(ω

)], (12)

Fourier features are a concatenation of 2L sine and

cosine waves. The ω

∈ R

parameters are randomly

sampled from a uniform 3D ball and ﬁxed at the be-

ginning of the training. We choose 2L = 128.

3.7 Alias-Free Convolutions

Next, we apply the StyleGAN3-R rotation-invariant

alias-free convolutional layer AFConv

for all of the

2D slices along the depth axis of the 3D grid. In order

to make the features alias-free, we apply a low pass

ﬁlter on them and downsample by 2× for every slice.

AFConv

is conditioned on w = Mapping(z),

where Mapping is an MLP, which enable condition-

ing on the latents z, thus,

[k] = AFConv

(down2x(f

[k]),w). (13)

AFConv

contains a convolution with a 1 ×1 ker-

nel and a leaky ReLU activation. The nonlinearity of

the leaky ReLU function is handled via Eq. 2, ﬁrst

upsampling the features, then applying the nonlinear-

ity, then downsample with a low pass ﬁlter. The next

layers then depend on the one before them,

[k] = AFConv

((f

m−1

[k]),w) (14)

for a total of r layers. The last features f

are then

then input to the renderer. Please note that the down-

sampling is necessary only for the Fourier features

Alias-Free GAN for 3D-Aware Image Generation

225

Table 1: Quantitative evaluation results using Fr

echet Inception Distance (FID) ↓ for FFHQ and AFHQv2-Cats datasets. The

resolution of the generated images is given next to the dataset’s name. Scores for the compared approaches are taken from

the corresponding papers, † scores taken from (Xue et al., 2022). The best and second-best scores are coloured in red and

orange, respectively.

FFHQ-256 FFHQ-512 FFHQ-1024 AFHQv2-256 AFHQv2-512

GIRAFFE (Niemeyer and Geiger, 2021b) 32 - 70.08

†

33.39

†

Lift. SG (Shi et al., 2021) 29.81 - - - -

GRAM (Deng et al., 2022b) 17.9 - - 14.6 -

GRAM-HD (Xiang et al., 2022) 13.00 - 12.0 7.05 7.67

GIRAFFE-HD (Xue et al., 2022) 11.93 - 10.13 12.36 -

VoxGRAF (Schwarz et al., 2022) 9.6 - - 9.6 -

CIPS-3D (Zhou et al., 2021) 6.97 - 12.26 - -

GMNR (Kumar et al., 2023) 9.20 6.81 6.58 - 6.01

OmniAvatar (Xu et al., 2023) - 5.70 - - -

EG3D (Chan et al., 2022) 4.80 4.70 - 3.88 2.77

SURF-GAN (Kwak et al., 2022) 4.72 - - - -

IDE-3D (Sun et al., 2022) - 4.60 - - -

Ours 3.94 4.10 3.14 4.66 4.57

and not for the rest of the intermediate layers (m > 1).

The perspective transformation may cause aliasing

artefacts during the sampling process, which are han-

dled by a denser sampling and down-sampling the

Fourier features. The number of layers r = 3 before

rendering is set such that they correspond to the num-

ber of convolutional layers in StyleGAN3 with reso-

lution 16×16. We set the number of channels to 128.

3.8 Rendering

Now that the 3D features are computed, we use them

to render the 2D image. For each point on the 3D grid,

we calculate the corresponding distance and density.

The distance of a 3D point to the camera is

δ[k] =

∥

[k + 1] + p

[k −1]

∥

/2, (15)

where negative indices correspond to the points on the

margin.

The densities σ at a 3D point are chosen to be the

ﬁrst channel of f

. Let Mask denote a masking func-

tion associated with the scene, it determines whether

a point should be used during rendering. The mask

will then remove points from the rendering process

by setting their densities to zero, further we clip any

negative values that may exist.

σ = max(0,Mask (p

,M) ⊙up2x(f

)), (16)

where the superscript denotes the index of the chan-

nel. Note that the mask computation requires the ex-

trinsic camera parameters and also up-sampling for

the features so that they match the sampling rate of

the 3D points.

Given the densities and the distances from the

camera we can perform volumetric rendering. We

sum along the depth axis and numerically integrate

the points k along the ray

r+1

∑

k=1

T [k](1 −exp(−δ[k]σ[k])up2x(f

[k]), (17)

where T are the transmittance values,

T [k] = exp



−

k−1

∑

m=1

δ[m]σ[m]



. (18)

The result f

r+1

is a 2D feature map.

Notice that full 3D equivariance with the perspec-

tive camera model and the volumetric rendering is

hard to deﬁne. Thus throughout the paper we only

use alias-free operations along the width and height

axes and not along the depth. This is an approxima-

tion, that we found to work well in practice.

3.9 Super-Resolution

The super-resolution part is then borrowed from the

StyleGAN3-R architecture. It is the convolutional

network comprising of the higher-resolution 2D con-

volutional layers. The output of the super-resolution

network is then the generated RGB image

generated

= SuperRes(f

r+1

). (19)

4 EXPERIMENTS

We evaluate our approach in the task of unconditional

multi-view image generation on two real-world image

datasets that allow a comparison with prior work. The

ﬁrst dataset is FFHQ (Karras et al., 2021b), a set of

70,000 human face images at 1024

pixels resolution.

FFHQ exhibits considerable variation in terms of age,

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

226

Figure 4: Sample images generated by our model trained on FFHQ-1024

and AFHQv2-Cats-512

datasets.

ethnicity and image background. Because of its high

resolution we can test generation results on different

results, namely 256, 512, and 1024

. To test the gen-

eralization capability of the proposed approach, we

also conduct experiments on AFHQv2 (Choi et al.,

2020; Karras et al., 2021a), a collection of 15,000

images of animal faces at a resolution of 512

pix-

els. AFHQv2 includes three domains (cats, dogs,

wildlife), each consisting of ≈5,000 images. We

follow previous work and use the 5,065 cat image

subset of this dataset to evaluate our method and

compare it to the most recent SOTA image synthe-

sis methods: GMNR (Kumar et al., 2023), OmniA-

vatar (Xu et al., 2023), IDE-3D (Sun et al., 2022),

GIRAFFE-HD (Xue et al., 2022), GRAM-HD (Xi-

ang et al., 2022), SURF-GAN (Kwak et al., 2022),

VoxGRAF (Schwarz et al., 2022), EG3D (Chan et al.,

2022), Lift. SG (Shi et al., 2021), pi-GAN (Chan

et al., 2021), and GIRAFFE (Niemeyer and Geiger,

2021b). We adapt the same training setup as in pre-

vious work; we also augment both datasets with hori-

zontal ﬂips. In contrast to other methods, e.g.EG3D),

we do not use any additional pose estimators, adaptive

data augmentation or transfer learning techniques.

4.1 Quantitative Results

In terms of metric-based evaluation, we assessed im-

age quality with the FID (Heusel et al., 2017), a com-

mon metric used to estimate the distance between

generated and real images. To compute the FID scores

for the proposed approach, we used 50,000 images

Alias-Free GAN for 3D-Aware Image Generation

227

(a) GIRAFFE (b) piGAN

(e) EG3D (f) Ours

Figure 5: Qualitative comparison between our approach and recent SOTA methods on the FFHQ-256

dataset.

generated by a trained model and all real images

from the respective dataset. As can be seen from Ta-

ble 1, the proposed approach demonstrates compet-

itive FID performance on both datasets, surpassing

prior work on all FFHQ variants and being the sec-

ond best on AFHQ. The results on AFHQv2 are be-

hind those achieved by the best approach EG3D. We

attribute this to the difference in pose distribution be-

tween FFHQ and AFHQv2 where AFHQv2 is much

more complex and diverse. For EG3D a pose esti-

mator is used so that a pose distribution is known at

training time, for simplicity we choose the same dis-

tribution for both FFHQ and AFHQv2.

Following (Chan et al., 2022), we evaluate multi-

view facial identity consistency (ID) by calculating

the mean ArcFace (Deng et al., 2019) cosine similar-

ity score between pairs of views of the same synthe-

sized face rendered from random camera poses. As

can be seen from Table 2, our approach compares

favourably with the current SOTA contenders.

Table 2: Multi-view identity consistency (ID) for FFHQ.

We indicated the image resolution used for training and

evaluation.

resolution ID ↑

GIRAFFE 256

0.64

π-GAN 128

0.67

Lift. SG 256

0.58

EG3D 256

0.76

EG3D 512

0.77

SURF-GAN 128

0.66

IDE-3D 512

0.76

Ours 256

0.73

Ours 512

0.76

Ours 1024

0.78

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

228

Figure 6: Style-mixing with our approach on FFHQ-1024

with mixing regularization. We take the coarse layers (0-7)

from the images in the ﬁrst column and the ﬁne layers (8-15)

from the images in the ﬁrst row. Coarse layers determine

the facial traits, hair length and hairstyle. Fine layers are

inﬂuence the skin tone and hair colouring.

4.2 Qualitative Results

In Figure 4 we show some sample images generated

by our approach on both datasets with the highest

available resolution, FFHQ-1024 and AFHQv2-512.

Manual examination of the images veriﬁes the high

quality, viewpoint consistency and diversity of the

outputs.

To put the results into context, we follow previ-

ous work and compare the image samples generated

by the competing approaches side-by-side, shown in

Figure 5. Some of the methods have clearly visi-

ble artefacts. For example, the faces generated by

GIRAFFE exhibit a halo around the hair region, the

hair strands are also inconsistently positioned when

looked at from different viewpoints. π-GAN gener-

ates overly-smoothed faces, making them look un-

realistic. Lifting StyleGAN generates well-formed

faces, but struggles with capturing details (note the

blur around the hair regions). Our method, on the

other hand, synthesizes high-quality images which

are viewpoint-consistent, detailed and realistic: note

the correct positioning and lack of artefacts when

generating ﬁne details, like hair strands or earrings.

Qualitatively, both ours, GIRAFFE-HD and EG3D

are photorealistic. Many images have the effect that

the eyes of the person look direclty into the camera

from all viewpoints. This is not an error in viewpoint

consistency, but a well known ambiguity. When the

Table 3: Rendering speed in images/second at three differ-

ent rendering resolutions. All compared approaches were

evaluated on a single GPU but the corresponding numbers

are taken from the original papers, so they serve as a refer-

ence, not a fair speed comparison.

resolution 256

512

1024

EG3D 36 35 -

GIRAFFE 181 161 -

GMNR 313 78.9 17.6

GRAM-HD - - 90

Lift. SG 51 - -

pi-GAN 5 1 -

SURF-GAN 72 - -

VoxGRAF 64 - -

Ours 30 26 23

geometry of the eye is inverted, it causes an illusion

that the eye looks at the camera all the time. As most

images look directly towards the camera in FHHQ, it

is natural for the network to learn the inverted geom-

etry, and all 3D-aware methods suffer from this.

4.3 Speed

In Table 3 we list the inference speed of our approach

and other methods we compared against. Since we

used standard components we achieve a high through-

put rate of the trained models of about 23 frames per

second for the highest tested resolution on a single

V100 GPU.

The numbers are given as a rough reference: the

approaches were benchmarked by the respective au-

thors on different hardware and with different re-

quirements. For example, while our method per-

forms end-to-end image synthesis, GRAM-HD (Xi-

ang et al., 2022) caches the manifold surfaces and HR

radiance maps as textured 3D meshes, and then runs

fast free-view synthesis with an efﬁcient mesh raster-

izer from (Laine et al., 2020). Our method is capable

of real-time inference even at 1024

resolution, with-

out sacriﬁcing image quality.

4.4 Style Mixing

The ability to modulate image style by feeding two or

more different latent code vectors to different layers

of the generator at inference time is known as style

mixing (Karras et al., 2021a; Karras et al., 2020; Kar-

ras et al., 2021b). Given the fact that our approach is

based on the StyleGAN3 architecture, it is reassuring

that the style mixing abilities are preserved. In Fig-

ure 6 we can observe that there is a clear separation of

the roles the coarse and ﬁne layers of the model take

on: coarse layers are responsible for the overall head

Alias-Free GAN for 3D-Aware Image Generation

229

(a) object appearance

(b) horizontal translation

(d) rotation

Figure 7: Conditioning and control: varying a) the latent z

that controls appearance, b) the camera matrix M for hori-

zontal translation, c) vertical translation and d) rotation.

pose, coarse face details and hairstyle, while ﬁne lay-

ers perform appropriate skin tone and hair colouring.

4.5 3D Controllability

In Figure 7 we visually demonstrate that our method

is capable of generating 3D controllable images

which is a feature that emerges naturally from vol-

ume rendering of input Fourier features To explore

this capability, we decompose our camera into indi-

vidual components of (1) vertical translation, (2) hor-

izontal translation and (3) spherical rotation. For ver-

tical translation, Figure 7 (a) shows the object identi-

ties are preserved while the viewpoints are consistent.

Our qualitative results for horizontal translation and

spherical rotation show compelling evidence that our

method provides multi-view consistency, in terms of

subject identities and backgrounds.

5 DISCUSSION

In this paper we constructed a 3D-aware generative

model that is able to render images both of high qual-

ity and high resolution, while maintaining fast infer-

ence and gain viewpoint control for the user. We have

demonstrated these capabilities both qualitatively and

quantitatively, while we kept the design as simple as

possible.

We argue that a beneﬁt of the proposed construc-

tion is the avoidance of extra regularization terms,

dual discriminators or specialized data-augmentation

strategies. The model retains the respective advan-

tages of its ingredients ”simply” by a careful com-

bination of NeRF and the alias-free StyleGAN3-R.

The training protocol follows the standard procedure

of StyleGAN3-R which is what we hoped for when

starting the investigation since specialized protocols

are hard to attain and prone to be sub-optimal.

There are several limitations that we plan to ad-

dress as future work. Currently our method does not

provide 3D depth or normals as output, as they can

only be extracted at a very low 16 ×16 pixel resolu-

tion. It would require specialized depth up-sampling

for any usable resolution.

Another interesting direction could be to

learn the viewpoint distribution similar to CAM-

PARI (Niemeyer and Geiger, 2021a). Training

a 3D-aware GAN requires a good match of the

viewpoint distribution used to sample and present in

the training data. Mismatch, either wider or narrower

viewpoints can lead to instability and incorrect

geometry. We expect that learning the viewpoint

distribution would lead to better performance e.g. on

the AFHQ dataset.

We understand the presented model and result as

a promising step to more complete 3D generation. In

particular we are interested in full 3D human genera-

tion and our model contains some necessary features

such as alias-free, high-quality, 3D-aware to move

into this more challenging domain.

ACKNOWLEDGEMENTS

We thank Nikolay Jetchev for the insights and sugges-

tions which helped us tremendously while developing

the approach. We also thank the anonymous review-

ers for their comments and suggestions.

REFERENCES

Bora, A., Price, E., and Dimakis, A. G. (2018). Ambi-

entgan: Generative models from lossy measurements.

In International Conference on Learning Representa-

tions.

Burkov, E., Rakhimov, R., Saﬁn, A., Burnaev, E., and

Lempitsky, V. S. (2022). Multi-neus: 3d head por-

traits from single image with neural implicit functions.

IEEE Access, 11:95681–95691.

Chan, E. R., Lin, C. Z., Chan, M. A., Nagano, K., Pan, B.,

Mello, S. D., Gallo, O., Guibas, L. J., Tremblay, J.,

Khamis, S., Karras, T., and Wetzstein, G. (2022). Ef-

ﬁcient geometry-aware 3d generative adversarial net-

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

230

works. In Conference on Computer Vision and Pattern

Recognition, pages 16102–16112.

Chan, E. R., Monteiro, M., Kellnhofer, P., Wu, J., and Wet-

zstein, G. (2021). Pi-gan: Periodic implicit generative

adversarial networks for 3d-aware image synthesis. In

Conference on Computer Vision and Pattern Recogni-

tion, pages 5799–5809.

Choi, Y., Uh, Y., Yoo, J., and Ha, J. (2020). Stargan v2: Di-

verse image synthesis for multiple domains. In Con-

ference on Computer Vision and Pattern Recognition,

pages 8185–8194.

Deng, J., Guo, J., Xue, N., and Zafeiriou, S. (2019). Ar-

cface: Additive angular margin loss for deep face

recognition. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

(CVPR).

Deng, Y., Wang, B., and yeung Shum, H. (2022a). Learning

detailed radiance manifolds for high-ﬁdelity and 3d-

consistent portrait synthesis from monocular image.

2023 IEEE/CVF Conference on Computer Vision and

Pattern Recognition (CVPR), pages 4423–4433.

Deng, Y., Yang, J., Xiang, J., and Tong, X. (2022b).

GRAM: generative radiance manifolds for 3d-aware

image generation. In Conference on Computer Vision

and Pattern Recognition, pages 10663–10673.

Gadelha, M., Maji, S., and Wang, R. (2017). 3d shape in-

duction from 2d views of multiple objects. In Interna-

tional Conference on 3D Vision, pages 402–411.

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A. C., and Ben-

gio, Y. (2014). Generative adversarial nets. In Con-

ference on Neural Information Processing Systems,

pages 2672–2680.

Gu, J., Liu, L., Wang, P., and Theobalt, C. (2022). Stylenerf:

A style-based 3d aware generator for high-resolution

image synthesis. In International Conference on

Learning Representations.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and

Hochreiter, S. (2017). Gans trained by a two time-

scale update rule converge to a local nash equilibrium.

In Advances in Neural Information Processing Sys-

tems, pages 6626–6637.

Kajiya, J. T. (1986). The rendering equation. In Pro-

ceedings of the 13th annual conference on Computer

graphics and interactive techniques, pages 143–150.

Kaneko, T. (2022). Ar-nerf: Unsupervised learning of depth

and defocus effects from natural images with aper-

ture rendering neural radiance ﬁelds. In Conference

on Computer Vision and Pattern Recognition, pages

18387–18397.

Karras, T., Aittala, M., Laine, S., H

ark

onen, E., Hellsten, J.,

Lehtinen, J., and Aila, T. (2021a). Alias-free gener-

ative adversarial networks. In Conference on Neural

Information Processing Systems, pages 852–863.

Karras, T., Laine, S., and Aila, T. (2021b). A style-

based generator architecture for generative adversar-

ial networks. IEEE Trans. Pattern Anal. Mach. Intell.,

43(12):4217–4228.

Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J.,

and Aila, T. (2020). Analyzing and improving the im-

age quality of stylegan. In Conference on Computer

Vision and Pattern Recognition, pages 8107–8116.

Kim, G. and Chun, S. Y. (2022). Datid-3d: Diversity-

preserved domain adaptation using text-to-image dif-

fusion for 3d generative model. 2023 IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 14203–14213.

Kumar, A., Bhunia, A. K., Narayan, S., Cholakkal, H., An-

wer, R. M., Khan, S. S., Yang, M., and Khan, F. S.

(2023). Generative multiplane neural radiance for 3d-

aware image generation. ArXiv, abs/2304.01172.

Kwak, J., Li, Y., Yoon, D., Kim, D., Han, D. K., and Ko, H.

(2022). Injecting 3d perception of controllable nerf-

gan into stylegan for editable portrait image synthesis.

In European Conference on Computer Vision, volume

13677 of Lecture Notes in Computer Science, pages

236–253. Springer.

Laine, S., Hellsten, J., Karras, T., Seol, Y., Lehtinen, J.,

and Aila, T. (2020). Modular primitives for high-

performance differentiable rendering. ACM Trans.

Graph., 39(6):194:1–194:14.

Liu, F. and Liu, X. (2022). 2d gans meet unsupervised

single-view 3d reconstruction. In European Confer-

ence on Computer Vision, volume 13661, pages 497–

514. Springer.

Max, N. (1995). Optical models for direct volume render-

ing. IEEE Transactions on Visualization and Com-

puter Graphics, 1(2):99–108.

Meetz, K., Meinzer, H., Baur, H., Engelmann, U., and

Scheppelmann, D. (1991). The heidelberg ray tracing

model. IEEE Computer Graphics and Applications,

11(06):34–43.

Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T.,

Ramamoorthi, R., and Ng, R. (2022). Nerf: represent-

ing scenes as neural radiance ﬁelds for view synthesis.

Commun. ACM, 65(1):99–106.

Niemeyer, M. and Geiger, A. (2021a). CAMPARI: camera-

aware decomposed generative neural radiance ﬁelds.

In International Conference on 3D Vision, pages 951–

961.

Niemeyer, M. and Geiger, A. (2021b). GIRAFFE: repre-

senting scenes as compositional generative neural fea-

ture ﬁelds. In Conference on Computer Vision and

Pattern Recognition, pages 11453–11464.

Or-El, R., Luo, X., Shan, M., Shechtman, E., Park, J. J., and

Kemelmacher-Shlizerman, I. (2022). Stylesdf: High-

resolution 3d-consistent image and geometry genera-

tion. In Conference on Computer Vision and Pattern

Recognition, pages 13493–13503.

Poole, B., Jain, A., Barron, J. T., and Mildenhall, B. (2022).

Dreamfusion: Text-to-3d using 2d diffusion. ArXiv,

abs/2209.14988.

Rushmeier, H. E. and Torrance, K. E. (1987). The zonal

method for calculating light intensities in the presence

of a participating medium. ACM SIGGRAPH Com-

puter Graphics, 21(4):293–302.

Schwarz, K., Sauer, A., Niemeyer, M., Liao, Y., and Geiger,

A. (2022). Voxgraf: Fast 3d-aware image synthesis

with sparse voxel grids. In Conference on Neural In-

formation Processing Systems.

Alias-Free GAN for 3D-Aware Image Generation

231

Shannon, C. E. (1949). Communication in the presence of

noise. Proceedings of the IRE, 37(1):10–21.

Shi, Y., Aggarwal, D., and Jain, A. K. (2021). Lifting 2d

stylegan for 3d-aware face generation. In Conference

on Computer Vision and Pattern Recognition, pages

6258–6266.

Skorokhodov, I., Tulyakov, S., Wang, Y., and Wonka, P.

(2022). Epigraf: Rethinking training of 3d gans. In

Conference on Neural Information Processing Sys-

tems.

Sun, J., Wang, X., Shi, Y., Wang, L., Wang, J., and Liu,

Y. (2022). IDE-3D: interactive disentangled editing

for high-resolution 3d-aware portrait synthesis. ACM

Trans. Graph., 41(6):270:1–270:10.

Szab

o, A., Meishvili, G., and Favaro, P. (2019). Unsuper-

vised generative 3d shape learning from natural im-

ages. ArXiv, abs/1910.00287.

Tancik, M., Srinivasan, P. P., Mildenhall, B., Fridovich-

Keil, S., Raghavan, N., Singhal, U., Ramamoorthi,

R., Barron, J. T., and Ng, R. (2020). Fourier fea-

tures let networks learn high frequency functions in

low dimensional domains. In Conference on Neural

Information Processing Systems.

Tang, J., Zhang, B., Yang, B., Zhang, T., Chen, D., Ma, L.,

and Wen, F. (2022). 3dfaceshop: Explicitly control-

lable 3d-aware portrait generation. IEEE transactions

on visualization and computer graphics, PP.

Wang, T., Zhang, B., Zhang, T., Gu, S., Bao, J., Bal-

truvslet@tokeneonedotaitis, T., Shen, J., Chen, D.,

Wen, F., Chen, Q., and Guo, B. (2022). Rodin: A

generative model for sculpting 3d digital avatars us-

ing diffusion. 2023 IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition (CVPR), pages

4563–4573.

Williams, P. L. and Max, N. (1992). A volume density op-

tical model. In Proceedings of the 1992 workshop on

Volume visualization, pages 61–68.

Xiang, J., Yang, J., Deng, Y., and Tong, X. (2022). Gram-

hd: 3d-consistent image generation at high reso-

lution with generative radiance manifolds. ArXiv,

abs/2206.07255.

Xu, H., Song, G., Jiang, Z., Zhang, J., Shi, Y., Liu, J.,

Ma, W.-C., Feng, J., and Luo, L. (2023). Omnia-

vatar: Geometry-guided controllable 3d head synthe-

sis. 2023 IEEE/CVF Conference on Computer Vi-

sion and Pattern Recognition (CVPR), pages 12814–

12824.

Xue, Y., Li, Y., Singh, K. K., and Lee, Y. J. (2022). GI-

RAFFE HD: A high-resolution 3d-aware generative

model. In Conference on Computer Vision and Pat-

tern Recognition, pages 18419–18428.

Zhou, P., Xie, L., Ni, B., and Tian, Q. (2021). Cips-3d:

A 3d-aware generator of gans based on conditionally-

independent pixel synthesis. ArXiv, abs/2110.09788.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

232