NeRF-Diffusion for 3D-Consistent Face Generation and Editing

ector Laria

1,2

, Kai Wang

, Joost van de Weijer

1,2

, Bogdan Raducanu

1,2

and Yaxing Wang

Computer Vision Center, Barcelona, Spain

Universitat Aut

onoma de Barcelona, Spain

Nankai University, China

Keywords:

NeRF, Diffusion Models, 3D Generation, Multi-View Consistency, Face Generation.

Abstract:

Generating high-ﬁdelity 3D-aware images without 3D supervision is a valuable capability in various applica-

tions. Current methods based on NeRF features, SDF information, or triplane features have limited variation

after training. To address this, we propose a novel approach that combines pretrained models for shape and

content generation. Our method leverages a pretrained Neural Radiance Field as a shape prior and a diffusion

model for content generation. By conditioning the diffusion model with 3D features, we enhance its ability to

generate novel views with 3D awareness. We introduce a consistency token shared between the NeRF module

and the diffusion model to maintain 3D consistency during sampling. Moreover, our framework allows for

text editing of 3D-aware image generation, enabling users to modify the style over 3D views while preserving

semantic content. Our contributions include incorporating 3D awareness into a text-to-image model, address-

ing identity consistency in 3D view synthesis, and enabling text editing of 3D-aware image generation. We

provide detailed explanations, including the shape prior based on the NeRF model and the content generation

process using the diffusion model. We also discuss challenges such as shape consistency and sampling satu-

ration. Experimental results demonstrate the effectiveness and visual quality of our approach.

1 INTRODUCTION

Producing high-ﬁdelity images in a 3D-consistent

manner while simultaneously capturing the geometry

of objects from image collections without the need for

any 3D supervision is referred to as 3D-aware gener-

ative image synthesis (Xia and Xue, 2022). This ca-

pability is valuable in various applications, including

virtual reality, robotics, and content creation. While

current methods excel at 3D shape estimation us-

ing NeRF features (Gu et al., 2022), SDF informa-

tion (Or-El et al., 2022), or triplane features (Chan

et al., 2022), they fall short in guiding the generation

process based on speciﬁc prompts or other conditional

factors once the model is trained.

On the other hand, recent advancements in text-

to-image (T2I) generators (Rombach et al., 2022;

Ramesh et al., 2021; Ramesh et al., 2022; Sa-

haria et al., 2022; Midjourney.com, 2022) have al-

lowed for realistic image synthesis based on widely

different textual description, but even these models

have limitations when it comes to embracing novel

conditions. To overcome these limitations, vari-

ous solutions have been proposed, such as Control-

Net (Zhang and Agrawala, 2023), T2I-adapter (Mou

et al., 2023), GLIGEN (Li et al., 2023b), Univer-

sal Guidance (Bansal et al., 2023), which enable T2I

models to accept additional conditions like sketches,

segmentation maps, bounding boxes, and layout im-

ages. However, these approaches do not address the

problem of 2D novel view synthesis. A naive solu-

tion is to condition the Stable Diffusion model with

3D features extracted from 3D-aware image synthe-

sis methods, as shown in Fig. 2 (top). However,

this approach does not guarantee identity preserva-

tion. Another issue is how to extend T2I models with

3D capabilities. Several methods (Poole et al., 2022;

Seo et al., 2023; Tang et al., 2023) have successfully

achieved novel views of a target, but these are con-

strained to learning a model for a single subject.

In this paper, we present a novel approach for gen-

erating 3D-aware image synthesis with text-editing

capability. Our method breaks down this task into

simpler sub-tasks and leverages off-the-shelf pre-

trained architectures for each stage. By doing so,

we take advantage of the rich representations learned

by these pretrained models, requiring only modest re-

sources to achieve convincing results. The proposed

Laria, H., Wang, K., van de Weijer, J., Raducanu, B. and Wang, Y.

NeRF-Diffusion for 3D-Consistent Face Generation and Editing.

DOI: 10.5220/0012381500003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 2: VISAPP, pages

587-594

ISBN: 978-989-758-679-8; ISSN: 2184-4321

587

approach consists of two main components: a shape

prior based on a pretrained Neural Radiance Field

(NeRF) model and a diffusion model for content gen-

eration. The NeRF model provides a robust shape rep-

resentation that is consistent among viewpoint trans-

lations. We condition the NeRF model with a shared

style vector to control the scene’s identity. The diffu-

sion model is based on Stable Diffusion (SD), a pre-

trained leading T2I generator with outstanding quality

and T2I capabilities. By conditioning the SD model

with 3D features extracted from 3D-aware image syn-

thesis methods, we enhance its capability to generate

novel views with 3D awareness. However, maintain-

ing 3D consistency during the iterative sampling pro-

cess of the diffusion model is challenging. To address

this, we introduce a consistency token that is shared

between the NeRF module and the diffusion model,

ensuring cohesion between style and shape. An addi-

tional advantage of our framework is its ability to re-

place the consistency token to modify the style among

3D views using text editing. This feature allows users

to manipulate and modify the generated views while

preserving their inherent semantic content. We show-

case these capabilities in Figure 1 (middle, bottom).

Our contributions include:

• Successfully incorporating 3D awareness into a

T2I model,

• Addressing the challenge of identity consistency

in 3D view synthesis and

• Enabling text editing of 3D-aware image genera-

tion.

2 RELATED WORK

3D Generative Models for Novel View Synthesis

from Single-View Images. The aim of 3D-aware

generative image synthesis is to produce high-quality

images with 3D consistency, capturing object surface

details from 2D image collections without explicit 3D

supervision. Early works, like GRAF (Schwarz et al.,

2020) and its successors, leverage NeRF and adver-

sarial frameworks for realistic scene representation

and training. CIPS-3D (Zhou et al., 2021) combines

shallow NeRF and deep 2D Implicit Neural Repre-

sentation (INR) for shape and appearance synthesis,

while StyleSDF (Or-El et al., 2022) integrates SDF-

based 3D representation into StyleGAN. EG3D (Chan

et al., 2022) proposes a triplane hybrid 3D repre-

sentation, employing a StyleGAN2 generator for im-

age rendering with super-resolution. StyleNeRF (Gu

et al., 2022) enhances rendering efﬁciency by inte-

grating NeRF into a style-based generator for high-

resolution 3D image generation.

However, these models are often domain-speciﬁc

and lack generalization to unseen datasets. To over-

come this limitation, we propose a method based on

StyleNeRF, a state-of-the-art 3D GAN. Our approach

enables novel 3D-aware view synthesis from a single

view through ﬁne-tuning a pretrained Text-to-Image

(T2I) model, requiring only a single ﬁnetuning itera-

tion.

3D-Aware Text-Edit Image Synthesis. DreamFu-

sion (Poole et al., 2022) pioneered Score Distillation

Sampling (SDS), utilizing a frozen diffusion model

as a critic for NeRF model learning. SDS allowed

users to specify subjects through text prompts, but it

was limited to one subject at a time during training.

3DFuse (Seo et al., 2023) addressed prompt ambigu-

ity by optimizing an initial image and introduced a se-

mantic identity space for consistency tokens. Make-

it-3D (Tang et al., 2023) learned a NeRF model and

a texture point cloud in two stages for single-object

synthesis. In contrast, our method exploits diffu-

sion model-generated views for high-quality genera-

tion without multi-stage processing. All these meth-

ods distill knowledge from a diffusion model, while

our approach uses pretrained NeRF models as geo-

metric priors, employing the diffusion model for gen-

erating consistent views across diverse subjects. Im-

portantly, our method is not limited to single-subject

generation.

T2I Model Augmentation with Additional Con-

ditions. To enrich Text-to-Image (T2I) models

with ﬁne-grained details, studies extend T2I diffu-

sion models with additional conditioning modalities.

Composer (Huang et al., 2023) explores multiple con-

trol signals alongside text descriptions, training from

scratch with extensive datasets but incurring high

GPU and ﬁnancial costs. Leveraging pretrained T2I

models like Stable Diffusion (Rombach et al., 2022),

ControlNet (Zhang and Agrawala, 2023) integrates

various controls into the denoising U-Net, enabling

task-speciﬁc guidance. GLIGEN (Li et al., 2023b),

T2I-Adapter (Mou et al., 2023), and others use large-

scale pretrained T2I diffusion models for customized

image generation, partially ﬁne-tuning them. Some

methods (Mou et al., 2023; Hu et al., 2023) offer com-

posable controls for robust augmentation. To address

this, we enhance ControlNet for 3D-aware image syn-

thesis, maintaining identity consistency with depth or

NeRF features. We introduce a module to improve

handling of real image inputs and obtain correspond-

ing 3D views.

Editing Methods for NeRF with Prompts. Sim-

plifying the diffusion generation to an optimization-

based prompt-guided approach draws parallels with

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

588

ControlNet

Ours

Ours +

Prompt edit

Figure 1: Naive generation of 3D-aware content with T2I models (top). Our approach links the identity among views (middle)

and is amenable to consistent editing (bottom). This ﬁgure shows that ControlNet can obtain high quality visual per frame

results, however the generated images lack 3D consistency as can be seen by the presence and absence of glasses for some

viewpoints, and inconsistent background.

recent studies, notably StyleNeRF (Gu et al., 2022).

This involves latent code optimization using a CLIP

loss guided by a text prompt. CLIP-NeRF (Wang

et al., 2022) disentangles a pretrained NeRF, trans-

lating the CLIP latent space to condition the NeRF

model. NeRF-Art (Wang et al., 2023) modulates

a pretrained NeRF model for shape and appearance

through a perceptual loss mechanism with CLIP.

Instruct-NeRF2NeRF (Haque et al., 2023) employs

an external diffusion model for data modiﬁcation akin

to knowledge distillation. Instruct 3D-to-3D (Kamata

et al., 2023) adjusts the NeRF model directly, reminis-

cent of DreamFusion (Poole et al., 2022), using score

distillation. These methods require ﬁne-tuning or sup-

plementary modules for each prompt-guided edit. In

contrast, our approach utilizes generic NeRF features,

adjusting appearance seamlessly without retraining or

speciﬁc adjustments for each edit. Zero-1-to-3 (Liu

et al., 2023) is closely related but requires extensive

ﬁne-tuning of the entire diffusion model and dataset

compilation with camera extrinsics. Our methodol-

ogy achieves satisfactory outcomes without relying

on data, employing a generic explicit shape model and

only requiring a control module for shape feature in-

terpretation and diffusion guidance.

3 METHODS

Generating 3D-aware image synthesis is a challeng-

ing task that requires inferring and preserving both

geometry and textures. In this paper, we present a

novel approach that breaks down this complex task

into sub-tasks and utilizes off-the-shelf pretrained ar-

chitectures for each sub-task. Our approach enables

us to achieve high-quality results by leveraging the

rich representation of these pretrained models, requir-

ing modest additional training.

Firstly, we employ a pretrained neural radiance

ﬁeld model (NeRF) to estimate a robust shape prior

that is consistent among viewpoint translations. This

prior conditions the pretrained diffusion model that

generates the ﬁnal image. However, due to the iter-

ative nature of diffusion model sampling, changes in

the conditioning result in changes in the output, lead-

ing to a loss of 3D consistency. Our method solves

this by introducing a consistency token that automat-

ically maintains the desired details during the genera-

tion of new views. This token is shared between the

NeRF module and the diffusion model to achieve co-

hesion between style and shape.

The proposed framework, illustrated in Figure 2,

leverages prior knowledge of several separated mod-

els to learn a new task in a zero-shot fashion. This ap-

proach enables additional capabilities like image-to-

3D inversion, zero-shot image translation, and editing

as shown in Figure 1. In the following sections, we

introduce the method and its components in detail.

3.1 Shape Prior

We employ a pretrained Neural Radiance Field

(NeRF) representation (Mildenhall et al., 2020),

which has shown state-of-the-art results in 3D-scene

reconstruction and novel view synthesis tasks. NeRFs

are typically parametrized by multi-layer perceptrons

(MLPs) f : (x, d) → (c, σ) the scene, where x ∈ R

and d ∈ R

represent the position and viewing direc-

tion, respectively. The output of the MLPs includes a

volume density σ ∈ R

and a view-dependent color

c ∈ R

To ensure representation coherence between the

NeRF-Diffusion for 3D-Consistent Face Generation and Editing

589

camera

parameters

“A photo of

a person”

NeRF

Text

Enc

(noisy

img)

(less

noisy

img)

Figure 2: The proposed method consists of a controllable

diffusion model augmented with a spatial prior and a con-

sistency token for consistent geometry and identity preser-

vation among different views.

shape and texture of the generated scenes, we condi-

tion the NeRF model with a shared style vector w (Gu

et al., 2022; Karras et al., 2020) that controls the per-

son’s identity. The conditioned module is denoted by

: (x, d) → (c

, σ

), where w = m(z), z ∈ Z. Fol-

lowing the StyleGAN (Gu et al., 2022) architecture,

m(·) is a mapping network and Z is the unit Gaussian

sphere where noise is drawn from.

We perform 2D aggregation on the output of the

NeRF model to reduce the dimensionality of the con-

ditioning map. This enables efﬁcient computation of

the NeRF output and reduces the number of parame-

ters required for the ﬁnal model. The modiﬁed condi-

tioning map is used to generate a view of the 3D scene

through a subsequent diffusion model.

However, rendering high-resolution images using

NeRF is computationally demanding due to its pixel-

wise ray computation. Conversely, low-resolution

renders are more efﬁcient but may compromise 3D

consistency when performing pixel-space operations

like upsampling. To ﬁnd a balance between efﬁciency

and consistency, we sample at the same spatial reso-

lution as the latent features z ∈ R

C×64×64

of the dif-

fusion model. Subsequently, as explained in the next

section, the encoder E(·) of the next stage is modiﬁed

to ensure spatial consistency.

3.2 Content Generation

There are several image generation models that

can incorporate spatial conditioning, such as GANs,

VAEs, and NeRFs. In this study, we choose diffusion

models due to their high quality, output diversity, and

ﬂexible inference conditioning, including their ability

to allow for text-based image editing.

One way to train a conditioned generation model

is by means of a ControlNet (Zhang and Agrawala,

2023) approach, which learns a zero-convolution

gated module to interact with the decoder of the

frozen generative model. However, we found that a

naive application of ControlNet to this problem suf-

Original

Corrected

saturation

Corrected

saturation,

excessive guidance

Figure 3: Comparison of sample quality. Identity is cor-

rectly maintained even on saturated conditioning (right).

fers from identity drift, referring to the fact that dif-

ferent view conditions modify the denoising path.

This effect leads to a signiﬁcantly different output

even with the same initial conditions, as seen in Fig-

ure 1 (top). Our hypothesis is that this occurs be-

cause, at each step, the diffusion model lacks con-

straints within the space of possible textures to ap-

proximate. Consequently, it tends to derive general

(and thus inconsistent) guidelines from both the cur-

rent denoising state x

and the text conditioning.

To secure the identity among random states, we

propose learning an injective translator function f :

W → T that maps any style-space instance in W

to the token space T . This enables us to condition

the diffusion model with concrete style, similar to

how text would condition the model. This translator

is implemented using an adapter network t

= A(w)

translates the style vector w to the corresponding tex-

tual token t

, and both the conditioning modules and

adapter network are jointly learned. We call t

identity token throughout the text, as it comprises all

the condensed information of the current style and can

act as a powerful word embedding. It is so that the

parametrization of A(·) by a linear layer is enough

to harness its conditioning. Our ﬁnal architecture

(Figure 2) produces signiﬁcantly improved results,

as shown in Figure 1 (middle). To avoid interfering

with the text conditioning capabilities of the diffusion

model, we repurpose some unused text tokens to in-

put this information into the network cross-attention

mechanism, a common practice in other works (Gal

et al., 2022; Han et al., 2023; Li et al., 2023a).

To ensure shape consistency, we modify the input

encoder E(·) of ControlNet by replacing the upsam-

pling and following downsampling blocks with 1×1

convolutions to avoid pixel-space operations that may

cause spatial inconsistency.

3.3 Sampling Saturation

In our proposed method, we employ a conditioning

module during the sampling process to guide the gen-

eration of images with a speciﬁc shape, and we utilize

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

590

Table 1: Method ablation of all the proposed contributions. TL1, TL2, TL4 are view consistency metrics. FI is an identity

consistency metric. FID is an image quality metric.

Method TL1 ↓ TL2 ↓ TL4 ↓ FI ↓ FID ↓

ControlNet 0.1875 0.1966 0.2160 0.2852±0.13 29.76

+ identity (saturated) 0.1489 0.1717 0.2075 0.0840±0.03 38.33

+ identity 0.1332 0.1545 0.1786 0.1188±0.05 32.67

classiﬁer-free guidance (Ho and Salimans, 2021) for

texture and view-conditional details such as light re-

ﬂections and shadows. However, we have observed

that the use of classiﬁer-free guidance can result in

style saturation, leading to excessive values for color,

lighting, and other accents at any guidance scale. We

call this phenomenom sampling saturation. A similar

result can be experienced when the guidance scale is

excessively tuned up, as shown in Figure 3.

Concretely, in order to overcome this undesirable

effect, our method extends the text conditioning of N

token embeddings to include an identity token as y =

1,···,N

, t

), respectively. The conditioning is sampled

as usual,

∇

log p

(x | y) =

∇

log p(x) + γ(∇

log p(x | y) − ∇

log p(x)), (1)

where γ is the guidance scale. Let us consider the con-

ditioned model p

(x) that adds residual information

to the frozen generative model, effectively modifying

the output distribution by p(x)r(x | c), being p(x) the

previous knowledge and r(x | c) the learned residual

control given the conditioning input c. If we apply

guidance sampling to this model, it yields

∇

log p

(x | y, c)

= ∇

log p

(x)+

+ γ(∇

log p

(x | y, c) − ∇

log p

(x))

= ∇

log p(x) + ∇

logr(x | c)+

+ γ(∇

log p(x | y) + ∇

logr(x | y, c)−

− ∇

log p(x) + ∇

logr(x | c)). (2)

We observe that the unconditional control term

∇

logr(x | c) lacks the identity consistency infor-

mation embedded in y, which causes it to push the

evaluation away from the desired identity t

at each

step. To overcome this mismatch, we introduce the

identity token back into the unconditional term for

aligned generation, which leads to the following equa-

tion used for sampling:

∇

log p

(x | t

1,···,N

, t

, c) =

∇

log p(x) + ∇

logr(x | c, t

+ γ(∇

log p(x | t

1,···,N

, t

+ ∇

logr(x | t

1,···,N

, t

, c)−

− ∇

log p(x) + ∇

logr(x | c, t

)). (3)

By including the identity token in the uncondi-

tional term of the guidance, we ensure that the model

is aware of the true identity of the object being gener-

ated, even if the learned conditioning is heavily biased

towards other text conditioning. This prevents the ac-

cumulation of identity error during iterative sampling

and leads to more stable and accurate image genera-

tion. This adjustment can be extended to any work

that uses this type of conditioning on pretrained mod-

els and incorporates information not present in the

spatial conditioning input c.

3.4 Training

During training, we ﬁrst sample a tuple of 2D aggre-

gated NeRF features c, style vector w and generated

image x from a pretrained frozen StyleNeRF model.

The features are passed to the ControlNet module,

along with the conditioning text and style (now con-

verted to identity token by the adapter), to produce

residuals for the frozen diffusion model. The frozen

diffusion model receives a noised version of the image

x, the conditioning text and the residuals produced, to

generate an ε output following ε-parametrization from

Stable Diffusion. The whole architecture is driven

by the L2 distance between the estimated and added

noise at that denoising step.

It’s important to highlight that, despite 3D-aware

generative models like StyleNeRF and EG3D lacking

text-editing capabilities compared to current diffusion

model methods, we can inherit the text-editing prop-

erty of the original text-to-image diffusion model in

Stable Diffusion by conditioning it with NeRF fea-

tures. Importantly, this inheritance occurs without re-

quiring additional training or model augmentations.

4 EXPERIMENTS

4.1 Experimental Setup

Dataset. For training, we use a pretrained StyleNeRF

on FFHQ (Karras et al., 2021) in a zero-shot fashion,

simply using the generated output resized to 512 pixel

resolution.

Evaluation metrics. We assess our method using

a temporal loss (TL) (Wang et al., 2020) to mea-

NeRF-Diffusion for 3D-Consistent Face Generation and Editing

591

real image

inverted

image

inverted

image

left 1

inverted

image

left 2

inverted

image

right 1

inverted

image

right 2

edit 1 front

edit 1

left 1

edit 1

left 2

edit 1

right 1

edit 1

right 2

edit 2 front

edit 2

left 1

edit 2

left 2

edit 2

right 1

edit 1

right 2

edit 3 front

edit 3

left 1

edit 3

left 1

edit 3

right 1

edit 1

right 2

real image

inverted

image

inverted

image

left 1

inverted

image

left 2

inverted

image

right 1

inverted

image

right 2

edit 1 front

edit 1

left 1

edit 1

left 2

edit 1

right 1

edit 1

right 2

edit 2 front

edit 2

left 1

edit 2

left 2

edit 2

right 1

edit 1

right 2

edit 3 front

edit 3

left 1

edit 3

left 2

edit 3

right 1

edit 1

right 2

RealInversionMonkeySkull

Metal

mask

RealInversionMonkeySkull

Metal

mask

Figure 4: Generation of novel views given a real image.

Generation of 3D-aware content base on text editing of the

original subject.

inverted

image

inverted

image

left 1

inverted

image

left 2

inverted

image

right 1

inverted

image

right 2

OriginalPink hairTannedGlasses

Figure 5: Further generation of 3D-aware content base on

text editing.

sure scene consistency, since the problem we are fac-

ing in evaluating the quality of our outputs is simi-

lar to those found in video generation problems. It

is a crucial metric to ensure that the generated se-

quences exhibit smooth transitions, maintain realism

over time, and align with human perception. To fur-

ther evaluate our method, we make a sequence where

we have subject and background and only vary the

viewing angle. To make sure the identity is consis-

tent, we employ one more face-speciﬁc metric we

called Face Identity (FI), by comparing face embed-

dings from a VGG-Face network of different views to

a reference front-facing generation. We use the Light-

Face (Serengil and Ozpinar, 2020) framework for this

metric, by computing the mean of all distances of all

views to the reference image. We also show the stan-

dard deviation for completeness. Additionally, visual

quality is also measured to make sure the approach

retains the same or similar quality as the base mod-

els, although this measure does not take 3D consis-

tency into acount. We use the well-known metrics

echet Inception Distance (FID) to measure sets of

single frames versus real data.

4.2 Results

Ablation. In our experiments, we made several de-

sign choices to evaluate the performance and effec-

tiveness of our proposed method. Firstly, we utilized

the NeRF++ model (Zhang et al., 2020) as our base

neural radiance ﬁeld model. This choice was moti-

vated by the model’s ability to disentangle foreground

and background. Additionally, the NeRF++ model

has been pretrained on faces (Gu et al., 2022), making

it suitable for our face generation task.

To provide a baseline for comparison, we employ

ControlNet and evaluate its performance. We perform

a series of experiments to extend it to achieve con-

sistent identity generation, as shown in Table 1.There

we showcase enhanced view consistency across con-

secutive frames, namely, the next, second, and fourth

frames (TL1,2,4). We also explore subject identity

consistency (FI). We attribute this improvement to the

improved encoder that was able to extract ﬁner details

from the object and the background, as well as the in-

troduction of the consistency (or identity) token for

controllable diffusion generation.

Notably, we observe that saturated images exhibit

improved subject identity consistency. However, it is

worth noting that this effect is primarily due to the

saturation of colors, which conceals subtle variations

in the subject’s appearance across different views,

thereby enhancing the perceived consistency. Further-

more, we observe that the quality of individual images

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

592

remains high, as indicated by the FID metric, across

our experiments. It’s worth noting that our primary

focus is elsewhere, and the FID metric may not fully

capture the subtleties of consistency and variations in

our results.

Sampling Saturation. To illustrate the effective-

ness of our approach in removing the effect of satu-

ration, we compared our results with those obtained

using the original iterative sampling approach with-

out identity token, as seen in Table 1. We found

that our method consistently outperforms the original

sampling method in terms of both quantitative metrics

and visual quality of generated images. Speciﬁcally,

the proposed method preserves the identity and con-

sistency better while retaining the image quality.

Prompt Editing. As argued in the introduction,

existing methods for 3D-aware novel view synthe-

sis (StyleNeRF) lack the ability to guide the gener-

ation process based on prompts (or other condition-

ing factors). Existing diffusion models, do have this

property, but they do not generate 3D-aware novel

views (regular generation (Rombach et al., 2022;

Ramesh et al., 2022; Midjourney.com, 2022), Con-

trolNet (Zhang and Agrawala, 2023)) or they require a

separate model for each instance (DreamBooth (Ruiz

et al., 2023), DreamFusion (Poole et al., 2022), etc.).

To the best of our knowledge, we present the ﬁrst

model which performs 3D-aware image synthesis,

with inference-time text editing capabilities.

Our method allows for the generation of novel

views and editing of real faces using a single input

image. To achieve this, we estimate the camera po-

sition, angle, and subject identity using a pretrained

projector. We then sample the NeRF to obtain the

shape structure of the face, while the texture is esti-

mated using the adapter A on regular diffusion sam-

pling. These results can be seen in Figure 4.

The method enables quick editing of a generated

or inverted real image while simultaneously gener-

ating new views. Because the learned conditioning

plays a signiﬁcant role in determining the image out-

put, we make a slight adjustment in the frozen diffu-

sion model. Speciﬁcally, we increase the weight of

the desired new prompt tokens by approximately 1.5

times, aiming to enhance their inﬂuence on the gen-

eration process. Despite this modiﬁcation, the output

remains consistent, as shown in Figure 4 and 5.

5 CONCLUSIONS

We have presented a novel approach to generate new

viewpoints from a single image by breaking down the

complex task into sub-tasks and utilizing off-the-shelf

pretrained architectures for each sub-task. Our ap-

proach enabled us to achieve high-quality results by

leveraging the rich representation of these pretrained

models and requiring only modest additional training.

Speciﬁcally, we employed a pretrained NeRF model

to estimate a robust shape prior, which we used to

condition the pretrained diffusion model for image

generation. We also introduced a consistency token

that automatically maintained the desired details dur-

ing the generation of new views, resulting in a cohe-

sive relationship between style and shape. We identify

and address an issue with style saturation during iter-

ative sampling with classiﬁer-free guidance. Finally,

our proposed framework enables additional capabili-

ties such as image-to-3D inversion, zero-shot image

translation, and editing.

Limitations and Future Work. The main depen-

dency of the method is the 3D-aware models used as

shape prior, in terms of quality of the shape gener-

ated, as it affect most part of the consistency. Another

limitation is the image generation quality of these 3D-

aware models, since we do not use real data we are

bounded to it when the consistency tightens between

the input and output. Future work includes investiga-

tion of better priors, improving the generation quality

beyond NeRF quality and better prompt edition.

ACKNOWLEDGEMENTS

This work is funded by the Spanish Gov-

ernment grants PID2022-143257NBI00,

TED2021-132513B-I00, PID2019-104174GB-

I00, MCIN/AEI/10.13039/501100011033 and by

CERCA Programme / Generalitat de Catalunya.

REFERENCES

Bansal, A., Chu, H.-M., Schwarzschild, A., Sengupta,

S., Goldblum, M., Geiping, J., and Goldstein, T.

(2023). Universal guidance for diffusion models.

arXiv preprint arXiv:2302.07121.

Chan, E. R., Lin, C. Z., Chan, M. A., Nagano, K., Pan, B.,

De Mello, S., Gallo, O., Guibas, L. J., Tremblay, J.,

Khamis, S., et al. (2022). Efﬁcient geometry-aware

3d generative adversarial networks. In Proceedings

of the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, pages 16123–16133.

Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano,

A. H., Chechik, G., and Cohen-Or, D. (2022). An

image is worth one word: Personalizing text-to-image

generation using textual inversion.

Gu, J., Liu, L., Wang, P., and Theobalt, C. (2022). Stylenerf:

A style-based 3d aware generator for high-resolution

NeRF-Diffusion for 3D-Consistent Face Generation and Editing

593

image synthesis. In International Conference on

Learning Representations.

Han, I., Yang, S., Kwon, T., and Ye, J. C. (2023). Highly

personalized text embedding for image manipulation

by stable diffusion. arXiv preprint arXiv:2303.08767.

Haque, A., Tancik, M., Efros, A., Holynski, A., and

Kanazawa, A. (2023). Instruct-nerf2nerf: Editing

3d scenes with instructions. In Proceedings of the

IEEE/CVF International Conference on Computer Vi-

sion.

Ho, J. and Salimans, T. (2021). Classiﬁer-free diffusion

guidance. In NeurIPS 2021 Workshop on Deep Gen-

erative Models and Downstream Applications.

Hu, M., Zheng, J., Liu, D., Zheng, C., Wang, C., Tao,

D., and Cham, T.-J. (2023). Cocktail: Mixing multi-

modality controls for text-conditional image genera-

tion. arXiv preprint arXiv:2306.00964.

Huang, L., Chen, D., Liu, Y., Shen, Y., Zhao, D., and Zhou,

J. (2023). Composer: Creative and controllable image

synthesis with composable conditions. arXiv preprint

arXiv:2302.09778.

Kamata, H., Sakuma, Y., Hayakawa, A., Ishii, M., and

Narihira, T. (2023). Instruct 3d-to-3d: Text in-

struction guided 3d-to-3d conversion. arXiv preprint

arXiv:2303.15780.

Karras, T., Laine, S., and Aila, T. (2021). A style-based

generator architecture for generative adversarial net-

works. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 43(12):4217–4228.

Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J.,

and Aila, T. (2020). Analyzing and improving the im-

age quality of StyleGAN. In CVPR.

Li, J., Li, D., Savarese, S., and Hoi, S. (2023a). Blip-

2: Bootstrapping language-image pre-training with

frozen image encoders and large language models.

Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., Li, C., and

Lee, Y. J. (2023b). Gligen: Open-set grounded text-to-

image generation. arXiv preprint arXiv:2301.07093.

Liu, R., Wu, R., Hoorick, B. V., Tokmakov, P., Zakharov,

S., and Vondrick, C. (2023). Zero-1-to-3: Zero-shot

one image to 3d object.

Midjourney.com (2022). Midjourney. https://www.

midjourney.com.

Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T.,

Ramamoorthi, R., and Ng, R. (2020). Nerf: Repre-

senting scenes as neural radiance ﬁelds for view syn-

thesis. In ECCV.

Mou, C., Wang, X., Xie, L., Zhang, J., Qi, Z., Shan, Y.,

and Qie, X. (2023). T2i-adapter: Learning adapters

to dig out more controllable ability for text-to-image

diffusion models. arXiv preprint arXiv:2302.08453.

Or-El, R., Luo, X., Shan, M., Shechtman, E., Park, J. J., and

Kemelmacher-Shlizerman, I. (2022). Stylesdf: High-

resolution 3d-consistent image and geometry gener-

ation. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

13503–13513.

Poole, B., Jain, A., Barron, J. T., and Mildenhall, B. (2022).

Dreamfusion: Text-to-3d using 2d diffusion. arXiv.

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and

Chen, M. (2022). Hierarchical text-conditional im-

age generation with clip latents. arXiv preprint

arXiv:2204.06125.

Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Rad-

ford, A., Chen, M., and Sutskever, I. (2021). Zero-shot

text-to-image generation. In International Conference

on Machine Learning, pages 8821–8831. PMLR.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and

Ommer, B. (2022). High-resolution image synthesis

with latent diffusion models. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition (CVPR), pages 10684–10695.

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., and

Aberman, K. (2023). Dreambooth: Fine tuning text-

to-image diffusion models for subject-driven genera-

tion. In Proceedings of the IEEE/CVF Conference on

Computer Vision and Pattern Recognition.

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Den-

ton, E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi,

S. S., Lopes, R. G., et al. (2022). Photorealistic text-

to-image diffusion models with deep language under-

standing. arXiv preprint arXiv:2205.11487.

Schwarz, K., Liao, Y., Niemeyer, M., and Geiger, A. (2020).

Graf: Generative radiance ﬁelds for 3d-aware image

synthesis. Advances in Neural Information Processing

Systems, 33:20154–20166.

Seo, J., Jang, W., Kwak, M.-S., Ko, J., Kim, H., Kim, J.,

Kim, J.-H., Lee, J., and Kim, S. (2023). Let 2d diffu-

sion model know 3d-consistency for robust text-to-3d

generation. arXiv preprint arXiv:2303.07937.

Serengil, S. I. and Ozpinar, A. (2020). Lightface: A hy-

brid deep face recognition framework. In 2020 Inno-

vations in Intelligent Systems and Applications Con-

ference (ASYU), pages 23–27. IEEE.

Tang, J., Wang, T., Zhang, B., Zhang, T., Yi, R., Ma, L.,

and Chen, D. (2023). Make-it-3d: High-ﬁdelity 3d

creation from a single image with diffusion prior.

Wang, C., Chai, M., He, M., Chen, D., and Liao, J.

(2022). Clip-nerf: Text-and-image driven manipula-

tion of neural radiance ﬁelds. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition (CVPR), pages 3835–3844.

Wang, C., Jiang, R., Chai, M., He, M., Chen, D., and Liao,

J. (2023). Nerf-art: Text-driven neural radiance ﬁelds

stylization. IEEE Transactions on Visualization and

Computer Graphics, pages 1–15.

Wang, W., Yang, S., Xu, J., and Liu, J. (2020). Consistent

video style transfer via relaxation and regularization.

IEEE Transactions on Image Processing, 29:9125–

9139.

Xia, W. and Xue, J.-H. (2022). A survey on 3d-aware image

synthesis. arXiv preprint arXiv:2210.14267.

Zhang, K., Riegler, G., Snavely, N., and Koltun, V. (2020).

Nerf++: Analyzing and improving neural radiance

ﬁelds. arXiv:2010.07492.

Zhang, L. and Agrawala, M. (2023). Adding conditional

control to text-to-image diffusion models. arXiv

preprint arXiv:2302.05543.

Zhou, P., Xie, L., Ni, B., and Tian, Q. (2021).

Cips-3d: A 3d-aware generator of gans based on

conditionally-independent pixel synthesis. arXiv

preprint arXiv:2110.09788.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

594