Towards the Detection of Diffusion Model Deepfakes

Jonas Ricker

1 a

, Simon Damm

1 b

, Thorsten Holz

2 c

and Asja Fischer

1 d

Ruhr University Bochum, Bochum, Germany

CISPA Helmholtz Center for Information Security, Saarbr

ucken, Germany

ﬁ

Keywords:

Deepfake Detection, Diffusion Models, Generative Adversarial Networks, Frequency Analysis.

Abstract:

In the course of the past few years, diffusion models (DMs) have reached an unprecedented level of visual

quality. However, relatively little attention has been paid to the detection of DM-generated images, which

is critical to prevent adverse impacts on our society. In contrast, generative adversarial networks (GANs),

have been extensively studied from a forensic perspective. In this work, we therefore take the natural next

step to evaluate whether previous methods can be used to detect images generated by DMs. Our experiments

yield two key ﬁndings: (1) state-of-the-art GAN detectors are unable to reliably distinguish real from DM-

generated images, but (2) re-training them on DM-generated images allows for almost perfect detection, which

remarkably even generalizes to GANs. Together with a feature space analysis, our results lead to the hypothesis

that DMs produce fewer detectable artifacts and are thus more difﬁcult to detect compared to GANs. One

possible reason for this is the absence of grid-like frequency artifacts in DM-generated images, which are

a known weakness of GANs. However, we make the interesting observation that diffusion models tend to

underestimate high frequencies, which we attribute to the learning objective.

1 INTRODUCTION

In the recent past, diffusion models (DMs) have

shown a lot of promise as a method for synthesizing

images. Such models provide better (or at least sim-

ilar) performance compared to generative adversarial

networks (GANs) and enable powerful text-to-image

models such as DALL·E 2 (Ramesh et al., 2022), Ima-

gen (Saharia et al., 2022), and Stable Diffusion (Rom-

bach et al., 2022). Advances in image synthesis have

resulted in very high-quality images being generated,

and humans can hardly tell if a given picture is an

actual or artiﬁcially generated image (so-called deep-

fake) (Nightingale and Farid, 2022). This progress

has many implications in practice and poses a danger

to our digital society: Deepfakes can be used for dis-

information campaigns, as such images appear par-

ticularly credible due to their sensory comprehensi-

bility. Disinformation aims to discredit opponents in

public perception, to create sentiment for or against

certain social groups, and thus inﬂuence public opin-

ion. In effect, deepfakes lead to an erosion of trust

https://orcid.org/0000-0002-7186-3634

https://orcid.org/0000-0002-4584-1765

https://orcid.org/0000-0002-2783-1264

https://orcid.org/0000-0002-1916-7033

in institutions or individuals, support conspiracy the-

ories, and promote a fundamental political camp for-

mation. DM-based text-to-image models entail par-

ticular risks, since an adversary can speciﬁcally cre-

ate images supporting their narrative, with very little

technical knowledge required. A recent example of

public deception featuring DM-generated images—

although without malicious intent—is the depiction of

Pope Francis in a puffer jacket (Huang, 2023). De-

spite the growing concern about deepfakes and the

continuous improvement of DMs, there is only a lim-

ited amount of research on their detection.

In this paper, we conduct an extensive experimen-

tal study on the detectability of images generated by

DMs. Since previous work on the detection of GAN-

generated images (e.g., (Wang et al., 2020; Grag-

naniello et al., 2021; Mandelli et al., 2022)) resulted

in effective detection methods, we raise the question

whether these can be applied to DM-generated im-

ages. Our analysis on ﬁve state-of-the-art GANs and

ﬁve DMs demonstrates that existing detection meth-

ods suffer from severe performance degradation when

applied to DM-generated images, with the AUROC

dropping by 15.2 % on average compared to GANs.

However, we show that by re-training, the detection

accuracy can be drastically improved, proving that

446

Ricker, J., Damm, S., Holz, T. and Fischer, A.

Towards the Detection of Diffusion Model Deepfakes.

DOI: 10.5220/0012422000003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 4: VISAPP, pages

446-457

ISBN: 978-989-758-679-8; ISSN: 2184-4321

images generated by DMs can be detected. Remark-

ably, a detector trained on DM-generated images is

capable of detecting images from GANs, while the

opposite direction does not hold. Our analysis in

feature space suggests that DM-generated images are

harder to detect because they contain fewer genera-

tion artifacts, particularly in the frequency domain.

However, we observe a previously overlooked mis-

match towards higher frequencies. Further analysis

suggests that this is caused by the training objective

of DMs, which favors perceptual image quality in-

stead of accurate reproduction of high-frequency de-

tails. We believe that our results provide the founda-

tion for further research on the effective detection of

deepfakes generated by DMs. Our code, data, and the

extended version of this paper (with additional experi-

ments) are available at https://github.com/jonasricker/

diffusion-model-deepfake-detection.

2 RELATED WORK

Fake Image Detection. In the wake of the

emergence of powerful image synthesis methods, the

forensic analysis of deepfake images received in-

creased attention, leading to a variety of detection

methods (Verdoliva, 2020). Existing approaches can

be broadly categorized into two groups. Methods in

the ﬁrst group exploit either semantic inconsisten-

cies like irregular eye reﬂections (Hu et al., 2021)

or known generation artifacts in the spatial (Nataraj

et al., 2019; McCloskey and Albright, 2019) or

frequency domain (Frank et al., 2020). The sec-

ond group uses neural networks to learn a feature

representation in which real images can be distin-

guished from generated ones. Wang et al. demon-

strate that training a standard convolutional neural

network (CNN) on real and fake images from a sin-

gle GAN yields a classiﬁer capable of detecting im-

ages generated by a variety of unknown GANs (Wang

et al., 2020). Given the rapid evolution of generative

models, developing detectors which generalize to new

generators is crucial and therefore a major ﬁeld of re-

search (Xuan et al., 2019; Chai et al., 2020; Wang

et al., 2020; Cozzolino et al., 2021; Gragnaniello

et al., 2021; Girish et al., 2021; Mandelli et al., 2022;

Jeong et al., 2022).

Since DMs have been proposed only recently, few

works analyze their forensic properties. Farid per-

forms an initial exploration of lighting (Farid, 2022a)

and perspective (Farid, 2022b) inconsistencies in im-

ages generated by DALL·E 2 (Ramesh et al., 2022),

showing that DMs often generate physically implau-

sible scenes. A novel approach speciﬁcally targeted at

DMs is proposed in (Wang et al., 2023), who observe

that DM-generated images can be more accurately re-

constructed by a pre-trained DM than real images.

The difference between the original and reconstructed

image then serves as the input for a binary classiﬁer.

Another work (Sha et al., 2023) focuses on text-to-

image models like Stable Diffusion (Rombach et al.,

2022). They ﬁnd that incorporating the prompt with

which an image was generated (or a generated cap-

tion if the real prompt is not available) into the detec-

tor improves classiﬁcation. In a work related to ours

(Corvi et al., 2023b), it is shown that GAN detectors

perform poorly on DM-generated images. Therefore,

a pressing challenge is to develop universal detection

methods that are effective against different kinds of

generative models, mainly GANs and DMs. Ojha

et al. make a ﬁrst step in this direction (Ojha et al.,

2023). Instead of training a classiﬁer directly on real

and fake images, which according to their hypothesis

leads to poor generalization since the detector focuses

on e.g., GAN-speciﬁc artifacts, they propose to use

a pre-trained vision transformer (CLIP-ViT (Dosovit-

skiy et al., 2021; Radford et al., 2021)), extended with

a ﬁnal classiﬁcation layer.

Frequency Artifacts in Generated Images. Zhang

et al. were the ﬁrst to demonstrate that the spectrum

of GAN-generated images contains visible artifacts in

the form of a periodic, grid-like pattern due to trans-

posed convolution operations (Zhang et al., 2019).

These ﬁndings were later reproduced (Wang et al.,

2020) and extended to the discrete cosine transform

(DCT) (Frank et al., 2020). Another characteristic

was discovered in (Durall et al., 2020), who showed

that GANs are unable to correctly reproduce the spec-

tral distribution of the training data. In particular, gen-

erated images contain increased magnitudes at high

frequencies. While several works attribute these spec-

tral discrepancies to transposed convolutions (Zhang

et al., 2019; Durall et al., 2020) or, more general,

up-sampling operations (Frank et al., 2020; Chan-

drasegaran et al., 2021), no consensus on their ori-

gin has yet been reached. Some works explain them

by the spectral bias of convolution layers due to lin-

ear dependencies (Dzanic et al., 2020; Khayatkhoei

and Elgammal, 2022), while others suggest the dis-

criminator is not able to provide an accurate training

signal (Chen et al., 2021; Schwarz et al., 2021).

In contrast, whether images generated by DMs ex-

hibit grid-like frequency patterns appears to strongly

depend on the speciﬁc model (Sha et al., 2023; Corvi

et al., 2023a; Ojha et al., 2023). Another interest-

ing observation is made by Rissanen et al. who an-

alyze the generative process of diffusion models in

the frequency domain (Rissanen et al., 2023). They

Towards the Detection of Diffusion Model Deepfakes

447

state that diffusion models have an inductive bias ac-

cording to which, during the reverse process, higher

frequencies are added to existing lower frequencies.

Other works (Kingma et al., 2021; Song et al., 2022b)

experiment with adding Fourier features to improve

learning of high-frequency content, the former report-

ing it leads to much better likelihoods.

3 BACKGROUND ON DMs

DMs are a class of probabilistic generative models,

originally inspired by nonequilibrium thermodynam-

ics (Sohl-Dickstein et al., 2015). The most common

formulations build either on DDPM (Ho et al., 2020)

or the score-based modeling perspective (Song and

Ermon, 2019; Song and Ermon, 2020; Song et al.,

2022b). Numerous modiﬁcations and improvements

have been proposed, leading to higher perceptual

quality (Nichol and Dhariwal, 2021; Dhariwal and

Nichol, 2021; Choi et al., 2022; Rombach et al., 2022)

and increased sampling speed (Song et al., 2022a;

Liu et al., 2022; Salimans and Ho, 2022; Xiao et al.,

2022). In short, DMs model a data distribution by

gradually disturbing a sample from this distribution

and then learning to reverse this diffusion process.

To be more precise, we brieﬂy review the forward

and backward process for the seminal work in (Ho

et al., 2020). In the diffusion (or forward) process for

DDPMs, a sample x

(an image in most applications)

is repeatedly corrupted by Gaussian noise in sequen-

tial steps t = 1, ... ,T in dependence of a monotoni-

cally increasing noise schedule {β

}

t=1

q(x

t−1

) = N (

1−β

t−1

,β

I) . (1)

With α

= 1−β

and

∏

s=1

, we can directly

sample from the forward process at arbitrary times:

q(x

) = N (

√

,(1−

)I) . (2)

The noise schedule is typically designed to satisfy

q(x

) ≈N (0, I). During the denoising (or reverse)

process, we aim to iteratively sample from q(x

t−1

)

to ultimately obtain a clean image from x

∼N (0,I).

However, since q(x

t−1

) is intractable as it depends

on the entire underlying data distribution, it is ap-

proximated by a deep neural network. More formally,

q(x

t−1

) is approximated by

t−1

) = N (µ

,t),Σ

,t)) , (3)

where mean µ

and covariance Σ

are given by the

output of the model (or the latter is set to a constant as

proposed in (Ho et al., 2020)). Predicting the mean of

the denoised sample µ

,t) is conceptually equiva-

lent to predicting the noise that should be removed,

denoted by ε

,t). Predominantly, the latter ap-

proach is implemented (e.g., (Ho et al., 2020; Dhari-

wal and Nichol, 2021)) such that training a DM boils

down to minimizing a (weighted) mean squared er-

ror (MSE) ∥ε −ε

,t)∥

between the true and pre-

dicted noise. Note that this objective can be inter-

preted as a weighted ELBO with data augmentation

(Kingma and Gao, 2023). For a recent overview on

DMs see (Yang et al., 2023).

4 DATASET

To ensure technical correctness, we decide to ana-

lyze a set of generative models for which pre-trained

checkpoints and/or samples of the same dataset,

namely LSUN Bedroom (Yu et al., 2016) (256×256),

are available. Otherwise, both the detectability of

generated samples and their spectral properties might

suffer from biases, making them difﬁcult to compare.

An overview of the dataset is given in Table 1, and we

provide details and example images in the appendix.

All samples are either directly downloaded or

generated using code and pre-trained models pro-

vided by the original publications. We consider data

from ten models in total, ﬁve GANs and ﬁve DMs.

This includes the seminal models ProGAN (Karras

et al., 2018) and StyleGAN (Karras et al., 2019),

as well as the more recent ProjectedGAN (Sauer

et al., 2021). Note that Diff(usion)-StyleGAN2 and

Diff(usion)-ProjectedGAN (Wang et al., 2022a) (the

current state of the art on LSUN Bedroom) use a

forward diffusion process to optimize GAN train-

ing, but this does not change the GAN model archi-

tecture. From the class of DMs, we consider the

Table 1: Models evaluated in this work. Fr

echet incep-

tion distances (FIDs) on LSUN Bedroom are taken from

the original publications and from (Dhariwal and Nichol,

2021) in the case of IDDPM. The lower the FID, the higher

the image quality.

Model Class Method FID

GAN

ProGAN 8.34

StyleGAN 2.65

ProjectedGAN 1.52

Diff-StyleGAN2 3.65

Diff-ProjectedGAN 1.43

DDPM 6.36

IDDPM 4.24

ADM 1.90

PNDM 5.68

LDM 2.95

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

448

original DDPM (Ho et al., 2020), its successor ID-

DPM (Nichol and Dhariwal, 2021), and ADM (Dhari-

wal and Nichol, 2021), the latter outperforming sev-

eral GANs with an FID (Heusel et al., 2017) of

1.90 on LSUN Bedroom. PNDM (Liu et al., 2022)

speeds up the sampling process by a factor of 20 us-

ing pseudo numerical methods, which can be applied

to existing pre-trained DMs. Lastly, LDM (Rombach

et al., 2022) uses an adversarially trained autoencoder

that transforms an image from the pixel space to a la-

tent space (and back). Training the DM in this more

suitable latent space reduces the computational com-

plexity and therefore enables training on higher reso-

lutions. The success of this approach is underpinned

by the groundbreaking results of Stable Diffusion, a

powerful and publicly available text-to-image model

based on LDM.

5 DETECTION ANALYSIS

In this section we analyze how well state-of-the-art

fake image detectors can distinguish DM-generated

from real images. At ﬁrst, we apply pre-trained de-

tectors known to be effective against GANs, followed

by a study on the generalization abilities of re-trained

detectors. Based on our ﬁndings, we conduct an in-

depth feature space analysis to gain a better under-

standing on how fake images are detected.

Detection Methods. We evaluate three state-of-the-

art CNN-based detectors: Wang2020 (Wang et al.,

2020), Gragnaniello2021 (Gragnaniello et al., 2021),

and Mandelli2022 (Mandelli et al., 2022). They are

supposed to perform well on images from unseen gen-

erative models, but it is unclear whether this holds for

DM-generated images as well.

Performance Metrics. The performance of the an-

alyzed classiﬁers is estimated in terms of the widely

used area under the receiver operating characteristic

curve (AUROC). However, the AUROC is overly op-

timistic as it captures merely the potential of a clas-

siﬁer, but the optimal threshold is usually unknown

(Cozzolino et al., 2021). Thus, we adopt the use of

the probability of detection at a ﬁxed false alarm rate

(Pd@FAR) as an additional metric, which is given as

the true positive rate at a ﬁxed false alarm rate. Intu-

itively, this corresponds to picking the y-coordinate of

the ROC curve given an x-coordinate. This metric is a

valid choice for realistic scenarios such as large-scale

content ﬁltering on social media, where only a small

amount of false positives is tolerable. We consider a

ﬁxed false alarm rate of 1%.

Evaluating Pre-Trained Detectors. At ﬁrst, we test

the performance of the pre-trained detectors based on

20000 samples, equally divided into real and gener-

ated images. While Wang2020 and Gragnaniello2021

are trained on images from a single GAN (ProGAN or

StyleGAN2), Mandelli2022 is trained on images from

a diverse set of generative models. The results in the

upper half of Table 2 show that all GAN-generated

images can be effectively distinguished from real im-

ages, with Gragnaniello2021 yielding the best results.

For DM-generated, however, the performance of all

detectors signiﬁcantly drops, on average by 15.2 %

AUROC compared to GANs. Although the average

AUROC of 91.4 % achieved by the best-performing

model Gragnaniello2021 (ProGAN variant) appears

promising, we argue that in a realistic setting with 1 %

tolerable false positives, detecting only 25.7 % of all

fake images is unacceptable.

To verify that our ﬁndings are not limited to our

dataset, we extend our evaluation to images from

DMs trained on other datasets, variations of ADM,

and popular text-to-image models. We provide de-

tails on these additional datasets in the appendix. The

results, given in the lower half of Table 2, support

the ﬁnding that detectors perform signiﬁcantly worse

on DM-generated images. Images from PNDM and

LDM trained on LSUN Church are detected better,

which we attribute to a dataset-speciﬁc bias.

Generalization of Re-Trained Detectors. Given the

ﬁndings presented above, the question arises whether

DMs evade detection in principle, or whether the de-

tection performance can be increased by re-training a

detector. We select the architecture from Wang2020

since the original training code is available and train-

ing is relatively efﬁcient. Furthermore, we choose

the conﬁguration Blur+JPEG (0.5) as it yields slightly

better scores on average. For each of the ten gener-

ators, we train a detector according to the authors’

instructions, using 78000 samples for training and

2000 samples for validation (equally divided into real

(LSUN Bedroom) and fake). We also consider three

aggregated settings in which we train on all images

generated by GANs, DMs, and both, respectively.

We report AUROC and Pd@1%FAR for each de-

tector evaluated on all datasets in Figure 1, based on

20000 held-out test samples (10000 real and 10000

fake per generator). All detectors achieve near-

perfect scores when evaluated on the dataset they

were trained on (represented by the values in the diag-

onal). While this is unsurprising for GANs, it shows

that DMs do exhibit detectable features that a detector

can learn. Regarding generalization, it appears that

detectors trained on images from a single DM per-

form better on images from unseen DMs compared to

Towards the Detection of Diffusion Model Deepfakes

449

Table 2: Detection performance of pre-trained universal detectors. For Wang2020 and Gragnaniello2021, we consider two

different variants, respectively. In the upper half, we report the performance of models trained on LSUN Bedroom, while

results on additional datasets are given in the second half. The best score (determined by the highest Pd@1%) for each

generator is highlighted in bold. We report average scores in gray.

AUROC / Pd@1%

Wang2020 Gragnaniello2021 Mandelli2022

Blur+JPEG (0.5) Blur+JPEG (0.1) ProGAN StyleGAN2

ProGAN 100.0 / 100.0 100.0 / 100.0 100.0 / 100.0 100.0 / 100.0 91.2 / 27.5

StyleGAN 98.7 / 81.4 99.0 / 84.4 100.0 / 100.0 100.0 / 100.0 89.6 / 14.7

ProjectedGAN 94.8 / 49.1 90.9 / 34.5 100.0 / 99.3 99.9 / 97.8 59.4 / 2.4

Diff-StyleGAN2 99.9 / 97.9 100.0 / 99.3 100.0 / 100.0 100.0 / 100.0 100.0 / 99.9

Diff-ProjectedGAN 93.8 / 43.3 88.8 / 27.2 99.9 / 99.2 99.8 / 96.6 62.1 / 2.8

Average 97.4 / 74.3 95.7 / 69.1 100.0 / 99.7 99.9 / 98.9 80.4 / 29.5

DDPM 85.2 / 14.2 80.8 / 9.3 96.5 / 39.1 95.1 / 30.7 57.4 / 0.6

IDDPM 81.6 / 10.6 79.9 / 7.8 94.3 / 25.7 92.8 / 21.2 62.9 / 1.3

ADM 68.3 / 3.4 68.8 / 4.0 77.8 / 5.2 70.6 / 2.5 60.5 / 1.8

PNDM 79.0 / 9.2 75.5 / 6.3 91.6 / 16.6 91.5 / 22.2 71.6 / 4.0

LDM 78.7 / 7.4 77.7 / 6.9 96.7 / 42.1 97.0 / 48.9 54.8 / 2.1

Average 78.6 / 9.0 76.6 / 6.8 91.4 / 25.7 89.4 / 25.1 61.4 / 2.0

ADM (LSUN Cat) 58.4 / 2.5 58.1 / 3.3 60.2 / 4.2 51.7 / 1.8 55.6 / 1.3

ADM (LSUN Horse) 55.5 / 1.5 53.4 / 2.2 56.1 / 2.7 50.2 / 1.4 44.2 / 0.5

ADM (ImageNet) 69.1 / 4.1 71.7 / 4.5 72.1 / 3.5 83.9 / 16.6 60.1 / 0.9

ADM-G-U (ImageNet) 67.2 / 3.7 62.3 / 1.2 66.8 / 1.6 78.9 / 10.2 60.0 / 1.0

PNDM (LSUN Church) 76.9 / 10.2 77.6 / 12.0 90.9 / 24.5 99.3 / 85.8 56.4 / 1.9

LDM (LSUN Church) 86.3 / 19.8 82.2 / 14.2 98.8 / 75.5 99.5 / 90.2 58.9 / 1.3

LDM (FFHQ) 69.4 / 3.6 71.0 / 3.6 91.1 / 25.4 67.2 / 2.1 63.0 / 0.6

ADM’ (FFHQ) 77.7 / 8.7 81.4 / 8.8 87.7 / 17.8 89.0 / 17.2 69.8 / 2.0

P2 (FFHQ) 79.5 / 8.9 83.2 / 9.2 89.2 / 11.5 91.1 / 18.9 72.5 / 2.7

Stable Diffusion v1-1 42.4 / 1.5 51.4 / 2.0 73.2 / 4.0 75.2 / 13.6 76.1 / 4.2

Stable Diffusion v1-5 43.7 / 1.4 52.6 / 2.1 72.9 / 2.8 79.8 / 18.3 75.3 / 4.1

Stable Diffusion v2-1 46.1 / 1.4 47.3 / 1.1 62.8 / 1.1 55.1 / 1.1 37.0 / 0.5

Midjourney v5 52.7 / 3.0 57.1 / 3.0 69.9 / 3.3 67.1 / 3.3 18.3 / 0.3

ProGAN

StyleGAN

ProjectedGAN

Diff-StyleGAN2

Diff-ProjectedGAN

DDPM

IDDPM

ADM

PNDM

LDM

GAN

All

Trained on

ProGAN

StyleGAN

ProjectedGAN

Diff-StyleGAN2

Diff-ProjectedGAN

DDPM

IDDPM

ADM

PNDM

LDM

Tested on

100.0 92.5 98.7 99.9 98.0 98.0 98.5 98.5 98.0 97.5 100.0 99.7 100.0

94.2 100.0 91.6 99.7 92.8 89.3 97.5 98.2 94.4 92.2 100.0 99.9 100.0

86.9 66.2 100.0 90.0 100.0 89.0 88.5 82.6 88.8 75.3 100.0 97.9 100.0

96.1 94.1 94.8 100.0 93.3 93.5 93.1 83.4 97.1 91.3 100.0 99.0 100.0

86.2 66.5 99.9 88.6 100.0 87.6 88.0 82.6 86.5 79.9 100.0 98.4 100.0

83.7 63.0 78.8 87.2 79.1 100.0 99.9 99.6 99.1 97.9 93.1 100.0 100.0

81.6 67.3 77.7 81.1 78.2 99.9 100.0 100.0 97.9 97.2 90.0 100.0 100.0

67.8 56.5 64.5 56.6 65.0 93.2 98.0 99.9 87.0 91.0 70.5 100.0 100.0

81.3 62.7 78.0 84.3 74.1 99.6 99.9 99.8 100.0 97.8 95.1 100.0 100.0

75.2 54.7 71.6 73.0 69.0 97.0 99.1 99.4 93.9 100.0 88.3 100.0 100.0

(a) AUROC

ProGAN

StyleGAN

ProjectedGAN

Diff-StyleGAN2

Diff-ProjectedGAN

DDPM

IDDPM

ADM

PNDM

LDM

GAN

All

Trained on

ProGAN

StyleGAN

ProjectedGAN

Diff-StyleGAN2

Diff-ProjectedGAN

DDPM

IDDPM

ADM

PNDM

LDM

Tested on

100.0 42.2 78.8 98.8 76.4 86.8 92.4 93.7 82.0 75.7 100.0 98.9 100.0

45.1 100.0 34.4 95.0 42.6 27.7 75.5 84.9 48.0 37.0 100.0 98.8 100.0

24.1 2.8 100.0 35.6 99.5 34.2 42.1 38.5 27.8 15.2 100.0 85.4 100.0

58.3 42.8 50.7 100.0 48.1 43.8 43.8 25.9 66.6 44.6 100.0 90.6 100.0

23.7 3.0 98.2 28.0 100.0 29.9 37.8 32.9 20.2 17.3 100.0 87.6 100.0

13.0 2.1 8.3 18.8 9.0 99.8 99.1 92.6 80.2 68.7 33.9 100.0 100.0

10.6 2.5 7.7 10.3 8.1 96.3 99.7 99.3 62.4 61.6 23.9 100.0 100.0

3.2 0.9 2.5 1.3 3.1 37.5 69.7 98.2 16.7 30.1 4.2 100.0 99.9

11.0 2.3 8.4 16.1 7.0 91.6 97.7 96.6 100.0 75.2 48.9 100.0 100.0

5.8 0.7 4.3 4.9 4.0 62.7 85.3 91.5 36.0 100.0 20.8 100.0 100.0

(b) Pd@1%FAR

Figure 1: Detection performance for re-trained detectors. The columns GAN, DM, and All correspond to models trained on

samples from all GANs, all DMs, and both, respectively.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

450

−100 −50 0 50 100

−100

−50

100

pre-trained

Gragnaniello2021 (ProGAN)

−100 −50 0 50 100

−100

−50

100

Gragnaniello2021 (StyleGAN2)

−100 −50 0 50 100

−100

−50

100

Wang2020 (Blur+JPEG (0.1))

−100 −50 0 50 100

−100

−50

100

Wang2020 (Blur+JPEG (0.5))

−100 −50 0 50 100

−100

−50

100

ﬁne-tuned

Wang2020 (ﬁne-tuned on All)

−100 −50 0 50 100

−100

−50

100

Wang2020 (ﬁne-tuned on GANs)

−100 −50 0 50 100

−100

−50

100

Wang2020 (ﬁne-tuned on DMs)

−100 −50 0 50 100

−100

−50

100

trained from scratch

Wang2020 (trained on All)

−100 −50 0 50 100

−100

−50

100

Wang2020 (trained on GANs)

−100 −50 0 50 100

−100

−50

100

Wang2020 (trained on DMs)

DDPM

ProGAN

IDDPM

StyleGAN

ADM

ProjectedGAN

PNDM

Diff-StyleGAN2

LDM

Diff-ProjectedGAN

Real

−100 −50 0 50 100

−100

−50

100

Gragnaniello2021 (pre-trained – ProGAN)

−100 −50 0 50 100

−100

−50

100

Wang2020 pre-trained

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

(a) Pre-trained (Blur+JPEG(0.5))

−100 −50 0 50 100

−100

−50

100

Gragnaniello2021 (pre-trained – ProGAN)

−100 −50 0 50 100

−100

−50

100

Wang2020 pre-trained

−100 −50 0 50 100

−100

−50

100

Wang2020 scratch All

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

(b) Trained on GANs and DMs

−100 −50 0 50 100

−100

−50

100

Gragnaniello2021 (pre-trained – ProGAN)

−100 −50 0 50 100

−100

−50

100

Wang2020 pre-trained

−100 −50 0 50 100

−100

−50

100

Wang2020 scratch All

−100 −50 0 50 100

−100

−50

100

Wang2020 scratch GANs

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

−100 −50 0 50 100

−100

−50

100

Gragnaniello2021 (pre-trained – ProGAN)

−100 −50 0 50 100

−100

−50

100

Wang2020 pre-trained

−100 −50 0 50 100

−100

−50

100

Wang2020 scratch All

−100 −50 0 50 100

−100

−50

100

Wang2020 scratch GANs

−100 −50 0 50 100

−100

−50

100

Wang2020 scratch DMs

(d) Trained on all DMs

Figure 2: Feature space visualization for the detector Wang2020 via t-SNE of real and generated images in two dimensions.

The features correspond to the representation prior to the last fully-connected layer of the given detector.

detectors trained on images from a single GAN. For

instance, the detector trained solely on images from

ADM achieves a Pd@1%FAR greater than 90 % for

all other DMs. These ﬁndings suggest that images

generated by DMs not only contain detectable fea-

tures, but that these are similar across different archi-

tectures and training procedures.

Surprisingly, detectors trained on images from

DMs are signiﬁcantly more successful in detecting

GAN-generated images than vice versa. This be-

comes most apparent when analyzing the detectors

that are trained on all GANs and DMs, respectively.

While the detector trained on images from all GANs

achieves an average Pd@1%FAR of 26.34 % on DM-

generated images, the detector trained on images from

all DMs on average detects 94.26 % of all GAN-

generated samples.

Analysis of the Learned Feature Spaces. We con-

duct a more in-depth analysis of the learned feature

spaces to better understand this behavior. We utilize

t-SNE (van der Maaten and Hinton, 2008) to visualize

the extracted features prior to the last fully-connected

layer in Figure 2. For the pre-trained Wang2020 we

observe a relatively clear separation between real and

GAN-generated images, while there exists a greater

overlap between real and DM-generated images (Fig-

ure 2a). These results match the classiﬁcation results

from Table 2. Looking at the detector which is trained

on DM-generated images only (Figure 2d), the feature

representations for GAN- and DM-generated images

appear to be similar. In contrast, the detectors trained

using GAN-generated images or both (Figures 2c and

2b) seem to learn distinct feature representations for

GAN- and DM-generated images.

Based on these results, we argue that the hypothe-

sis, according to which a detector trained on one fam-

ily of generative models cannot generalize to a differ-

ent family (Ojha et al., 2023), only holds true “in one

direction”. Given the feature space visualizations, de-

tectors trained on GAN-generated images appear to

focus mostly on GAN-speciﬁc artifacts, which may

be more prominent and easier to learn. In contrast,

a detector trained exclusively on DM-generated im-

ages learns a feature representation in which images

generated by GANs and DMs are mapped to simi-

lar embeddings. As a consequence, this detector can

generalize to GAN-generated images, since it is not

“distracted” by family-speciﬁc patterns, but learns to

detect artifacts which are present in both GAN- and

DM-generated images.

This also implies that DM-generated images con-

tain fewer family-speciﬁc artifacts. This becomes ap-

parent when analyzing them in the frequency domain,

which we demonstrate in the following section.

6 FREQUENCY ANALYSIS

For detecting GAN-generated images, exploiting ar-

tifacts in the frequency domain has proven to be

highly effective (Frank et al., 2020). Since DMs con-

tain related building blocks as GANs (especially up-

sampling operations in the underlying U-Net (Ron-

neberger et al., 2015)), it seems reasonable to sus-

pect that DM-generated exhibit similar artifacts. In

this section, we analyze the spectral properties of

DM-generated images and compare them to those of

GAN-generated images. We investigate potential rea-

sons for the identiﬁed frequency characteristics by an-

alyzing the denoising process.

Transforms. We use two frequency transforms

that have been applied successfully in both traditional

image forensics (Lyu, 2008) and deepfake detection:

discrete Fourier transform (DFT) and the reduced

spectrum (Durall et al., 2020; Dzanic et al., 2020;

Schwarz et al., 2021), which is as a 1D representa-

tion of the DFT. While DFT visualizes frequency ar-

Towards the Detection of Diffusion Model Deepfakes

451

Real ProGAN StyleGAN ProjectedGAN Diff-StyleGAN2 Diff-ProjectedGAN

−5

−3

−1

(a) GANs

Real DDPM IDDPM ADM PNDM LDM

−5

−3

−1

(b) DMs

Figure 3: Mean DFT spectrum of real and generated images. To increase visibility, the color bar is limited to [10

−5

,10

−1

with values lying outside this interval being clipped.

tifacts, the reduced spectrum can be used to identify

spectrum discrepancies.

Analysis of Frequency Artifacts. Figure 3

depicts the absolute DFT spectrum averaged over

10000 images from each GAN and DM trained on

LSUN Bedroom. Before applying the DFT, images

are transformed to grayscale and, following previous

works (Marra et al., 2019; Wang et al., 2020), high-

pass ﬁltered by subtracting a median-ﬁltered version

of the image. For all GANs we observe signiﬁ-

cant artifacts, predominantly in the form of a regular

grid, corresponding to previous ﬁndings (Zhang et al.,

2019; Frank et al., 2020). In contrast, the DFT spec-

tra of images generated by DMs (see Figure 3b), are

signiﬁcantly more similar to the real spectrum with al-

most no visible artifacts. LDM is an exception: while

being less pronounced than for GANs, generated im-

ages exhibit a clearly visible grid across their spec-

trum. As mentioned in Section 4, the architecture of

LDM differs from the remaining DMs as the ﬁnal im-

age is generated using an adversarially trained autoen-

coder, which could explain the discrepancies. This

observation supports previous ﬁndings which suggest

that the discriminator is responsible for spectrum de-

viations (Chen et al., 2021; Schwarz et al., 2021).

We conclude that “traditional” DMs, which gen-

erate images by gradual denoising, do not produce

the frequency artifacts known from GANs. Regard-

ing our results in Section 5, this could explain why

detectors trained on GAN images do not generalize to

DMs, while training on DM-generated images leads

to better generalization.

Analysis of Spectrum Discrepancies. In a second

experiment we analyze how well GANs and DMs are

able to reproduce the spectral distribution of real im-

ages. We visualize the reduced spectra for all gen-

erators in Figure 4, again averaged over 10000 im-

ages. Except for Diff-StyleGAN2, all GANs con-

tain the previously reported elevated high frequen-

cies. Among the DMs, these can only be observed for

LDM. This strengthens the hypothesis that it this the

autoencoder which causes GAN-like frequency char-

acteristics. However, we observe that all DMs have

a tendency to underestimate the spectral density to-

wards the higher end of the frequency spectrum. This

is particularly noticeable for DDPM, IDDPM, and

ADM.

Source of Spectrum Underestimation. Based on

these ﬁndings, we conduct an additional experiment

to identify the source of this spectrum underestima-

tion. Since DMs generate images via gradual denois-

ing, we analyze how the spectrum evolves during this

denoising process. For this experiment, we use code

and model from ADM (Dhariwal and Nichol, 2021)

trained on LSUN Bedroom. We generate samples at

different time steps t and compare the reduced spec-

trum (averaged over 512 images) to that of 50000 real

images. The results are shown in Figure 5.

We adopt the ﬁgure type from (Schwarz et al.,

2021) and depict the relative spectral density error

err

fake

real

−1, with the colorbar clipped at -1

and 1. At t = T = 1000, the image is pure Gaussian

noise, which naturally causes strong spectrum devi-

ations. Around t = 300, the error starts to decrease,

but interestingly it appears that the optimum is not

reached at t = 0, but at t ≈10. It should be noted that

while at this step the frequency spectrum is closest

to that of real images, they still contain visible noise.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

452

0 0.5 1.0

f/ f

nyq

−5

−4

−3

−2

−1

Spectral Density

−4

−3

Real

ProGAN

StyleGAN

ProjectedGAN

Diff-StyleGAN2

Diff-ProjectedGAN

(a) GANs

0 0.5 1.0

f/ f

nyq

−5

−4

−3

−2

−1

Spectral Density

−4

−3

Real

DDPM

IDDPM

ADM

PNDM

LDM

(b) DMs

Figure 4: Mean reduced spectrum of real and generated images. The part of the spectrum where GAN-characteristic discrep-

ancies occur is magniﬁed.

During the ﬁnal denoising steps,

err

becomes nega-

tive, predominantly for higher frequencies, which cor-

responds to our observations in Figure 4b.

We hypothesize that this underestimation towards

higher frequencies stems from the learning objective

used to train DMs. Recalling Section 3, DMs are

trained to minimize the MSE between the true and

predicted noise at different time steps. The weighting

of the MSE therefore controls the relative importance

of each step. While the semantic content of an image

is generated early during the denoising process, high-

frequency details are synthesized near t = 0 (Kingma

et al., 2021). Theoretically, using the variational

lower bound L

vlb

as the training objective would yield

the highest log-likelihood. However, training DMs

−1.0

−0.5

0.0

0.5

1.0

(a) 0 ≤t ≤1000

−1.0

−0.5

0.0

0.5

1.0

(b) 0 ≤t ≤100

Figure 5: Spectral density error

err

throughout the denois-

ing process. The error is computed relative to the spectrum

of real images. We display the error for (a) all sampling

steps and (b) a close-up of the last 100 steps. The colorbar

is clipped at -1 and 1.

with L

vlb

is difﬁcult (Ho et al., 2020; Nichol and

Dhariwal, 2021), which is why in practice modiﬁed

objectives are used. The loss proposed in (Ho et al.,

2020), L

simple

= E

t,x

,ε

[∥ε −ε

,t)∥

], for example,

considers each denoising step as equally important.

Compared to L

vlb

, the steps near t = 0 are signiﬁ-

cantly down-weighted, trading off a higher perceptual

image quality for higher log-likelihood values. The

MSE of ADM over t shown in Figure 6 demonstrates

that the ﬁnal denoising steps are the most difﬁcult

(which is already plain to see as the signal-to-noise ra-

tio increases for t → 0, i.e., the to-be-predicted noise

makes up ever smaller fractions of x

). The hybrid

training objective L

hybrid

= L

simple

+λL

vlb

(Nichol and

Dhariwal, 2021), used in IDDPM and ADM, incorpo-

rates L

vlb

(with λ = 0.001) and already improves upon

DDPM in modeling the high-frequency details of an

image, but still does not match it accurately.

In summary, we conclude that the denoising steps

near t = 0, which govern the high-frequency con-

tent of generated images, are the most difﬁcult to

model. By down-weighting the importance of these

steps (relatively to the L

vlb

), DMs achieve remarkable

perceptual image quality (or benchmark metrics such

as FID), but seem to fall short of accurately matching

the high-frequency distribution of real data.

0 200 400 600 800 1000

−4

−2

MSE

Figure 6: Mean and standard deviation of the MSE for

ADM on LSUN Bedroom after training. The denoising

steps towards t = 0, accounting for high frequencies, have a

higher error.

Towards the Detection of Diffusion Model Deepfakes

453

−1.0

−0.5

0.0

0.5

1.0

Figure 7: Spectral density error

err

for different numbers

of denoising steps. The error is computed relatively to the

spectrum of real images. The colorbar is clipped at -1 and

1. Note that the y-axis is not scaled linearly.

Effect of the Number of Sampling Steps. Lastly,

we analyze how the number of sampling steps during

the denoising process affects the frequency spectrum.

Previous work reported that increasing the number

of steps leads to an improved log-likelihood, corre-

sponding to better reproduction of higher frequencies

(Nichol and Dhariwal, 2021). Our results in Figure 7

conﬁrm these ﬁndings, increasing the number of de-

noising steps reduces the underestimation.

7 CONCLUSION

Deepfakes pose a severe risk for society, and diffu-

sion models have the potential to raise disinformation

campaigns to a new level. Despite the urgency of the

problem, research about detecting DM-generated im-

ages is still in its infancy. In this work, we provide a

much-needed step towards the detection of DM deep-

fakes. Instead of starting from the ground up, we build

on previous achievements in the forensic analysis of

GANs. We show that, after re-training, current state-

of-the-art detection methods can successfully distin-

guish real from DM-generated images. Further anal-

ysis suggests that DMs produce fewer detectable ar-

tifacts than GANs, explaining why detectors trained

on DM-generated images generalize to GANs, but

not vice versa. While artifacts in the frequency do-

main have been shown to be a characteristic feature

of GAN-generated images, we ﬁnd that DMs pre-

dominantly do not have this weakness. However, we

observe a systematic underestimation of the spectral

density, which we attribute to the loss function of

DMs. Whether this mismatch can be exploited for

novel detection methods should be part of future re-

search. We hope that our work can foster the forensic

analysis of images generated by DMs and spark fur-

ther research towards the effective detection of deep-

fakes.

ACKNOWLEDGEMENTS

Funded by the Deutsche Forschungsgemeinschaft

(DFG, German Research Foundation) under Ger-

many’s Excellence Strategy - EXC 2092 CASA -

390781972.

REFERENCES

Chai, L., Bau, D., Lim, S.-N., and Isola, P. (2020). What

makes fake images detectable? Understanding proper-

ties that generalize. In European Conference on Com-

puter Vision (ECCV).

Chandrasegaran, K., Tran, N.-T., and Cheung, N.-M.

(2021). A closer look at Fourier spectrum discrep-

ancies for CNN-generated images detection. In IEEE

Conference on Computer Vision and Pattern Recogni-

tion (CVPR).

Chen, Y., Li, G., Jin, C., Liu, S., and Li, T. (2021). SSD-

GAN: Measuring the realness in the spatial and spec-

tral domains. In AAAI Conference on Artiﬁcial Intel-

ligence (AAAI).

Choi, J., Lee, J., Shin, C., Kim, S., Kim, H., and Yoon,

S. (2022). Perception prioritized training of diffusion

models. In IEEE Conference on Computer Vision and

Pattern Recognition (CVPR).

Corvi, R., Cozzolino, D., Poggi, G., Nagano, K., and Ver-

doliva, L. (2023a). Intriguing properties of synthetic

images: From generative adversarial networks to dif-

fusion models. In IEEE Conference on Computer Vi-

sion and Pattern Recognition (CVPR) Workshops.

Corvi, R., Cozzolino, D., Zingarini, G., Poggi, G., Nagano,

K., and Verdoliva, L. (2023b). On the detection of

synthetic images generated by diffusion models. In

IEEE International Conference on Acoustics, Speech

and Signal Processing (ICASSP).

Cozzolino, D., Gragnaniello, D., Poggi, G., and Verdoliva,

L. (2021). Towards universal GAN image detection.

In International Conference on Visual Communica-

tions and Image Processing (VCIP).

Dhariwal, P. and Nichol, A. (2021). Diffusion models beat

GANs on image synthesis. In Advances in Neural In-

formation Processing Systems (NeurIPS).

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,

D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,

M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,

N. (2021). An image is worth 16x16 words: Trans-

formers for image recognition at scale. International

Conference on Learning Representations (ICLR).

Durall, R., Keuper, M., and Keuper, J. (2020). Watch your

up-convolution: CNN based generative deep neural

networks are failing to reproduce spectral distribu-

tions. In IEEE Conference on Computer Vision and

Pattern Recognition (CVPR).

Dzanic, T., Shah, K., and Witherden, F. (2020). Fourier

spectrum discrepancies in deep network generated im-

ages. In Advances in Neural Information Processing

Systems (NeurIPS).

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

454

Farid, H. (2022a). Lighting (in)consistency of paint by text.

arXiv preprint.

Farid, H. (2022b). Perspective (in)consistency of paint by

text. arXiv preprint.

Frank, J., Eisenhofer, T., Sch

onherr, L., Fischer, A.,

Kolossa, D., and Holz, T. (2020). Leveraging fre-

quency analysis for deep fake image recognition.

In International Conference on Machine Learning

(ICML).

Girish, S., Suri, S., Rambhatla, S. S., and Shrivastava, A.

(2021). Towards discovery and attribution of open-

world GAN generated images. In IEEE Conference

on Computer Vision and Pattern Recognition (CVPR).

Gragnaniello, D., Cozzolino, D., Marra, F., Poggi, G., and

Verdoliva, L. (2021). Are GAN generated images easy

to detect? A critical analysis of the state-of-the-art.

In IEEE International Conference on Multimedia and

Expo (ICME).

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and

Hochreiter, S. (2017). GANs trained by a two time-

scale update rule converge to a local nash equilibrium.

In Advances in Neural Information Processing Sys-

tems (NeurIPS).

Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion

probabilistic models. In Advances in Neural Informa-

tion Processing Systems (NeurIPS).

Hu, S., Li, Y., and Lyu, S. (2021). Exposing GAN-

Generated faces using inconsistent corneal specular

highlights. In IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP).

Huang, K. (2023). Why Pope Francis is the star of A.I.-

generated photos. The New York Times.

Jeong, Y., Kim, D., Ro, Y., Kim, P., and Choi, J. (2022). Fin-

gerprintNet: Synthesized ﬁngerprints for generated

image detection. In European Conference on Com-

puter Vision (ECCV).

Karras, T., Aila, T., Laine, S., and Lehtinen, J. (2018). Pro-

gressive growing of GANs for improved quality, sta-

bility, and variation. In International Conference on

Learning Representations (ICLR).

Karras, T., Laine, S., and Aila, T. (2019). A style-based

generator architecture for generative adversarial net-

works. In IEEE Conference on Computer Vision and

Pattern Recognition (CVPR).

Khayatkhoei, M. and Elgammal, A. (2022). Spatial fre-

quency bias in convolutional generative adversarial

networks. AAAI Conference on Artiﬁcial Intelligence

(AAAI).

Kingma, D., Salimans, T., Poole, B., and Ho, J. (2021).

Variational diffusion models. In Advances in Neural

Information Processing Systems (NeurIPS).

Kingma, D. P. and Gao, R. (2023). Understanding diffusion

objectives as the ELBO with simple data augmenta-

tion. In Advances in Neural Information Processing

Systems (NeurIPS).

Liu, L., Ren, Y., Lin, Z., and Zhao, Z. (2022). Pseudo nu-

merical methods for diffusion models on manifolds.

In International Conference on Learning Representa-

tions (ICLR).

Lyu, S. (2008). Natural Image Statistics in Digital Image

Forensics. PhD thesis, Dartmouth College.

Mandelli, S., Bonettini, N., Bestagini, P., and Tubaro, S.

(2022). Detecting GAN-generated images by orthog-

onal training of multiple CNNs. In IEEE International

Conference on Image Processing (ICIP).

Marra, F., Gragnaniello, D., Verdoliva, L., and Poggi, G.

(2019). Do GANs leave artiﬁcial ﬁngerprints? In

IEEE Conference on Multimedia Information Pro-

cessing and Retrieval (MIPR).

McCloskey, S. and Albright, M. (2019). Detecting GAN-

generated imagery using saturation cues. In IEEE In-

ternational Conference on Image Processing (ICIP).

Nataraj, L., Mohammed, T. M., Manjunath, B. S., Chan-

drasekaran, S., Flenner, A., Bappy, J. H., and Roy-

Chowdhury, A. K. (2019). Detecting GAN generated

fake images using co-occurrence matrices. Electronic

Imaging.

Nichol, A. Q. and Dhariwal, P. (2021). Improved denoising

diffusion probabilistic models. In International Con-

ference on Machine Learning (ICML).

Nightingale, S. J. and Farid, H. (2022). AI-synthesized

faces are indistinguishable from real faces and more

trustworthy. Proceedings of the National Academy of

Sciences.

Ojha, U., Li, Y., and Lee, Y. J. (2023). Towards universal

fake image detectors that generalize across generative

models. In IEEE Conference on Computer Vision and

Pattern Recognition (CVPR).

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh,

G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P.,

Clark, J., Krueger, G., and Sutskever, I. (2021). Learn-

ing transferable visual models from natural language

supervision. In International Conference on Machine

Learning (ICML).

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen,

M. (2022). Hierarchical text-conditional image gener-

ation with CLIP latents. arXiv preprint.

Rissanen, S., Heinonen, M., and Solin, A. (2023). Gener-

ative modelling with inverse heat dissipation. In In-

ternational Conference on Learning Representations

(ICLR).

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Om-

mer, B. (2022). High-resolution image synthesis with

latent diffusion models. In IEEE Conference on Com-

puter Vision and Pattern Recognition (CVPR).

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-

Net: Convolutional networks for biomedical image

segmentation. In International Conference on Med-

ical Image Computing and Computer Assisted Inter-

vention.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,

Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-

stein, M., Berg, A. C., and Fei-Fei, L. (2015). Ima-

geNet large scale visual recognition challenge. Inter-

national Journal of Computer Vision (IJCV).

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Den-

ton, E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi,

S. S., Lopes, R. G., Salimans, T., Ho, J., Fleet, D. J.,

and Norouzi, M. (2022). Photorealistic text-to-image

Towards the Detection of Diffusion Model Deepfakes

455

diffusion models with deep language understanding.

arXiv preprint.

Salimans, T. and Ho, J. (2022). Progressive distillation for

fast sampling of diffusion models. In International

Conference on Learning Representations (ICLR).

Sauer, A., Chitta, K., M

uller, J., and Geiger, A. (2021). Pro-

jected GANs converge faster. In Advances in Neural

Information Processing Systems (NeurIPS).

Schwarz, K., Liao, Y., and Geiger, A. (2021). On the fre-

quency bias of generative models. In Advances in

Neural Information Processing Systems (NeurIPS).

Sha, Z., Li, Z., Yu, N., and Zhang, Y. (2023). DE-FAKE:

Detection and attribution of fake images generated by

text-to-image diffusion models. ACM SIGSAC Con-

ference on Computer and Communications Security

(CCS).

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and

Ganguli, S. (2015). Deep unsupervised learning us-

ing nonequilibrium thermodynamics. In International

Conference on Machine Learning (ICML).

Song, J., Meng, C., and Ermon, S. (2022a). Denoising dif-

fusion implicit models. In International Conference

on Learning Representations (ICLR).

Song, Y. and Ermon, S. (2019). Generative modeling

by estimating gradients of the data distribution. In

Advances in Neural Information Processing Systems

(NeurIPS).

Song, Y. and Ermon, S. (2020). Improved techniques for

training score-based generative models. In Advances

in Neural Information Processing Systems (NeurIPS).

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A.,

Ermon, S., and Poole, B. (2022b). Score-based gen-

erative modeling through stochastic differential equa-

tions. In International Conference on Learning Rep-

resentations (ICLR).

van der Maaten, L. and Hinton, G. (2008). Visualizing data

using t-SNE. Journal of Machine Learning Research

(JMLR).

Verdoliva, L. (2020). Media forensics and DeepFakes: An

overview. IEEE Journal of Selected Topics in Signal

Processing.

Wang, S.-Y., Wang, O., Zhang, R., Owens, A., and Efros,

A. A. (2020). CNN-generated images are surprisingly

easy to spot... for now. In IEEE Conference on Com-

puter Vision and Pattern Recognition (CVPR).

Wang, Z., Bao, J., Zhou, W., Wang, W., Hu, H., Chen,

H., and Li, H. (2023). DIRE for diffusion-generated

image detection. IEEE International Conference on

Computer Vision (ICCV).

Wang, Z., Zheng, H., He, P., Chen, W., and Zhou, M.

(2022a). Diffusion-GAN: Training GANs with dif-

fusion. arXiv preprint.

Wang, Z. J., Montoya, E., Munechika, D., Yang, H.,

Hoover, B., and Chau, D. H. (2022b). DiffusionDB:

A large-scale prompt gallery dataset for text-to-image

generative models. arXiv preprint.

Xiao, Z., Kreis, K., and Vahdat, A. (2022). Tackling the

generative learning trilemma with denoising diffusion

GANs. In International Conference on Learning Rep-

resentations (ICLR).

Xuan, X., Peng, B., Wang, W., and Dong, J. (2019). On the

generalization of GAN image forensics. In Biometric

Recognition (CCBR).

Yang, L., Zhang, Z., Song, Y., Hong, S., Xu, R., Zhao, Y.,

Zhang, W., Cui, B., and Yang, M.-H. (2023). Diffu-

sion models: A comprehensive survey of methods and

applications. ACM Computing Surveys.

Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., and

Xiao, J. (2016). LSUN: Construction of a large-scale

image dataset using deep learning with humans in the

loop. arXiv preprint.

Zhang, X., Karaman, S., and Chang, S.-F. (2019). Detecting

and simulating artifacts in GAN fake images. In IEEE

International Workshop on Information Forensics and

Security (WIFS).

APPENDIX

Details on LSUN Bedroom Dataset

LSUN Bedroom (Yu et al., 2016). We download

and extract the lmbd database ﬁles using the ofﬁ-

cial repository

. The images are center-cropped to

256×256 pixels.

ProGAN (Karras et al., 2018). We download the

ﬁrst 10000 samples from the non-curated collection

provided by the authors.

StyleGAN (Karras et al., 2019). We download the

ﬁrst 10000 samples generated with ψ = 0.5 from the

non-curated collection provided by the authors.

ProjectedGAN (Sauer et al., 2021). We sample

10000 images using code and pre-trained models pro-

vided by the authors using the default conﬁguration

(--trunc=1.0).

Diff-StyleGAN2 and Diff-ProjectedGAN (Wang

et al., 2022a). We sample 10000 images using code

and pre-trained models provided by the authors using

the default conﬁguration.

DDPM (Ho et al., 2020), IDDPM (Nichol and

Dhariwal, 2021), and ADM (Dhariwal and Nichol,

2021). We download the samples provided by the

authors of ADM

and extract the ﬁrst 10000 samples

for each generator. For ADM on LSUN, we select the

models trained with dropout.

https://github.com/fyu/lsun

https://github.com/tkarras/progressive growing of

gans

https://github.com/NVlabs/stylegan

https://github.com/autonomousvision/projected gan

https://github.com/Zhendong-Wang/Diffusion-GAN

https://github.com/openai/guided-diffusion

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

456

Real ProGAN StyleGAN ProjectedGAN Diff-StyleGAN2 Diff-ProjectedGAN DDPM IDDPM ADM PNDM LDM

Figure 8: Non-curated example images for real LSUN Bedroom, GAN-generated, and DM-generated images.

PNDM (Liu et al., 2022). We sample 10000

images using code and pre-trained model provided

by the authors.

We specify --method F-PNDM

and --sample speed 20 for LSUN Bedroom and

--sample speed 10 for LSUN Church, as these are

the settings leading to the lowest FID according to Ta-

bles 5 and 6 in the original publication.

LDM (Rombach et al., 2022). We sample 10000

images using code and pre-trained models provided

by the authors using settings from the corresponding

table in the repository.

For LSUN Church there is an

inconsistency between the repository and the paper,

we choose 200 DDIM steps (-c 200) as reported in

the paper.

Details on Additional Datasets

Here we provide details on the additional datasets

analyzed in Table 2. Note that ADM-G-U refers to

the two-stage up-sampling stack in which images are

generated at a resolution of 64×64 and subsequently

up-sampled to 256×256 pixels using a second model

(Dhariwal and Nichol, 2021). The generated images

are obtained according to the instructions given in the

previous section.

Due to the relevance of facial images in the con-

text of deepfakes, we also include two DMs not yet

considered, P2 and ADM’ (Choi et al., 2022), trained

on FFHQ (Karras et al., 2019). ADM’ is a smaller

version of ADM with 93 million instead of more

than 500 million parameters.

P2 is similar to ADM’

https://github.com/luping-liu/PNDM

https://github.com/CompVis/latent-diffusion

https://github.com/jychoi118/P2-weighting#

but features a modiﬁed weighting scheme which im-

proves performance by assigning higher weights to

diffusion steps where perceptually rich contents are

learned (Choi et al., 2022). We download checkpoints

for both models from the ofﬁcial repository and sam-

ple images according to the authors’ instructions.

Real images from LSUN (Yu et al., 2016), Ima-

geNet (Russakovsky et al., 2015), and FFHQ (Kar-

ras et al., 2019) are downloaded from their ofﬁcial

sources. Images from LSUN Cat/Horse, FFHQ, and

ImageNet are resized and cropped to 256×256 pixels

by applying the same pre-processing that was used

when preparing the training data for the model they

are compared against. For all datasets we collect

10000 real and 10000 generated images.

Images from Stable Diffusion

are generated

using the diffusers library

with default settings.

For each version, we generate 10000 images us-

ing prompts from DiffusionDB (Wang et al., 2022b).

Since Midjourney

is proprietary, we collect 300 im-

ages created using the “–v 5” ﬂag from the ofﬁcial

Discord server. As real images, we take a subset

of 10000 images from LAION-Aesthetics V2

with

aesthetics scores greater than 6.5. For the detection

experiments, we use the entire images, for comput-

ing frequency spectra we take center crops of size

256×256.

training-your-models

https://stability.ai/blog/stable-diffusion-public-release

https://huggingface.co/docs/diffusers/index

https://www.midjourney.com

https://laion.ai/blog/laion-aesthetics/

Towards the Detection of Diffusion Model Deepfakes

457