Adversarial Unsupervised Domain Adaptation Guided with Deep

Clustering for Face Presentation Attack Detection

Yomna Safaa El-Din

1 a

, Mohamed N. Moustafa

2 b

and Hani Mahdi

1 c

Computer and Systems Engineering Department, Ain Shams University, Cairo, Egypt

Department of Computer Science and Engineering, The American University in Cairo, New Cairo, Egypt

Keywords:

Biometrics, Face Presentation Attack Detection, Domain Adaptation, Deep Clustering, MobileNet.

Abstract:

Face Presentation Attack Detection (PAD) has drawn increasing attentions to secure the face recognition sys-

tems that are widely used in many applications. Conventional face anti-spooﬁng methods have been proposed,

assuming that testing is from the same domain used for training, and so cannot generalize well on unseen

attack scenarios. The trained models tend to overﬁt to the acquisition sensors and attack types available in

the training data. In light of this, we propose an end-to-end learning framework based on Domain Adaptation

(DA) to improve PAD generalization capability. Labeled source-domain samples are used to train the feature

extractor and classiﬁer via cross-entropy loss, while unsupervised data from the target domain are utilized in

adversarial DA approach causing the model to learn domain-invariant features. Using DA alone in face PAD

fails to adapt well to target domain that is acquired in different conditions with different devices and attack

types than the source domain. And so, in order to keep the intrinsic properties of the target domain, deep clus-

tering of target samples is performed. Training and deep clustering are performed end-to-end, and experiments

performed on several public benchmark datasets validate that our proposed Deep Clustering guided Unsuper-

vised Domain Adaptation (DCDA) can learn more generalized information compared with the state-of-the-art

classiﬁcation error on the target domain.

1 INTRODUCTION

Face detection and recognition is an important topic

in computer vision, it is used in many applications

from which authentication is the most sensitive. Since

the wide spread of smart mobile devices and the in-

corporation of latest vision technologies in these de-

vices, end users ﬁnd it more convenient to use their

biometric data for authentication instead of classic

passwords typing. On the other hand, this ease of

use makes it easier for attacker to spoof the authen-

tication system using pre-recorded biometric samples

of the device user. Hence, the interest in developing

reliable anti-spooﬁng or Presentation Attack Detec-

tion (PAD) techniques is increasing. Through the past

years, several approaches were developed in litera-

ture (El-Din et al., 2020b) starting from basic meth-

ods relying on image processing and hand-engineered

features, till approaches depending on automatically

learnt features by deep-learning.

https://orcid.org/0000-0003-4959-4543

https://orcid.org/0000-0002-0017-3724

https://orcid.org/0000-0002-7442-9948

These approaches have succeeded to obtain per-

fect attack detection results on intra-dataset scenarios,

where the dataset is split into training and testing sub-

sets, so both subsets are coming from the same sen-

sor model and acquisition environment. However, the

main drawback of such methods is their lack of gen-

eralization to different environments and attack sce-

narios. The performance of the learnt representations

in classifying the attack from the bona-ﬁde (real) pre-

sentation degrades signiﬁcantly when test data is cap-

tured by different sensor or in different settings or il-

lumination conditions. In view of this, Domain Adap-

tation (DA) (Ganin et al., 2016) and Domain Gen-

eralization (DG) (Li et al., 2018c) were introduced

recently in the PAD ﬁeld. The target of DG is to

learn representations that are robust across different

domains, given samples from several source domains,

such as in (Li et al., 2018a), (Shao et al., 2019), (Jia

et al., 2020). While, DA aims at adapting a model

trained on labeled source domain to a different tar-

get domain. Unsupervised DA (UDA) uses labeled

samples from a source domain and unlabeled samples

from a target domain, with a goal to achieve low clas-

El-Din, Y., Moustafa, M. and Mahdi, H.

Adversarial Unsupervised Domain Adaptation Guided with Deep Clustering for Face Presentation Attack Detection.

DOI: 10.5220/0010432900360045

In Proceedings of the International Conference on Image Processing and Vision Engineering (IMPROVE 2021), pages 36-45

ISBN: 978-989-758-511-1

siﬁcation error on the target domain though samples

are unlabeled, by learning domain-invariant features.

For example, (Li et al., 2018b) experimented with

both hand-crafted and deep learnt features in DA,

however their approach was not end-to-end and the

deep features did not generalize well. They achieved

their best results using a combination of hand-crafted

features. Adversarial training was used in DA for face

PAD in (Wang et al., 2019) to learn an embedding

space shared by both the source and target domain

models. The training process is still not end-to-end

where source pre-training, embedding adaptation and

target classiﬁcation are done separately.

In this paper, we focus on developing an end-to-

end trainable solution for PAD based on DA, which

focuses on improving the generalization of the model

for cross-dataset testing without the need for several

labeled source domains as in DG. Existing DA-based

solutions solely aim to align the distribution of an un-

labeled target domain to that of a different source do-

main, neglecting the speciﬁc nature of target domain.

Target domain in face PAD is a different PAD dataset

usually using a different device for authentication, in

addition to different attack types in different illumina-

tion conditions. So solely trying to align the distribu-

tion of such different attacks scenarios to the distribu-

tion of attack scenarios in the labeled source dataset

would not succeed, especially when the device used

for authentication in one domain, is close to the one

used for attack in the other domain, e.g. mobile de-

vice. So, we propose an approach that utilizes DA for

PAD generalization to a different domain without ne-

glecting the intrinsic properties of this target domain.

We incorporate clustering based on deeply extracted

features, for guiding the feature extraction network to

generate features that are domain invariant, yet main-

tain the class-wise separability of the target dataset.

The main contributions of this work are: (1)

proposing a novel end-to-end DA-based training ar-

chitecture for the generalization of face PAD based;

(2) utilize deep embedding clustering of target do-

main in guiding the DA process; (3) show substantial

improvement on SOTA in cross-dataset evaluation on

public benchmark face PAD datasets, with close to

0% cross-dataset error. The rest of the paper is or-

ganized as follows: Section 2 reviews the latest liter-

ature in face PAD and domain adaptation. Our pro-

posed algorithm is explained in Section 3, followed

by the experiments, benchmark datasets used and re-

sults in Section 4, then conclusions in Section 5.

2 RELATED WORK

2.1 CNN-based Face PAD

Recent software-based face presentation attack detec-

tion methods can be mainly categorized into texture-

based and temporal-based techniques. The texture-

based methods rely on extracting features from the

frames that would identify if the presented image is

fake or bona-ﬁde. Features could be hand-crafted

features as color texture (Boulkenafet et al., 2016),

SIFT (Patel et al., 2016b) or SURF (Boulkenafet

et al., 2017) which obtained good results in differen-

tiating real from fake presentations. However, they

are often sensitive to varying acquisition conditions,

such as camera devices, lighting conditions and Pre-

sentation Attack Instruments (PAIs). Hence, the need

to automatically learn and extract meaningful fea-

tures directly from the data using deep representa-

tions, such as in (Nagpal and Dubey, 2018; El-Din

et al., 2020b).

In additional to texture-based features, temporal-

based models utilize the temporal information in face

videos for better detection of attack presentations.

Frame difference was combined with deep features

in (Patel et al., 2016a). In (Feng et al., 2016) im-

age quality information and motion information from

optical ﬂow were combined with neural network for

classiﬁcation. LSTM-CNN architecture was used

in (Xu et al., 2015) and in (Wang et al., 2018) mul-

tiple RGB frames were used to estimate face depth

information, and then two modules were used to ex-

tract short and long-term motion.

These methods obtain excellent results in intra-

dataset testing, yet still fail to generalize to unseen

environments and acquisition conditions. They show

high cross-dataset evaluation errors, hence the need to

incorporate domain adaptation techniques to decrease

the discrepancy in distributions of the domain used

for training and that used for deployment.

2.2 Unsupervised Domain Adaptation

Recently, Domain Adaptation (DA) has been intro-

duced in computer vision, to tackle the problem of

domain shift when applying models trained on a cer-

tain (source) domain to another (target) domain. Sev-

eral methods, such as (Ganin et al., 2016), rely on ad-

versarial training (Goodfellow et al., 2014) to guide

the feature extraction module to generate domain-

invariant features that make it harder for a domain dis-

criminator to decide the original domain of the sam-

ple. Speciﬁcally, unsupervised DA uses labeled sam-

ples from the source domain in addition to unlabeled

Adversarial Unsupervised Domain Adaptation Guided with Deep Clustering for Face Presentation Attack Detection

samples from the target domain; to train a model that

reduces the classiﬁcation error on the unlabeled target

domain.

Inspired by the success of DA in image classiﬁca-

tion (Pei et al., 2018), (Long et al., 2018), (Saito et al.,

2018b), (Saito et al., 2018a), (Kurmi and Nambood-

iri, 2019), (Zhang et al., 2019), (Tang and Jia, 2020),

(Kang et al., 2020), we believe that it can be used to

address the problem of generalization in face PAD.

A model ﬁne-tuned on certain small-sized face PAD

dataset fails to generalize when testing on different

PAD domains with different domain. The learnt fea-

tures become speciﬁc to the subjects or sensors avail-

able in the source dataset. Hence, by using domain

adaptation in face PAD, the model will be guided to

learn domain-invariant features that can differentiate

between bona-ﬁde and attack face videos regardless

of the instance origin. However, learning domain in-

variant features can hurt classiﬁcation of the target

face PAD dataset by ignoring the ﬁne-level class-wise

structure of this target since the attack samples are

generated with different instruments, and bona-ﬁde

samples may be captured by different sensors. Hence,

we propose to incorporate deep clustering of target

samples to constraint the model to keep the discrimi-

native structure of both classes in the target dataset.

2.3 Deep Unsupervised Clustering

Deep learning is adopted in clustering of deep

visual features since Deep Embedded Clustering

(DEC) (Xie et al., 2016). Clustering aims at catego-

rizing unlabeled data into groups (clusters). A DEC

is a method that jointly learns feature representations

and cluster assignments, where a neural network is

ﬁrst pre-trained by means of an autoencoder and then

ﬁne-tuned by jointly optimizing cluster centroids in

output space and the underlying feature representa-

tion using Kullback-Leibler divergence minimization.

Later, variants of DEC have emerged, such as (Guo

et al., 2018) which adds data augmentation.

Unlike DEC, which require layer-wise pretraining

as well as non-joint embedding and clustering learn-

ing, DEeP Embedded RegularIzed ClusTering (DE-

PICT) (Dizaji et al., 2017) utilizes an end-to-end op-

timization for training all network layers simultane-

ously using the uniﬁed clustering and reconstruction

loss functions. DEPICT consists of a multi-layer con-

volutional autoencoder followed by a multinomial lo-

gistic regression function. The clustering objective

function uses relative entropy (KL divergence) min-

imization, regularized by a prior for the frequency of

cluster assignments. An alternating strategy is then

followed to optimize the objective by updating pa-

rameters and estimating cluster assignments. Recon-

struction loss functions is employed in the autoen-

coder to prevent the deep embedding function from

overﬁtting. A joint learning framework is introduced

to minimize the uniﬁed clustering and reconstruction

loss functions together and train all network layers si-

multaneously.

Recently, clustering has been introduced in sev-

eral domain adaptation methods. (Wang et al., 2019)

proposed a method to alleviate the effects of nega-

tive transfer in adversarial domain matching between

source and target representations. They proposed to

simultaneously learn tightly clustered target repre-

sentations while encouraging that each cluster is as-

signed to a unique and different class from the source.

In (Tang et al., 2020), structural domain similarity is

assumed and the clustering solution is constrained us-

ing structural source regularization. By minimizing

the KL divergence between predictive label distribu-

tion of the network and an introduced auxiliary one;

replacing the auxiliary distribution with that formed

by ground-truth labels of source data implements the

structural source regularization via a simple strategy

of joint network training.

2.4 DA in Face PAD

Domain Adaptation (DA) and Domain Generalization

(DG) have been utilized recently to reduce the gap be-

tween the target domain and the source domain during

face PAD. (Shao et al., 2019) focuses on improving

the generalization ability of face PAD methods from

the perspective of the domain generalization. Adver-

sarial learning was proposed to train multiple feature

extractors to learn a generalized feature space. They

also incorporated an auxiliary face depth supervision

to further enhance the generalization ability. Later,

a Single-Side Domain Generalization framework was

proposed in (SSDG) (Jia et al., 2020) that is end-to-

end. They proposed to learn a generalized feature

space, where the feature distribution of the real faces

is compact while that of the fake ones is dispersed

among domains but compact within each domain.

One of the ﬁrst work exploring DA for face PAD

is (Li et al., 2018b) were both hand-crafted features

and deep neural network learned features are adopted

and compared in DA. (Li et al., 2018b) found that the

deep learning based methods may not generalize well

under cross-database testing scenarios, and their best

results were achieved using concatenated CoALBP

and LPQ feature in HSV and YCbCr color space.

A 3D CNN architecture tailored for the spatial-

temporal input is proposed by (Li et al., 2018a) for en-

hancing the generalization capability of the network.

IMPROVE 2021 - International Conference on Image Processing and Vision Engineering

Figure 1: Architecture of the proposed Deep Clustering-guided-Domain Adaptation (DCDA) for face PAD. F : Feature

extraction network, D: Domain Discriminator, GRL: Gradient Reverse Layer, C : Categories Classiﬁer, S: Source, T : Target.

Bona-ﬁde images are highlighted in green border, while attack images are highlighted in red. Deep Features Clustering:

predicts target pseudo-labels ˜y and cluster centers Z

. Cluster Assignment: assigns target features to clusters based on

Student’s t-distribution.

A robust representation across different face spoof-

ing domains is presented by introducing the general-

ization loss as the regularization term. Given train-

ing samples from several domains, the network is op-

timized such that the Maximum Mean Discrepancy

(MMD) distances among different domains can be

minimized. They performed the experiments by com-

bining three publicly available face PAD datasets to

create 10 protocols. In each protocol, data from one

camera is set aside as the unseen target domain, and

a subset of the remaining cameras are used as source

domains.

ADA (Wang et al., 2019) is the ﬁrst to incorpo-

rate adversarial domain adaptation in a learning ap-

proach to improve face PAD generalization capability.

A source model optimized with triplet loss is ﬁrst pre-

trained in source domain, and then adversarial adap-

tation is used for training a target model to learn a

shared embedding space by both the source and target

domain models. Finally, target images are mapped

with the target model to the embedding space and

classiﬁed with k-nearest neighbors’ classiﬁer. How-

ever, as the ﬁrst attempt to use adversarial training for

domain adaptation, the training is not performed end-

to-end. In (Mohammadi et al., 2020), authors relied

only on bona-ﬁde samples of the target domain for

DA. They hypothesize that, in a CNN trained for PAD

given a source domain, some of the ﬁlters learned in

the initial layers are robust ﬁlters that generalize well

to the target dataset, whereas others are more speciﬁc

to the source dataset. They propose to prune such ﬁl-

ters that do not generalize well from one dataset to

another in order to improve the performance of the

network on the target dataset. Feature Divergence

Measure (FDM) is computed to quantify the level of

domain shift at a given layer in a CNN.

(Wang et al., 2020) proposed disentangled repre-

sentation learning for cross-domain face PAD. Their

approach consists of Disentangled Representation

learning (DR-Net) and Multi-Domain feature learn-

ing (MD-Net). DR-Net learns a pair of encoders via

generative models that can disentangle PAD informa-

tive features from subject discriminative features. The

disentangled features from different domains are fed

to MD-Net which learns domain-independent features

for the ﬁnal cross-domain face PAD task. They tested

single-source to single-target cross-domain PAD and

also multi-source to multi-target and obtained state

of the art results on four public datasets. Their later

work (DR-UDA) (Wang et al., 2021) consists of three

modules, ML-Net, UDA-Net and DR-Net. ML-Net

uses the labeled source domain face images to learn a

discriminative feature representation. UDA-Net per-

forms unsupervised adversarial domain adaptation in

order to optimize the source domain and target do-

main encoders jointly, and obtain a common feature

space shared by both domains. Furthermore, DR-

Net disentangles the features irrelevant to speciﬁc do-

mains by reconstructing the source and target domain

face images from the common feature space.

3 METHODOLOGY

In this section, we introduce the frameworks of unsu-

pervised DA and unsupervised clustering. Then, we

present our proposed model for UDA in face PAD.

Figure 1 shows a brief overview of the proposed ar-

chitecture.

Since the most common target platform is mobile

devices, we follow (El-Din et al., 2020a) and use lat-

est architecture of MobileNet; MobileNetV3 (Howard

et al., 2019) instead of the commonly used Resnet-

50 (He et al., 2016). MobileNet is tuned for mobile

phone CPUs which helps preserve the mobile battery

life by reducing power consumption. With ∼ 80%

Adversarial Unsupervised Domain Adaptation Guided with Deep Clustering for Face Presentation Attack Detection

less parameters, MobileNetV3 achieves comparable

ImageNet accuracy as Resnet50 with reduced infer-

ence time.

3.1 Deep Unsupervised Domain

Adaptation

Unsupervised Domain Adaptation (UDA), depends

on having a set of labeled source samples S =

{(x

)}

i=1

and another set of unlabeled samples

from target domain T = {(x

)}

j=1

. The goal is to

train a model that is capable of achieving low classi-

ﬁcation errors on the unlabeled target domain guided

by the labeled source samples. The feature extraction

module is trained to be able to extract features that

beneﬁt the categories classiﬁcation without differen-

tiating the domain origin of the sample.

As (DANN) (Ganin et al., 2016), adversarial

training is incorporated to guide the feature extrac-

tion module, F , to generate features that confuse a

domain discriminator, D, to not be able to determine

the domain of the input features. The categories (task)

classiﬁer, C , is then trained on top of these generated

domain-invariant features; using the labeled source

samples, to decide the ﬁnal classiﬁcation label.

The task classiﬁcation loss is calculated as

∑

i=1

(C (F (x

)),y

), (1)

where L

is categorical cross-entropy loss, F is the

feature extractor network and L

is the task classiﬁ-

cation loss from all source samples using. Similarly,

domain discrimination loss,

+ N

)

∑

m=1

(D(F (x

)),d

), (2)

where L

is categorical cross-entropy loss, d

is do-

main label, zero for source samples, and one other-

wise. This loss is minimized over the parameters of

fﬂD while maximized over the parameters of F via

the gradient reverse layer (GRL).

3.2 Proposed DC-guided UDA for Face

PAD

For handling the problem of generalization in face

PAD, we propose to use UDA, in combination with

Deep Embedding Clustering (DEC) of the unlabeled

target samples during training. Motivation for UDA

is to alleviate the shift between the source and target

domains. However, we do not want to lose the target

properties for each class.

Aligning both source and target domains in face

PAD with source and target coming from different

sensors and attack instruments, might lead to target

samples being misclassiﬁed and shifted towards the

wrong class. For example, a target mobile attack in-

stance can be assigned to the closest source sample

which might be bona-ﬁde class if bona-ﬁde samples

of source dataset are captured with same instrument

(mobile device). So motivation for adding target clus-

tering is to preserve the class-wise separation of tar-

get domain samples. Which together with adversar-

ial DA, will guide F to generate features that reduce

domain shift without corrupting the class-wise sepa-

rability of target domain.

Algorithm 1: Training of DCDA: Deep Clustering-guided-

Domain adaptation for face PAD.

Let {θ

, θ

} be the learnable parameters for

each model component.

Let {Z

, Z

} be the learnable cluster centers for

bona-ﬁde and attack classes respectively.

Input:

Labeled source videos S : (X

) and unlabeled

target videos T : (X

)

Batch size: B

Output:

Feature extractor: F (·)

Classiﬁer: C(·)

Deep Descriminative Clustering:

Fix model parameters

} = F (x

) for all x

∈ X

} = F (x

) for all x

∈ X

= avg({z

}) for y

= c ∀ c ∈ {BF, A}

˜y

← k-means clustering of {z

} using

as initial centers

for c ∈ {BF, A} do

= avg({z

}) for y

= c

= avg(Z

)

ep = 0

while ep < max epochs do

for b = 0 to iter per epoch do

Draw random batch {(x

)}

i=1

{(x

, ˜y

)}

j=1

= θ

− ∇

+ L

˜y

+ L

)

= θ

− ∇

+ L

˜y

)

= θ

− ∇

= Z

− ∇

,∀c ∈ {BF, A}

end for

Update target pseudo-labels ˜y

based on {z

} dis-

tance to Z

and Z

ep = ep + 1

end while

IMPROVE 2021 - International Conference on Image Processing and Vision Engineering

3.2.1 Deep Clustering for DA

Our training follows the unsupervised deep clustering

methods (Xie et al., 2016), (Dizaji et al., 2017) which

alternates between cluster assignment while ﬁxing

model parameters, then model update while ﬁxing

these cluster assignment. At the start of each epoch,

k-means clustering is performed on the deep features

generated by F to generate pseudo-labels,

, for the

unlabeled target samples. Then, during epoch itera-

tions, two losses based on Kullback-Leibler (KL) di-

vergence (Xie et al., 2016) are minimized to update

the parameters of F , C and cluster centroids Z

via

back-propagation.

These learnable centroids Z

= {Z

} for each

of the bona-ﬁde and attack classes are re-updated at

the start of each epoch, while ﬁxing the model pa-

rameters. Guided by the labels of source samples,

and the source features generated by the current F ,

clusters centers for the source domain; Z

, can be ob-

tained in the embedding space. On the other hand,

for the unlabeled target samples, k-means clustering

is used on the generated latent features of all target

samples. This obtains both pseudo-labels for all tar-

get instances in training,

, and clusters centers for

the target domain, Z

. Finally, the learnable cluster

center for each class Z

is updated to be the mean of

both Z

and Z

During training iterations of an epoch, target sam-

ples are used to minimize KL divergence two-way.

The loss to be minimized can be written as

dec

= KL(Q||P) + L

reg

(3)

∑

j=1

∑

k=1

log

∑

k=1

ˆq

log ˆq

where P

is the cluster assignments for target samples

and Q

is an auxiliary target distributions, and the pur-

pose of Kl divergence minimization is to decrease the

distance between the model predicted P

and the dis-

tribution Q

. The second term follows (Krause et al.,

2010) for incorporating class balance to avoid degen-

erate solutions, where ˆq

∑

j=1

As in (Dizaji et al., 2017), optimization of loss in

equation 3 alternates between updating auxiliary dis-

tribution Q

then using Q

to update model parame-

ters. Q

is calculated in closed-form solutions as

∑

)

∑

)

. (4)

For further regulation of target clustering, we use the

previously estimated target pseudo-labels as part of

by setting q

= 0.5 ∗ q

+ 0.5 ∗ ˜y

Then using calculated P

and Q

, parameters of F

and C are updated by minimizing

= −

∑

j=1

∑

k=1

log p

, (5)

As mentioned earlier, we use KL divergence min-

imization with target domain samples for two losses

which update parameters of feature extraction mod-

ule F via backpropagation. The ﬁrst loss additionally

aims to update the classiﬁer C as well, and the second

loss updates the cluster centroids Z

. For the ﬁrst loss

˜y

), we set P

as the classiﬁer prediction probabili-

ties after softmax; p

= so f tmax(C (F (x

))), so that

it becomes like cross-entropy classiﬁcation loss using

pseudo-labeled target samples.

For the second loss (L

), P

is estimated using the

Student’s t-distribution to measure the similarity be-

tween target features Z

and cluster centroids Z

in (Xie et al., 2016)

(1 + ||z

− Z

/α)

−

α+1

∑

(1 + ||z

− Z

/α)

−

α+1

Finally, the estimated pseudo-labels for target

samples are used to update the parameters of both the

feature extractor F and the classiﬁer C by minimizing

the following task classiﬁcation loss

˜y

∑

j=1

˜y

(C (F (x

)), ˜y

), (6)

where L

˜y

is categorical cross-entropy loss.

3.2.2 Complete Model Learning

The complete end-to-end training methodology of our

proposed DC-guided-DA for face PAD is listed in Al-

gorithm 1. We use only one frame per video.

4 EXPERIMENTS AND RESULTS

4.1 Face PAD Datasets

Table 1 summarizes the total number of samples

present in each subset of the datasets used, in addition

to the Presentation Attack Instruments (PAI) used and

the sensors used in recording videos for authentica-

tion.

Replay-Attack (Chingovska et al., 2012) is one of the

earliest datasets presented in literature for the prob-

lem of face spooﬁng It consists of 1200 short videos

from 50 different subjects with resolution 320 × 240

from 50 different subjects. Attack scenario include

Adversarial Unsupervised Domain Adaptation Guided with Deep Clustering for Face Presentation Attack Detection

Table 1: Number of samples per class per subset for each used PAD dataset.

Database PAI Sensor used for authentication Subset Bona-ﬁde Attack Total

Replay-Attack

1) PR (A4)

2) VR on iPhone

3) VR on iPad

(1) Webcam in MacBook laptop

train

devel

test

300

400

360

480

MSU-MFSD

1) PR (A3)

2) high-def VR on iPad

3) VR on iPhone

1) Webcam in MacBook Air

2) FC of Google Nexus5 Mob

train

test

120

204

Replay-Mobile

1) PR (A4)

2) VR on matte-screen

1) FC of iPad Mini2 Tablet

2) FC of LG-G4 Mobile

train

devel

test

192

256

192

120

160

110

312

416

302

FC: Front-Camera, PR: Hard-copy print of high-res photo, VR: Video replay

Table 2: Results of Proposed DC-guided-DA for Face-PAD in ACER% at threshold 0.5.

train→test RA→M RA→RM M→RA M→RM RM→RA RM→M Average

Source-only 34 49.8 39.4 15.6 42.3 42 37.18

DA w/o clustering 29.6 47.2 49.25 11.35 45 2.9 30.88

DCDA w/o L

˜y

18.35 49.2 10.40 2.25 11.65 37.80 19.94

DCDA 0 0 0.15 1.6 1.15 1.65 0.76

RA: Replay-Attack, M: MSU-MFSD, RM: Replay-Mobile

(a) RA→M

(−)

(b) RA→RM

(−)

(??)

(d) RA→RM

(??)

(e) M→RA

(−)

(f) M→RM

(−)

(g) M→RA

(??)

(h) M→RM

(??)

(i) RM→RA

(−)

(j) RM→M

(−)

(k) RM→RA

(??)

(l) RM→M

(??)

Figure 2: t-SNE visualization analysis. Upper row

(−)

: DA without clustering, Bottom row

(??)

: Proposed DC-guided-DA.

Blue: Source, Green: Target, ◦: Bona-ﬁde, ×: Attack. Best viewed in color.

IMPROVE 2021 - International Conference on Image Processing and Vision Engineering

Table 3: Comparison with SOTA in HTER%.

RA→M M→RA Average

KSA

(Li et al., 2018b) 18.6

23.3

20.95

ADA (Wang et al., 2019) 30.5 5.1 17.8

PAD-GAN (Wang et al., 2020) 23.2 8.7 15.95

SSDG (Jia et al., 2020) 7.38

11.7

9.54

DCDA (Proposed) 0 0.15 0.08

On concatenated CoALBP and LPQ features in HSV and YCbCr color space

Source-domain includes two other datasets

”hard-copy print-attack”, ”mobile-photo attack” and

”high-deﬁnition screen attack”. Attacks are presented

to the sensor (regular webcam) either with a ”ﬁxed”

tripod, or by an attacker holding the presenting device

(printed paper or replay device) with his/her ”hand”.

MSU Mobile Face Spooﬁng Database (MSU-

MFSD) (Wen et al., 2015) targets the problem of face

spooﬁng on smartphones . The dataset includes real

and spoofed videos from 35 subjects . Two devices

were used, the webcam of a MacBook Air with res-

olution 640 × 480 and the front facing camera of a

smartphone with 720 × 480 resolution. Three attack

scenarios are used: print-attack on A3 paper, video re-

play attack on the screen of an iPad and video replay

attack on a smartphone.

Replay-Mobile (Costa-Pazo et al., 2016) was re-

leased by the same research institute that released

Replay-Attack. It has 1200 short videos from 40

subjects captured by two mobile devices at resolu-

tion 720 × 1280. Each subject has ten bona-ﬁde ac-

cesses and 16 attack videos under different attack

modes. Two types of attack are present: photo-print

and matte-screen attack displaying digital-photo or

video.

4.2 Experimental Setup

Our experiments were performed on NVIDIA

GeForce 840m GPU with CUDA version 11.0. Bob

package (Anjos et al., 2012) was used for datasets

management and PyTorch was used for models and

training. Evaluation metrics for PAD are the ISO/IEC

30107-3:2017

metrics. Attack Presentation Clas-

siﬁcation Error Rate (APCER), Bona-ﬁde Presenta-

tion Classiﬁcation Error Rate (BPCER) and their Av-

erage Classiﬁcation Error Rate (ACER) ((APCER +

BPCER)/2) is used for reporting results in the tables.

4.3 Results and Discussion

Table 2 presents results of our proposed DC-guided

UDA for face PAD on the 3 benchmark face datasets

used. Results are reported as the average ACER %

https://www.iso.org/standard/67381.html

of three runs, ACER is calculated on the test sub-

set of the target dataset. The ﬁrst row represents

the results obtained by ﬁne-tuning a MobileNetV3

classiﬁcation network on source dataset only with-

out domain adaptation. We performed experiments

to study the inﬂuence of each model component on

the overall performance of the algorithm. Clustering

components and losses were removed and only Do-

main Adaptation was performed, results in the second

row of Table 2 show only slight improvement over

source-only trained models. Then, adding cluster-

ing components with target psuedo-labels estimation

and target clustering loss L

, but without updating the

classiﬁer C with target classiﬁcation loss L

˜y

, yielded

a signiﬁcant decrease in the target classiﬁcation er-

ror on most datasets as shown in third row. How-

ever, though feature extraction network is trying to

learn domain-invariant features, the classiﬁer trained

on source-samples only still fails in some cases to

achieve low errors on some target datasets. For ex-

ample, the classiﬁer trained on Replay-Attack dataset

fails to discriminate the attack and bona-ﬁde samples

on Replay-Mobile dataset.

Finally, the last row shows results obtained by

our full proposed DCDA framework, which achieves

near-perfect classiﬁcation of the unlabeled target sam-

ples. Comparison with state-of-the art DA-based

face PAD solutions is provided in Table 3 showing

superiority of our proposed DC-guided-DA frame-

work. Furthermore, t-SNE visualization analysis is

presented in Figure 2, comparing our proposed archi-

tecture, with models trained using Domain Adapta-

tion only. The visualizations show that our proposed

framework could align the classiﬁcation boundaries

for both source and target datasets, it also shows the

diversity of attack and sensors types present in the

same dataset that form clusters in the same class of

the same dataset, for example Replay-Attack in Fig-

ure 2 parts 2c, 2d, 2g and 2k.

Adversarial Unsupervised Domain Adaptation Guided with Deep Clustering for Face Presentation Attack Detection

5 CONCLUSION AND FUTURE

WORK

In this paper, we proposed an approach that exploits

unsupervised adversarial domain adaptation guided

with target clustering, in order to improve the gener-

alization ability for face PAD. Speciﬁcally, our frame-

work utilizes UDA to learn domain invariant features

that could leverage from the labeled source samples

to classify the unlabeled samples from target domain.

Yet, the approach succeeds to preserve the intrinsic

properties of the target domain via deep clustering of

target embedding features. Our approach is trained

in an end-to-end fashion and succeeds to reach per-

fect adaptation to the target domain when evaluated

on public benchmark datasets, reaching only 0 - 2%

cross-dataset error. Our future work would focus on

evaluating on more variable datasets, in addition to

reducing the dependency of the model during training

on target domain samples from both classes, trying

to let the model focuses on learning from bona-ﬁde

samples with minimal attack samples contribution.

REFERENCES

Anjos, A., Shafey, L. E., Wallace, R., G

unther, M., Mc-

Cool, C., and Marcel, S. (2012). Bob: a free sig-

nal processing and machine learning toolbox for re-

searchers. In 20th ACM Conference on Multimedia

Systems (ACMMM), Nara, Japan.

Boulkenafet, Z., Komulainen, J., and Hadid, A. (2016).

Face spooﬁng detection using colour texture analysis.

IEEE Transactions on Information Forensics and Se-

curity, 11(8):1818–1830.

Boulkenafet, Z., Komulainen, J., and Hadid, A. (2017).

Face antispooﬁng using speeded-up robust features

and ﬁsher vector encoding. IEEE Signal Processing

Letters, 24(2):141–145.

Chingovska, I., Anjos, A., and Marcel, S. (2012). On

the effectiveness of local binary patterns in face anti-

spooﬁng. In 2012 BIOSIG - Proceedings of the In-

ternational Conference of Biometrics Special Interest

Group (BIOSIG), pages 1–7.

Costa-Pazo, A., Bhattacharjee, S., Vazquez-Fernandez, E.,

and Marcel, S. (2016). The replay-mobile face

presentation-attack database. In 2016 International

Conference of the Biometrics Special Interest Group

(BIOSIG), pages 1–7.

Dizaji, K. G., Herandi, A., Deng, C., Cai, W., and Huang,

H. (2017). Deep clustering via joint convolutional au-

toencoder embedding and relative entropy minimiza-

tion. In IEEE international conference on Computer

Vision, pages 5747–5756.

El-Din, Y. S., Moustaf, M. N., and Mahdi, H. (2020a). On

the effectiveness of adversarial unsupervised domain

adaptation for iris presentation attack detection in mo-

bile devices. In ICMV’20.

El-Din, Y. S., Moustafa, M. N., and Mahdi, H. (2020b).

Deep convolutional neural networks for face and iris

presentation attack detection: survey and case study.

IET Biometrics, 9:179–193(14).

Feng, L., Po, L.-M., Li, Y., Xu, X., Yuan, F., Cheung, T.

C.-H., and Cheung, K.-W. (2016). Integration of im-

age quality and motion cues for face anti-spooﬁng: A

neural network approach. Journal of Visual Commu-

nication and Image Representation, 38:451 – 460.

Ganin, Y., Ustinova, E., Ajakan, H., Germain, P.,

Larochelle, H., Laviolette, F., Marchand, M., and

Lempitsky, V. (2016). Domain-adversarial train-

ing of neural networks. J. Mach. Learn. Res.,

17(1):2096–2030.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Ben-

gio, Y. (2014). Generative adversarial nets. In Ghahra-

mani, Z., Welling, M., Cortes, C., Lawrence, N. D.,

and Weinberger, K. Q., editors, Advances in Neu-

ral Information Processing Systems 27, pages 2672–

2680. Curran Associates, Inc.

Guo, X., Zhu, E., Liu, X., and Yin, J. (2018). Deep em-

bedded clustering with data augmentation. volume 95

of Proceedings of Machine Learning Research, pages

550–565. PMLR.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In 2016 IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 770–778.

Howard, A., Sandler, M., Chen, B., Wang, W., Chen, L.,

Tan, M., Chu, G., Vasudevan, V., Zhu, Y., Pang, R.,

Adam, H., and Le, Q. (2019). Searching for mo-

bilenetv3. In 2019 IEEE/CVF International Confer-

ence on Computer Vision (ICCV), pages 1314–1324.

Jia, Y., Zhang, J., Shan, S., and Chen, X. (2020). Single-

side domain generalization for face anti-spooﬁng. In

Proc. IEEE Conference on Computer Vision and Pat-

tern Recognition (CVPR).

Kang, G., Jiang, L., Wei, Y., Yang, Y., and Hauptmann,

A. G. (2020). Contrastive adaptation network for

single-and multi-source domain adaptation. IEEE

transactions on pattern analysis and machine intelli-

gence.

Krause, A., Perona, P., and Gomes, R. G. (2010). Discrim-

inative clustering by regularized information maxi-

mization. In Lafferty, J. D., Williams, C. K. I., Shawe-

Taylor, J., Zemel, R. S., and Culotta, A., editors, Ad-

vances in Neural Information Processing Systems 23,

pages 775–783. Curran Associates, Inc.

Kurmi, V. K. and Namboodiri, V. P. (2019). Looking back

at labels: A class based domain adaptation technique.

In International Joint Conference on Neural Networks

(IJCNN).

Li, H., He, P., Wang, S., Rocha, A., Jiang, X., and Kot,

A. C. (2018a). Learning generalized deep feature rep-

resentation for face anti-spooﬁng. IEEE Transactions

on Information Forensics and Security, 13(10):2639–

2652.

IMPROVE 2021 - International Conference on Image Processing and Vision Engineering

Li, H., Li, W., Cao, H., Wang, S., Huang, F., and Kot,

A. C. (2018b). Unsupervised domain adaptation for

face anti-spooﬁng. IEEE Transactions on Information

Forensics and Security, 13(7):1794–1809.

Li, H., Pan, S. J., Wang, S., and Kot, A. C. (2018c). Do-

main generalization with adversarial feature learning.

In 2018 IEEE/CVF Conference on Computer Vision

and Pattern Recognition, pages 5400–5409.

Long, M., CAO, Z., Wang, J., and Jordan, M. I. (2018).

Conditional Adversarial Domain Adaptation. In Ben-

gio, S., Wallach, H., Larochelle, H., Grauman, K.,

Cesa-Bianchi, N., and Garnett, R., editors, Advances

in Neural Information Processing Systems 31, pages

1640–1650. Curran Associates, Inc.

Mohammadi, A., Bhattacharjee, S., and Marcel, S. (2020).

Domain adaptation for generalization of face presen-

tation attack detection in mobile settengs with mini-

mal information. In ICASSP 2020 - 2020 IEEE Inter-

national Conference on Acoustics, Speech and Signal

Processing (ICASSP), pages 1001–1005.

Nagpal, C. and Dubey, S. R. (2018). A performance eval-

uation of convolutional neural networks for face anti

spooﬁng. CoRR.

Patel, K., Han, H., and Jain, A. K. (2016a). Cross-database

face antispooﬁng with robust feature representation.

In You, Z., Zhou, J., Wang, Y., Sun, Z., Shan, S.,

Zheng, W., Feng, J., and Zhao, Q., editors, Biometric

Recognition, pages 611–619, Cham. Springer Interna-

tional Publishing.

Patel, K., Han, H., and Jain, A. K. (2016b). Secure

face unlock: Spoof detection on smartphones. IEEE

Transactions on Information Forensics and Security,

11(10):2268–2283.

Pei, Z., Cao, Z., Long, M., and Wang, J. (2018). Multi-

adversarial domain adaptation.

Saito, K., Ushiku, Y., Harada, T., and Saenko, K. (2018a).

Adversarial dropout regularization. In International

Conference on Learning Representations.

Saito, K., Watanabe, K., Ushiku, Y., and Harada, T.

(2018b). Maximum classiﬁer discrepancy for unsu-

pervised domain adaptation. In 2018 IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition,

pages 3723–3732.

Shao, R., Lan, X., Li, J., and Yuen, P. C. (2019). Multi-

adversarial discriminative deep domain generaliza-

tion for face presentation attack detection. In 2019

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition (CVPR), pages 10015–10023.

Tang, H., Chen, K., and Jia, K. (2020). Unsupervised do-

main adaptation via structurally regularized deep clus-

tering. In IEEE Conference on Computer Vision and

Pattern Recognition (CVPR).

Tang, H. and Jia, K. (2020). Discriminative adversarial do-

main adaptation. ArXiv, abs/1911.12036.

Wang, G., Han, H., Shan, S., and Chen, X. (2019). Improv-

ing cross-database face presentation attack detection

via adversarial domain adaptation. In 2019 Interna-

tional Conference on Biometrics (ICB), pages 1–8.

Wang, G., Han, H., Shan, S., and Chen, X. (2020). Cross-

domain face presentation attack detection via multi-

domain disentangled representation learning. In 2020

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition (CVPR), pages 6677–6686.

Wang, G., Han, H., Shan, S., and Chen, X. (2021). Unsuper-

vised adversarial domain adaptation for cross-domain

face presentation attack detection. IEEE Transactions

on Information Forensics and Security, 16:56–69.

Wang, R., Wang, G., and Henao, R. (2019). Discriminative

clustering for robust unsupervised domain adaptation.

ArXiv, abs/1905.13331.

Wang, Z., Zhao, C., Qin, Y., Zhou, Q., and Lei, Z. (2018).

Exploiting temporal and depth information for multi-

frame face anti-spooﬁng. CoRR.

Wen, D., Han, H., and Jain, A. K. (2015). Face spoof

detection with image distortion analysis. IEEE

Transactions on Information Forensics and Security,

10(4):746–761.

Xie, J., Girshick, R., and Farhadi, A. (2016). Unsupervised

deep embedding for clustering analysis. In Balcan,

M. F. and Weinberger, K. Q., editors, International

conference on machine learning, volume 48 of Pro-

ceedings of Machine Learning Research, pages 478–

487, New York, New York, USA. PMLR.

Xu, Z., Li, S., and Deng, W. (2015). Learning tempo-

ral features using lstm-cnn architecture for face anti-

spooﬁng. In 2015 3rd IAPR Asian Conference on Pat-

tern Recognition (ACPR), pages 141–145.

Zhang, Y., Tang, H., Jia, K., and Tan, M. (2019). Domain-

symmetric networks for adversarial domain adapta-

tion.

Adversarial Unsupervised Domain Adaptation Guided with Deep Clustering for Face Presentation Attack Detection