Privacy-Preserving Face Recognition in Hybrid Frequency-Color

Domain

Dong Han

1,2 a

, Yong Li

1,∗ b

and Joachim Denzler

2 c

Huawei European Research Center, Riesstraße 25, 80992 M

unchen, Germany

Computer Vision Group, Friedrich Schiller University Jena, Ernst-Abbe-Platz 2, 07743 Jena, Germany

Keywords:

Privacy-Preserving, Face Recognition, Frequency Information, Color Information, Face Embedding

Protection.

Abstract:

Face recognition technology has been deployed in various real-life applications. The most sophisticated deep

learning-based face recognition systems rely on training millions of face images through complex deep neural

networks to achieve high accuracy. It is quite common for clients to upload face images to the service provider

in order to access the model inference. However, the face image is a type of sensitive biometric attribute tied

to the identity information of each user. Directly exposing the raw face image to the service provider poses

a threat to the user’s privacy. Current privacy-preserving approaches to face recognition focus on either con-

cealing visual information on model input or protecting model output face embedding. The noticeable drop in

recognition accuracy is a pitfall for most methods. This paper proposes a hybrid frequency-color fusion ap-

proach to reduce the input dimensionality of face recognition in the frequency domain. Moreover, sparse color

information is also introduced to alleviate signiﬁcant accuracy degradation after adding differential privacy

noise. Besides, an identity-speciﬁc embedding mapping scheme is applied to protect original face embedding

by enlarging the distance among identities. Lastly, secure multiparty computation is implemented for safely

computing the embedding distance during model inference. The proposed method performs well on multiple

widely used veriﬁcation datasets. Moreover, it has around 2.6% to 4.2% higher accuracy than the state-of-the-

art in the 1:N veriﬁcation scenario.

1 INTRODUCTION

With the development of computational power and

advanced algorithms, state-of-the-art (SOTA) face

recognition (FR) models have achieved quite high ac-

curacy in public open-source datasets. However, pri-

vacy concerns raise attention with the advance of arti-

ﬁcial intelligence (AI). Since the deep learning-based

method needs enormous amounts of facial data, it has

more risks in terms of sensitive information leakage.

Therefore, it is necessary to develop a mechanism

to protect privacy information while maintaining the

high utility of the FR system.

The acquisition of large-scale face images from

the public through various service providers or or-

ganizations is becoming an important concern. The

storage of face images is restricted, especially for the

https://orcid.org/0000-0002-7782-3457

https://orcid.org/0000-0002-6920-0663

https://orcid.org/0000-0002-3193-3300

Corresponding author

original ones, for the consideration of the potential

misuse of analyzing personal sensitive information

such as ethnicity, religious beliefs, health status, so-

cial status, etc. Hence, since most face recognition

systems require access to the raw face image, those

concerns restrict traditional FR usage to a certain ex-

tent.

Another risk of FR is in face embeddings since

they can be considered as one type of biometric data.

The embedding contains the information that is ex-

tracted from the face, which can be used for identi-

ﬁcation. In the scenario of multiple stored face em-

bedding datasets computed by the FR system, if there

are the same identities existing in different datasets,

the FR model can be used for identifying the com-

mon identities (1:N) or simple cross-authentication.

The demographic information (e.g., sex, age and race)

from the target face embedding dataset can be in-

ferred by re-identiﬁcation attacks with the help of a

corresponding public accessible database (F

abi

an and

Guly

as, 2020). Most convolutional neural networks

536

Han, D., Li, Y. and Denzler, J.

Privacy-Preserving Face Recognition in Hybrid Frequency-Color Domain.

DOI: 10.5220/0012373200003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 2: VISAPP, pages

536-546

ISBN: 978-989-758-679-8; ISSN: 2184-4321

(CNNs) FR models rely on output embeddings to dis-

tinguish faces from different identities. By simplify-

ing the solution of the embedding vector to a normal-

ized face space, an end-to-end decoder is trained by

the texture and landmark of the face image to recover

the predictive normalized face (which is a front-facing

and neutral-expression face) (Cole et al., 2017). Face

embeddings of the same face computed from differ-

ent models should be independent and non-correlated

intuitively. However, researchers (McNeely-White

et al., 2022) claim it is possible to ﬁnd a linear trans-

formation to map face embedding between two mod-

els. It poses a threat that one can use the mapping

function between two networks to infer the interest

identity in another embedding database.

In this paper, we focus on reducing the privacy

information in two stages of the FR system which

involve the input of the feature extractor and the

output face embedding. Most frequency-based FR

methods have high-dimensional inputs without uti-

lizing any color information (Ji et al., 2022; Wang

et al., 2022; Mi et al., 2023). We propose a fre-

quency fusion method that selects the most impor-

tant coefﬁcients across the channel to discard redun-

dant information. The sparse visual color informa-

tion with fused frequency information from the pro-

posed hybrid frequency-color domain is the raw in-

put to the feature extractor before adding perturbation.

Then, the output face embeddings are projected into

an identity-protected space. The separation of embed-

dings from different identities is enlarged during the

mapping process. Unlike most previous works of the

FR which focused on 1:1 veriﬁcation as the main per-

formance evaluation, accuracy is also reported in the

1:N veriﬁcation scenario in our work.

As a summary, the main contributions of this re-

search work are the following:

• We propose a frequency fusion approach which

enables dimensionality reduction in the existing

frequency domain-based face recognition model.

• A hybrid frequency-color information fusion

method is designed to improve recognition accu-

racy by combining sparse color and frequency in-

formation together without revealing much visual

information.

• The identity-speciﬁc separation characteristic of

the face embedding protection method is exten-

sively investigated in 1:N face veriﬁcation.

• Secure multiparty computation (SMPC) is applied

to embedding distance calculation to further en-

hance the robustness against the reverse attack of

face embeddings.

2 RELATED WORK

2.1 Privacy-Preserving Face

Recognition

The privacy-preserving face recognition (PPFR) tech-

nique attracts more and more attention from both

academia and industry. In general, users with lim-

ited computation power access FR service by up-

loading their face image to a service provider hav-

ing a pre-trained model based on similar datasets. In

this scenario, the user’s face image is directly ex-

posed to the service provider during the inference

stage. Therefore, when the privacy of facial infor-

mation is concerned or the user is simply not willing

to share the original face image, the utility of such

face recognition will be affected. For the privacy pro-

tection of face recognition, the traditional way is to

add certain distortions such as blur, noise and mask

to face images for reducing the privacy (Korshunov

and Ebrahimi, 2013). These naive distortions produce

unsatisfying recognition performance and the original

face is relatively easy to be reconstructed. Homo-

morphic encryption is an encryption technique that

ensures data privacy by encrypting the original data

and enabling computations to be performed on the

encrypted data (Ma et al., 2017). Nevertheless, the

encryption method necessitates signiﬁcant additional

computational resources. Differential privacy (DP) is

another typical way to protect privacy by adding per-

turbation to the original or preprocessed face images

(Chamikara et al., 2020). In order to strengthen infor-

mation privacy for face recognition, the researchers

(Wang et al., 2022) propose the privacy-preserving

FR in frequency-domain (PPFR-FD), which selects

pre-ﬁxed subsets of frequency channels and imple-

ments operations such as channel shufﬂing, mixing,

and discording the lowest frequency channel. The

learnable DP noise is introduced to reduce the visual

information in the frequency domain while maintain-

ing high utility of recognition (Ji et al., 2022).

2.2 Frequency Domain Learning

Discrete cosine transform (DCT) is a powerful trans-

formation technique in image processing, commonly

used in JPEG encoding (Wallace, 1991). DCT repre-

sents images in the form of cosine waves. For human

observers, the major visual information inside the im-

age is contributed by the low frequency, while the

high frequency only contains subtle visual informa-

tion. Image data is the major input format for most

CNNs. By accelerating neural network training, the

traditional RGB image inputs can be replaced by the

Privacy-Preserving Face Recognition in Hybrid Frequency-Color Domain

537

Figure 1: Overview of proposed privacy-preserving FR framework in hybrid frequency-color domain.

corresponding DCT coefﬁcients to speed up the pro-

cess (Gueguen et al., 2018). The frequency represen-

tation has been used for different image tasks includ-

ing image classiﬁcation (Ulicny and Dahyot, 2017)

and segmentation (Lo and Hang, 2019). Avoiding us-

ing all frequency channels in DCT representation and

selecting a small number of low frequency channels

for training is also possible to maintain the relative

high accuracy (Xu et al., 2020).

2.3 Color Domain Learning

Color information is considered to be the primary

and signiﬁcant element within an image since it is

strongly associated with objects or scenes. Conse-

quently, it is widely recognized and utilized as a fun-

damental feature in the ﬁelds of image recognition

and retrieval. The local binary patterns (LBP) tech-

nique is designed for texture description (Ojala et al.,

1996) and it has proven to be highly discriminating

for FR due to different levels of locality. The face

image is partitioned into several patches, and tex-

ture descriptors are retrieved from each region sep-

arately. Then the descriptors are combined to pro-

vide a comprehensive depiction of the facial features.

The corresponding histograms from descriptors can

be used as feature vectors for the FR model (Aho-

nen et al., 2006). However, LBP only extracts tex-

ture information from the image but color informa-

tion is ignored. The goal of color-related local bi-

nary pattern (cLBP) (Xiao et al., 2020) is to learn the

most important color-related patterns from decoded

LBP (DLBP) so that color images can be recognized.

In their work, the LBPs are computed from the pro-

posed color space relative similarity space (RSS) be-

sides from RGB channels. Then the LBP is converted

into the DLBP by the decoder map.

2.4 Face Embedding Protection

There are two main types of methods for face em-

bedding protection: handcrafted-based and learning-

based algorithms. Handcrafted-based methodologies

employ algorithmically deﬁned transformations to

convert face embeddings into more secure represen-

tations (Pandey and Govindaraju, 2015; Drozdowski

et al., 2018). Normally, a learning-based method

is associated with the feature extractor in the net-

work. CNN-based protection method learns a map-

ping function to convert the extracted feature vec-

tor to maximum entropy binary (MEB) codes (Ku-

mar Pandey et al., 2016). Bioconvolving method (Ab-

dellatef et al., 2019) is able to generate cancelable bio-

metric embedding directly based on the deep features

from CNN.

PolyProtect (Hahn and Marcel, 2022) is one type

of handcrafted approach that is able to convert origi-

nal face embeddings to protected ones by using mul-

tivariate polynomials. It can be directly used as an

independent module after the feature extractor. It is

incorporated into our proposed framework.

3 METHODOLOGY

In this section, the proposed FR framework is illus-

trated in Figure 1.

The whole framework consists of three stages in-

cluding color learning, training and deployment. In

the color learning stage, the DLBP and LBP are ex-

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

538

tracted to represent the local color details of the origi-

nal image, which are later mixed with fused frequency

information. In the training stage, the size of the

DCT image is [H,W,C] while C is the number of fre-

quency channels. Then, the frequency fusion mod-

ule reduces the channel number by C/3 before adding

the pixel-wise DP noise through the noise model. In

the deployment stage, the client uploads a perturbated

frequency-color hybrid representation to the server

side for feature extraction. Besides, compared to the

traditional FR system, the embedding mapping based

on PolyProtect is incorporated to protect original face

embeddings in order to enlarge the distance among

different identities by the identity-speciﬁc pairs C, E

deﬁned by the client. Lastly, the SMPC serves as ex-

tra protection to safely compute the distance during

the veriﬁcation stage.

3.1 Block Discrete Cosine Transform

(BDCT)

Figure 2: BDCT operation on a face image.

BDCT splits images into different blocks and con-

verts them into frequency representations, and it can

be used for both color and grayscale images. As de-

picted in Figure 2, the original RGB image is upsam-

pled by 8 times and transformed into YCbCr color

space before the BDCT operation in order to keep the

global face structure in each frequency channel. In

this case, the block size of DCT is in 8 × 8 pixels,

and therefore each channel derives 64 DCT images.

The frequency coefﬁcients range from -1024 to 1023.

The Y component contains the most obvious (in terms

of human perception) grayscale information about the

content in the image while Cb and Cr carry informa-

tion about the colors. Besides, the output DCT images

in the ﬁrst column are the direct currents (DCs) which

represent the lowest frequency information.

The DCT coefﬁcients in the top-left channel corre-

late to the lowest frequency channel shown in Figure

3a. Away from the channel in all directions (horizon-

tal, vertical and diagonal), the coefﬁcients correlate

to higher frequencies, the right-bottom channel corre-

sponds to the highest frequency. The low-frequency

channels contain more visual structure than the high-

(a) (b)

Figure 3: DCT face images in Y channel and the DCT en-

ergy.

frequency ones. The frequency image energy G is de-

ﬁned as:

G =

∑

I(i, j)

(1)

where I is the DCT representation and (i, j) de-

notes the position of each coefﬁcient.

According to Equation 1, the energy of each fre-

quency is calculated and the percentage of energy dis-

tribution among all channels is shown in Figure 3b.

The lowest frequency channel accounts for around

99% of energy among all 64 channels.

3.2 Frequency Fusion (FF)

The dimensionality of the BDCT output is 189, even

after dropping the DCs. It requires a huge amount

of storage space if BDCT images are needed to be

stored, and it also introduces more difﬁculty in the

training stage for model convergence. In order to

solve such a problem, we propose a frequency fusion

scheme to reduce the dimensionality.

Figure 4: Cross-channel frequency fusion on BDCT im-

ages.

Privacy-Preserving Face Recognition in Hybrid Frequency-Color Domain

539

Inspired by the previous study, the high absolute

DCT coefﬁcient indicates high importance for the vi-

sual structure. Therefore, it is possible to combine

BDCT images across the channel. As illustrated in

Figure 4, for the BDCT images in Y, Cb, Cr channels

at the same level of frequency, the highest absolute

value from the pixel in the same position is selected as

the ﬁnal coefﬁcient. It is noticed that the correspond-

ing sign of the selected value is kept. After the fre-

quency fusion, the fused BDCT image contains only

63 channels which is three times less than the input

dimension used in the paper (Ji et al., 2022).

3.3 Color Information Descriptor

As the only frequency information is used for recog-

nition in the frequency domain, the color informa-

tion from the original RGB image is completely dis-

missed, which hinders the recognition accuracy. The

goal of the proposed color information descriptor is

to extract and transfer color information into the rep-

resentation without preserving much visual structure.

Our implementation directly uses the decoded local

binary pattern (DLBP) as a sparsity representation of

color information, drawing inspiration from the work

(Xiao et al., 2020). Additionally, the classical LBPs

are also computed for comparison.

Figure 5: DLBP features. They are computed based on the

same example RGB image used in Figure 2.

As shown in Figure 5, only a few images contain

certain ambiguous visual information. Most facial

details and contours are barely perceived in contrast

with traditional LBP features.

3.4 Hybrid Frequency-Color Fusion

(HFCF)

Figure 6: Frequency-color sorting.

To the best of our knowledge, there is no explicit way

to combine or mix the frequency and color informa-

tion. For the frequency information, even though it

is known that the frequency decreases from the up-

per left corner to the lower right corner, the exact or-

der cannot be easily observed. Our method is quite

intuitive, as shown in Figure 6. Firstly, the fused

BDCT images are sorted according to the DCT im-

age energy calculated by Equation 1. We sort output

BDCT images in descending order and arrange them

row by row. Secondly, the DLBP features are sorted

by checking the similarity (e.g., Euclidean distance)

with respect to the DC component in fused BDCT im-

ages. The sorted DLBP features are also in descend-

ing order.

For sorted frequency and color information, we

present the multiple naive fusion schemes (e.g., addi-

tion, multiplication, and concatenation). It is noticed

that the LBP features are only used in concatenation

since the LBP contains a relatively high visual struc-

ture which means it is less privacy-preserving com-

pared with the DLBP features. The hybrid frequency-

color information representations shown in Figure 7

are the different input options for the model backbone

before adding the DP noise.

In order to analyze the privacy-preserving qual-

ity, the visual similarity between the proposed hybrid

frequency-color information and the original image is

compared. Besides quantifying image compression

quality, the peak signal-to-noise ratio (PSNR) is a use-

ful metric of image similarity as well as the structural

similarity index measure (SSIM). Moreover, SSIM

can reﬂect a certain amount of human vision percep-

tual quality.

As shown in Figure 8, the DLBP and LBP have a

lower PSNR than the ones in the frequency domain,

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

540

Figure 7: Hybrid frequency-color information. The DC as

well as the ﬁrst DLBP feature are excluded. The four differ-

ent fusions provide options for the model backbone input.

Figure 8: PSNRs (db) and SSIM between the original RGB

image (luma channel in YCbCr space is used for calcula-

tion) and the ones in frequency and color domains. The

lower the value, the better the privacy-preserving. The ﬁrst

three sorted BDCT and DLBP images are shown in the ﬁrst

and second row; the LBPs on R, G, and B channel are shown

in last row.

which means they are more privacy-preserving. How-

ever, PSNR performs less well in evaluating the qual-

ity of images perceived by humans. It is obvious that

DLBP images, especially the last two, contain less vi-

sual information compared with the ones from LBPs.

According to the SSIM comparison, DCT representa-

tions contain more structural information than DLBP

and LBP. Additionally, DLBP captures less informa-

tion than LBP. It shows that DLBP is more difﬁcult to

be interpreted in terms of visual perception.

3.5 Embedding Mapping

Assuming the original face embedding V = [v

, v

, ...,

], the protected face embedding is denoted as P =

, p

, ..., p

]. As stated in PolyProtect, the mapping

operation is achieved by following formula. For the

ﬁrst value in p

p1 = C

∗ v

+ ...+C

∗ v

(2)

where C = [c

, c

, ..., c

] and E = [e

, e

, ..., e

]

are vectors that contain non-zero integer coefﬁcients.

The ﬁrst m values in V are mapped into p

, then p

is calculated based on another m consecutive values

after m. There is no obvious evidence to choose the

range of C if the cosine distance metric is selected

since it is not sensitive to magnitude changes. For

the range of E, it is reasonable to avoid large numbers

since face embedding consists of small ﬂoating point

values, while large powers wipe out certain embed-

ding elements. We keep m = 5 and E in the range [1,

5] as the author suggested. Since we aim to generate

a unique C vector for each identity, a large C range

[-100, 100] is used in our setting.

Another important parameter is named overlap,

which indicates the number of common values from

V that are used in the computation of each value in P.

For instance, v

∼v

are values for the calculation of

when overlap = 0, while v

∼v

is selected in case

of overlap = 1, by repeating use the v

that already be-

ing used in computation of p

. For reversibility, when

overlap is 4, it has a high possibility to reverse tar-

get P to an approximated V’ when the formula and all

parameters are known (Hahn and Marcel, 2022).

We focus on handling this issue in our SMPC-

based method in Section 3.6. Besides, all the experi-

ments in the original PolyProtec paper are mainly per-

formed for 1:1 veriﬁcation. In our work, the PolyPro-

tect is tested in 1:N veriﬁcation in Section 4.3.

3.6 Secure Multiparty Computation

(SMPC)

Through the integration of SMPC with other cryptog-

raphy methodologies (Evans et al., 2018), it becomes

possible to secretly verify the encrypted biometric at-

tributes of a user with the previously provided data.

In order to tackle the issue of using overlap = 4

in PolyProtect, we propose SMPC-based similarity

computation on protected embeddings, as shown in

Figure 9.

During the veriﬁcation stage, the secure shar-

ing protocol Π

MULT I

is established when comparing

the protected embedding P

and the enrolled embed-

ding P

from the database. The dot product results

Privacy-Preserving Face Recognition in Hybrid Frequency-Color Domain

541

Figure 9: SMPC for embedding distance computation.

Y(P

·P

) are split into [Y ]

and [Y ]

for the client and

server, respectively. Then [Y ]

and the L2 norm of the

enrolled embedding P

are sent to the client. Lastly,

the dot product can be calculated by [Y ]

+ [Y ]

. With

all elements, the client is able to compute the cosine

similarity between the protected embedding and each

enrolled embedding in the database. The correspond-

ing cosine distance can also be easily derived from

the cosine similarity. By avoiding sharing the raw

protected embedding with the server, the risk of over-

lap 4 in PolyProtect is solved. Our implementation is

based on CrypTen (Knott et al., 2021) in PyTorch. For

detailed information about secure sharing, we recom-

mend referring to the work (Damg

ard et al., 2012).

4 EXPERIMENTS

4.1 Dataset

The VGGFace2 (Cao et al., 2018) dataset is utilized

as a training dataset; it comprises 3.31 million pho-

tos of 9131 identities, with over 300 images for each

identity and a wide range of posture, age, lighting

and ethnicity. For the 1:1 veriﬁcation, we explore

efﬁcient face veriﬁcation benchmark datasets includ-

ing Labeled Faces in the Wild (LFW) (Huang et al.,

2008), Celebrities in Frontal Proﬁle (CFP) (Sengupta

et al., 2016) In-the-Wild Age Database (AgeDB)

(Moschoglou et al., 2017) to check the model perfor-

mance. Besides the most widely used, we also report

the performance of different models on datasets (e.g.,

Cross-Age LFW (CALFW) (Zheng et al., 2017) and

Cross-Pose LFW (CPLFW) (Zheng and Deng, 2018))

with large pose and age variations to test the model

generalization and robustness. For the 1:N veriﬁca-

tion, the customized dataset that was constructed from

MS-Celeb-1M (Guo et al., 2016) is used. It consists

of two parts: the gallery dataset and the query dataset.

The The former only contains 1 image per identity,

as the selected images are used for generating the en-

rolled embedding database. The latter has ﬁve images

for each identity, while the one used in the gallery

dataset is excluded during the selection process.

4.2 Implementation Details

For the input RGB face image, it is aligned (based

on MTCNN (Zhang et al., 2016)) and resized into

112 × 112 pixels. Then the image is upsampled by

8 times (through bilinear interpolation) in order to

keep global visual structure in the frequency domain.

The upsampled image is transferred into YCbCr color

space before performing BDCT with a block size of

8 × 8 pixels. The partial implementation is based on

TorchJPEG (Ehrlich et al., 2020).

The initial BDCT image has 192 channels in total.

After dropping the DCs and applying the proposed

frequency fusion scheme illustrated in Figure 4, the

fused BDCT image only contains 63 channels. Then

the LBP and DLBP features are computed based on

the original RGB face using the proposed color in-

formation descriptor. The LBP is computed by con-

verting each pixel into a binary number (8 digits) in

comparison with its 8 neighbors. We calculate the

LBP feature separately on each channel. The DLBP

is computed by our implementation of LBP decoding

(Xiao et al., 2020). Then, the proposed frequency-

color sorting and hybrid frequency-color information

scheme combine the fused BDCT image with LBP

and DLBP features through addition, multiplication

and concatenation, the detailed operation is shown

in Figure 6 and Figure 7. Before feeding the hy-

brid frequency-color information to the backbone, the

learnable DP noise (the implementation according to

paper (Ji et al., 2022) is added. The baseline model

is based on the ResNet-34 (He et al., 2016) back-

bone. The same random seed is set in all experiments.

The model is trained for 24 epochs with a batch size

of 128. The stochastic gradient descent (SGD) op-

timizer is selected with 0.9 momentum and 0.0005

weight decay, respectively. For the loss function Ar-

cFace (Deng et al., 2019), s is set to 64 and m is set

to 0.3. All experiments are conducted on 2 NVIDIA

Tesla 100 GPUs with the PyTorch framework.

4.3 Experimental Results

4.3.1 1:1 Veriﬁcation

We compare the results with the SOTA baseline mod-

els: ArcFace is trained with unprotected RGB images

and DCTDP is trained in a frequency domain pro-

tected by perturbation of DP noise. We test our pro-

posed frequency fusion and hybrid frequency-color

method on popular 1:1 veriﬁcation datasets. The

recognition accuracy is shown in Table 1. It is good to

notice that our DCTDP-FF method can maintain high

accuracy by only 63 frequency channels compared

with the DCTDP which keeps 189. Besides, the color

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

542

Table 1: Comparison of 1:1 veriﬁcation accuracy of different methods. DCTDP-FF denotes the fused frequency domain,

which is applied to the baseline DCTDP. Concat, add and multi denote the different options in HFCF.

Method (%) # Channels LFW CFP-FP AgeDB CALFW CPLFW

ArcFace (Deng et al., 2019) 3 99.70 98.14 95.62 94.28 93.10

DCTDP (Ji et al., 2022) 189 99.64 97.69 95.10 93.87 91.77

DCTDP-FF 63 99.60 97.69 94.95 93.25 91.83

HFCF-LBP (concat) 66 99.58 97.76 94.63 93.60 91.87

HFCF-DLBP (add) 63 99.37 97.77 94.52 92.93 90.55

HFCF-DLBP (concat) 30 96.03 88.57 83.70 83.92 80.87

HFCF-DLBP (concat) 126 99.57 97.69 95.03 92.95 91.70

HFCF-DLBP (multi) 63 99.25 97.50 94.43 93.16 90.87

information from LBP and DLBP is helpful for im-

proving accuracy in some cases. However, there are

no obvious changes since LBP and DLBP are quite

sparse. When only the ﬁrst 15 channels are selected

from frequency and DLBP in HFCF-DLBP (concat),

it reduces the accuracy signiﬁcantly. According to

their relative high performance, HFCF-LBP (concat)

and HFCF-DLBP (concat) with full channels are se-

lected as the main methods to investigate in further

experiments.

4.3.2 1:N Veriﬁcation

The majority of the experiments in previous works

were mainly conducted in a 1:1 veriﬁcation setting to

evaluate FR performance. However, 1:N veriﬁcation

is the most common situation in real-world applica-

tion. Therefore, we also test our methods in a 1:N ver-

iﬁcation scenario. The enrolled embedding database

is computed based on the images from the gallery

dataset. Each image represents a unique identity and

there are 85742 in total. In the inference stage, an im-

age from a random identity is picked to compare the

distance with the embeddings in the enrolled embed-

ding database.

For a more fair 1:N veriﬁcation accuracy mea-

surement on different models, the randomness of the

selection of query images from the query dataset is

ﬁxed. The mean accuracy is calculated for every 1000

query images. As shown in Table 2, in original em-

bedding case, the model only applied frequency fu-

sion DCTDP-FF has lower accuracy than DCTDP in

all three top rank predictions. The model with LBP

features concatenated HFCF-LBP (concat) has better

performance than the pure frequency fusion model. It

also has higher accuracy than DCTDP. It is good to

notice that the method with DLBP features concate-

nated has the best performance among all the mod-

els, and it achieves 2.6%, 4.3% and 4.2% more ac-

curacy than the SOTA baseline DCTDP. Therefore,

our proposed hybrid frequency-color domain, espe-

cially the one based on DLBP features HFCF-DLBP

(concat) has more useful information for recognition

even though it only provides trivial visual representa-

tion of original face image. Such a characteristic is

suitable for privacy preservation since most of the fa-

cial structure is concealed from the inputs of the back-

bone. As showing in Figure 8, DLBP image has low

SSIM value and only ambiguous contour is presented.

Apart from the accuracy improvement brought by

the proposed hybrid frequency-color information, the

1:N performance is further enhanced by the identity-

speciﬁc embedding mapping. Based on empirical ex-

periments on different C, E parameter settings, large

range of E can degrade recognition accuracy while

the range of C is suggested to be larger enough to

generate different combinations (at least more than

the number of identity). In terms of privacy preserv-

ing, since proposed FR system requires user-speciﬁc

parameters C and E, it is difﬁcult to access the FR

system even the identity image is leaked. In the

rank 1 prediction scenario, the accuracy is increased

by 7.4% for ArcFace, 9.4% for DCTDP, 13.3% for

DCTDP-FF, 8.1% for HFCF-LBP (concat) and 9.3%

for HFCF-DLBP (concat). Furthermore, the accuracy

in the other two rank cases is raised to a considerable

extent when compared to the accuracy computed us-

ing the original embedding setting.

4.3.3 Hard-Case Query Performance

To evaluate the identity-speciﬁc embedding mapping,

a challenging query image is chosen for inference on

various models. The results of the query are displayed

in Figure 10.

The selected query image has a large dissimilar-

ity compared with the image used for computing the

enrolled embedding in the database. From Figure

10, DCTDP and DCTDP-FF both failed to recognize

the query identity in the top 5 predictions on orig-

inal embeddings. It shows that frequency informa-

tion might not be enough for the model to extract dis-

criminated information in the hard-case query image.

While HFCF-LBP (concat) as well as HFCF-DLBP

(concat) successfully recognize the query image even

in the top 1 prediction. It is obvious that the hybrid

Privacy-Preserving Face Recognition in Hybrid Frequency-Color Domain

543

Table 2: Comparison of 1:N veriﬁcation accuracy on original embedding (the direct output of the backbone) and protected

embedding (mapping by PolyProtect with overlap 4) of different methods. The retrieval rate are calculated for different top

rank predictions.

Method (%) # Channels Retrieval Rate

Original Embedding Protected Embedding

Rank=1 Rank=5 Rank=10 Rank=1 Rank=5 Rank=10

ArcFace (Deng et al., 2019) 3 87.8 93.8 95.3 95.2 97.3 98.1

DCTDP (Ji et al., 2022) 189 79.3 86 88.3 88.7 93.9 95.3

DCTDP-FF 63 75 84 84.9 88.3 93.3 95.1

HFCF-LBP (concat) 66 80.9 88.4 89.9 89.0 94.8 96.4

HFCF-DLBP (concat) 126 81.9 90.3 92.5 91.2 95.7 96.8

Figure 10: Top 5 query results of the hard-case in 1:N veriﬁcation. The ﬁrst three rows show the results based on original

embeddings and last three rows present results based on protected embeddings. For left to right, the predictions are from

ArcFace, DCTDP, DCTDP-FF, HFCF-LBP (concat) and HFCF-DLBP (concat), respectively. The query image is highlighted

by yellow box while green box indicates the success prediction.

frequency-color information captures useful visual in-

formation even it is in the form of a sparse repre-

sentation. Here, we recommend HFCF-DLBP (con-

cat) over HFCF-LBP (concat) since LBP features still

contain much more visual structure compared with

DLBP features. For the query results based on pro-

tected embeddings, all the methods successfully pre-

dicted the query image in the ﬁrst place. Besides, it

is interesting to observe that the other 4 predictions

have quite high visual dissimilarity, especially for Ar-

cFace and DCTDP. It can be the reason that embed-

ding mapping enlarges the distance among different

identities. Another good aspect of this feature is that

it protects the privacy information of the query im-

age by avoiding providing other high-visually simi-

lar identities. For example, in query results based

on original embeddings, even if the correct identity

is not shown in the top 5 predictions, we can still

have a rough appearance perception by observing the

ﬁrst few predicted identities because the predictions

are correlated with the visual information of the orig-

inal RGB image. However, protected embeddings are

computed based on identity-speciﬁc embedding map-

ping, which enlarges the distance among embeddings

based on identity rather than visual similarity. There-

fore, it is more difﬁcult for people to have or “guess”

the approximated appearance perception by observ-

ing query results when the system fails to recognize

the query image.

4.3.4 Embedding Distributions

In order to further verify the separation ability of pro-

tected embeddings, we select 3000 identities from

the gallery dataset and compute the corresponding

protected embeddings as well as the original em-

beddings. To visualize the distribution of high-

dimensional embeddings, dimensionality reduction

has to be applied. Uniform Manifold Approxima-

tion and Projection (UMAP) is chosen because it pre-

serves both the local and global structures of the ini-

tial embeddings compared with t-distributed Stochas-

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

544

tic Neighbor Embedding (t-SNE) (van der Maaten

and Hinton, 2008). Basically, two groups that are sep-

arated in the embedded UMAP space are likewise far

away in the original data. The visualization of em-

beddings is shown in Figure 11 by applying UMAP to

reduce embeddings into two-dimensional representa-

tions.

(a) Original distribution.

(b) Protected distribution.

Figure 11: Distributions of original and protected face em-

beddings. From left to right, it shows results from ArcFace,

DCTDP, DCTDP-FF, HFCF-LBP (concat), HFCF-DLBP

(concat). The overlapped region is highlighted by red circle

for better visualization.

For the original embedding distribution in Figure

11a, there are several heavily overlapped regions from

DCTDP, DCTDP-FF and HFCF-LBP (concat) while

such regions are not observed in ArcFace and HFCF-

DLBP (concat). Besides, it is interesting to notice that

the distribution from HFCF-DLBP (concat) is quite

scattered in comparison with only the frequency-

based method, which means the sparse color infor-

mation introduced by DLBP can be learned by the

feature extractor of the FR model. In the protected

embedding distribution shown in Figure 11b, the pre-

vious large overlapped regions have decreased, even

though there are still small congested areas. There-

fore, the separation capability of the identity-speciﬁc

embedding mapping scheme is manifest and obvious.

5 CONCLUSION

In this paper, we have proposed a hybrid frequency-

color fusion scheme named HFCF to convert RGB

image data into a hybrid domain. There are two main

features of HFCF: the dimensionality reduction in fre-

quency information and the sparse visual represen-

tation in color information. HFCF can speed up the

CNN training process and improve recognition accu-

racy with negligible color information. For face em-

bedding protection, the identity-speciﬁc embedding

mapping scheme with SMPC converts and securely

calculates the embedding distance. Experimental re-

sults show that the proposed FR framework can yield

excellent performance in both 1:1 and 1:N veriﬁca-

tion with good privacy preservation for input data as

well as face embeddings. For the future work, we

would like to investigate model robustness by per-

forming black-box attacking through image recon-

struction model. It is also interesting to see if pro-

posed method can be generalized into other image

contents in different tasks.

REFERENCES

Abdellatef, E., Ismail, N. A., Abd Elrahman, S. E. S., Is-

mail, K. N., Rihan, M., and Abd El-Samie, F. E.

(2019). Cancelable fusion-based face recognition.

Multimedia Tools and Applications, 78:31557–31580.

Ahonen, T., Hadid, A., and Pietikainen, M. (2006). Face

description with local binary patterns: Application to

face recognition. IEEE transactions on pattern analy-

sis and machine intelligence, 28(12):2037–2041.

Cao, Q., Shen, L., Xie, W., Parkhi, O. M., and Zisserman,

A. (2018). Vggface2: A dataset for recognising faces

across pose and age. In 2018 13th IEEE international

conference on automatic face & gesture recognition

(FG 2018), pages 67–74. IEEE.

Chamikara, M. A. P., Bertok, P., Khalil, I., Liu, D., and

Camtepe, S. (2020). Privacy preserving face recogni-

tion utilizing differential privacy. Computers & Secu-

rity, 97:101951.

Cole, F., Belanger, D., Krishnan, D., Sarna, A., Mosseri, I.,

and Freeman, W. T. (2017). Synthesizing normalized

faces from facial identity features. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 3703–3712.

Damg

ard, I., Pastro, V., Smart, N., and Zakarias, S. (2012).

Multiparty computation from somewhat homomor-

phic encryption. In Annual Cryptology Conference,

pages 643–662. Springer.

Deng, J., Guo, J., Xue, N., and Zafeiriou, S. (2019). Ar-

cface: Additive angular margin loss for deep face

recognition. In Proceedings of the IEEE/CVF con-

ference on computer vision and pattern recognition,

pages 4690–4699.

Drozdowski, P., Struck, F., Rathgeb, C., and Busch, C.

(2018). Benchmarking binarisation schemes for deep

face templates. In 2018 25th IEEE International Con-

ference on Image Processing (ICIP), pages 191–195.

IEEE.

Ehrlich, M., Davis, L., Lim, S.-N., and Shrivastava, A.

(2020). Quantization guided jpeg artifact correc-

tion. In Computer Vision–ECCV 2020: 16th Euro-

pean Conference, Glasgow, UK, August 23–28, 2020,

Proceedings, Part VIII 16, pages 293–309. Springer.

Evans, D., Kolesnikov, V., Rosulek, M., et al. (2018). A

pragmatic introduction to secure multi-party compu-

tation. Foundations and Trends® in Privacy and Se-

curity, 2(2-3):70–246.

abi

an, I. and Guly

as, G. G. (2020). De-anonymizing facial

Privacy-Preserving Face Recognition in Hybrid Frequency-Color Domain

545

recognition embeddings. Infocommunications Jour-

nal, 12(2):50–56.

Gueguen, L., Sergeev, A., Kadlec, B., Liu, R., and Yosinski,

J. (2018). Faster neural networks straight from jpeg.

Advances in Neural Information Processing Systems,

31.

Guo, Y., Zhang, L., Hu, Y., He, X., and Gao, J. (2016). Ms-

celeb-1m: A dataset and benchmark for large-scale

face recognition. In Computer Vision–ECCV 2016:

14th European Conference, Amsterdam, The Nether-

lands, October 11-14, 2016, Proceedings, Part III 14,

pages 87–102. Springer.

Hahn, V. K. and Marcel, S. (2022). Towards protecting

face embeddings in mobile face veriﬁcation scenar-

ios. IEEE Transactions on Biometrics, Behavior, and

Identity Science, 4(1):117–134.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition (CVPR).

Huang, G. B., Mattar, M., Berg, T., and Learned-Miller,

E. (2008). Labeled faces in the wild: A database

forstudying face recognition in unconstrained envi-

ronments. In Workshop on faces in’Real-Life’Images:

detection, alignment, and recognition.

Ji, J., Wang, H., Huang, Y., Wu, J., Xu, X., Ding, S., Zhang,

S., Cao, L., and Ji, R. (2022). Privacy-preserving

face recognition with learnable privacy budgets in fre-

quency domain. In European Conference on Com-

puter Vision, pages 475–491. Springer.

Knott, B., Venkataraman, S., Hannun, A., Sengupta, S.,

Ibrahim, M., and van der Maaten, L. (2021). Crypten:

Secure multi-party computation meets machine learn-

ing. Advances in Neural Information Processing Sys-

tems, 34:4961–4973.

Korshunov, P. and Ebrahimi, T. (2013). Using warping for

privacy protection in video surveillance. In 2013 18th

International Conference on Digital Signal Process-

ing (DSP), pages 1–6. IEEE.

Kumar Pandey, R., Zhou, Y., Urala Kota, B., and Govin-

daraju, V. (2016). Deep secure encoding for face tem-

plate protection. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition

workshops, pages 9–15.

Lo, S.-Y. and Hang, H.-M. (2019). Exploring semantic seg-

mentation on the dct representation. In Proceedings of

the ACM Multimedia Asia, pages 1–6.

Ma, Y., Wu, L., Gu, X., He, J., and Yang, Z. (2017). A se-

cure face-veriﬁcation scheme based on homomorphic

encryption and deep neural networks. IEEE Access,

5:16532–16538.

McNeely-White, D., Sattelberg, B., Blanchard, N., and

Beveridge, R. (2022). Canonical face embeddings.

IEEE Transactions on Biometrics, Behavior, and Iden-

tity Science, 4(2):197–209.

Mi, Y., Huang, Y., Ji, J., Zhao, M., Wu, J., Xu, X., Ding, S.,

and Zhou, S. (2023). Privacy-preserving face recog-

nition using random frequency components. In Pro-

ceedings of the IEEE/CVF International Conference

on Computer Vision, pages 19673–19684.

Moschoglou, S., Papaioannou, A., Sagonas, C., Deng, J.,

Kotsia, I., and Zafeiriou, S. (2017). Agedb: the ﬁrst

manually collected, in-the-wild age database. In pro-

ceedings of the IEEE conference on computer vision

and pattern recognition workshops, pages 51–59.

Ojala, T., Pietik

ainen, M., and Harwood, D. (1996). A com-

parative study of texture measures with classiﬁcation

based on featured distributions. Pattern recognition,

29(1):51–59.

Pandey, R. K. and Govindaraju, V. (2015). Secure face tem-

plate generation via local region hashing. In 2015

international conference on biometrics (ICB), pages

299–304. IEEE.

Sengupta, S., Chen, J.-C., Castillo, C., Patel, V. M., Chel-

lappa, R., and Jacobs, D. W. (2016). Frontal to proﬁle

face veriﬁcation in the wild. In 2016 IEEE winter con-

ference on applications of computer vision (WACV),

pages 1–9. IEEE.

Ulicny, M. and Dahyot, R. (2017). On using cnn with dct

based image data.

van der Maaten, L. and Hinton, G. (2008). Visualizing data

using t-sne. journal of machine learning research 9.

Nov (2008).

Wallace, G. K. (1991). The jpeg still picture compression

standard. Communications of the ACM, 34(4):30–44.

Wang, Y., Liu, J., Luo, M., Yang, L., and Wang, L. (2022).

Privacy-preserving face recognition in the frequency

domain. In Proceedings of the AAAI Conference on

Artiﬁcial Intelligence, volume 36, pages 2558–2566.

Xiao, B., Geng, T., Bi, X., and Li, W. (2020). Color-

related local binary pattern: A learned local de-

scriptor for color image recognition. arXiv preprint

arXiv:2012.06132.

Xu, K., Qin, M., Sun, F., Wang, Y., Chen, Y.-K., and Ren,

F. (2020). Learning in the frequency domain. In Pro-

ceedings of the IEEE/CVF conference on computer vi-

sion and pattern recognition, pages 1740–1749.

Zhang, K., Zhang, Z., Li, Z., and Qiao, Y. (2016). Joint

face detection and alignment using multitask cascaded

convolutional networks. IEEE signal processing let-

ters, 23(10):1499–1503.

Zheng, T. and Deng, W. (2018). Cross-pose lfw: A database

for studying cross-pose face recognition in uncon-

strained environments. Beijing University of Posts and

Telecommunications, Tech. Rep, 5(7).

Zheng, T., Deng, W., and Hu, J. (2017). Cross-age

lfw: A database for studying cross-age face recogni-

tion in unconstrained environments. arXiv preprint

arXiv:1708.08197.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

546