Proxy Embeddings for Face Identiﬁcation among Multi-Pose Templates

Weronika Gutfeter

and Andrzej Pacut

NASK - Research and Academic Computer Network, Warsaw, Poland

Keywords:

Biometrics, Face Identiﬁcation, Proxy Embeddings, Multi-view Image Recognition.

Abstract:

Many of a large scale face identiﬁcation systems operates on databases containing images showing heads in

multiple poses (from frontal to full proﬁles). However, as it was shown in the paper, off-the-shelf methods

are not able to take advantage of this particular data structure. The main idea behind our work was to adapt

the methods proposed for multi-view and semi-3D objects classiﬁcation to the multi-pose face recognition

problem. The proposed approach involves neural network training with proxy embeddings and building the

gallery templates out of aggregated samples. A benchmark testing scenario is proposed for the purpose of the

problem, which is based on the linked gallery and probes databases. The gallery database consists of multi-

pose face images taken under controlled conditions, and the probes database contains samples of in-the-wild

type. Both databases must be linked, having at least partially common labels. Two variants of the proposed

training procedures were tested, namely, the neighbourhood component analysis with proxies (NCA-proxies)

and the triplet margin loss with proxies (triplet-proxies). It is shown that the proposed methods perform better

than models trained with cross-entropy loss and than off-the-shelf methods. Rank-1 accuracy was improved

from 48.82% for off-the-shelf baseline to 86.86% for NCA-proxies. In addition, transfer of proxy points

between two independently trained models was discussed, similarly to hyper-parameters transfer methodology.

Proxy embeddings transfer opens a possibility of training two domain-speciﬁc networks with respect to two

datasets identiﬁcation schema.

1 INTRODUCTION

In many face identiﬁcation applications there is a need

to identify low-quality in-the-wild samples against the

high-quality gallery with face images in several poses.

While this setup is typical for various surveillance

systems, police identiﬁcation engines etc., problem

is not often distinguished from standard in-the-wild

identiﬁcation scenario. Here we assume that at least

one part of the data can be more controlled and cover

a wider range of facial views. In fact, testing the be-

haviour of such systems require two speciﬁc linked

databases. One, which we call the gallery database,

consists of several face images in various poses for

each individual. The second database, namely, probes

database, typically contains many low-quality im-

ages. What is important, both databases must be

linked in a way that some individuals are represented

in both parts. Databases, that meet all these require-

ments, are hardly available, so it was necessary to put

some work into adjusting existing sets.

https://orcid.org/0000-0001-6359-8220

https://orcid.org/0000-0003-3489-8990

The gallery face images should be of higher qual-

ity and are obtained under controlled conditions. Ac-

quisition procedure is usually designed by a spe-

cialists who deﬁne lighting, background, head po-

sitions and camera settings. These datasets usually

include multiple, but constant set of poses for each

subject, like the mug shot photography taken in po-

lice stations, or registries of people in administrative

databases. The second type of images, namely the

faces in-the-wild, are the most common type of im-

ages in the real-world recognition. They can be taken

in varying lighting conditions, with noisy background

and using different acquisition devices. Both types of

images need to be confronted in the face identiﬁca-

tion applications. Two identiﬁcation problems of the

above type can be considered. First, given a watch list

of gallery images, one wants to identify faces among

the in-the-wild set. Secondly, given the in-the-wild

image of face, one wants to identify its correspond-

ing gallery set label. We are interested here in the

latter problem, namely, identiﬁcation of in-the-wild

samples in the multiple-pose gallery database. The

most important point is to make full use of the gallery

resources.

Gutfeter, W. and Pacut, A.

Proxy Embeddings for Face Identiﬁcation among Multi-Pose Templates.

DOI: 10.5220/0009166505130520

In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020) - Volume 5: VISAPP, pages

513-520

ISBN: 978-989-758-402-2; ISSN: 2184-4321

513

Figure 1: Sample images from Quis-campi dataset. First

row: gallery images of a person in 3 predeﬁned poses (im-

ages were cropped to show only the face areas). Second

row: same person, frames from an outdoor surveillance sys-

tem. Third row: a few frames sampled from a video of ro-

tating head having uniform green background.

It was noticed, that there exist plenty of a very

large datasets containing face images in-the-wild, for

example Labeled Faces in the Wild (Huang et al.,

2007) or the datasets which belong to NIST Face

Recognition Challenges, like IJCB-C IARPA Janus

Benchmark (Maze et al., 2018). We are aware that the

top algorithms submitted to solve the problems stated

in these benchmarks should also perform well in our

task. However, it is hard to construct linked databases

for in-the-wild to multi-view identiﬁcation as we de-

ﬁned here using only unconstrained datasets.

Proposed here methods are based on generating

face descriptors with deep convolutional networks.

Multi-pose gallery properties enable to aggregate the

face descriptors belonging to the same subject. Ag-

gregation can be realized at various processing stages:

or by pooling descriptors or by altering the struc-

ture of network and merging the features inside the

model. Approach with descriptor averaging can be

applied also to off-the-shelf methods, referenced be-

low as baseline methods. However, in sec. 4.1 it

is shown that the solution based on watch-list cre-

ated with aggregated baseline gallery templates only

slightly outperform accuracy of the baseline system

that uses frontal images. In other words, multi-pose

properties do not boost the identiﬁcation quality with-

out re-training the underlying networks. We therefore

decided to focus on improving algorithms for gener-

ating multi-view face templates.

2 DATASETS PREPARATION

In the proposed identiﬁcation scenario we require face

images database to be consisted of two datasets. One

dataset, called here the gallery, includes controlled

images, where each person is represented by multiple

images with various head poses. Controlled images

are understood as the images having uniform back-

ground, good lighting conditions, no occlusions and

showing a face in one of a predeﬁned views. The sec-

ond dataset, called here the probes set, contains im-

ages in-the-wild of the same individuals as the ones

present the gallery subset. Assembling these datasets

was quite challenging. Aside from the widely avail-

able separate multi-view face databases and large in-

the-wild databases, it is difﬁcult to ﬁnd face databases

with two linked dataset: gallery and in-the-wild, with

non-empty set of common individuals.

A publicly available database of images closest to

our requirements is Quis-campi database (Quis-campi

dataset, 2015). Its size is yet not sufﬁcient consider-

ing current trends in training and testing face recogni-

tion algorithms. Unprocessed version of this database

contains images of 280 users but the common part of

the gallery and the probes subsets consisted only of

images of 170 different persons.

2.1 Quis-Campi - Face and Silhouette

Database

Quis-campi is a database for biometric recognition in

surveillance environments. It was collected in the SO-

CIA Lab at the University of Beira Interior in Covilh

Portugal. A part of it was utilised in facial identiﬁ-

cation competition (Proenc¸a et al., 2018). It is com-

posed of various types of imaging data divided into

two subsets which reﬂect the division introduced be-

fore: gallery set and probes set (faces in-the-wild).

In the gallery dataset every person is represented

by three high quality photographs of the full silhou-

ette in frontal, left proﬁle and right proﬁle pose. Ad-

ditionally, the gallery set includes video sequences

showing rotating head of the same person. The video

frames have high resolution, do not contain any ad-

ditional objects occluding faces and are captured in-

doors on green uniform background. The second

dataset, namely the probes set is acquired using an

outdoor surveillance system. Frames containing faces

were extracted automatically by the system and then

manually labeled by the authors of database. These

images show people from a distance, not looking into

a camera and in poor lighting conditions.

Data from Quis-campi required many processing

steps to be functional as a component of training neu-

ral networks procedure. We needed to merge some

parts of it to create the coherent subsets. Faces were

detected automatically using dlib face detector (King,

2009) and then cropped to contain only the face areas.

The resulting 3 face images per person in the gallery

are not enough for further processing, so the gallery

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

514

was extended by selected frames from the rotating

head videos. We hand-picked the frames that con-

tain head in a certain pose, and cropped the images.

From each video sequence, 5 frames were extracted:

1 with frontal face, 2 with proﬁles (left and right), 2

with half-proﬁles (left and right). Therefore we ended

up with 10 gallery samples per person having prede-

ﬁned labelled poses. The pose labels were saved and

known during the training. The probes subset was

processed using the same face detector, but the pose

remained unknown. Some identity labels were manu-

ally adjusted, especially in cases where the probe im-

age contained more than one face. For the purpose of

our experiment we excluded individuals which were

only in a single dataset (gallery or probes). Even-

tually, we gathered 1700 images in a gallery subset

with 170 individuals, and 2806 images in probes sub-

set with the same 170 individuals. Every person was

represented by exactly 10 gallery samples, and on the

average 16.5 probe samples.

3 MULTI-POSE FACE

TEMPLATES WITH PROXY

EMBEDDINGS

The main idea behind the proposed method is to adapt

some of the novel algorithms for the multi-view im-

age recognition to the considered task of multi-pose

face identiﬁcation.

The basic approach of working with face images

of high pose variation is to apply the pose correc-

tion techniques. Images are transformed to contain

faces in one common view. In case of face frontaliza-

tion, every face is turned into a frontal portrait. The

frontalization improves the performance of recogni-

tion, what was shown in paper (Banerjee et al., 2018),

yet it causes some information loss due to uniformity

of such transformation.

To address the problem of effective use of multi-

pose images, many other solutions were proposed.

Solutions are predominantly based on specialized

deep network models that accept the input in form

of multi-view image sequence, like in the mv-cnn ap-

proach (Su et al., 2015). In mv-cnn identiﬁcation, fea-

tures from distinct samples presenting the same sub-

ject are pooled at one of network layers to generate

a single aggregated descriptor. The aggregated de-

scriptor is capable of learning 3D characteristics of

subject shown from various perspectives. However,

in the case of pure convolutional architecture of net-

work, the number of views representing one subject is

often expected to be ﬁxed.

Since in our database the number of multi-view

samples is not as large as it might be in other multi-

view objects datasets with hundreds of views per ob-

ject, it was hard to train new multi-view network.

Therefore multi-view descriptors [MV] are realized

as the averages of the single sample descriptors. The

probes images are always encoded in a single-view

mode. Nevertheless, we point the mv-cnn architec-

ture as the next step in the method development af-

ter collecting the data that will allow us to train the

model from scratch. Still, the aggregated descriptors

are found to be performing better than the single-view

ones [SV], see Sec.4.1.

The idea that we found particularly interesting for

the task of managing multi-view data was introduced

in (Ho et al., 2019). The authors proposed the Pose

Invariant Embeddings, which allows to train the net-

work simultaneously against multi-view and single-

view object representations by introducing a novel

similarity metrics. The training procedure was pre-

sented for 3 image classiﬁcation and retrieval sys-

tems: the pose-invariant CNNs, the pose-invariant

proxies and the pose-invariant triplet centers. The

ﬁrst is based on multi-view networks and the second

two utilize the concept of proxy learning proposed

in (Movshovitz-Attias et al., 2017). We follow some

of the concepts proposed in the last two papers and

train our network with two loss functions: the neigh-

bourhood component analysis-based function (NCA)

and the triplet margin loss.

3.1 Facial Features Embedding using

Neural Networks

The core part of our template modeling method is re-

alized using convolutional networks. We based on

VGG-Face model, which is described in paper (Parkhi

et al., 2015). Network transforms the input samples

X = {x

..x

} into the feature embedding space

g(X) in which they can be compared. Values of the

embeddings g(X) are obtained from the descriptor

layer L

which is a fully connected layer set to have

predeﬁned size. In case of VGG-Face model size of

descriptor layer L

is equal to 4096 and it is a penul-

timate layer in the whole structure. In our models the

size of this layer is set to 2048 (obviously excluding

the case when we test off-the-shelf methods). The

cross-entropy loss function requires additional clas-

siﬁcation layer L

to compute class-related probabili-

ties. Number of neurons in L

is adequate to the num-

ber of classes in the training dataset. In our case it is

equal to the number of users in Quis-campi dataset,

which is 170. The true label of a user for a sample

x is marked as c(x). Even if a loss is calculated on

Proxy Embeddings for Face Identiﬁcation among Multi-Pose Templates

515

Figure 2: Toy example of learning with two-dimensional (size of network’s descriptor layer is equal to 2) proxy embeddings

for a 7-class dataset. Left: embeddings after the initialization. Colors represent the true class labels. For clarity, the 7 class

proxies (star-marked) are distributed uniformly on the unit hypersphere. Right: position of embeddings in ﬁnal stage. Here

embeddings are not `

normalized (which is normally done in all other experiments). Embeddings of samples belonging to

the same class are well separated and enclosing their associated proxies.

the output of L

layer (cross-entropy approach), at the

validation stage sample similarities are computed on

the output of descriptor layer L

by employing near-

est neighbor classiﬁer.

Distance learning technique, like triplet learning

with max margin loss, operates directly on descrip-

tors and do not need classiﬁcation layer L

to be com-

puted. However, triplet learning requires forming the

input samples in a speciﬁc way. Triplets are formed

from 3 input samples (x, y+, z−). The ﬁrst sample x

is called anchor, the second one is a sample belonging

to the same class as anchor c(x) = c(y), and the third

belongs to the impostor class c(x) 6= c(z). Because

of the large number of triplets combinations, training

procedure requires some mining techniques applied to

ﬁnd the examples carrying the most meaningful infor-

mation.

3.2 Application of Proxies

One of the solutions to accelerate the training and to

limit the number of triplets is the idea of proxies pro-

posed in (Movshovitz-Attias et al., 2017). Proxies are

the points embedded in the same space as the descrip-

tors and are intended to represent the original training

set. They can play a signiﬁcant role as a modiﬁcation

of loss functions used for neural network training.

Each point in the training set has one proxy asso-

ciated with it. As we decided to use the static proxy

assignment, the proxies are directly associated with

class labels, and this association does not change dur-

ing execution of the algorithm. Having a set of prox-

ies P = {p

, p

,..., p

} and knowing the label c(x)

for each input point, we deﬁne the proxy of an in-

put point x as proxy of its class: p(x) = p

c(x)

. In our

method number of proxies is equal to the number of

class labels K, but in general they do not have to be

equal, however in the source paper this variant gives

the best results.

Idea of proxies is applicable to a wide range

of training methods. We test it with two types of

loss functions: the triplet loss (max-margin loss) and

the one based on neighbourhood component analysis

NCA described in paper (Goldberger et al., 2005).

Following the training procedure proposed

in (Movshovitz-Attias et al., 2017), the standard

triplet formulation (x, y, z) is replaced by the triplets

using proxy embeddings (x, p(x), p(Z)), where p(x)

is a proxy representative of the positive example x,

and p(Z) is a representative of all negative compar-

isons. Note that the number of negative comparisons

is equal to K − 1 where K is the number of class

labels (and proxy points as well). It is dramatically

smaller than in the case of standard triplets.

Triplet loss function with proxy embeddings used

in this work will be referred further as the triplet-

proxies. It is based on standard max-margin loss func-

tion with the negative and positive embeddings re-

placed by its proxies.

T RI

(x, p(x), p(z)) = max



d(g(x), p(x))

−d(g(x), p(z)) + M, 0



(1)

where M is the margin parameter, d(g(x), p(x)) is

the Euclidean distance between the embedding g(x)

of sample x and its proxy embedding p(x), while

d(g(x), p(z)) is the distance between the sample em-

bedding g(x) and proxy embedding of the negative

class c(x) 6= c(z). The triplet margin loss is one of

the loss functions that represent the distance-learning

approach in deep neural networks.

Proxy embeddings can be applied to classiﬁcation

learning as well. Neighbourhood component anal-

ysis (NCA) with proxy embeddings is an example

of such application. Identiﬁcation made by model

trained with NCA loss will be referred further as

NCA-proxies.

NCA

(x, p(x), p(Z)) = −log



exp(−d(g(x),p(x)))

∑

p(z)∈p(Z)

exp(−d(g(x),p(z)))



(2)

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

516

where the meaning of x, p(x) and p(Z) are similar to

those in (1). Both the embeddings of the training set

g(X) and the proxy embeddings P are `

-normalized.

Proxy points are randomly initialized at the begin-

ning of each training fold and kept constant during

the training session.

In a multi-view variant [MV] of network train-

ing we use aggregated embeddings g(X

) in place

of single-view embeddings g(x ). Triplets must be

formed accordingly to loss functions modiﬁcation

to contain set of samples instead of single sample

, p(X

), p(Z)).

3.3 Evaluation Scenario and Batch

Sampling

For identiﬁcation evaluation, we divided the database

into a training and a validation parts. As our dataset

has two components, we obtained four sets: the train-

ing gallery set, the validation gallery set, the training

probes set and the validation probes set. This speciﬁc

structure need to be considered in a cross-validation

division. In each fold, we take N

samples per per-

son from the gallery set and N

samples per person

from the probes set to be included in the validation

part. For that reason, it is required that each user in

the gallery set is represented by at least 2 ∗ N

samples

and by 2 ∗ N

samples in the probes set. In our experi-

ments employing Quis-campi database, N

and N

are

both set to 3. A single gallery images of a particular

view was drawn randomly in case of many samples of

the same view of an individual. Each result discussed

in our work is the average of 5 random folds created

according to the above guidelines. All of the data in a

training set was augmented using random crops, rota-

tion and small colour modiﬁcations.

The code used in our experiments was prepared

in Pytorch framework. VGG-Face with pre-trained

weights is used as a baseline model in comparisons.

At the validation stage the descriptors are compared

using the cosine distance metric. The ranking lists are

then formed and the cumulative match curves (CMC)

calculated. We evaluate the algorithms using Rank-

1 and Rank-3 identiﬁcation metrics. However, in ta-

ble 1, a veriﬁcation metric is also presented to show

the full picture of algorithms’ performance. In this or-

der we use the area under the receiver-operator curve

(ROC) index.

4 RESULTS AND DISCUSSION

4.1 Baseline: Off-the-shelf Method

Evaluation

Identiﬁcation of the probes images in the multi-pose

gallery set can be approached in various ways. One

of the simplest - baseline - approaches is to apply an

off-the-shelf facial recognition algorithm. Within this

line, we used the VGG-Face with pre-trained weights

and build several baseline tests. We used this model

without any modiﬁcation - the penultimate layer of

the convolutional network produced the output of size

4096 to be used as the face descriptor g(x).

Figure 3: Multi-pose face identiﬁcation using off-the-

shelf methods. Cumulative match score curves computed

for the Quis-campi dataset for 5 cases: all with the gallery

samples including all 3 poses with the most similar re-

trieved; frontal, left, and right, with the gallery images only

of certain poses, and averaged with the pose descriptors av-

eraged for each gallery sample. The shaded areas are within

1 − σ limits. While all the results are inferior, the average

case gives the best results, the frontal is the second, then all.

The left and right galleries are the worst.

First, we used the baseline approach to encode all

facial templates, regarding of pose (the all test). In

the next 3 baseline tests, we compare the probe with

only single-pose gallery images. For compatibility,

we follow the same cross-validation procedure (Sec.

3.3) for all tests. At every fold, three gallery images

are drawn that contain one frontal image, one left pro-

ﬁle and one right proﬁle. We thus tested the probes

images with frontal-face-only watch-list (the frontal

test) and repeated testing with left-sided-only (the

left test) and right-side-only gallery lists (the right

test). The last option we considered is using a gallery

with the aggregated descriptor, namely with all three

gallery samples averaged (the average test). Note that

in the all test the gallery set is 3 times larger than in

Proxy Embeddings for Face Identiﬁcation among Multi-Pose Templates

517

the other four cases. The ﬁrst image with a correct

label is counted as the hit when determining the cu-

mulative match score (minimal distance strategy).

The results shown in graph 3 was the average of 5

cross-validation folds.

The highest Rank-1 accuracy is obtained for the

watch-list build with template aggregation approach

and it is equal to 48.78%. Next results in order be-

long to frontal images case (Rank-1=43.02%) and to

the all-images gallery (Rank-1=41.18%). Not surpris-

ingly, watch-lists constructed from proﬁle pictures

dramatically decrease the accuracy (23.88% for left

proﬁles and 25.29% for right). The fact that the ag-

gregated descriptors give the best results is promising.

They are, however, very close to those obtained with

only frontal faces gallery. The proﬁle images seem

to degenerate identiﬁcation and without retraining the

network hence it is not very beneﬁcial to include them

in the gallery watch-list at all.

4.2 Results: Proxy Embeddings in

Multi-Pose Face Identiﬁcation

Evaluation of identiﬁcation with proxy embeddings

is made for models trained with the use of 3 meth-

ods: cross-entropy with respect to class labels (called

shortly cross-entropy), NCA with proxy embeddings

(NCA-proxies) and triplet loss with proxy embedding

(triplet proxies). As the baseline, we used the best re-

sults obtained in previous experiment, namely base-

line frontal and baseline averaged.

We add identiﬁcation accuracy computed with

VGG-Face model, where the gallery is created only

from frontal views (baseline - frontal) and from aver-

aged descriptors (baseline average. In every case, the

probes samples are identiﬁed within the gallery sam-

ples.

Two types of matching approaches are considered:

the ﬁrst is to match a probe sample against a single

view SV) gallery sample, and second is to match a

probe sample against the multi-view gallery sample

MV).

When matching against multi-view gallery sam-

ples, the view set is created from 3 gallery samples

drawn with respect to frontal, left side, and right side

samples. Validation is performed by application of

the cumulative match score curve. In ﬁnal compari-

son we consider Rank-1 and Rank-3 metric. All meth-

ods are based on the network are the same, except for

the training loss function (cross-entropy)

Detailed information about the identiﬁcation and

recognition rates are summarized in Tab. 1. As it can

be observed in Fig. 4, the best results were obtained

for model trained with NCA-proxies. NCA-proxies

Figure 4: Multi-pose face identiﬁcation with proxy em-

beddings. The CMC curves for identiﬁcation of probes

samples in multi-pose Quis-campi dataset gallery. Each

classiﬁer was trained with NCA-proxies and the cross-

entropy loss in two versions: with the list created from

single-view images (SV) and created from multi-view im-

ages (MV). For comparison, the best baseline result for un-

trained classiﬁer using VGG-Face model is also shown.

method achieved Rank-1 of 86.67% for the single-

view gallery and 86.86% for the multi-view gallery.

triplet-proxies method is slightly worse in this case.

Rank-1 rate for triplet-proxies was 74.00% for single-

view and 76.35% for the multi-view gallery. Gallery

database built of multi-view descriptors gives in all

cases better results than for the single-view galleries.

Gain of Rank-1 accuracy for NCA-proxies is by about

78% as compared to the off-the-shelf method and by

32% compared to model trained with cross-entropy.

Results for face identiﬁcation with proxy embed-

dings are very favorable, especially if we consider the

fact that it allows us to manipulate the classiﬁer be-

haviour directly at the embedding layer. This conclu-

sion raises the question if it is possible to train two

models with the same set of proxy points and whether

the descriptors in these two models can be used to

match samples from different domains.

4.3 Proxy Transfer

As well as the weights of the deep neural network,

the proxy points can be transferred between the train-

ing sessions. The main goal of the proxy-related loss

functions is to establish the embeddings of samples

that belong to a given class to be near the associated

proxy points. Consequently, the embeddings should

maintain this property when trained on two distinct

datasets. The embeddings created by two different

models, but estimated basing on the same set of proxy

points and also the same set of associations, can be

used in a single evaluation scenario. To check this

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

518

Table 1: Summary of evaluation metrics for the probe samples identiﬁcation against the multi-pose gallery samples from the

Quis-campi dataset. Rank-1 and Rank-3 identiﬁcation accuracies are presented along with the area under (AUC) receiver-

operator curve (ROC). The table covers the results of 3 experiments discussed in this works: the identiﬁcation with off-the-

shelf methods (Fig. 3), the training with proxy embeddings (Fig. 4) and the application of proxy transfer (Fig. 5).

Method Rank-1 Rank-3 ROC-AUC

NCA-proxies

single-view [SV] 86.67% 90.49% 0.9930

multi-view [MV] 86.86% 93.43% 0.9941

Triplet-proxies

single-view [SV] 74.00% 83.33% 0.9817

multi-view [MV] 76.35% 87.69% 0.9842

Cross-entropy

single-view [SV] 60.47% 74.98% 0.9708

multi-view [MV] 65.88% 81.29% 0.9774

Baseline

all images gallery 41.18% 56.20% 0.8955

frontal images gallery 43.02% 60.12% 0.9033

left images gallery 30.78% 46.00% 0.8942

right images gallery 32.86% 47.65% 0.8964

averaged gallery 48.82% 64.20% 0.9259

Proxy transfer (NCA-proxies)

single-view [SV] 70.13% 83.53% 0.9827

multi-view [MV] 72.47% 89.47% 0.9869

idea, we set up an experiment in which we train two

independent models that share the architecture and

the dictionary of proxy points. By dictionary, we

mean the values of embeddings and label of the re-

lated classes. The models were trained on two ex-

clusive subsets of Quis-campi database: one on the

gallery set and one on the ‘in the wild’ probes set.

The evaluation is performed in the same way as in the

previous experiments. Both models were trained em-

ploying NCA with proxy embedding (NCA-proxies).

The gallery watch-list is built using the gallery model,

either in a form of single-view or multi-view descrip-

tors. The probes are encoded using the probes model

they are matched with the gallery descriptors during

evaluation.

Figure 5: Proxy transfer. CMC curve for the identiﬁca-

tion of probe samples in the multi-pose Quis-campi dataset

gallery. For the proxy transfer, two distinct models were

trained, one only on the gallery samples and the second only

on the probe samples, with the same sets of proxy points set.

All models were trained using NCA-proxies loss.

Our preliminary results in proxy transfer do not

surpass the single model approach, but they are

close despite the independent training. Rank-1 for

multi-view gallery identiﬁcation with proxy transfer

is 72.47% and Rank-3 reaches the level of 89.47%.

However, validation results need to be improved be-

cause of the relatively high variance of Rank-1 (it is

visible in ﬁgure 5), σ of Rank-1 is about 5%. The re-

sults obtained with proxy transfer are still much better

than for off-the-shelf method.

Our intuition behind this approach is to create a

possibility of building two domain-speciﬁc models:

one for the multi-pose controlled images (possibly

based on some modiﬁcation of mv-cnn architecture)

and the second for the data-in-the wild. The second

model can be adapted to the images acquired from the

particular source, e.g. for certain types of surveillance

systems.

5 CONCLUSIONS

In this work, a new method for multiple-pose face

identiﬁcation was proposed, based on proxy embed-

dings in combination with two loss functions: the

triplet loss and the neighbourhood component anal-

ysis loss. A benchmark scenario was introduced for

training and testing image recognition from uncon-

trolled in-the wild probes to multi-view gallery. The

application of the new method results in a large in-

crease of the identiﬁcation rates. Alongside, it was

demonstrated that one cannot beneﬁt from multi-pose

image databases without changing the model struc-

ture or retraining the network to the speciﬁc task.

There is still a space for increasing the full poten-

Proxy Embeddings for Face Identiﬁcation among Multi-Pose Templates

519

tial of the presented methodology. The most impor-

tant step would consist of creating larger and more

diversiﬁed datasets. It would be then possible to per-

form parallel training on two models with different ar-

chitectures: one specialized in multi-pose controlled

face photographs and the second for uncontrolled im-

ages or for images coming from a predeﬁned type of

capture device. It was shown here that the network

training can be set up in the way that embeddings in

both models are forced to enclose to same set of proxy

points.

ACKNOWLEDGEMENTS

This work was supported by NCBiR grant DOB-

BIO7/18/02/2015. Computations made in this paper

would not be possible without the support of NVIDIA

Corporation that donated the GPU to author.

REFERENCES

Banerjee, S., Brogan, J., Krizaj, J., Bharati, A., Webster,

B. R., Struc, V., Flynn, P. J., and Scheirer, W. J.

(2018). To frontalize or not to frontalize: Do we really

need elaborate pre-processing to improve face recog-

nition? In 2018 IEEE Winter Conference on Applica-

tions of Computer Vision (WACV), pages 20–29.

Goldberger, J., Hinton, G. E., Roweis, S. T., and Salakhutdi-

nov, R. R. (2005). Neighbourhood components anal-

ysis. In Saul, L. K., Weiss, Y., and Bottou, L., editors,

Advances in Neural Information Processing Systems

17, pages 513–520. MIT Press.

Ho, C.-H., Morgado, P., Persekian, A., and Vasconcelos, N.

(2019). Pies: Pose invariant embeddings. In The IEEE

Conference on Computer Vision and Pattern Recogni-

tion (CVPR).

Huang, G. B., Ramesh, M., Berg, T., and Learned-Miller,

E. (2007). Labeled faces in the wild: A database for

studying face recognition in unconstrained environ-

ments. Technical Report 07-49, University of Mas-

sachusetts, Amherst.

King, D. E. (2009). Dlib-ml: A machine learning toolkit.

Journal of Machine Learning Research, 10:1755–

1758.

Maze, B., Adams, J., Duncan, J. A., Kalka, N., Miller, T.,

Otto, C., Jain, A. K., Niggel, W. T., Anderson, J., Ch-

eney, J., and Grother, P. (2018). IARPA Janus Bench-

mark - C: Face Dataset and Protocol. In 2018 Inter-

national Conference on Biometrics (ICB), pages 158–

165.

Movshovitz-Attias, Y., Toshev, A., Leung, T. K., Ioffe, S.,

and Singh, S. (2017). No fuss distance metric learning

using proxies. CoRR, abs/1703.07464.

Parkhi, O. M., Vedaldi, A., and Zisserman, A. (2015). Deep

face recognition. In British Machine Vision Confer-

ence.

Proenc¸a, H., Nixon, M., Nappi, M., Ghaleb, E.,

Ozbulak,

G., Gao, H., Ekenel, H. K., Grm, K., Struc, V., Shi,

H., Zhu, X., Liao, S., Lei, Z., Li, S. Z., Gutfeter, W.,

Pacut, A., Brogan, J., Scheirer, W. J., Gonzalez-Sosa,

E., Vera-Rodriguez, R., Fierrez, J., Ortega-Garcia, J.,

Riccio, D., and De Maio, L. (2018). Trends and con-

troversies. IEEE Intelligent Systems, 33(3):41–67.

Quis-campi dataset (2015). Biometric Recognition in

Surveillance Environments. http://http://quiscampi.di.

ubi.pt/. Accessed: 2020-01-01.

Su, H., Maji, S., Kalogerakis, E., and Learned-Miller, E. G.

(2015). Multi-view convolutional neural networks for

3d shape recognition. In Proc. ICCV.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

520