Nearest Neighbor-Based Data Denoising for Deep Metric Learning

George Galanakis

1,2 a

, Xenophon Zabulis

2 b

and Antonis A. Argyros

1,2 c

Computer Science Department, University of Crete, Greece

Institute of Computer Science, FORTH, Greece

Keywords:

Label Noise, Data Denoising, Deep Metric Learning, KNN, Image Retrieval.

Abstract:

The effectiveness of supervised deep metric learning relies on the availability of a correctly annotated dataset,

i.e., a dataset where images are associated with correct class labels. The presence of incorrect labels in a

dataset disorients the learning process. In this paper, we propose an approach to combat the presence of such

label noise in datasets. Our approach operates online, during training and on the batch level. It examines the

neighborhood of samples, considers which of them are noisy and eliminates them from the current training

step. The neighborhood is established using features obtained from the entire dataset during previous training

epochs and therefore is updated as the model learns better data representations. We evaluate our approach using

multiple datasets and loss functions, and demonstrate that it performs better or comparably to the competition.

At the same time, in contrast to the competition, it does not require knowledge of the noise contamination rate

of the employed datasets.

1 INTRODUCTION

Two of the most common tasks related to analysis of

images are classiﬁcation and retrieval. In their sim-

pler form, they both operate on the image level. In

classiﬁcation the task is to associate the image with a

class/label from a predeﬁned set of classes, while in

retrieval the goal is to identify a set of similar images

from an existing database, in ranked order. In many

practical scenarios, classes are not known beforehand;

In other situations, lack of existing data may result to

inaccuracies. Therefore, a classiﬁer learned on a pre-

deﬁned set of classes is not possible. Instead of learn-

ing a classiﬁer, retrieval approaches, usually posed as

metric learning, operate on the relative similarity of

images and aim to construct embedding spaces where

similar ones lay in nearby areas, while dissimilar im-

ages are far apart. In this work we focus on the re-

trieval and, more speciﬁcally, in the case in which

there is noise noise in the training dataset.

Despite the emergence of off-the-self pre-trained

models and the availability of self-supervised and

few-shot learning approaches, the majority of ma-

chine learning algorithms are still data-hungry and re-

quire excessive annotation efforts. Annotation can be

https://orcid.org/0000-0002-6261-6252

https://orcid.org/0000-0002-1520-4327

https://orcid.org/0000-0001-8230-3192

performed either by humans or by some (semi-) auto-

matic approaches. In both cases, annotated data might

contain errors. Causes of these errors are misunder-

standing, ambiguity or other human and machine in-

herent issues that arise during the annotation process.

In the context of this work we are interested in

annotation errors with respect to labels that are as-

sociated with images. For example, an image might

be assigned a label that does not correspond to the

correct object class, e.g. an image is associated with

a dog class but it actually depicts a cat. Our pro-

posed methodology, called Nearest Neighbor-based

Data Denoising (N2D2) deals with such or similar

types of noise in an online fashion. It identiﬁes the

noisy samples that are contained in the current train-

ing batch and eliminates them from the computation

of the loss, such that they do not negatively affect the

training of the model.

The basic idea behind our approach is depicted in

Fig. 1. The left part of the ﬁgure illustrates a clean

dataset comprised of 3 classes, “a”, “b” and “c”, while

the right part the same dataset where 2 samples, (1)

and (3), are mislabeled. Suppose that we want to ex-

amine samples (1)-(3) that are contained in the current

training batch. The samples are assigned with scores

and are classiﬁed as noisy or clean - see the method-

ology section for details. In this case, (1) is correctly

identiﬁed as noisy; a fairly simple task. Sample (2)

Galanakis, G., Zabulis, X. and Argyros, A.

Nearest Neighbor-Based Data Denoising for Deep Metric Learning.

DOI: 10.5220/0012383000003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 2: VISAPP, pages

595-603

ISBN: 978-989-758-679-8; ISSN: 2184-4321

595

Figure 1: A simple example of the identiﬁcation of samples as noisy versus clean using our method. See text for details.

is correctly identiﬁed as clean despite the presence of

the other samples among the neighbors. (3) is a corner

case, where a noisy sample is identiﬁed as clean.

2 RELATED WORK

Noise Types. The existence of falsely labeled train-

ing samples is usually referred to as label noise. A

common labeling mistake, especially existent in large

datasets, is the assignment of an image to a wrong

class. Other types of disturbances in the appearance

of the object can also be considered as noise. Such

disturbances include occlusions of the dominant ob-

ject, co-existence of multiple objects corresponding

to either known or unknown classes, random sensor

noise, etc. The above types of noise have been also

reported as label noise, in the sense that the single la-

bel that is assigned does not represent the entire story

told by the image.

The above types of noise can be found in both syn-

thetic and real-world noisy datasets that have been ex-

ploited by recent noise-aware methodologies. A syn-

thetic dataset can be created by random mutation of

some of the correct labels of an existing clean dataset,

with the mutation being either symmetric - same num-

ber of images from all classes mutate to noisy - or

asymmetric. Another type of synthetic noise, called

small-cluster noise, was introduced by (Liu et al.,

2021). This type of noise eliminates some of the

known classes by randomly assigning corresponding

images to the rest of the classes. This process im-

itates an openset scenario, where images from un-

known classes are present in the data. On the other

hand, real-world datasets are usually crawled without

supervision from random internet sources, and there-

fore noise is inherent.

Noise in Classiﬁcation Problems. Combating la-

bel noise during training of neural networks has been

studied mostly for classiﬁcation problems. A popular

family of such methods is called Co-teaching (Han

et al., 2018), which suggested the paradigm of simul-

taneously training two identical networks. The main

idea is to update the one model based on observations

of the other model. More speciﬁcally, the set of batch

samples that have small loss when passed through

the one network are used to update the weights of

the companion network. This approach aims to ﬁnd

noisy samples by reducing the bias that each indi-

vidual network has on the samples that it has pre-

viously seen. An extension of this approach, called

Co-teaching+ (Yu et al., 2019), ﬁrst considers sam-

ples from which the two classiﬁers disagree on the

predicted label and then selects the subset of samples

that have small loss. Disagreement helps the two net-

works stay divergent during training, else they con-

verge to the same solution and essentially degener-

ate to the same network. The approach in (Xia et al.,

2020) proposed a method to distinguish the critical

parameters and non-critical parameters of the network

and applied different update rules for different types

of parameters. This is based on the observation that

non-critical ones tend to overﬁt noisy labels and their

gradients are misleading for generalization. The work

in (Yao et al., 2021) proposed the Jensen-Shannon di-

vergence as measure between the correct labels dis-

tribution and the predicted one in order to identify

the noisy samples. They also have a separate pro-

cess for distinguishing in-distribution (ID) and out-

of-distribution (OOD) noise samles and they re-label

the ID samples according to a proposed label that is

obtained from a mean-teacher network.

A more recent method called SSR (Feng et al.,

2022) combines classiﬁcation prediction and inspec-

tion of the feature space neighborhood of the samples.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

596

First, samples whose classiﬁcation score is above a

threshold but their label is not the same, are relabeled

accordingly. Afterwards, a sample is considered as

clean if the most popular label among its neighbors

is the same as that of the sample. Both sample rela-

beling and noise removal processes are applied to the

entire dataset, after each training epoch. Our method

is similar to SSR in the sense of the use of K Near-

est Neighbor (KNN) for determining noisy samples.

However, we apply our method in the context of re-

trieval, we do not apply any re-labeling mechanism

and also we utilize an alternative scoring methodol-

ogy for determining noisy samples.

Noise in Metric Learning Problems. Deep met-

ric learning approaches are closely related to the

loss function that is used for learning the embedding

space. Some of the loss functions are purposefully

designed do deal with label noise. For example, Sub-

Center ArcFace (Deng et al., 2019) loss extends the

ArcFace loss (Deng et al., 2019) with multiple cen-

ters per-class, such that presence of noisy samples

does not affect considerably the single learned cen-

troid. SoftTriple loss (Qian et al., 2019) shares the

same idea for multiple centers and is inherently more

robust against label noise, however it was not origi-

nally evaluated on noisy datasets. Most recently, CCL

(Cai et al., 2023) adopts the same idea about maintain-

ing a center bank. However, it utilizes a contrastive

loss. We evaluate our method using both robust and

non-robust losses and report the impact in each case.

Since CCL was only recently proposed and its imple-

mentation was not available, we did not include it in

our experiments.

Explicit consideration and elimination of noisy

samples is a less studied topic. Two recent approaches

are PRISM (Liu et al., 2021) and T-SINT (Ibrahimi

et al., 2022). PRISM maintains a memory queue

of the features of all samples and one centroid per

class. Samples are compared using the cosine sim-

ilarity metric against their corresponding class cen-

troid and those with low similarity that fall under a

ﬁltering ratio are considered as noisy. The memory

queue is ﬁlled up with samples that are considered

clean. T-SINT proposed a student-teacher architec-

ture, where the teacher model identiﬁes noisy sam-

ples in the pairwise distance matrix and appropriately

guides the student network to update its weights. T-

SINT operates directly on the distance matrix and ex-

cludes only the false positive pairs, i.e. positive pairs

that include the noisy samples. A common shortcom-

ing of PRISM and T-SINT is that they require knowl-

edge of the noise rate of the dataset, which is a strong

requirement.

In practice, T-SINT operates very similarly to

PRISM in the datasets which we consider for bench-

marking, while its performance gain is attributed to

the better network architecture it utilizes. Further-

more, it can be considered as a more intrusive method

with respect to the current implementations as it de-

pends on two concurrent networks (student-teacher)

and the source code for fair comparison is not pro-

vided. Hence, we chose PRISM as a baseline com-

petitor.

Our method shares some similarities and differ-

ences with PRISM. Both methods keep a memory

bank of the features during training. PRISM does so

for updating the class centroids during training. In-

stead, we use the entire bank once, at the end of each

epoch, to establish the similarities between all dataset

samples as required for the inspection of batch sam-

ples’ neighborhood in subsequent steps. Also, both

methods are not directly dependent to the neural net-

work and loss, thus can be applied as an extension to

already robust networks and losses. A core difference

of our method is that it does not require the knowl-

edge of the ﬁltering rate to determine noisy samples

as PRISM does.

The contributions of the present work can be sum-

marized as follows.

• This work presents a method about accounting

label noise during supervised training of deep

metric learning models through exploration of

the neighborhood of the batch samples using the

KNN algorithm.

• Our method is evaluated against baselines and

the competing PRISM approach on a range of

datasets, noise rates and types. It is demonstrated

that in most cases, it performs better than PRISM

even with less information than their method re-

quires, i.e. the noise rate of the dataset.

• The method is decoupled with respect to the neu-

ral network architecture and the loss function,

thus, in contrast to other proposed methodolo-

gies, it can be plugged into any existing training

pipeline as a preceding step before the computa-

tion of the loss and/or mining processes.

3 METHODOLOGY

The general training pipeline of deep metric models

includes the following steps (forward pass): (a) pop-

ulation of samples of a batch through a sampling

method, (b) feeding the samples through the net-

work and extraction of features, (c) mining of pairs or

triplets and, (d) computation of the loss. Let X denote

the features of the batch samples of size N ×D, where

Nearest Neighbor-Based Data Denoising for Deep Metric Learning

597

input : training batch B containing features

X of size N × D and labels y of size

N; noise predictor P

output: noise prediction pred of size N

1 assign pred with all f alse

2 if P

has not trained then

3 return pred;

4 else

5 foreach x

, y

in B do

6 /* P

pipeline */

7 Find k neighbors of x

;

8 Calculate scores s;

9 if s(y

) < θ then

10 pred(i) ←− true

11 end

12 return pred;

Algorithm 1: Overview of the proposed algorithm for the

identiﬁcation of noisy samples.

N is the batch size and D the dimensionality of the

features. Let also y denote the corresponding labels

of a total number of C classes that are contained in the

dataset. We propose extending this pipeline by intro-

ducing an extra step, after features extraction, which

considers if the samples are noisy or not and reduces,

quite similar to mining, the samples that will be con-

sidered for computing the loss. We denote such noise

predictor module as P

3.1 Identiﬁcation of Noisy Samples

Our methodology for identifying noisy samples relies

on the realization that the neighborhood of a sample

in the embedding space should mostly contain sam-

ples which correspond to the same class. From an-

other perspective, noisy samples i.e., samples that are

mislabelled, are expected to be surrounded by neigh-

bors that mostly correspond to other classes. How-

ever, the above assumption holds only when the em-

bedding space has been considerably formed, mean-

ing that samples from the same class are close to each

other, while samples from different classes lay further

apart.

In order to quantify the purity of the neighborhood

of a sample, we obtain statistics of the nearest neigh-

bors. More speciﬁcally, we count the k nearest neigh-

bors and normalize over k. We then construct a C

dimensional vector, containing the frequency of each

class into the neighborhood of the sample, as obtained

previously. Finally we scale the vector into the range

[0, 1], to compensate for various neighborhood con-

ﬁgurations. Classiﬁcation of a sample as noisy is then

as simple as comparing the score of its correspond-

ing class against a threshold θ; we choose θ = 0.5. If

the score of the sample is lower than the threshold, it

means that the neighborhood does not support the la-

bel of the class and therefore we consider it as noisy.

The above steps are summarized in Algorithm 1.

At every training iteration, X, y are kept in a mem-

ory bank. In the end of all iterations (epoch end) we

index the training data using FAISS library (Johnson

et al., 2019), which permits fast computation of the k

nearest neighbors in subsequent steps.

3.2 Loss Functions

A key aspect in metric learning is the loss function

which guides the evolution of the embedding space

through training. The loss function endorses small

distances between same-class samples and large dis-

tances between samples that correspond to different

classes. As it has been demonstrated, some losses are

more resistant than others when they encounter label

noise, while others work better on different types of

noise. In this work we explore different losses in or-

der to demonstrate the effectiveness of our approach

in such different scenarios. We brieﬂy describe the

losses below.

Contrastive Loss. The baseline of metric learning

losses is the Contrastive Loss, given by Eq.1.

contrastive

= [d

− m

pos

]

+ [m

neg

− d

]

(1)

The loss penalizes large intra-class distances (d

) and

small inter-class distances (d

), of the samples that

appear in the training batch, given margins that regu-

late the amount of the penalty that is applied; by de-

fault m

pos

= 0.0 and m

neg

= 1.0.

Cross-Batch Memory Loss. Cross-Batch Memory

Loss (XBM) (Wang et al., 2020) was proposed as an

extension to Contrastive Loss, as it aims to increase

the number of pairwise distances, by not only com-

paring samples within batch but also across batches.

This approach essentially provides more appropriate

negative examples which are crucial for learning. For

this purpose, it keeps a memory bank containing em-

beddings and corresponding labels from the previous

batches. The memory bank is implemented as a FIFO

queue, containing n most recent embeddings, where n

is a hyperparameter.

Multi-Similarity Loss. Multi-Similarity Loss (Wang

et al., 2019) is formulated as

∑

i=1

{

log[1 +

∑

k∈P

−a(S

−λ)

]

log[1 +

∑

k∈N

β(S

−λ)

]},

(2)

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

598

Table 1: Statistics of the datasets.

Dataset

Training set

# classes / samples

Testing set

# classes / samples

MNIST / KMNIST 10 / 60,000 10 / 10,000

Cars 98 / 8,054 98 / 8,131

CUB 100 / 5,864 100 / 5,924

SOP 11,318 / 59,551 11,316 / 60,502

where a,β, λ are hyperparameters. For each anchor-

negative pair, this loss considers both the similarity

of the samples in this pair (self-similarity), but also a

relative similarity with respect to other negative pairs.

The same holds for positive pairs, too. In such way,

the impact of each pair in the loss is weighted accord-

ing to both self and relative similarities with respect

to other pairs of samples.

SoftTriple Loss. SoftTriple Loss (Qian et al., 2019),

is a proxy-based loss inspired by the Softmax loss,

which is common in classiﬁcation, and the triplet loss.

It maintains multiple learnable centers for each class

(proxies) and replaces the standard sample-sample

similarity with a weighted sample-proxies similarity.

Essentially, the proxies can be thought of as learnable

representatives of each class.

) = − log

exp(λ(S

′

i,y

− δ))

exp(λ(S

′

i,y

− δ)) +

∑

j̸=y

exp(λS

′

i, j

)

(3)

where

′

i,c

∑

exp(

)

∑

exp(

)

. (4)

In the above equation, x

is the i-th batch sample and

represents the k-th proxy of class c. γ, δ are prede-

ﬁned hyperparameters.

Sub-center ArcFace Loss. Sub-center ArcFace Loss

(SCAF) (Deng et al., ) is an extension to the popular

ArcFace Loss (Deng et al., 2019) and shares a simi-

lar idea to SoftTriple by utilizing multiple centers per

class, hence the name “Sub-center”.

SCAF

= − log

scos(θ

i,y

+m)

scos(θ

i,y

+m)

∑

j=1, j̸=y

scos θ

i, j

, (5)

where

i, j

= arcos(max

)), k ∈ {1, . . . , K}. (6)

In the above equation, K represents the number of

centers per class and θ

i j

the angle of feature vectors

that represent i-th and j-th samples. m and s are hy-

perparameters. As stated in (Deng et al., ), the loss

can compensate possible noise in datasets due to the

multiple centers per-class.

Some of the loss functions above might bene-

ﬁt from reasoning about which positive and negative

samples should be accounted, for computing the loss

of a given batch sample. This operation is commonly

referred as online pair/triplet mining. We did not

adopt any mining algorithm for two reasons. First,

to retain a common basis of comparison and second,

to reduce the overhead of comparing with and with-

out an additional algorithmic parameter. We note that

our scope is not towards exhaustive evaluation of met-

ric losses, but to demonstrate that our method beneﬁts

few recent of them.

4 EXPERIMENTS

4.1 Datasets

We evaluate our method on four datasets, MNIST

(LeCun and Cortes, 2010), Stanford Cars (Krause

et al., 2013), Caltech-UCSD Birds-200-2011 (CUB)

(Wah et al., 2011) and Stanford Online Products

(SOP) (Oh Song et al., 2016). Cars and CUB datasets

were originally created for the classiﬁcation task,

meaning that training and testing set share the same

classes. We use the re-organized splits provided

by (Liu et al., 2021). For MNIST, rather than re-

organizing the dataset we used an alternative one

called Kuzushiji-MNIST (KMNIST) (Clanuwat et al.,

2018) to serve as the testing set. KMNIST comprises

of 10 characters of Japanese cursive writing style and

was proposed as a drop-in replacement of MNIST.

Tab.1 shows the dataset statistics and Fig.2 demon-

strates indicative images of them. Each dataset is

evaluated for varying synthetic noise rates.

4.2 Implementation & Training Settings

We implemented our method using PyTorch Light-

ning framework (Falcon, 2019). We also adopted im-

plementations of loss functions from PyTorch Metric

Learning library (Musgrave et al., 2020b). Comple-

mentary to our algorithm, we re-implemented PRISM

(Liu et al., 2021), in order to have a common ex-

perimental framework and maintain the same settings

across multiple runs. We validated the implementa-

tion using Cars dataset and 50% noise rate.

As it is stated in (Musgrave et al., 2020a), the

curse of metric learning approaches is that they are

not compared under the same settings and, most im-

portantly, that performance improvements are actu-

ally due to different training settings and modeling

rather than the proposed methodologies themselves.

With this in mind, our experimentation was conducted

in a way to guarantee accurate comparison of the

Nearest Neighbor-Based Data Denoising for Deep Metric Learning

599

Figure 2: Indicative samples from the evaluated datasets.

From top to bottom: MNIST & KMNIST, Cars, CUB and

SOP.

noise reduction methodologies and we did not opti-

mize or ﬁne-tuned the settings for each dataset. We

ﬁxed the random seed for all experiments, too.

For the experiments which include Cars, CUB and

SOP datasets, our training settings mirror those spec-

iﬁed in (Liu et al., 2021). More speciﬁcally, for Cars

and CUB we utilize the BN-Inception CNN architec-

ture for learning 512-d embeddings while for SOP we

utilize ResNet-50 with an additional 128-d linear pro-

jection head. Each model is trained for 30 epochs us-

ing Adam optimizer with initial learning rate 3

−5

and

weight decay 5

−4

. For the losses which include train-

able parameters, namely SoftTriple and Sub-center

ArcFace, we utilized the same optimizer as the base

model however with higher learning rate as suggested

in (Qian et al., 2019). At each training step the learn-

ing rate is updated using the Cosine Annealing ap-

proach (Loshchilov and Hutter, 2016). During train-

ing, each sample is randomly resized and ﬂipped on

the horizontal axis. The batch size is set to 64, while

each batch is formed using at least 4 samples of the

same class, such that there are enough positive exam-

ples to be used by the loss.

For the experiments on MNIST, we utilize a much

simpler model comprising of 2 linear layers, where

each layer is followed by ReLU and Dropout opera-

tions. The embeddings are 64-d and are obtained from

the top layer. All implementation details can be found

in the provided code

github.com/ggalan87/nearest-neighbor-data-denoising

4.3 Evaluation Criteria

For evaluating our method, we follow the standard

protocol that is applied in the domain of metric learn-

ing. We utilize a testing set which is disjoint to

the training set with respect to class labels; i.e., no

class of training set appears in the test set. More

speciﬁcally, we adopted the dataset splits that were

provided by (Liu et al., 2021). After training, em-

beddings of the testing set are extracted and nearest

neighbors of each sample are computed and ranked.

Given the ranked list, we compute and report Preci-

sion@1 (P@1), as the evaluation metric. Following

(Liu et al., 2021), we report the result of the best per-

forming model.

4.4 Results & Discussion

Results of the experimental evaluation using various

loss functions and noise rates are demonstrated in Ta-

bles 2-3. Both our approach and PRISM, being au-

tonomous noise reduction models, directly operate on

the current batch of training samples and decide if a

sample is noisy or not based on historical observations

and statistics. In order to have an upper limit of the

effect of such approach we also run the same exper-

iments using a ground-truth noise reduction model.

This model is given the correct noisy samples and

therefore constructs a ground-truth ﬁlter.

We note that a core hyperparameter of PRISM al-

gorithm is the ﬁltering rate, which is the amount of

samples that will be considered for removal. We in-

sist that this is a hard requirement. Since PRISM does

not explicitly specify the noise rates for which their

experiments were run, we include results from runs

of PRISM with (a) ﬁltering rate ﬁxed to the ground-

truth noise rate of the dataset but also (b) runs where

the ﬁltering rate is set to an arbitrary value i.e., 50%.

Small and Medium Scale Datasets. Table 2 shows

the results of the experiments that were run on

MNIST, Cars and CUB datasets. In the majority

of the experiments, our method obtained better P@1

than PRISM, while in the rest of the experiments,

it operated on par. We note that our method has a

great advantage in that it does not require the knowl-

edge of the noise rate as PRISM does. More specif-

ically, in the experiments with the Cars dataset, our

method obtained the highest P@1 for all noise rates

and noise types. In MNIST and CUB our method re-

sulted in equivalent P@1 and at least a combination

of our method and a loss function is the second best

and very close to the best approach.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

600

Table 2: Label noise experiments on CARS, CUB, and MNIST datasets. In the case of PRISM, dual results correspond to a

correct and a 50% noise rate (hyperparameter of method). See text for the details. The reported result is Precision@1. The

best score excluding ground-truth noise removal is annotated with bold font, while the second best with underline.

MNIST −→ KMNIST CARS CUB

symmetric noise symmetric noise small cluster noise symmetric noise small cluster noise

loss name 10% 20% 50% 10% 20% 50% 25% 50% 10% 20% 50% 25% 50%

without noise reduction

XBM 95.41 94.63 91.8 76.95 74.36 67.23 72.17 62.8 54.25 54.34 49.61 52.45 48.75

MS 96.42 95.77 94.13 76.29 69.98 43.02 69.72 46.03 56.03 53.48 33.46 51.23 37.73

SCAF 95.98 95.42 92.98 78.2 67.07 54.71 65.36 64.28 56.11 54.19 45.93 54.51 45.8

ST 96.23 95.86 94.35 76.22 72.85 61.14 74.74 68.43 54.32 52.87 46.3 53.44 49.73

ground-truth noise reduction

XBM 94.84 94.79 94.12 77.61 77.35 75.01 76.97 70.73 54.68 54.95 53.98 54.61 55.72

MS 96.22 96.28 96.02 78.35 78.39 73.81 76.72 72.46 55.72 55.44 54.27 51.16 50.03

SCAF 96.38 96.53 95.71 81.29 80.56 76.88 77.91 74.06 56.95 57.33 54.2 53.97 55.01

ST 97.16 97.06 96.63 80.3 79.26 74.08 78.7 75.07 57.26 56.55 53.83 51.23 52.84

PRISM (50% noise rate)

XBM 92.21 92.22 92.2 71.21 72.47 73.44 70.99 65.99 53.06 53.39 51.96 52.41 50.57

MS 91.88 89.63 86.81 67.99 69.98 69.57 68.9 68.55 50.56 52.18 52.79 51.52 49.65

SCAF 91.99 91.71 94.22 73.78 75.06 53.29 72.15 70.64 54.12 53.81 53.29 53.17 49.54

ST 93.34 92.98 93.91 75.29 75.65 69.14 75.48 71.7 55.62 55.86 51 53.81 51.4

PRISM (correct noise rate)

XBM 94.44 94.16 92.2 77.57 76.58 73.44 75.19 65.99 54.71 55.64 51.96 53.87 50.57

MS 95.71 94.06 86.81 78.49 76.5 69.57 75.03 68.55 55.98 55.55 52.79 53.71 49.65

SCAF 96.14 95.9 94.22 79.84 78.86 53.29 77.64 70.64 56.95 57.14 53.29 54.71 49.54

ST 96.58 96.27 93.91 79.13 77.07 69.14 76.9 71.7 56.4 54.95 51 54.66 51.4

N2D2 (ours)

XBM 94.22 93.82 91.8 76.05 74.1 71.8 73.83 64.8 55.3 55.5 54.19 54.61 51.35

MS 94.17 93.58 92.56 74.08 74.8 70.08 72.49 66.07 53.75 54.22 52.45 53.39 49.48

SCAF 95.04 94.66 93.89 79.15 79.14 73.73 75.61 70.65 56.57 56.65 53.7 54.78 49.32

ST 96.23 96.16 95.27 80.39 79.13 72.55 78.07 72.04 56.72 56.36 52.14 54.81 50.96

Table 3: Label noise experiments on SOP dataset using

SoftTriple loss. In the case of PRISM, dual results corre-

spond to a correct and a 50% noise rate (hyperparameter

of method). See text for the details. The reported result is

Precision@1. The best score excluding ground-truth noise

removal is annotated with bold font, while the 2nd best is

underlined.

SOP

symmetric noise small cluster noise

10% 20% 50% 25% 50%

without noise reduction 68.93 67.24 63.46 70.81 70.2

ground-truth noise reduction 70.28 69.2 67.23 71.11 70.83

PRISM (50% noise rate) 69.28 69.15 65.85 71.5 71.67

PRISM (correct noise rate) 70.16 69.17 65.85 71.69 71.67

N2D2 (ours) 72.37 71.35 68.06 73.29 72.97

Large Scale Dataset. For the Stanford Online Prod-

ucts Dataset (SOP) we chose the SoftTriple loss that

operated on average better than the others in the previ-

ous datasets for both our method and PRISM. Corre-

sponding results are demonstrated in Table 3. This

dataset is signiﬁcantly different to the rest in that

it has two orders of magnitude more classes, while

each class comprises very few samples - 5 on aver-

age. Apart from the different model (ResNet-50 vs.

BN-Inception), we kept the same hyperparameter and

training settings as in the rest of the datasets. In this

case our method operated better than PRISM in all

cases.

Most interestingly, it surpassed the ground-truth

noise removal approach, too. We attribute this sur-

prising fact to the following fact. The SOP dataset

Figure 3: Indicative samples from SOP dataset which show

the diversity in intra-class appearance. The left and the mid-

dle image correspond to the same class while the right im-

age corresponds to a different class.

has very few samples per class and our algorithm

might remove some additional non-noisy samples

(false positives), besides those that are noisy. We be-

lieve that these additional samples were less meaning-

ful for the training in the following sense. The goal

of metric learning through the desired loss function

is to minimize the intra-class distance and increase

the inter-class distance by examining the similarities

or the distances of positive and negative samples that

are contained in the batch. In the case of SOP, aver-

age samples per class are very few, while annotated

images might be very diverse. For example in the

case which is shown in Fig. 3, even if the samples

are associated with their correct labels, the loss func-

tion would guide the model to bring the ﬁrst two sam-

ples close and the third further apart in the embed-

ding space, although from human perspective this is

counter intuitive. Removing such samples before loss

computation, in the sense of mining, helped the learn-

ing of better representations.

Reproducibility. We note that some of the presented

results are different to those reported by the PRISM

Nearest Neighbor-Based Data Denoising for Deep Metric Learning

601

paper, being either higher or lower. For example,

we obtained much higher P@1 for the baseline mod-

els, i.e. those that run without noise removal. This

suggests that base models are more durable to label

noise than reported. We also obtained higher P@1

using PRISM in the Cars dataset and at 50% noise

rates. In contrast, we observed lower performance for

CUB and SOP datasets. We attribute this to slightly

different training settings that were not carefully ex-

plained and we believe that further hyperparameter

tuning would reproduce their results. All the meth-

ods were run using the same settings, therefore we

are conﬁdent that the comparison is fair.

Datasets with Inherent Noise. In the introduction

we mentioned another noise type, that is the inher-

ent noise introduced due to automatic scraping of

images from the internet. Examples of real-word

noisy datasets are Cars98N (Liu et al., 2021) and

Food101N (Lee et al., 2018), which are also evaluated

by PRISM. In our preliminary experiments both our

method and the ground-truth noise removal operated

worse than the baseline models. For both datasets,

PRISM reports performance gain of less than 1% as

compared to the baseline models. We believe that

both methods are not yet capable of combating this

kind of noise and we opt to explore it in future work

in conjunction with alternative baseline models and

losses.

Precision over Training Epochs. Figure 4 reports a

detailed comparison of the effect of predicting and ex-

cluding noisy samples as the model is being trained.

The particular comparison is run on the Cars dataset

with 50% symmetric noise and the SoftTriple loss.

For each of the evaluated methods, we report P@1 on

the test dataset. We observe that the impact of noise

removal is crucial even in the ﬁrst epochs, those that

establish the good performance of the model in the

later epochs.

4.5 Hyperparameter k

We experimented with hyperparameter k of the near-

est neighbors algorithm. More speciﬁcally, we eval-

uated our method using the Cars dataset, 50% sym-

metric noise rate and the SoftTriple loss function. We

did an “ofﬂine” evaluation as follows. We ﬁrst run an

experiment with the above settings using the ground-

truth noise removal approach and stored the features

per epoch. Afterwards, we run our methodology of-

ﬂine, using subsequent epochs as reference and cur-

rent, as it would be encountered in the online ex-

periment, e.g. KNN trained on features of 0th and

noise measured for batches of 1st epoch. We repeated

Figure 4: P@1 over training epochs for the compared meth-

ods.

the same experiment for the different parameteriza-

tion of the nearest neighbors algorithm and evaluated

using the F1-score of the classiﬁcation of a sample to

noisy vs clean. We chose an ofﬂine approach such

that all possible options are evaluated using the ex-

act same set of features rather than features that are

affected due to the model update as guided by noise

removal. Figure 5 shows the results. Very low values,

e.g. k = 1 − 5, were evaluated only for completeness

as they were not expected to capture enough informa-

tion about the neighborhood. We observe that values

around 50 and 400 provide similar high F1-score. We

chose k = 200 which resulted to top average F1-score,

for all the formal experiments and all datasets.

An observation that can be drawn from the re-

sults is that a relatively low k = 20 does not perform

well during initial epochs (F1-score: 76.6%), however

reaches similar F1-score to a much higher k = 1000

during the ﬁnal epochs. An explanation is that larger

values of k compensate for the fuzzy embedding space

that is encountered during the ﬁrst epochs, where sim-

ilar samples are not close enough, yet. With this in

mind, one could think the possibility of a variable k,

which lowers as the model stabilizes. Another ap-

proach could be to combine the scores obtained from

different k values such that the noise reduction model

observes both the local and the global neighborhood.

5 CONCLUSIONS

In this work we proposed a simple yet effective ap-

proach to cope with noisy samples during training of

deep metric models. The approach is based on in-

spection of the neighborhood of the samples using

the KNN algorithm and appropriate classiﬁcation of

samples as noisy according to the presence of sam-

ples that are of the same class as the examined one.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

602

Figure 5: Evaluation of the classiﬁcation performance of

samples to noisy vs clean using variable k. The reported

metric is the F1-score.

Our approach can be easily plugged in every training

pipeline as a preceding step before loss computation

and possibly triplet mining, without e.g. requiring ar-

chitectural changes and complex schemes. The evalu-

ation in a range of experimental settings demonstrated

that our approach performed better or on par with the

competition, even without the hard requirement about

knowing the noise rate of the dataset.

In future work we will examine more modern

model architectures and other types of noise. We

will also examine re-labeling approaches such that we

can avail from all existing samples without neglecting

them and possibly noise elimination at the level of the

dataset instead of at the level of the batch.

ACKNOWLEDGEMENTS

This work was implemented under the project Craeft

which received funding from the European Union’s

Horizon Europe research and innovation program un-

der grant agreement No 101094349 and was sup-

ported by the Hellenic Foundation for Research and

Innovation (HFRI) under the “1st Call for H.F.R.I Re-

search Projects to support Faculty members and Re-

searchers and the procurement of high-cost research

equipment”, project I.C.Humans, number 91.

REFERENCES

Cai, B., Xiong, P., and Tian, S. (2023). Center con-

trastive loss for metric learning. arXiv preprint

arXiv:2308.00458.

Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A.,

Yamamoto, K., and Ha, D. (2018). Deep learn-

ing for classical japanese literature. arXiv preprint

arXiv:1812.01718.

Deng, J., Guo, J., Liu, T., Gong, M., and Zafeiriou, S. Sub-

center arcface: Boosting face recognition by large-

scale noisy web faces. In ECCV.

Deng, J., Guo, J., Xue, N., and Zafeiriou, S. (2019). Ar-

cface: Additive angular margin loss for deep face

recognition. In CVPR.

Falcon, W. A. (2019). Pytorch lightning. GitHub, 3.

Feng, C., Tzimiropoulos, G., and Patras, I. (2022). Ssr:

An efﬁcient and robust framework for learning with

unknown label noise. In BMVC.

Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang,

I., and Sugiyama, M. (2018). Co-teaching: Robust

training of deep neural networks with extremely noisy

labels. In NIPS.

Ibrahimi, S., Sors, A., de Rezende, R. S., and Clinchant, S.

(2022). Learning with label noise for image retrieval

by selecting interactions. In WACV.

Johnson, J., Douze, M., and J

egou, H. (2019). Billion-scale

similarity search with GPUs. IEEE Transactions on

Big Data, 7(3):535–547.

Krause, J., Stark, M., Deng, J., and Fei-Fei, L. (2013). 3d

object representations for ﬁne-grained categorization.

In ICCVW.

LeCun, Y. and Cortes, C. (2010). MNIST handwritten digit

database.

Lee, K.-H., He, X., Zhang, L., and Yang, L. (2018). Clean-

net: Transfer learning for scalable image classiﬁer

training with label noise. In CVPR.

Liu, C., Yu, H., Li, B., Shen, Z., Gao, Z., Ren, P., Xie, X.,

Cui, L., and Miao, C. (2021). Noise-resistant deep

metric learning with ranking-based instance selection.

In CVPR.

Loshchilov, I. and Hutter, F. (2016). Sgdr: Stochastic

gradient descent with warm restarts. arXiv preprint

arXiv:1608.03983.

Musgrave, K., Belongie, S. J., and Lim, S.-N. (2020a). A

metric learning reality check. In ECCV.

Musgrave, K., Belongie, S. J., and Lim, S.-N. (2020b). Py-

torch metric learning. ArXiv, abs/2008.09164.

Oh Song, H., Xiang, Y., Jegelka, S., and Savarese, S.

(2016). Deep metric learning via lifted structured fea-

ture embedding. In CVPR.

Qian, Q., Shang, L., Sun, B., Hu, J., Li, H., and Jin, R.

(2019). Softtriple loss: Deep metric learning without

triplet sampling. In ICCV.

Wah, C., Branson, S., Welinder, P., Perona, P., and Be-

longie, S. (2011). The caltech-ucsd birds-200-2011

dataset.

Wang, X., Han, X., Huang, W., Dong, D., and Scott, M. R.

(2019). Multi-similarity loss with general pair weight-

ing for deep metric learning. In CVPR.

Wang, X., Zhang, H., Huang, W., and Scott, M. R. (2020).

Cross-batch memory for embedding learning. In

CVPR.

Xia, X., Liu, T., Han, B., Gong, C., Wang, N., Ge, Z., and

Chang, Y. (2020). Robust early-learning: Hindering

the memorization of noisy labels. In ICLR.

Yao, Y., Sun, Z., Zhang, C., Shen, F., Wu, Q., Zhang, J.,

and Tang, Z. (2021). Jo-src: A contrastive approach

for combating noisy labels. In CVPR.

Yu, X., Han, B., Yao, J., Niu, G., Tsang, I., and Sugiyama,

M. (2019). How does disagreement help generaliza-

tion against label corruption? In ICML.

Nearest Neighbor-Based Data Denoising for Deep Metric Learning

603