Class-conditional Importance Weighting for Deep Learning with Noisy

Labels

Bhalaji Nagarajan

1,† a

, Ricardo Marques

1,† b

, Marcos Mejia

1 c

and Petia Radeva

1,2,∗ d

Dept. de Matem

atiques i Inform

atica, Universitat de Barcelona, Barcelona, Spain

Computer Vision Center, Cerdanyola (Barcelona), Spain

Keywords:

Noisy Labeling, Loss Correction, Class-conditional Importance Weighting, Learning with Noisy Labels.

Abstract:

Large-scale accurate labels are very important to the Deep Neural Networks to train them and assure high per-

formance. However, it is very expensive to create a clean dataset since usually it relies on human interaction.

To this purpose, the labelling process is made cheap with a trade-off of having noisy labels. Learning with

Noisy Labels is an active area of research being at the same time very challenging. The recent advances in

Self-supervised learning and robust loss functions have helped in advancing noisy label research. In this paper,

we propose a loss correction method that relies on dynamic weights computed based on the model training.

We extend the existing Contrast to Divide algorithm coupled with DivideMix using a new class-conditional

weighted scheme. We validate the method using the standard noise experiments and achieved encouraging

results.

1 INTRODUCTION

Deep Neural Networks (DNNs) tend to show an in-

credible upshot in performance when trained with

large-scale labeled data under supervised environ-

ments (Krizhevsky et al., 2012). The strong and

implicit assumption in training any DNN is that the

dataset is clean and reliable. However, in real-world

it is difﬁcult to meet this assumption owing to the ex-

pensive cost and the time required to create such large

high-quality datasets (Liao et al., 2021). The labelling

cost is reduced substantially by crowd-sourcing the

labelling process or by using an automated labelling

system. However, this inherently leads to having er-

rors in the labels.

Recent advances in DNNs show that it is possi-

ble to create learning algorithms that abide to less

accurate training data (Sun et al., 2017; Pham et al.,

2021; Ghiasi et al., 2021). However the DNNs have

a tendency to overﬁt on the label noise (Zhang et al.,

2021a). There are two common approaches to tackle

the problem of overﬁtting on noisy labels - Semi-

https://orcid.org/0000-0003-2473-2057

https://orcid.org/0000-0001-8261-4409

https://orcid.org/0000-0002-6839-8436

https://orcid.org/0000-0003-0047-5172

∗

IAPR Fellow

†

Joint ﬁrst authors

Supervised Learning (SSL) and Learning with Noisy

Labels (LNL) (Zheltonozhskii et al., 2021). SSL uses

scarce high-quality labelled data to learn representa-

tions of large amount of unlabelled data (Hendrycks

et al., 2019). LNL approach uses less expensive anno-

tations, but uses noisy labels as a trade-off (Natarajan

et al., 2013). Both approaches are closely related to

each other and are often used in combination to help

DNNs learn from less accurate samples (Zheltonozh-

skii et al., 2021; Li et al., 2020; Chen et al., 2021).

LNL has been already studied both in machine and

deep learning (Fr

enay and Verleysen, 2013; Fr

enay

et al., 2014; Nigam et al., 2020; Cordeiro and

Carneiro, 2020). The objective of any LNL algorithm

is to ﬁnd the best estimator for a dataset distribution

learnt from the original distribution with noise. It is

necessary for the DNN to learn the noise structure and

estimate the parameters accordingly. In many LNL

approaches, there is short ‘warm-up’ phase where su-

pervised learning or self-learning is used before deal-

ing with the label noise. By using the warm up, it

is possible to model the loss into a Mixture Model

(Arazo et al., 2019). The main reasoning for using

this phase is based on the behaviour of DNNs to learn

the clean samples faster than the noisy samples (Arpit

et al., 2017).

The next phase of LNL deals with adapting the

noise of the distribution and achieve robust classi-

Nagarajan, B., Marques, R., Mejia, M. and Radeva, P.

Class-conditional Importance Weighting for Deep Learning with Noisy Labels.

DOI: 10.5220/0010996400003124

In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 5: VISAPP, pages

679-686

ISBN: 978-989-758-555-5; ISSN: 2184-4321

679

ﬁers. Several strategies have been proposed to make

the LNL network learn the dataset distribution with-

out the noise (Algan and Ulusoy, 2021). The com-

monly used Cross-Entropy Loss and Mean Absolute

Error are not robust to the underlying noise (Ma et al.,

2020) and it is important for the objective functions to

be robust to the noise of the underlying distribution.

Loss correction methods help in increasing the robust-

ness of losses by modifying the loss functions based

on the weights of the labels. In this paper, we pro-

pose a class-conditional loss correction method based

on the importance of classes. The loss is adapted dur-

ing each step of the training using weights computed

from the classiﬁer scores. This adjustment is carried

out such that the classes that are weakly learned are

emphasized better during the learning process. To

validate the proposed method, we use the Contrast to

Divide framework (Zheltonozhskii et al., 2021) and

correct the loss during the training phase. Below, we

outline the main contributions of this work.

• First, we propose the weighed version of the loss

function for LNL. By weighting the unlabeled

part of the training data, it is made possible to give

more importance to the less learnt or hard to learn

classes.

• Second, we do an extensive analysis of various

components in the loss function and study the pro-

gression of the LNL framework. Moreover, we

show improvement with respect to the state of art

on LNL on a public dataset.

The rest of the paper is organized as follows. In

Section 2, we brieﬂy discuss the related work. We

present the details of the proposed technique in Sec-

tion 3. The experiments and evaluations used to val-

idate the proposed method is explained in Section 4

followed by conclusion in Section 5.

2 RELATED WORK

There are several works in the literature on learning

with noisy labels. In this section, we brieﬂy review

the recent literature that are relevant to our proposed

method.

2.1 Learning with Noisy Labels

There are several classes of LNL algorithms, broadly

falling into loss modiﬁcations and noise detection

schemes. Some methods use label correction (Xiao

et al., 2015; Li et al., 2017), where the noisy labels

are corrected using inferences made by DNNs, which

are in-turn trained only on clean labels, while other

methods use loss correction schemes. In this class

of algorithms, the network aims at increasing its ro-

bustness towards noise by modifying the loss function

(Han et al., 2018; Ma et al., 2020). A computationally

efﬁcient method based on noise similarity labels was

used instead of learning from noisy class labels and

was able to reduce the noise rate (Wu et al., 2021).

In general, similarity-based approaches have been ef-

fective in many LNL algorithms where using a noise

transition matrix serves as a bridge between the clean

and noisy samples (Hsu and Kira, 2015; Hsu et al.,

2019; Wu et al., 2020).

The loss correction methods are based on modify-

ing the loss function with weights during the training

of DNNs. Common problems with the existing loss

functions are over-ﬁtting of noise and under-learning.

Importance weighting schemes have been effective in

making the losses more robust to noise (Liu and Tao,

2015; Zhang and Sabuncu, 2018; Yu et al., 2019;

Zhang and Pﬁster, 2021). The Symmetric Cross En-

tropy loss was created using a Reverse Cross Entropy

term along with the Cross Entropy term to make the

loss more robust to noise and achieve better learn-

ing of the samples (Wang et al., 2019). Normaliza-

tion techniques proved to make the commonly used

loss functions more robust to noise and also by us-

ing two robust loss functions to create an Active Pas-

sive Loss helped in boosting each other’s performance

(Ma et al., 2020). Backward and forward noise tran-

sition matrices, which are based on matrix inversion

and multiplication were pre-computed and shew to

increase the robustness of the loss function (Patrini

et al., 2017). In the above discussed methods, the

basic assumption in a relabeling approach is to have

clean labels, which is also a limitation of these algo-

rithms.

Another variation of LNL algorithms focuses

on new learning schemes adapted to noisy labels

(Malach and Shalev-Shwartz, 2017; Yu et al., 2019).

DivideMix (Li et al., 2020) uses a co-teaching strat-

egy to learn two networks simultaneously, so that one

network learns from the other networks’ conﬁdent

samples. This algorithm uses a loss to ﬁt a Gaus-

sian Mixture Model in order to divide the samples

into labeled and unlabeled set. A Beta-mixture model

was also used to model the losses for learning the

noise in an unsupervised manner (Arazo et al., 2019).

Selective Negative Learning and Positive Learning

were used to selectively apply positive learning on

expected-to-be-clean data, which is obtained by Neg-

ative Learning, where complimentary labels were

used instead of the actual labels (Kim et al., 2019).

This approach proved to be very effective compared

to the normal positive selection of samples. Early

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

680

Learning Regularization (Liu et al., 2020) learned the

clean samples ﬁrst, followed by noisy samples in later

epochs. This method was beneﬁcial as it prevented

the network from memorization of the noisy samples.

Data augmentation is also an effective means to com-

bat the noisy label problem (Berthelot et al., 2019b;

Li et al., 2020; Berthelot et al., 2019a; Sohn et al.,

2020). AugDesc (Nishi et al., 2021) used weak aug-

mentations to learn the loss and strong augmentations

to improve the generalizations.

Most of the literature presented above, uses a com-

bination of different LNL schemes to make the net-

work robust to noise. In this paper, we propose a loss

correction scheme on top of the already effective Di-

videMix and Contrast to Divide learning schemes to

enhance the learning of models.

2.2 Self-supervised and

Semi-supervised Learning

Semi-supervised learning algorithms utilize the un-

labeled data by performing providing pseudo-labels

to the unlabeled data and adding constraints to the

objective functions. Regularization could be consis-

tency regularization or entropy minimization. Mix-

Match (Berthelot et al., 2019b) combined both the

regularization methods to produce labels to the unla-

beled classes. ReMixMatch (Berthelot et al., 2019a)

and FixMatch (Sohn et al., 2020) were adaptations of

MixMatch, which used weakly augmented images to

produce labels and predict against the strongly aug-

mented images. It is also beneﬁcial to remove wrong

labels that have high levels of noise. By using only a

portion of the training set which is correct, the same

performance could be achieved (Ding et al., 2018;

Kong et al., 2019).

Self-Supervised Learning (SSL) algorithms learn

representations in a task-agnostic environment so that

the representations are meaningful irrespectively of

the labels. Contrastive loss has been vital in the recent

success of SSL algorithms, which clusters data points

based on the (dis-)similarity of classes (Wang and

Liu, 2021). By using these representations, any down-

stream task could be well learned by the DNNs. SSL

algorithms have been widely used in solving the noisy

label problems. Since the networks are learned with-

out labels, they are able to produce features that are

robust to noise (Cheng et al., 2021). Data re-labeling

helps in increasing the effectiveness of DNNs. The

performance was boosted by using a parallel network

to learn the portion of clean labels (Mandal et al.,

2020). Supervised learning and self-supervised learn-

ing can also be used together as a co-learning scheme

as this could maximize the learning behaviour using

both the constraints (Tan et al., 2021; Huang et al.,

2021). Contrastive DivideMix (Zhang et al., 2021b)

fuses the contrastive and semi-supervised learning al-

gorithms.

DivideMix (Li et al., 2020) uses a semi-supervised

training phase. It uses the MixMatch algorithm to per-

form label co-reﬁnement and co-guessing on labeled

and unlabeled samples. This works on per-sample

loss behaviour and has been an effective technique

to model the noise. One of the bottleneck in this

method is the warm-up phase. This was overcome

using the Contrast to Divide (Zheltonozhskii et al.,

2021) method. Instead of using a supervised learning

in DivideMix, this algorithm used a self-supervised

learning method. In our proposed approach, we add

an importance weighting scheme that would enable

the algorithms to focus selectively on the classes.

3 IMPORTANCE WEIGHTING

In this section, we ﬁrst brief the rationale behind the

approach. We provide background information fol-

lowed by the proposed weighted scheme.

3.1 Rationale

Our approach is motivated by the observation that

learning is unbalanced across classes, that is, after a

given number of epochs, the accuracy of the model

tends to vary signiﬁcantly over different classes.

Our hypothesis is that, by focusing the learning ef-

fort in those classes for which the model is cur-

rently less efﬁcient, the overall accuracy of the model

can be improved. To test this hypothesis, we pro-

pose to enhance the DivideMix algorithm (Li et al.,

2020) with a class-conditional importance weighting

scheme which assigns a larger weight to the classes

for which the model has a poorer performance.

3.2 Background

At each epoch, the DivideMix algorithm, on which

we build, separates the training set into two disjoint

sets: a set X containing potentially clean data, and a

set U containing potentially noisy data. This separa-

tion between clean and noisy data is made by ﬁtting

a Gaussian mixture model to the softmax output of a

pretrained network (Li et al., 2020). The loss function

used for training thus combines the losses on both the

potentially clean and noisy sets, and is given by (Li

et al., 2020):

L = L

+ λ

reg

, (1)

Class-conditional Importance Weighting for Deep Learning with Noisy Labels

681

where L

is the cross-entropy loss over the aug-

mented and mixed clean data X

; L

is a mean

squared error loss over the augmented and mixed

noisy data U

; and ﬁnally, L

reg

is a regularization term

used to encourage the model to evenly distribute its

predictions across all classes.

The loss L

of the noisy data is deﬁned as:

∑

(x,p)∈U

||p − p

(x)||

where |U

| is the number of noisy samples at the cur-

rent epoch, p is the label assigned to each noisy sam-

ple x through co-guessing (Li et al., 2020), and p

(x)

is the model prediction for x given the current model

parameters θ. The regularization term L

reg

, in its turn,

is given by:

reg

∑

log





|S|

∑

x∈S

(x)

−1





, (2)

where S = X

+ U

, and π

= 1/C is a uniform prior

distribution over the probability of each class in S .

Providing a uniform prior distribution π

= 1/C in

Equation (2) causes the loss to be minimal when the

model yields exactly the same number of predictions

for all classes in the data set.

3.3 Class-conditional Importance

Weighted Loss

We now describe the approach taken to assign a

weight for each class. Let f be a vector of C elements,

C being the number of classes in the data set. Each el-

ement f

∈ f is given by f

= 1 − F

, where F

repre-

sents the F

score for a particular class c at the current

epoch. The vector f is then smoothed over a window

of n

epochs, yielding f

. Then, the weight vector w is

computed as:

w =

max(λ

)

max(λ

)

×C, (3)

where λ

is a hyperparameter which has the role

of limiting how far the resulting weights can devi-

ate from the value 1. Hence, the resulting weights

w = {w

,...,w

} of Equation (3) take a larger value

for those classes for which the F

score is smaller.

Furthermore, to account for the importance of

each class in the learning phase, we introduce the

weights w (Equation (3)) in the loss function L

the unlabeled set (Equation (3.2)), yielding:

∑

(x,p)∈U

||p − p

(x)||

, (4)

where w

is the weight of class c, and c is the class of

the image x. Finally, to apply the weights to the reg-

ularization loss L

reg

, we simply replace the uniform

prior distribution π

= 1/C used in DivideMix (Equa-

tion (2)) by a non-uniform prior based on the weights

w, such that:

, (5)

where w

is the class weight according to Equa-

tion (3), and C is the number of classes in the dataset.

This way, when the model performs poorly for a given

class due to not choosing that same class as many

times as it should, using the prior speciﬁed in Equa-

tion (5) will encourage the model to increase the num-

ber of prediction for this same class. Figure 1 shows

the pipeline of our proposed approach.

4 EXPERIMENTS

We evaluate our proposed framework following the

common methodology in synthetic noise benchmarks.

We use CIFAR-10 (Krizhevsky and Hinton, 2009) to

validate the method, varying the amount of injected

noise. We measure the performance of the networks

using accuracy as an evaluation metric. We provide

the accuracy over ﬁve runs for each noise ratio, fol-

lowing the results presented in Contrast to Divide

(Zheltonozhskii et al., 2021).

4.1 Implementation Details

Similarly to (Li et al., 2020), we used a PreAct

ResNet-18 architecture (He et al., 2016) for Di-

videMix. Moreover, we coupled DivideMix with

Contrast to Divide (C2D), which has been shown to

considerably boost the original DivideMix algorithm

(Zheltonozhskii et al., 2021). As regards the injected

noise in the CIFAR-10 data set, we used two types

of label noise: symmetric and asymmetric. Given

a target noise ratio, the symmetric noise is gener-

ated by randomly substituting the original label by a

randomly selected new label chosen with a uniform

probability over the rest of the class labels. Regard-

ing the asymmetric noise, we follow (Zheltonozhskii

et al., 2021) who designed the noise so as to mimic

the structure of real-world noise labels substituting

the labels with those of the most similar classes. In

each experiment, the networks are optimized during

360 epochs.

Regarding the λ

hyperparameter, we follow (Li

et al., 2020) and set its value to 1. To compute the

weights per class, we set λ

= 0.1 in Equation (3), and

use a smoothing window of 5 epochs (i.e., n

= 5).

The problem of selecting λ

is discussed below.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

682

Figure 1: Pipeline of our proposed approach. The symCLR component pre-trains networks A and B. Then, these undergo

the DivideMix warm-up phase where the two networks are trained on all the data set during a small number of epochs using

standard cross-entropy loss. Then, at each epoch, the co-divide is applied to divide the data set in two disjoint sets, yielding

the set of the clean and of the noisy labels (X and U, respectively). For each mini-batch, networks A and B are then trained

separately using MixMatch and our proposed weighting scheme.

Table 1: Study of the optimal λ

value. The table shows the peak and ﬁnal accuracy on CIFAR-10 for 5 ≤ λ

≤ 50.

Method 5 10 15 20 25 30 35 40 45 50

Weighted C2D Peak 91.73 92.45 92.92 93.50 93.50 93.60 93.58 93.69 93.52 93.66

+DM (90%) Final 91.45 92.35 92.73 93.33 93.48 93.53 93.47 93.28 93.42 93.57

4.1.1 Selection of λ

As in (Li et al., 2020) and (Zheltonozhskii et al.,

2021), the performance of our proposed approach can

vary signiﬁcantly depending on the used parameter

, i.e., the hyperparameter specifying the weight of

the unsupervised loss L

in the ﬁnal loss (see Equa-

tion (1)). Therefore, in Table 1, we provide a detailed

analysis of the effect of this hyperparameter on the

ﬁnal accuracy reached by our method when consider-

ing 90% of symmetric noise. The results show that

the optimal λ

for our method with a noise ratio of

90% is of 40. This value is roughly in-line with the

one (50) reported by (Li et al., 2020).

Following a similar approach, we collected a set

of selected λ

values for each considered noise ra-

tio value. We can observe that the optimal λ

value

found seems to decrease with the noise ratio present

in the dataset. This seems to indicate that, the larger

the amount of noisy labels present in the used dataset,

the more relevant the loss L

of the unlabeled data

becomes in the learning process. The set of selected

values for each considered noise ratio value are

shown in Table 2. These values are used henceforth in

all experiments for each corresponding noise value.

Table 2: Table with the selected λ

values for each noise

ratio.

Noise

20% 50% 80% 90%

40%

Ratio (asym)

0 25 30 40 0

Figure 2: Different terms of the loss function as a function

of the epoch. The plot was generated for CIFAR-10 with

80% of symmetric noise and λ

= 30.

At this stage, it is interesting to analyze the role of

the λ

hyperparameter in the total loss value of Equa-

tion (1). Figure 2 depicts the different terms of the

used loss function. We can observe that the hyperpa-

rameter λ

acts as a scaling factor which brings L

(in

red) up to a magnitude in which it can actually inﬂu-

ence the ﬁnal loss function shape (in orange the prod-

uct λ

× L

, and in green the ﬁnal loss, denoted by

total

in Figure 2). It is also apparent from the curves

that the loss of the labeled data (L

, in blue) is the one

that dominates the total loss shape. Finally, the reg-

ularization loss L

reg

seems to have a rather marginal

role on the overall optimization process.

Class-conditional Importance Weighting for Deep Learning with Noisy Labels

683

Table 3: Peak and ﬁnal accuracy (%, mean ± std over ﬁve runs) on CIFAR-10. DivideMix and C2D+DM results are obtained

from literature.

Method 20% 50% 80% 90% 40% (asym)

DivideMix

Peak 96.1 94.6 93.2 76.0 -

Final 95.7 94.4 92.9 75.4 -

C2D+DM

Peak 96.43±0.07 95.32±0.12 94.40±0.04 93.57±0.09 93.45±0.07

Final 96.23±0.09 95.15±0.16 94.30±0.12 93.42±0.04 90.75±0.35

Weighted Peak 96.50±0.07 95.79±0.06 94.40±0.05 93.70±0.16 93.62±0.09

C2D+DM (ours) Final 96.40±0.21 95.56±0.07 94.24±0.09 93.54±0.13 92.83±0.21

4.2 Results and Analysis

The results for the application of our proposed

method to the CIFAR-10 dataset are shown in Table 3,

where a comparison with the results of the original

method is provided. The results show that, when us-

ing our importance weighting scheme, the accuracy

results generally improve over that of C2D+DM, and

it never performs worse. Indeed, except for a noise

level of 80%, our method delivers consistent improve-

ments over its non-weighted counterpart. This con-

ﬁrms our hypothesis that the overall efﬁciency of the

algorithm can be improved by focusing the learning

effort in those classes that the model is having more

difﬁculty to learn. Moreover, it also validates our

weighting strategy based on the F

score proposed in

Equation (3).

A detailed illustration of the weights values

throughout the learning process is provided in Fig-

ure 3. It shows that the weights for a given class re-

main coherent through the learning phase, since we

are able to clearly identify each class (corresponding

to a particular color) through the weights plot. They

also show that the weights for each class converge to a

particular value, which is determined by the F

score

that the model is able to get for each particular class

as the learning progresses.

4.3 Ablation Study

In this section, we study the effect of weights in the

and L

reg

terms individually. We show the results

of this ablation study in Table 4. They show that when

the weights are applied to only one of the two con-

sidered terms (L

and L

reg

), the accuracy is inferior

to the case in which the weights are included in both

losses.

5 CONCLUSIONS

In this paper, we propose a class-conditional dynam-

ically weighted Contrast to Divide algorithm, where

Figure 3: Illustration of the weights assigned to each class

(y-axis) during 355 training epochs (x-axis, 5 warm-up

epochs + 355 of DivideMix). The results are generated us-

ing the weights of a single network. Each color corresponds

to a different class in the dataset (total of 10).

Table 4: Ablation study. The entries for C2D+DM and

weighted C2D+DM is mean over ﬁve runs, whereas the

other two are mean over two runs.

Method

80% 90%

(λ

= 30) (λ

= 40)

C2D+DM

Peak 94.40±0.04 93.57±0.09

Final 94.30±0.12 93.42±0.04

Weights in Peak 94.23±0.12 93.68±0.09

Only Final 94.11±0.16 93.51±0.16

Weights in Peak 94.28±0.06 93.58±0.09

reg

Only Final 94.11±0.05 93.37±0.11

Weighted Peak 94.40±0.05 93.70±0.16

C2D+DM Final 94.24±0.09 93.54±0.13

the weights emphasize the learning behaviour of in-

dividual classes. Here, we use a per-class importance

weighting scheme based on F1-score obtained in each

epoch. Our importance weighting approach proved to

outperform the state of the art for the CIFAR-10 data

set in all the noise rates. We studied the behavior of λ

in different noise rates and also analysed the weights

throughout the learning process. The results prove the

effectiveness of the proposed scheme on an existing

state of the art LNL approach.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

684

Although, the algorithm has shown performance

improvements, it is important to study the behaviour

in more complex data sets such as CIFAR-100, Cloth-

ing 1M and WebVision. In this paper, we have used

F1-score to create the weights, however, other meth-

ods have to be studied to compute the weights per

class, which can eventually improve the results pre-

sented here. This information regarding the perfor-

mance per class (i.e., F

score or other) can be used

to improve other stages of the original DivideMix al-

gorithm, such as, for example, the division between

clean and noisy data.

ACKNOWLEDGEMENTS

This work was partially funded by TIN2018-095232-

B-C21, SGR-2017 1742, Greenhabit EIT Digital

program and CERCA Programme / Generalitat de

Catalunya. Bhalaji Nagarajan acknowledges the sup-

port of FPI Becas, MICINN, Spain. We acknowledge

the support of NVIDIA Corporation with the donation

of the Titan Xp GPUs.

REFERENCES

Algan, G. and Ulusoy, I. (2021). Image classiﬁcation with

deep learning in the presence of noisy labels: A sur-

vey. Knowledge-Based Systems, 215:106771.

Arazo, E., Ortego, D., Albert, P., O’Connor, N., and

McGuinness, K. (2019). Unsupervised label noise

modeling and loss correction. In International Con-

ference on Machine Learning, pages 312–321. PMLR.

Arpit, D., Jastrzebski, S., Ballas, N., Krueger, D., Bengio,

E., Kanwal, M. S., Maharaj, T., Fischer, A., Courville,

A., Bengio, Y., et al. (2017). A closer look at mem-

orization in deep networks. In International Confer-

ence on Machine Learning, pages 233–242. PMLR.

Berthelot, D., Carlini, N., Cubuk, E. D., Kurakin, A.,

Sohn, K., Zhang, H., and Raffel, C. (2019a).

Remixmatch: Semi-supervised learning with distri-

bution alignment and augmentation anchoring. arXiv

preprint arXiv:1911.09785.

Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N.,

Oliver, A., and Raffel, C. A. (2019b). Mixmatch:

A holistic approach to semi-supervised learning. Ad-

vances in Neural Information Processing Systems, 32.

Chen, Y., Shen, X., Hu, S. X., and Suykens, J. A. (2021).

Boosting co-teaching with compression regularization

for label noise. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition,

pages 2688–2692.

Cheng, H., Zhu, Z., Sun, X., and Liu, Y. (2021). Demys-

tifying how self-supervised features improve training

from noisy labels. arXiv preprint arXiv:2110.09022.

Cordeiro, F. R. and Carneiro, G. (2020). A survey on

deep learning with noisy labels: How to train your

model when you cannot trust on the annotations? In

2020 33rd SIBGRAPI Conference on Graphics, Pat-

terns and Images (SIBGRAPI), pages 9–16. IEEE.

Ding, Y., Wang, L., Fan, D., and Gong, B. (2018). A semi-

supervised two-stage approach to learning from noisy

labels. In 2018 IEEE Winter Conference on Applica-

tions of Computer Vision (WACV), pages 1215–1224.

IEEE.

enay, B., Kab

an, A., et al. (2014). A comprehensive in-

troduction to label noise. In ESANN. Citeseer.

enay, B. and Verleysen, M. (2013). Classiﬁcation in the

presence of label noise: a survey. IEEE transactions

on neural networks and learning systems, 25(5):845–

869.

Ghiasi, G., Zoph, B., Cubuk, E. D., Le, Q. V., and Lin, T.-

Y. (2021). Multi-task self-training for learning gen-

eral representations. In Proceedings of the IEEE/CVF

International Conference on Computer Vision, pages

8856–8865.

Han, B., Yao, J., Niu, G., Zhou, M., Tsang, I., Zhang, Y.,

and Sugiyama, M. (2018). Masking: A new perspec-

tive of noisy supervision. Advances in Neural Infor-

mation Processing Systems, 31:5836–5846.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In 2016 IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 770–778.

Hendrycks, D., Mazeika, M., Kadavath, S., and Song, D.

(2019). Using self-supervised learning can improve

model robustness and uncertainty. Advances in Neural

Information Processing Systems, 32:15663–15674.

Hsu, Y.-C. and Kira, Z. (2015). Neural network-based

clustering using pairwise constraints. arXiv preprint

arXiv:1511.06321.

Hsu, Y.-C., Lv, Z., Schlosser, J., Odom, P., and Kira, Z.

(2019). Multi-class classiﬁcation without multi-class

labels. arXiv preprint arXiv:1901.00544.

Huang, L., Zhang, C., and Zhang, H. (2021). Self-adaptive

training: Bridging the supervised and self-supervised

learning. arXiv preprint arXiv:2101.08732.

Kim, Y., Yim, J., Yun, J., and Kim, J. (2019). Nlnl: Neg-

ative learning for noisy labels. In Proceedings of the

IEEE/CVF International Conference on Computer Vi-

sion, pages 101–110.

Kong, K., Lee, J., Kwak, Y., Kang, M., Kim, S. G., and

Song, W.-J. (2019). Recycling: Semi-supervised

learning with noisy labels in deep neural networks.

IEEE Access, 7:66998–67005.

Krizhevsky, A. and Hinton, G. (2009). Learning multiple

layers of features from tiny images. Technical report,

University of Toronto, Toronto, Ontario.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. Advances in neural information processing

systems, 25:1097–1105.

Li, J., Socher, R., and Hoi, S. C. (2020). Dividemix:

Learning with noisy labels as semi-supervised learn-

ing. arXiv preprint arXiv:2002.07394.

Class-conditional Importance Weighting for Deep Learning with Noisy Labels

685

Li, Y., Yang, J., Song, Y., Cao, L., Luo, J., and Li, L.-J.

(2017). Learning from noisy labels with distillation.

In Proceedings of the IEEE International Conference

on Computer Vision, pages 1910–1918.

Liao, Y.-H., Kar, A., and Fidler, S. (2021). Towards

good practices for efﬁciently annotating large-scale

image classiﬁcation datasets. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, pages 4350–4359.

Liu, S., Niles-Weed, J., Razavian, N., and Fernandez-

Granda, C. (2020). Early-learning regularization pre-

vents memorization of noisy labels. Advances in Neu-

ral Information Processing Systems, 33.

Liu, T. and Tao, D. (2015). Classiﬁcation with noisy labels

by importance reweighting. IEEE Transactions on

pattern analysis and machine intelligence, 38(3):447–

461.

Ma, X., Huang, H., Wang, Y., Romano, S., Erfani, S., and

Bailey, J. (2020). Normalized loss functions for deep

learning with noisy labels. In International Confer-

ence on Machine Learning, pages 6543–6553. PMLR.

Malach, E. and Shalev-Shwartz, S. (2017). Decoupling”

when to update” from” how to update”. Advances in

Neural Information Processing Systems, 30:960–970.

Mandal, D., Bharadwaj, S., and Biswas, S. (2020). A novel

self-supervised re-labeling approach for training with

noisy labels. In Proceedings of the IEEE/CVF Win-

ter Conference on Applications of Computer Vision,

pages 1381–1390.

Natarajan, N., Dhillon, I. S., Ravikumar, P. K., and Tewari,

A. (2013). Learning with noisy labels. Advances

in neural information processing systems, 26:1196–

1204.

Nigam, N., Dutta, T., and Gupta, H. P. (2020). Impact of

noisy labels in learning techniques: a survey. In Ad-

vances in data and information sciences, pages 403–

411. Springer.

Nishi, K., Ding, Y., Rich, A., and Hollerer, T. (2021). Aug-

mentation strategies for learning with noisy labels. In

Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pages 8022–

8031.

Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., and

Qu, L. (2017). Making deep neural networks robust

to label noise: A loss correction approach. In Pro-

ceedings of the IEEE conference on computer vision

and pattern recognition, pages 1944–1952.

Pham, H., Dai, Z., Xie, Q., and Le, Q. V. (2021). Meta

pseudo labels. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition,

pages 11557–11568.

Sohn, K., Berthelot, D., Li, C.-L., Zhang, Z., Carlini, N.,

Cubuk, E. D., Kurakin, A., Zhang, H., and Raffel, C.

(2020). Fixmatch: Simplifying semi-supervised learn-

ing with consistency and conﬁdence. arXiv preprint

arXiv:2001.07685.

Sun, C., Shrivastava, A., Singh, S., and Gupta, A. (2017).

Revisiting unreasonable effectiveness of data in deep

learning era. In Proceedings of the IEEE international

conference on computer vision, pages 843–852.

Tan, C., Xia, J., Wu, L., and Li, S. Z. (2021). Co-learning:

Learning from noisy labels with self-supervision. In

Proceedings of the 29th ACM International Confer-

ence on Multimedia, pages 1405–1413.

Wang, F. and Liu, H. (2021). Understanding the behaviour

of contrastive loss. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recogni-

tion, pages 2495–2504.

Wang, Y., Ma, X., Chen, Z., Luo, Y., Yi, J., and Bailey, J.

(2019). Symmetric cross entropy for robust learning

with noisy labels. In Proceedings of the IEEE/CVF

International Conference on Computer Vision, pages

322–330.

Wu, S., Xia, X., Liu, T., Han, B., Gong, M., Wang, N.,

Liu, H., and Niu, G. (2020). Multi-class classiﬁca-

tion from noisy-similarity-labeled data. arXiv preprint

arXiv:2002.06508.

Wu, S., Xia, X., Liu, T., Han, B., Gong, M., Wang, N., Liu,

H., and Niu, G. (2021). Class2simi: A noise reduc-

tion perspective on learning with noisy labels. In In-

ternational Conference on Machine Learning, pages

11285–11295. PMLR.

Xiao, T., Xia, T., Yang, Y., Huang, C., and Wang, X. (2015).

Learning from massive noisy labeled data for image

classiﬁcation. In Proceedings of the IEEE conference

on computer vision and pattern recognition, pages

2691–2699.

Yu, X., Han, B., Yao, J., Niu, G., Tsang, I., and Sugiyama,

M. (2019). How does disagreement help generaliza-

tion against label corruption? In International Confer-

ence on Machine Learning, pages 7164–7173. PMLR.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals,

O. (2021a). Understanding deep learning (still) re-

quires rethinking generalization. Communications of

the ACM, 64(3):107–115.

Zhang, X., Liu, Z., Xiao, K., Shen, T., Huang, J., Yang, W.,

Samaras, D., and Han, X. (2021b). Codim: Learn-

ing with noisy labels via contrastive semi-supervised

learning. arXiv preprint arXiv:2111.11652.

Zhang, Z. and Pﬁster, T. (2021). Learning fast sample re-

weighting without reward data. In Proceedings of the

IEEE/CVF International Conference on Computer Vi-

sion, pages 725–734.

Zhang, Z. and Sabuncu, M. R. (2018). Generalized cross

entropy loss for training deep neural networks with

noisy labels. In 32nd Conference on Neural Informa-

tion Processing Systems (NeurIPS).

Zheltonozhskii, E., Baskin, C., Mendelson, A., Bronstein,

A. M., and Litany, O. (2021). Contrast to divide: Self-

supervised pre-training for learning with noisy labels.

arXiv preprint arXiv:2103.13646.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

686