Deep Active Learning with Noisy Oracle in Object Detection

Marius Schubert

1,∗ a

, Tobias Riedlinger

2,∗ b

, Karsten Kahl

1 c

and Matthias Rottmann

1 d

School of Mathematics and Natural Sciences, IZMD, University of Wuppertal, Germany

Institute of Mathematics, Technical University Berlin, Germany

Keywords:

Active Learning, Label Noise, Robustness, Label Error Detection, Object Detection.

Abstract:

Obtaining annotations for complex computer vision tasks such as object detection is an expensive and time-

intense endeavor involving numerous human workers or expert opinions. Reducing the amount of annotations

required while maintaining algorithm performance is, therefore, desirable for machine learning practitioners

and has been successfully achieved by active learning. However, it is not merely the amount of annotations

which inﬂuences model performance but also the annotation quality. In practice, oracles that are queried

for new annotations frequently produce signiﬁcant amounts of noise. Therefore, cleansing procedures are

oftentimes necessary to review and correct given labels. This process is subject to the same budget as the initial

annotation itself since it requires human workers or even domain experts. Here, we propose a composite active

learning framework including a label review module for deep object detection. We show that utilizing part of

the annotation budget to correct the noisy annotations partially in the active dataset leads to early improvements

in model performance, especially when coupled with uncertainty-based query strategies. The precision of the

label error proposals signiﬁcantly inﬂuences the measured effect of the label review. In our experiments we

achieve improvements of up to 4.5mAP points by incorporating label reviews at equal annotation budget.

1 INTRODUCTION

In the previous decade, deep learning has revolu-

tionized computer vision models across many differ-

ent tasks like supervised object detection (Ren et al.,

2015; Redmon and Farhadi, 2018; Carion et al.,

2020). Object detection has various potential real-

world applications, many of which have not yet been

developed practically in a sense that public datasets

are oftentimes not available. When such a new ﬁeld is

to be developed, there are many practical challenges

during dataset curation and creation. Oftentimes, data

can be recorded with, e.g., cameras in large amount

at acceptable cost, while acquiring corresponding la-

bels might be comparatively costly and might require

expert knowledge. Active learning aims at maintain-

ing model performance while reducing the amount of

training data by leveraging data informativeness for

the label acquisition. The model is utilized in turn to

ﬁnd the data, in our case from a large pool of unla-

https://orcid.org/0000-0002-9410-8949

https://orcid.org/0000-0002-1953-8607

https://orcid.org/0000-0002-3510-3320

https://orcid.org/0000-0003-3840-0184

Equal contribution.

object detector

training

predictions

query

labeling

review

Figure 1: Our active learning cycle consists of training

on labeled data L, querying and labeling informative data

points Q out of a pool of unlabeled data U by an oracle and

a review of acquired data R = L ∪ Q .

beled data, for which new labels would improve the

model performance most efﬁciently (Settles, 2009;

Brust et al., 2018; Riedlinger et al., 2022). The

goal is to request as few annotations with human la-

bor as possible while still receiving a well-performing

model that makes accurate predictions. When devel-

oping and simulating active learning models in a lab-

Schubert, M., Riedlinger, T., Kahl, K. and Rottmann, M.

Deep Active Learning with Noisy Oracle in Object Detection.

DOI: 10.5220/0012315800003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 2: VISAPP, pages

375-384

ISBN: 978-989-758-679-8; ISSN: 2184-4321

375

oratory setup, one typically assumes an oracle that

provides correct labels for queried data points (Brust

et al., 2018; Riedlinger et al., 2022). However, in

practice, such an oracle does not exist and any per-

son that labels data produces errors with some fre-

quency (Yan et al., 2014). Especially in complex

domains such as medical applications where expert

opinions are required for the annotation process, there

exists variability between different oracles (Schilling

et al., 2022). Some authors have considered active

learning with noisy oracles in image classiﬁcation be-

fore (Younesian et al., 2020; Younesian et al., 2021;

Yan et al., 2016; Gupta et al., 2019). In the present

work, we consider active learning with a noisy ora-

cle, to the best of our knowledge for the ﬁrst time,

in object detection. More precisely, we utilize re-

cent ﬁndings on label errors in state-of-the-art ob-

ject detection benchmarks (Schubert et al., 2023) to

simulate two types of predominantly occurring la-

bel errors in object detection oracles. On the one

hand we treat missed bounding box labels which do

not appear in the ground truth at all. On the other

hand, we consider bounding box labels with incor-

rect class assignment which are likely to induce un-

desired training feedback. We do so for the EMNIST-

Det dataset (Riedlinger et al., 2022) which is an ex-

tension of EMNIST (Cohen et al., 2017) to the ob-

ject detection and instance segmentation setting. We

complement this with the BDD100k dataset (Yu et al.,

2020) which has mostly clean bounding box annota-

tions of variable size. Both datasets have high qual-

ity bounding box labels such that we can simulate la-

bel errors without greater inﬂuence of naturally occur-

ring label noise. We introduce independent and iden-

tically distributed errors into the labels which have

been queried during the data-acquisition process. We

simulate a label reviewer as a second human in the

loop who has access to a label error detection mod-

ule (Schubert et al., 2023) which is integrated into the

active learning cycle, see ﬁg. 1. We compare differ-

ent sources for label error proposals which are to be

considered after data acquisition. The efﬁciency of

the proposal method controls the frequency of jus-

tiﬁed review cases, i.e., the efﬁciency of the budget

utilization for label reviewing. The review oracle is

assumed to contain smaller amounts of noise since la-

bels do not have to be generated from scratch. Instead,

only individual proposals have to be reviewed.

In our experiments, we observe that a label er-

ror detection method applied to active learning with a

noisy oracle clearly outperforms active learning with

random label review and active learning without any

label review. We compare different query strategies

with and without review in terms of performance as a

function of annotation budget (split into labeling and

reviewing cost). Improvement of performance is ob-

served consistently for random queries as well as for

an uncertainty query based on the entropy of the ob-

ject detector’s softmax output. Furthermore, our ﬁnd-

ings are consistent over two datasets, i.e., an artiﬁcial

one and a real world one, as well as across two dif-

ferent object detectors. We ﬁnd that the success of

our method can likely be attributed to a strong perfor-

mance of the label error detection method.

Our contribution can be summarized as follows:

• We contribute the ﬁrst method that performs par-

tially automated label review and active label se-

lection for object detection.

• We provide an environment for performing rapid

prototyping of methods for active learning with

noisy oracles.

• Our method outperforms manual and review-free

active learning for different queries, datasets and

underlying object detectors.

Our method can be used with humans in the loop for

labeling and label review to maximize model perfor-

mance at minimal annotation budget, thus aiding data

acquisition pipelines with partial automation.

2 RELATED WORK

Our contribution is located at the intersection of two

disciplines which both aim at reducing the tiresome

workload of repetitive image annotation by human

workers. Active learning aims at reducing the over-

all amount of annotations given while the goal of la-

bel reviewing is to control or improve the quality of

present annotations.

Label Error Detection in Object Detection. Previ-

ous work on the detection of label errors for object de-

tectors tend to make use of a model which was trained

on given, potentially error-prone data. (Hu et al.,

2022) compare a softmax probability-based measure

per prediction with the given annotations to obtain

proposals for label errors. (Schubert et al., 2023) used

an instance-wise loss computation to identify differ-

ent types of label errors. (Koksal et al., 2020), in con-

trast, use a template matching scheme to ﬁnd annota-

tion errors in frame sequences for UAV detection.

Active Learning in Object Detection. Training in

the context of more reﬁned computer vision tasks

such as object detection requires signiﬁcant compu-

tational resources and the learning task itself comes

with an elevated degree of complexity. All the more

important is efﬁcient handling when it comes to ex-

pensive data annotations which can be approached by

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

376

active learning. The following accounts for the pi-

oneering successes that were accomplished in fully

supervised active learning for deep object detection.

(Brust et al., 2018) used margin sampling by ag-

gregating probability margin scores over predicted

bounding boxes in different ways where a class-

weighting scheme addresses class imbalances. (Roy

et al., 2018) similarly utilize softmax entropy and

committee-based scoring making use of the differ-

ent detection scales of object detection architectures.

(Schmidt et al., 2020) compare different deep en-

semble consensus-based selection strategies leverag-

ing model uncertainty. (Choi et al., 2021) utilize

Monte-Carlo dropout in conjunction with a Gaussian

mixture model to estimate epistemic and aleatoric un-

certainty, respectively. The utilized scoring function

for querying takes both kinds of uncertainty into ac-

count. (Haussmann et al., 2020) compare methodi-

cally diverse scoring functions in the objectness en-

tropy, mutual information, estimated learning gradi-

ent and conﬁdence coupled with different diversity

selection mechanisms.

Active Learning with Noisy Oracle in Classiﬁca-

tion. The intersection between active learning and

training under label noise has been addressed in the

context of classiﬁcation tasks before. While (Kim,

2022) used an active query mechanism for cleaning

up labels, the proposed training algorithm itself is not

active. (Younesian et al., 2020) consider noisy binary

and categorical oracles by assigning different label

costs to both in an online, stream-based active learn-

ing setting. (Yan et al., 2016) treat the query complex-

ity of noisy oracles with a reject option in a theoretical

manner. (Gupta et al., 2019) consider batch-based ac-

tive learning with noisy oracles under the introduction

of the QActor framework by (Younesian et al., 2021)

which has a label cleaning module in its active learn-

ing cycle is most related to our approach. One of the

proposed quality models chooses examples to clean

via the cross-entropy loss which are then re-labeled

by the oracle.

3 ACTIVE LEARNING WITH

NOISY ORACLE

In this section, we describe the task of active learning

in object detection as well as the addition of a review

module to the generic active learning cycle. While

in the active learning setting, new labels are queried

on the basis of an informativeness measure, the pres-

ence of erroneous or misleading oracle responses can

counteract the beneﬁt of the informed data selection.

In order to account for new data containing incorrect

labels, we introduce a review module that generates

proposals for label errors to review and to potentially

correct.

3.1 Active Learning with Review in

Object Detection

Most of the commonly used datasets in object detec-

tion, e.g., MS-COCO (Lin et al., 2014) and Pascal

VOC (Everingham et al., 2010), are also the most

commonly used datasets in active learning and con-

tain label errors (Schubert et al., 2023; Rottmann and

Reese, 2023). This means that active learning meth-

ods developed on these datasets are also evaluated

based on noisy labels. To consider label errors dur-

ing active learning experiments, we introduce a re-

view module. Active learning can be viewed as an

alternating process of training a model and labeling

data based on informativeness according to the model.

Starting with an initially labeled set of images L, once

a model is trained based on L, the test performance is

measured. Object detectors are usually evaluated in

terms of mean average precision (mAP, see (Evering-

ham et al., 2010)). New images selected to be labeled

(Q ) are queried from a pool U of previously unla-

beled images. After obtaining labels for the queried

images Q by an oracle, we introduce a review step

where an oracle reviews the combined set R = L ∪Q .

The model is then trained on the reviewed data and the

cycle is repeated T times. The active learning cycle is

visualized in ﬁg. 1 where acquisition and review of

data are two independent modules.

Queries. Active learning research usually revolves

around the development of model architectures, loss

functions or selection strategies used in the query

step. Different query approaches are then com-

pared for different annotation budgets in terms of the

achieved test performance which is often measured in

terms of mAP in object detection. In the following, we

investigate two different query strategies: random se-

lection and selection of samples with large entropy of

the softmax output of the object detector. For the for-

mer, images are randomly chosen from U. For the lat-

ter, images are selected based on the predictive clas-

siﬁcation uncertainty according to the current model

in the following sense. On a given image x, a neural

network predicts a ﬁxed number N

of bounding box

predictions

= {( ˆx

, ˆy

, ˆw

, ˆs

, ˆp

,.. ., ˆp

)}, (1)

where i = 1, ... ,N

. Here, ˆx

, ˆy

, ˆw

represent the

localization, ˆs

the objectness score and ˆp

,.. ., ˆp

the

class probabilities for all classes {1, ... ,C}. Only the

set of boxes that remain after score-thresholding (with

Deep Active Learning with Noisy Oracle in Object Detection

377

threshold s

) and non-maximum suppression (NMS)

are used to compute prediction-wise entropies. The

entropy H(

) of prediction

is given by H(

) =

−

∑

c=1

ˆp

·log( ˆp

). Moreover, we incorporate a class-

weighting (Brust et al., 2018) for computing instance-

wise uncertainty scores. Finally, the instance-wise en-

tropies of a single image are summed up to obtain an

image-wise query score and the images are sorted in

descending order by this query score. Note, that both

queries are independent of labels and, therefore, of

any label errors ( random selection is independent of

predictions and labels as well as the image-wise query

score for the entropy method is determined based on

the predictions only). Note, that query algorithms are

based on unlabeled data and oracle noise does not di-

rectly impact the selection of images. However, ora-

cle noise does inﬂuence model training.

Review Module. In order to account for noisy ora-

cles, we allow for incorrect annotations given in re-

sponse to an active learning query. To counter-act

the negative inﬂuence of noisy annotations, we intro-

duce a review module into the active learning cycle in

which proposals for label reviews are given and part

of the annotation budget can be used to clean up a

portion of annotation errors. In the following, we in-

troduce the detection of two different kinds of label

errors: missing labels (misses) and labels with incor-

rect class assignments (ﬂips).

For one active learning cycle, we allow for the

consumption of a ﬁxed annotation budget C . This

budget C is split up into a fraction C

= (1 − λ)C

used for querying new annotations and C

= λC used

for reviewing data.

After the query, Q is automatically labeled and,

together with L, forms the set of active images for

the next cycle. Before the next training cycle starts,

we regard R = L ∪ Q as the set of annotations which

are potentially reviewed. Inspired by (Schubert et al.,

2023), we introduce a post-processing label error de-

tection method where the detection of misses and ﬂips

are two independent tasks. When both types of label

errors are simultaneously present, we use a parame-

ter α ∈ [0,1] to distribute the review budget C

be-

tween reviewing misses (αC

) and ﬂips ((1 − α)C

In the following, we introduce two different review

functions: a random review and a review (Schubert

et al., 2023) based on the computed loss function.

An illustration of our label error detection method

can be found in ﬁg. 2. We consider the set of all

predictions on images from R and the corresponding

noisy labels

Y . To detect misses, we select those pre-

dictions that are identiﬁed as false positive predictions

according to the noisy ground truth

Y . To get an order

for the review, we sort the false positives in descend-

ing order by the objectness score ˆs for the highest loss

based review and in random order for the random re-

view. Large values of ˆs on false positives (where ob-

jectness is supposed to be small) amounts to a large

objectness loss.

For the ﬂips, every label from

Y is assigned to

the most overlapping prediction if the IoU of the two

boxes is greater than of equal to IoU

. Then, the cross-

entropy loss of the possibly incorrect label and the

predicted probability distribution is used as a review

score for every given label. In case there is no sufﬁ-

ciently overlapping prediction, the respective label is

not considered for review. Note, that for given label

class ˜c, the cross-entropy loss of the assigned predic-

tion

b is CE(

b| ˜c) = −

∑

c=1

c ˜c

log( ˆp

) = −log( ˆp

˜c

where δ

i j

is the Kronecker symbol, i.e., δ

i j

= 1 if

i = j and δ

i j

= 0 otherwise. That is, the label with

assigned prediction that has the lowest corresponding

class probability ˆp

˜c

generates the highest loss. In case

of the random review method, we randomly select as-

signed labels for review by uniformly sampling over

all labels.

4 NUMERICAL EXPERIMENTS

In this section, we describe our active learning setup,

label noise generation and the automated data review

as well as involved hyperparameters. Afterwards, we

show results for both query functions with and with-

out review for two different datasets and object detec-

tors in terms of mAP. We also measure the perfor-

mance of the review proposal mechanism in terms of

precision.

4.1 Experimental Setup

For our active learning setup, automated labeling and

reviewing is desirable. Therefore, we simulate la-

bel errors for all training images of the underlying

datasets. We do not include evaluations of the active

learning experiments on the widely used MS-COCO

or Pascal VOC datasets for the following reasons. For

an automated review procedure, the frequently oc-

curring label errors in both datasets (Schubert et al.,

2023) would lead to strongly biased results. Evalu-

ations on either dataset would require manual anno-

tation review after each active learning cycle for sev-

eral repetitions of the same experiment. This manual

review after each cycle is necessary in practice, how-

ever, out of scope for an experimental evaluation of

the proposed method.

Datasets and Models. We make use of the EMNIST-

Det dataset (Riedlinger et al., 2022) with 20,000

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

378

s = 0.63

s = 0.97

ˆs = 0.82

s = 0.95

s = 0.83

s = 0.94

s = 0.81

S: ˆs = 0.90

s = 0.63

prediction

s = 0.95

S: ˆs = 0.90

s = 0.63

false positive

ˆp

= 0.83

ˆp

= 0.07

= 0.87

= 0.97

= 0.96

= 0.93

has pred w/ high IoU

descending order in ˆs

(highest background loss ﬁrst)

ascending order in ˆp

(highest classiﬁcation loss ﬁrst)

Miss Detection

Flip Detection

Figure 2: Schematic illustration of the label error detection mechanism using highest loss for missed labels (top, blue) and

label ﬂips (bottom, green). The detection of misses considers false positive predictions of the object detector while the

detection of ﬂips considers ground truth boxes, each of them matching at least one of the predictions in localization.

training images and 2,000 test images as well as

BDD100k (Yu et al., 2020) (short: BDD), where we

ﬁlter the training and validation split, such that we

only use daytime images with clear weather condi-

tions, resulting in 12,454 training images. Further-

more, the validation set is split into equally-sized test

and validation sets, each consisting of 882 images.

Since EMNIST-Det is a simpler task to learn com-

pared to BDD, we apply a RetinaNet (Lin et al., 2017)

and a Faster R-CNN (Ren et al., 2015) with a ResNet-

18 (He et al., 2016) backbone for EMNIST-Det, as

well as a Faster R-CNN with a ResNet-101 backbone

for BDD. Note, that this setup was introduced and

used in related work (Riedlinger et al., 2022; Schu-

bert et al., 2023).

Based on clean training data, the test performance

for EMNIST-Det in terms of mAP is 91.2% for Faster

R-CNN and 90.9% for RetinaNet. For Faster R-

CNN, the test performance decreases to 90.2% with

simulated misses in the training data and to 89.2%

with simulated ﬂips. Simulating misses and ﬂips si-

multaneously yields a test performance of 89.4% for

Faster R-CNN and 89.3% for RetinaNet. For BDD

and Faster R-CNN, a test performance of 50.0% is

obtained for unmodiﬁed training data and 48.9% for

training data including misses and drops.

Simulation of Label Errors. For the simulation of

misses and ﬂips, we follow (Schubert et al., 2023). An

illustration of the label error injection scheme can be

found in ﬁg. 3. Any dataset is equipped with a set of

G labels, i.e.,

Y = {b

= (x

) : i = 1, ...,G}. (2)

Let I = {1,. .., G} be the set of indices of all boxes

∈ Y , i = 1,.. .,G. Furthermore, we choose a pa-

I \ I

Discard

· G

Flip

· G

Figure 3: Schematic illustration of noise injection into a

clean dataset. Misses are generated by randomly discard-

ing γ

/2 · G of the present annotations. Afterwards, γ

/2 · G

of the remaining annotations receive random class ﬂips to

one of the C − 1 incorrect classes. This results in the noisy

ground truth

Y used in our experiments.

rameter γ

∈ [0,1] representing the relative frequency

of label errors during data acquisition.

For generating label misses, we randomly choose

a subset I

⊂ I of size

· G, representing missed

ground truth boxes which are discarded from Y . The

number of remaining annotations that receive a class

ﬂip is again

· G, where the class ﬂip is determined

uniformly over the C − 1 incorrect class assignments.

The potentially ﬂipped class for each label b

is de-

noted by ˜c

. Finally, we denote the training set includ-

ing label errors by

Y = {(x

, ˜c

) : i ∈ I \ I

Note, that a single label is perturbed by only one type

of label error at most. In addition, label errors are not

simulated on the test dataset to ensure an unbiased

evaluation of test performance.

Automated Review of Label Errors Since the or-

acle is noisy with error frequency γ

, the review is

also error-prone, i.e., misses are detected with proba-

Deep Active Learning with Noisy Oracle in Object Detection

379

Table 1: Overview of important training, review and active learning hyperparameters for all datasets and networks.

Active Learning Review Training

init

| C s

T γ

IoU

α batch size training iters

Faster R-CNN EMNIST-Det 150 200 0.7 20 0.2 0.05 0.3 0.5 4 25,000

RetinaNet EMNIST-Det 150 200 0.5 20 0.2 0.05 0.3 0.5 4 38,000

Faster R-CNN BDD100k 625 10,000 0.7 7 0.2 0.05 0.3 0.5 4 170,000

Figure 4: EMNIST-Det ablation study of the ratio between

labeled and reviewed bounding boxes for Faster R-CNN

where both label error types are present. We compare the

random query without review with random query and high-

est loss review (RHL) with the chosen ration λ in brackets.

bility 1 − γ

and still overlooked with probability γ

The ﬂips, whether the label error proposal was a false

alarm or not, are corrected with probability 1 − γ

and

randomly misclassiﬁed with probability γ

Implementation Details. We implemented our active

learning methods in the open source MMDetection

toolbox (Chen et al., 2019). For the label error simula-

tion, we choose the relative frequency of label errors

= 0.2, the relative frequency of label errors dur-

ing review γ

= 0.05, the value for score-thresholding

= 0.7 for Faster R-CNN and s

= 0.5 for RetinaNet

as well as the IoU-value IoU

= 0.3 that assigns pre-

dictions with labels. We choose γ

< γ

, assuming that

the oracle is more engaged in viewing and evaluating

single boxes during the review compared to labeling

from scratch, with all boxes having to be located and

classiﬁed on a new image. As hyperparameters for the

active learning cycle, the initially labeled set consists

of 150 randomly picked images EMNIST-Det and of

625 randomly picked images for BDD. The budget for

a single active learning step C is 200 for EMNIST-

Det and C = 10,000 for BDD. Labeling a single box

has cost 1, as does reviewing a label error proposal,

whether miss or ﬂip and also whether a label error

was identiﬁed or not. If misses and ﬂips are simul-

taneously present in the experiment, we set the ratio

between reviewing misses and ﬂips α = 0.5. Finally,

the number of active learning steps for EMNIST-Det

is T = 20 and for BDD T = 7. For an overview of

training, review and active learning hyperparameters,

see table 1.

4.2 Results

In the following, we show active learning results for

EMNIST-Det and BDD. Therefore, we compare six

different methods, the random and entropy query,

both without review, as well with random review or

review by highest loss. Furthermore, we present per-

formance results for both review methods in terms of

precision over the whole active learning course.

Ablation for the Ratio of Queried and Reviewed

Bounding Boxes. For the methods with review, the

fraction λ of the amount of new data queries and the

amount of bounding box reviews plays a signiﬁcant

role. Therefore, we repeat the same experiment for

EMNIST-Det and Faster R-CNN with different val-

ues for λ, see ﬁg. 4. These results are based on train-

ing data with simulated ﬂips and misses. The gray

and yellow lines indicate the 100% and 90% refer-

ence performance mark of the model trained with the

entire (noisy) dataset. The random query with high-

est loss review (RHL) is visually almost identical for

λ = 0.1, λ = 0.16 and λ = 0.2. All these three meth-

ods outperform the random query without review. The

random query without review performs similar to the

random query with highest loss review with λ = 0.3

and outperforms the random query with λ = 0.4, i.e.,

at about λ = 0.3 is the break-even-point, at which it is

no longer worthwhile to review more bounding boxes

instead of labeling new ones. We hypothesize that

this tipping point is strongly dependent on our cho-

sen setup with a relative frequency of label errors of

= 0.2. Since the red curve seems to be most favor-

able, we set the fraction between queries and reviews

to λ = 0.2 in all of the following experiments.

Active Learning with Different Label Error Types.

We ﬁrst investigate active learning curves for

EMNIST-Det and Faster R-CNN. We consider active

learning curves where (a) we simulated only misses,

(b) only ﬂips, and (c) both label error types occur

equally often in the training dataset, each with noise

rate γ

= 0.1 (recall ﬁg. 3). We compare both query

strategies, random and entropy, without review, with

highest loss (HL) review and with random (R) review,

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

380

(a) Class Flips (b) Label Misses (c) Flips + Misses

Figure 5: EMNIST-Det active learning curves, where only ﬂips are present (left), only misses (center) as well as where ﬂips

and misses are simultaneously present (right).

(a) Faster R-CNN + EMNIST-Det (b) RetinaNet + EMNIST-Det (c) Faster R-CNN + BDD

(d) Faster R-CNN + EMNIST-Det (e) RetinaNet + EMNIST-Det (f) Faster R-CNN + BDD

Figure 6: Review quality results for random highest loss (top) and entropy highest loss (bottom). Misses and ﬂips are

simultaneously present in all experiments.

(a) Faster R-CNN (b) RetinaNet

Figure 7: Comparison of EMNIST-Det active learning curves for Faster R-CNN (left) and RetinaNet (right) where both label

error types are present.

Deep Active Learning with Noisy Oracle in Object Detection

381

respectively. The obtained active learning curves are

averaged over four random initializations and evalu-

ated in terms of the total annotation budget consumed.

Note, that for the active learning methods without re-

view the total amount of annotation budget is equal to

the amount of (possibly incorrect) labeled bounding

boxes. For those methods incorporating a review step

the total amount of annotation budget represents the

sum of labeled and reviewed bounding boxes. Fig-

ure 5 shows active learning curves in terms of test

performance with point-wise standard deviations. For

all three active learning curves, we observe that the

entropy method outperforms random at every point.

Furthermore, the queries without review perform su-

perior to the respective query with random review.

Entropy HL and random HL, i.e., both methods with

highest loss review clearly outperform the strategies

without review and with random review. We conclude

that the success of reviewing queries strongly depends

on the performance of the review methods and that

random review is too expensive in terms of annota-

tion budget. From this we conclude that it is more

worthwhile to acquire new (noisy) labels than to ran-

domly review the active labels, at least for the given

amount of noise we studied.

All in all, the distance between the active learning

curves of the six methods is signiﬁcantly larger for

label ﬂips as compared to misses. Moreover, the max-

imum performance with simulated label errors is also

inferior for the ﬂips compared to the misses. We hy-

pothesize that the reason for this is the sub-sampling

from the negatively associated anchors (Ren et al.,

2015) during training. This mechanism leads to only

partial learning from the misses, whereas an incorrect

foreground class induced from ﬂips has a negative im-

pact on every gradient step.

The signiﬁcant difference of the active learning

curves of the respective queries with random review

and the highest loss review can be attributed to the

high precision of the highest loss review. The ran-

dom review has an expected precision of γ

. Figure 6

shows the precision for the highest loss review applied

after random query in (a) and after entropy query in

(d) across the span of all active learning cycles. In

both plots, (a) and (d), ﬂips and misses are simulta-

neously present, i.e., both plots correspond to the re-

spective method from ﬁg. 5 (c). The blue lines vi-

sualize the precision for the review identifying a ﬂip

and the orange lines analogously for the misses. Here,

the precision for detecting ﬂips is always above 50%

and tends to improve as the active learning exper-

iment progresses, whereas the precision for the de-

tection of misses is even consistently above 90%. In

general, ﬂips are more difﬁcult to detect compared to

Table 2: Mean average precision values in % with standard

deviations in brackets for 2000 and 4000 queried and re-

viewed annotations for Faster R-CNN (top) and RetinaNet

(bottom) for EMNIST-Det. Note, that in every experiment

both label error types are present; the upper half represents

ﬁg. 7 (a) and the bottom half ﬁg. 7 (b).

Network Method mAP

@2000

mAP

@4000

Faster R-CNN

Random 68.97(±1.09) 77.41(±0.28)

Random HL 72.13(±0.64) 80.92(±0.23)

Entropy HL 75.13(±0.79) 82.87(±0.29)

RetinaNet

Random 68.38(±0.52) 78.04(±0.51)

Random HL 72.94(±1.06) 80.40(±0.34)

Entropy HL 74.39(±0.28) 81.86(±0.27)

misses due to the different construction of the detec-

tion methods of either label error type, recall ﬁg. 2.

Comparing Faster R-CNN with RetinaNet on

EMNIST-Det. For the following results, we com-

pare only the random query without review with the

random and entropy query, both with highest loss re-

view. Figure 7 shows active learning curves for these

methods for Faster R-CNN in (a) and for RetinaNet

in (b). Note, that in both cases both label error types

are present. Moreover, (a) is a trimmed version of

ﬁg. 5 (c) to make it more convenient to compare the

results from both detectors visually. We observe that

the curves for Faster R-CNN and RetinaNet look very

similar over the entire active learning course. All

curves start at just below 40% mAP and the respec-

tive methods end at similar test performances. The

ranking of the methods is always the same: entropy

HL outperforms random HL and random without re-

view. Also, random HL outperforms random without

review. For RetinaNet, random HL seems to be closer

to entropy HL as compared to Faster R-CNN.

These observations are also supported by table 2,

wherein we stated the mAP values with standard de-

viations in parentheses. We compare performance for

the total annotation budget consumed equal to 2000

and 4000 from the active learning curves shown in

ﬁg. 7. In particular, for entropy with highest loss re-

view, the mAP

@2000

for Faster R-CNN is 0.74 per-

cent points (pp) higher and even 1.01 mAP

@4000

higher. For Faster R-CNN, the difference between en-

tropy HL and random HL is 3 pp for mAP

@2000

and

1.95 for mAP

@4000

. For RetinaNet, the difference is

only 1.45 pp for mAP

@2000

and 1.46 for mAP

@4000

Comparing the quality of the highest loss review,

the results for RetinaNet are highly correlated to the

results of Faster R-CNN, see ﬁg. 6. For RetinaNet,

the precision for the highest loss review in combina-

tion with random query is visualized in (b) and with

the entropy query in (e). Here, the precision for de-

tecting the misses is at or above 90%. The precision

for the detection of ﬂips is always greater or equal to

60% and from active learning cycle 7 onward even

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

382

Figure 8: BDD active learning curves for Faster R-CNN

where both label error types are present.

always above 80%. We observe that the precision

of the highest loss review increases while the exper-

iments progresses, i.e., object detectors trained with

more data generate better label error proposals. We

hypothesize that with more data, overﬁtting can be

more effectively prevented and that the object detec-

tors will generalize better, thus label errors in the ac-

tive labels will not be as signiﬁcant when sufﬁcient

data is available.

Results for BDD with Faster R-CNN. Figure 8

shows active learning curves for the random query

without review, as well as for the random and en-

tropy query both with highest loss review. Compa-

rable to the results for EMNIST-Det, the ranking of

the methods is identical over the entire active learning

course. The random query without review is inferior

to both queries with review and entropy HL is supe-

rior to random HL. Note, that the distance between the

two queries with review is marginal. In contrast, there

is a signiﬁcant difference between either one and the

random query without review.

The review quality of the highest loss review for

the random query is shown in ﬁg. 6 (c) and for the

entropy query in (f). Again, the misses are detected at

all times with a precision of nearly 100%. Starting at

just under 50%, the precision for identifying ﬂips in-

creases steadily over the active learning course ending

at close to 80%. We conclude that involving a label

review in the active learning cycle is also highly ben-

eﬁcial in the more complex BDD real world dataset.

Analogous to the results for EMNIST-Det, the highest

loss review becomes more precise as the experiments

progress and the number of active labels increases.

5 CONCLUSION

In this work, we considered label errors in active

learning cycles for object detection for the ﬁrst time,

where we assumed a noisy oracle during the annota-

tion process. We realized this assumption by simu-

lating two types of label errors for the training data

of datasets which are reasonably free of intrinsic la-

bel errors. These types of label errors are missing

bounding box labels as well as bounding box labels

with an incorrect class assignment. We introduce a

review module to the active learning cycle, that takes

as input the currently labeled images and the cor-

responding predictions of the most recently trained

object detector. Furthermore, we detect both types

of label errors by a random review method and a

method based on the highest loss of the model’s pre-

dictions and the corresponding noisy labels. We ob-

serve that the incorporation of random review leads to

an even worse test performance compared to the cor-

responding query without review. Nevertheless, we

show that the combination of query strategies, like

random selection or instance-wise entropy, with an

accurate review yields a signiﬁcant performance in-

crease. For both query strategies, the improvement

obtained by including the highest loss review per-

sists during the whole active learning course for dif-

ferent dataset-network-combinations. We make our

code for reproducing results and further development

publicly available at https://github.com/mschubert95/

active learning with label errors.

ACKNOWLEDGEMENTS

We gratefully acknowledge ﬁnancial support by the

German Federal Ministry of Education and Re-

search in the scope of project “UnrEAL”, grant no.

01IS22069. The authors gratefully acknowledge the

Gauss Centre for Supercomputing e.V. for funding

this project by providing computing time through the

Johnvon Neumann Institute for Computing on the

GCS Supercomputer JUWELS at Julich Supercom-

puting Centre.

REFERENCES

Brust, C.-A., K

ading, C., and Denzler, J. (2018). Active

learning for deep object detection. arXiv preprint

arXiv:1809.09875.

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov,

A., and Zagoruyko, S. (2020). End-to-end object de-

tection with transformers. In European conference on

computer vision, pages 213–229. Springer.

Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X.,

Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng,

D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu,

R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy,

Deep Active Learning with Noisy Oracle in Object Detection

383

C. C., and Lin, D. (2019). MMDetection: Open mm-

lab detection toolbox and benchmark. arXiv preprint

arXiv:1906.07155.

Choi, J., Elezi, I., Lee, H.-J., Farabet, C., and Alvarez,

J. M. (2021). Active Learning for Deep Object De-

tection via Probabilistic Modeling. In Proceedings of

the IEEE/CVF International Conference on Computer

Vision, pages 10244–10253, Montreal, QC, Canada.

IEEE.

Cohen, G., Afshar, S., Tapson, J., and Van Schaik, A.

(2017). Emnist: Extending mnist to handwritten let-

ters. In 2017 international joint conference on neural

networks (IJCNN), pages 2921–2926. IEEE.

Everingham, M., Van Gool, L., Williams, C. K., Winn, J.,

and Zisserman, A. (2010). The pascal visual object

classes (voc) challenge. International journal of com-

puter vision, 88:303–338.

Gupta, G., Sahu, A. K., and Lin, W.-Y. (2019). Noisy Batch

Active Learning with Deterministic Annealing. arXiv

preprint arXiv:1909.12473.

Haussmann, E., Fenzi, M., Chitta, K., Ivanecky, J., Xu, H.,

Roy, D., Mittel, A., Koumchatzky, N., Farabet, C., and

Alvarez, J. M. (2020). Scalable Active Learning for

Object Detection. In 2020 IEEE Intelligent Vehicles

Symposium (IV), pages 1430–1435.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Hu, Z., Gao, K., Zhang, X., Wang, J., Wang, H., and Han,

J. (2022). Probability differential-based class label

noise puriﬁcation for object detection in aerial images.

IEEE Geoscience and Remote Sensing Letters, 19:1–

Kim, K. I. (2022). Active Label Correction Using Robust

Parameter Update and Entropy Propagation. In Eu-

ropean Conference on Computer Vision, pages 1–16.

Springer.

Koksal, A., Ince, K. G., and Alatan, A. (2020). Ef-

fect of annotation errors on drone detection with

YOLOv3. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

Workshops, pages 1030–1031.

Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll

ar, P.

(2017). Focal loss for dense object detection. In

Proceedings of the IEEE international conference on

computer vision, pages 2980–2988.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,

Ramanan, D., Doll

ar, P., and Zitnick, C. L. (2014).

Microsoft coco: Common objects in context. In Com-

puter Vision–ECCV 2014: 13th European Confer-

ence, Zurich, Switzerland, September 6-12, 2014, Pro-

ceedings, Part V 13, pages 740–755. Springer.

Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental

improvement. arXiv preprint arXiv:1804.02767.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. Advances in neural information

processing systems, 28.

Riedlinger, T., Schubert, M., Kahl, K., Gottschalk, H., and

Rottmann, M. (2022). Towards Rapid Prototyping

and Comparability in Active Learning for Deep Ob-

ject Detection. arXiv preprint arXiv:2212.10836.

Rottmann, M. and Reese, M. (2023). Automated detection

of label errors in semantic segmentation datasets via

deep learning and uncertainty quantiﬁcation. In Pro-

ceedings of the IEEE/CVF Winter Conference on Ap-

plications of Computer Vision, pages 3214–3223.

Roy, S., Unmesh, A., and Namboodiri, V. P. (2018). Deep

active learning for object detection. In BMVC, volume

362, page 91.

Schilling, M. P., Scherr, T., Munke, F. R., Neumann, O.,

Schutera, M., Mikut, R., and Reischl, M. (2022). Au-

tomated Annotator Variability Inspection for Biomed-

ical Image Segmentation. IEEE access, 10:2753.

Schmidt, S., Rao, Q., Tatsch, J., and Knoll, A. (2020).

Advanced active learning strategies for object detec-

tion. In 2020 IEEE Intelligent Vehicles Symposium

(IV), pages 871–876. IEEE.

Schubert, M., Riedlinger, T., Kahl, K., Kr

oll, D., Schoenen,

S.,

Segvi

c, S., and Rottmann, M. (2023). Identifying

label errors in object detection datasets by loss inspec-

tion. arXiv preprint arXiv:2303.06999.

Settles, B. (2009). Active learning literature survey.

Yan, S., Chaudhuri, K., and Javidi, T. (2016). Active learn-

ing from imperfect labelers. Advances in Neural In-

formation Processing Systems, 29.

Yan, Y., Rosales, R., Fung, G., Subramanian, R., and Dy, J.

(2014). Learning from multiple annotators with vary-

ing expertise. Machine Learning, 95(3):291–327.

Younesian, T., Epema, D., and Chen, L. Y. (2020). Active

learning for noisy data streams using weak and strong

labelers. arXiv preprint arXiv:2010.14149.

Younesian, T., Zhao, Z., Ghiassi, A., Birke, R., and Chen,

L. Y. (2021). Qactor: Active learning on noisy la-

bels. In Asian Conference on Machine Learning,

pages 548–563. PMLR.

Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F.,

Madhavan, V., and Darrell, T. (2020). Bdd100k: A

diverse driving dataset for heterogeneous multitask

learning. In Proceedings of the IEEE/CVF conference

on computer vision and pattern recognition, pages

2636–2645.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

384