Web based Object Annotation Tool using a Triplet-ReID Sorting

Approach

Afonso Costa

1,2

, Andr

e L. Ferreira

1,2 a

and Jo

ao M. Fernandes

1 b

University of Minho, Braga, Portugal

Bosch Car Multimedia Portugal S.A., Braga, Portugal

Keywords:

Labelling Tool, Triplet Loss, Object Detection, Computer Vision, Deep Learning.

Abstract:

The robustness of the object detection methods has seen an increasing attention, which leads to a desire for

more control over the training and testing phases. In practice, the need for labelling unique objects present

on a dataset can be of help. However, manually labelling datasets of considerable size can be impractical.

This paper describes an approach to improve labelling information of a dataset by supporting an object re-

identiﬁcation task. The primary objective is to ﬁnd repeated objects in the dataset. The proposed solution

relies on a web-based application that allows the user to choose which of the similar objects returned by the

Triplet-ReID method are in fact the same as the query object. The effectiveness of the method was tested on a

dataset with considerable object variability. Experimental results show a viable sorting performance associated

with considerable speed improvement when compared to a traditional labelling approach. In fact, a dataset

with 55 unique objects in a total of 1098 images would take 18 hours with a traditional tool and 12 hours

with proposed one. Moreover, given the generic architecture of the developed framework, it can certainly be

applied to a wide range of use cases.

1 INTRODUCTION

Object detection methods have seen substantial im-

provement over the last few years. In fact, tradi-

tional object detection frameworks which were based

in handcrafted features and shallow architectures

evolved to methods capable of learning high-level

feature representation through deeper architectures.

Alongside the models, the existing object detection

challenges have become more demanding and com-

plex. This change is naturally portrayed in the several

competition datasets, such as PASCAL VOC (Ever-

ingham et al., 2010) (Everingham et al., 2014), Im-

ageNet (Deng et al., 2010), MS COCO (Lin et al.,

2014), and Open Images (Krasin et al., 2017). In

detail, the more recent ones tend to have more im-

ages, more objects per image and a higher degree of

variability in terms of object position, size and place-

ment (Liu et al., 2020). As a result, the newer evalua-

tion process, which belongs to MS COCO, introduced

metrics which account for object localization and a

more detailed overview of the model’s performance

https://orcid.org/0000-0002-7028-426X

https://orcid.org/0000-0003-1174-1966

using different objects size. As might be expect, all

these changes reward models which perform best on

all the different scenarios and not just in a particu-

lar well-deﬁned use case. Thus, greater importance is

being given to the robustness of the proposed models,

hence highlighting those that can perform effectively

not only in controlled environments, but also in more

difﬁcult and diverse real worlds situations.

In order to achieve that goal, it is important to have

control over the training and testing phases, specially

when it comes to the used objects. In fact, such con-

trol enables a better and accurate understanding of the

performance during the development of an object de-

tection method. In practice, the need for labelling

unique objects in a dataset can be of help in such

circumstances. However, when dealing with datasets

with considerable size the labelling task can be man-

ually impractical.

The primary objective is to label the objects on

a dataset while maintaining trace of their uniqueness,

thus enabling ﬁner control over the objects in the train

and test sets. For instance, increasing the conﬁdence

that the train and test are disjoint sets or excluding

objects in certain conﬁgurations.

Costa, A., Ferreira, A. and Fernandes, J.

Web based Object Annotation Tool using a Triplet-ReID Sorting Approach.

DOI: 10.5220/0010253102790286

In Proceedings of the 10th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2021), pages 279-286

ISBN: 978-989-758-486-2

279

The existing labelling tools are not suited for the

use case in question since they all demand the user to

transverse every single unlabelled objects each time

the labelling occurs for a speciﬁc object, which is a

clearly not optimised workﬂow that can even make

the process unfeasible.

The present work has the goal of developing a new

tool that improves such workﬂow by minimizing the

time needed to label the objects of a given dataset.

Therefore, this paper contribution consists of a web-

based tool which uses the Triplet-ReID deep learning

model (Hermans et al., 2017) to sort the unlabelled

objects in relation to a query object so that similar

ones appear ﬁrst. Thus, reducing the need to iterate

the entire dataset every time.

The applicability of the tool was tested on a real

dataset with considerable object variability, both in

terms of lighting and perspective. Additionally, a time

consumption comparison between the optimised and

manual approaches was conducted in order to access

the viability of the solution.

The remainder of this paper is organised as fol-

lows. Next section gives background and discusses

related work. Section 3 presents the architecture,

modules and functionality of the developed labelling

tool. Section 4 discusses the evaluation process and

results. Finally, conclusions are presented in Section

2 BACKGROUND AND RELATED

WORK

2.1 Existing Labelling Tools

There is a high variety of existing labelling tools, for

instance, the LabelMe (Russell et al., 2008), Image

Labeler App, RectLabel (R. Kawamura, 2020), and

LabelImg (Tzutalin, ). They all mainly focus on la-

belling (e.g. segmentation or bounding box annota-

tions) one image at a time. Being the goal to label

every object image with a given ID, analysing them

repeatedly one by one leads to a cumbersome work-

ﬂow. Additionally, since the images the not appear in

any useful order, the user needs to iterate every sin-

gle image when labelling each of the existing unique

objects.

2.2 Object Re-identiﬁcation

The task of object re-identiﬁcation (re-ID) can be de-

scribe as seeking the occurrences of a query object

in a set of candidate objects (gallery). The research

in this ﬁeld has not been very active. In contrast,

the person re-identiﬁcation subtask has attracted in-

creasing attention in recent years mainly due to the

automated video surveillance practical applications

(Mathur et al., 2020). Naturally, this research area

can be seen as a particular use case of the object re-

identiﬁcation task as a whole, thus enabling several

solutions to also be used in the later task.

The person re-ID task matches people across a

monitoring system of multiple non-overlapping cam-

eras. In practice, given a query image the goal is to

found in the gallery those in which the same person

appears.

Before 2014, the existing approaches commonly

used hand-designed features to represent and match

the identity of different objects. The features can be

mainly separated in colour based and texture based

features, ranging widely from HSV colour histogram

(Li and Wang, 2013) (Farenzena et al., 2010), LAB

colour histogram (Zhao et al., 2013), SIFT (Zhao

et al., 2013), LBP histogram (Li and Wang, 2013),

HOG features (Li and Wang, 2013) and Gabor fea-

tures (Li and Wang, 2013). After the feature extrac-

tion process, most models use a simple standard dis-

tance measure (e.g. l

-norm, l

-norm, Bahattacharyya

distance) to evaluate the matching performance be-

tween the query and gallery images.

As might be expected, in realistic conditions the

person’s appearance usually undergoes dramatic vi-

sual changes regarding the view angle, resolution,

lighting, background clutter and occlusion. Conse-

quently, it becomes challenging to achieve high per-

formance results using handmade feature-based tech-

niques. Additionally, depending on the images con-

ditions, usually there is a subset of features that bet-

ter represent the identity of the given person. How-

ever, the distance measures mentioned above give the

same weight to every feature on every use case thus

potentially harming the overall method performance.

In 2011 papers such as Dikmen et al. (Dikmen et al.,

) and Zheng et al. (Zheng et al., 2011) addressed the

aforementioned problems by viewing the person re-

identiﬁcation as a distance learning problem. As a re-

sult, the goal shifts from extracting robust appearance

features in order to better represent the identity of an

individual to learning the optimal distance that max-

imises the matching accuracy regardless the choice of

features. The end goal is to maximise the probability

of true matching pairs having a smaller distance than

two images of different individuals. Nowadays, this is

still an active approach (Hermans et al., 2017) (Wojke

and Bewley, 2018) (Yang et al., 2019).

From 2014 onwards, the person re-identiﬁcation

ﬁeld became increasingly popular and has seen great

ICPRAM 2021 - 10th International Conference on Pattern Recognition Applications and Methods

280

improvement, mostly due to the deep learning break-

through. In practice, the methods went from hand-

designed features to automatic feature extraction,

which opened the ﬁeld to new possibilities. Yi et al.

(Yi et al., 2014) was one of the ﬁrst papers to apply

a deep learning model to the person re-identiﬁcation

problem. The paper proposed a deep metric learn-

ing method by using a Siamese convolutional neural

network, which allowed the transition from the pre-

viously mentioned hand-crafted features and discrete

workﬂow to jointly learning features and metrics in a

uniﬁed framework.

Currently, most person re-identiﬁcation models

tend to use very speciﬁc features such as human body

part segmentation (Quan et al., 2019) and spatial-

temporal information (Wang et al., 2018) to improve-

ment the models performance. However, these mod-

ules make the techniques unsuitable for the object re-

identiﬁcation task in hands, since there is no temporal

information and the objects do not always have a hu-

man body shape.

3 TOOL

The Figure 1 describes the architecture of the la-

belling tool. In detail, it encompasses four main

components such as the objects dataset, Triplet-ReID

module, Flask web framework, and React JavaScript

library.

Figure 1: Tool architecture.

Firstly, the objects dataset component represents

the directory in which one can ﬁnd the images as well

as the metadata associated to each object. The tool

allows one image to have multiple objects. In this

case, the metadata must specify each object bound-

ing box, thus enabling their extraction for future pro-

cessing, for instance by the Triplet-ReID module. All

the information needed to represent the dataset can be

deﬁned in a conﬁguration ﬁle, which decouples the

dataset from the actual tool implementation, hence in-

creasing the applicability of the tool.

Secondly, there is the backend which includes the

Triplet-ReID python module and Flask. The former

is a Triplet-ReID person re-identiﬁcation model that

is imported by Flask and used to sort the results of a

given object query. The later is the bridge between

all the tool components. Through a REST API, it re-

ceives HTTP requests sent by the frontend component

and updates its internal state accordingly. In detail,

it reads and updates the dataset as the labelling pro-

cess is fulﬁlled, while also managing the state of the

Triplet-ReID embedding space throughout it’s lifes-

pan (initialisation, populate, update, and queries).

Finally, the frontend uses the React JS library in

conjunction with the React Bootstrap framework. The

later allows for a fast development process since most

of the components already exist. Moreover, by us-

ing pre-existing components and a minimalist inter-

face the user experience becomes familiar and more

ﬂuid.

3.1 Triplet-ReID Module

The Triplet-ReID module is used as a sorting algo-

rithm, hence presenting the more similar unlabelled

objects ﬁrst.

The module mainly uses a ResNet-50 architecture

with the last layer replaced with two fully connected

layers that generate an output of 128 units, which is

the ﬁnal embedding dimension. The training batch

encompasses PK images of randomly sampled K in-

stances of randomly sampled P objects. In all experi-

ments, both K and P are set to their default values, 4

and 32 respectively. Afterwards, the batch images are

used to generate the training triplets which are a three-

fold of the current object instance being processed,

the hardest positive (more different instance of the

same object) and the hardest negative (more similar

instance of a different object) within the PK subset.

The triplets are then used to compute the Batch Hard

loss. As portrayed in the Figure 2, the module uses

the Batch Hard loss to project the objects in an embed-

ding space where, eventually, all instances of the same

object are closer to each other (lower Euclidean dis-

tance) than any instance of a different object (higher

Euclidean distance).

Figure 2: Triplet loss learning workﬂow. Source: (Schroff

et al., 2015).

In practice, in order to sort the unlabelled objects,

one needs to project them to the embedding space.

Web based Object Annotation Tool using a Triplet-ReID Sorting Approach

281

Later, an object can be used as query and the module

returns the unlabelled objects sorted by similarity.

3.2 Functionality / Workﬂow

The labelling workﬂow starts with the random se-

lection of an unlabelled object, in other words, the

query object. The random approach allows the user to

quickly iterate through the dataset and label an object

of their choice. After selecting the object, the remain-

der of the dataset is sorted with the Triplet-ReID mod-

ule, hence based on the similarity to the query object.

The Figure 3 shows the default view when engaging

in the labelling process. In the top of page the user

can visualise the query object, alongside some meta-

data, so it has a reference of the object that its trying

to identify. Afterwards, there is two buttons respon-

sible for navigating backward (‘Prev page’) and for-

ward (‘Next page’) in the results list. The third button

opens a popup dialog with a list of the selected objects

so far, where one can remove an object of the selected

list just by clicking on it and ﬁnish the labelling pro-

cess by clicking ‘Save Changes’. Finally, back in the

default viewing page one can ﬁnd the list of proposed

unlabelled objects. By default the list presents ten ob-

jects at once. The task is to select the images where

the query object appears. When doing the selection

process the selected objects are replaced by the next

object on the list without forcing the user to move to

the next page, hence reducing the friction in the pro-

cess since new objects are automatically appearing on

the screen.

Figure 3: Layout of the page where the labelling process

occurs.

Depending on the object dataset in question, it can

be hard to visually understand the object depending

on its enviroment (e.g. low light and occlusion). To

this end, there is an option to further inspect an object

image at any given point. By using the control key

and the mouse left click combination, the user has ac-

cess to the full image where the object is placed. In

detail, it is possible to zoom over any speciﬁc point of

the image. Moreover, it is possible to re-query with

an speciﬁc object by using the alt key and left mouse

click combination. In some speciﬁc cases this feature

might be useful since re-querying can re-rank the un-

labelled objects in a more useful order.

In the end, after having all objects selected the la-

belling process can actually be completed with the

aforementioned ”Save Changes” button, which redi-

rects the user to the random object selection page,

restarting the process all over again.

4 EXPERIMENTAL RESULTS

The Table 1 summarises the three datasets, namely

A, B and C, used in the experiments with the Triplet-

ReID model. All objects were cropped out using their

bounding box annotations and none was discarded.

Table 1: Metadata describing the datasets used for training

and testing the Triplet-ReID model.

Dataset

Total objs

(unique IDs)

Training objs

(unique IDs)

Testing objs

(unique IDs)

A 2451 (96) 1963 (96) 488 (93)

B 2451 (96) 2231 (83) 220 (13)

C 3684 (145) 2586 (90) 1098 (55)

Firstly, the A and B datasets were generated based

on the same collection of objects. The splitting pro-

cess used to get the train and test sets are the main

difference between the two. The former was accom-

plished by iterating through every object ID and tak-

ing 90% of the object instances for training and the

remaining 10% for testing. As might be expected, de-

spite having different instances of the same object on

the training and testing dataset, the same object ap-

pears in both sets, which hinders a proper analyses on

the generalisation and robustness of the model. Con-

sequently, in order to evaluate the aforementioned im-

pact, the dataset B uses a default splitting technique,

thus splitting the dataset so the set of object IDs in

the train and test datasets are disjoint. The C dataset

uses the B dataset splitting technique with 70% of the

objects going for training and 30% for testing. Fur-

thermore, the C dataset is the union of all the objects

present in the B dataset and some novelty ones.

In order to get an accurate understanding of the

tool performance, the object instances present in the

datasets have high variability, which ultimately in-

creases the difﬁculty of the sorting task. The Figure 4

shows that variability occurs in terms of size, perspec-

tive, light and colour. Furthermore, there are some

cases where the object instances are blurry due to low

image quality and some are extremely occluded by

random components.

ICPRAM 2021 - 10th International Conference on Pattern Recognition Applications and Methods

282

Figure 4: Example of variability within the same object.

Overall, the datasets include objects with high

variability obtained in a real world use case, which

allows an evaluation of performance of the labelling

tool in a realistic environment.

4.1 Evaluation Metrics

The evaluation encompass two different perspectives

of the tool: the labelling time and the sorting quality

of the unlabelled objects.

On the one hand, the labelling time will be evalu-

ated by comparing the same tool with and without the

objects sorted by the Triplet-ReID model. Hence, get-

ting a sense of the impact that the sorting can have in

the labelling process. In detail, the random approach

can be formulated using the following equation:

∑

i=0

(T − M ∗ i) ∗α (1)

where for each N unique objects, the user labels

the existing M repetitions by spending α seconds

analysing each remaining image in order to decide if it

is the query object or not. This process is looped until

the user labels all T images in the dataset. It should be

highlighted that the equation assumes that all unique

objects have the same number of repetitions.

The sorted approach can be approximated by the

following equation:

= N ∗ (M + E +C) ∗ α (2)

where for each N unique objects, the user labels

the existing M repetitions by spending α seconds

analysing each remaining image in order to decide if

it is the query object or not. However, as the model is

not capable of placing all matching objects in the ﬁrst

M positions of the results, the E value is the overhead

needed to encompass this factor. For instance, if there

are 5 matching objects and the overhead is 10, then

within the ﬁrst 15 positions, the user can ﬁnd the 5

matching results. Additionally, the C constant repre-

sents an interval where a matching object does not ap-

pear. Normally, this value is equivalent to two pages

of objects (20 objects) where no match appears. Af-

terwards, the formulation assumes that the user saves

the existing selected objects and proceeds to the next

query.

On the other hand, there is the unlabelled objects

sorting performance. The Triplet-ReID method pro-

poses a Cumulative Matching Characteristics (CMC)

metric (Bolle et al., 2005), which is very popular for

person re-identiﬁcation methods. However, the afore-

mentioned metric does not portrait well the sorting

performance since the goal is not to evaluate how

soon a correct object appears in the ranking but how

much are placed in the early positions of the ranking.

In order to solve the previous issue, the CMC met-

ric was replaced to encompass the new perspective

in which the model is being used. To this end, three

new metrics were introduced: mean average precision

(mAP), Rank-K accuracy and average distance (AD).

Firstly, the area under the Precision-Recall curve

is used to compute the average precision (AP) of each

querying object. Afterwards, the mean of all AP val-

ues produces the mAP metric. (Zheng et al., 2015)

In practice, a lower mAP value means that the user

needs to cycle through more objects than it needs to

label all instances of an object. In contrast, a perfect

mAP would mean that the user only needs to iterate

the exact number of correct objects to label them all.

Secondly, the Rank-K accuracy is the result of av-

eraging the division of the correct objects (c) by the

ﬁrst K positions of the sorted objects. Therefore, the

Rank-K accuracy metric can be summarised as:

Rank-K accuracy =

∑

i=1

(3)

For instance, if one is trying to label an object with

100 unlabelled instances, a lower Rank-100 accuracy

means that a low number of those instances are within

the ﬁrst 100 positions. In contrast, a higher Rank-

100 accuracy portraits a sorting order where a higher

number of correct objects are within the ﬁrst 100 po-

sitions.

Lastly, there is the AD which directly portraits

the average distance between the correct objects po-

sitions after being sorted. For instance, a AD of 2.5

for a given query object means that the unlabelled in-

stances of such object have an average distance of 2.5

objects between them. In other words, a lower AD

results in correct objects being closer to one another

and a higher AD the opposite. Ideally, one wants an

AD below two pages of objects so that correct objects

are continuously seen throughout the labelling, which

results in a more ﬂuid experience.

Web based Object Annotation Tool using a Triplet-ReID Sorting Approach

283

Table 2: Experiments with A dataset.

Experiment

name

mAP

(%)

Rank-K accuracy

(%)

(objects)

Train

augmentation

Embedding

augmentation

Net Head

EA3 88.77 84.52 1.5784 ﬂip -

ResNet50 v1

(pre-trained on ImageNet) Fully connected

(1024x1024)

EA5 88.91 84.20 2.5468 ﬂip+crop -

EA6 88.36 84.25 2.3895 - ﬂip

EA7 90.78 88.05 3.1601 - crop

EA9 86.78 81.83 1.8556 ﬂip ﬂip

EA12 89.51 85.72 2.3761 ﬂip+crop ﬂip+crop

EA15 88.38 84.74 3.1141 ﬂip+crop ﬂip+crop ResNet101 v1

(pre-trained on ImageNet)EA16 90.95 88 1.8446 ﬂip ﬂip

Table 3: Experiments with B dataset.

Experiment

name

mAP

(%)

Rank-K accuracy

(%)

(objects)

Train

augmentation

Embedding

augmentation

Net Head

EB3 82.52 76.52 5.9809 ﬂip -

ResNet50 v1

(pre-trained on ImageNet) Fully connected

(1024x1024)

EB5 79.26 73.01 5.4948 ﬂip+crop -

EB6 81.94 75.92 4.5859 - ﬂip

EB7 86.08 80.86 4.3681 - crop

EB9 79.75 73.60 6.0326 ﬂip ﬂip

EB12 78.00 71.90 5.2936 ﬂip+crop ﬂip+crop

EB15 79.04 72.67 5.8664 ﬂip+crop ﬂip+crop ResNet101 v1

(pre-trained on ImageNet)EB16 82.11 75.37 4.9387 ﬂip ﬂip

4.2 Results

Analysing the time consumption metrics in a more

practical point of view, the random approach (Eq. 1)

enforces the user to transverse every remaining object

image for each query object. In contrast, the sorted

version (Eq. 2) puts the control on the user so he can

stop earlier. Consequently, depending on the dataset

one approach can be better than the other. There-

fore, if the dataset has few unique objects (N) but each

with a high number of repetitions than the random ap-

proach can actually be better. Whereas, if the dataset

has a higher number of unique objects and each with a

low number of repetitions than the sorted version can

achieve better results, hence a faster labelling process.

An analyses was performed by using ﬁctitious

datasets FB and FC which are lightly based on the

B and C dataset, respectively. The FB dataset can be

describe with the following values: T = 220, N = 13,

M = 16, C = 20, E = 32, and α = 2. Similarly,

the FC dataset is deﬁned by the following constants:

T = 1098, N = 55, M = 19, C = 20, E = 357, and α

= 2.

When using the FB dataset, the random approach

) would take about 54 minutes to complete the la-

belling. Meanwhile, the sorted approach (t

) would

take only about 30 minutes. Furthermore, when sim-

ulating with a bigger dataset such as the FC the t

time

increases signiﬁcantly reaching the 18 hours while the

time would stay at about 12 hours. Naturally, bigger

datasets yield larger differences, making the random

approach quickly unfeasible in a business context.

Regarding the Triplet-ReID module, the same ex-

periments were performed with each of the three pre-

viously mentioned datasets. In detail, each experi-

ment started by training the model with the appro-

priate images of each dataset on a set of parame-

ters, such as the network architecture, the use of pre-

trained weights and crop/ﬂip augmentations. The rest

of the parameters remained with their default values

as changing them did not improved the model perfor-

mance. Next, the trained model is used to project the

test images to an embedding space. As in training,

the embedding stage enables crop and ﬂip augmenta-

tions as well. Lastly, the evaluation phase computes

the mAP, Rank-K accuracy and AD metrics by query-

ing all the test images over the populated embedding

space.

The Table 2 shows the results yield by the A

dataset. The highest performance increase is due

to the use of the pretrained weights. In contrast,

there is no meaningful performance difference be-

tween the experiments done with the resnet v1 50

and the resnet v1 101 networks. The best results

were achieved with the EA7 experiment with a mAP

of 90.78%, Rank-K accuracy of 88.05% and AD of

3.1601, and the EA16 experiment with a mAP of

90.95%, Rank-K accuracy of 88% and an AD of

1,8446.

The same experiments were conducted using the

B dataset and the results are reported in Table 3. The

best results were achieved in the EB3 and EB7 ex-

periments. The main difference between the EB and

EA set of experiments is the splitting algorithm used

to separate the training and testing data. As might be

expected, all metrics decreased. The main reason is

ICPRAM 2021 - 10th International Conference on Pattern Recognition Applications and Methods

284

Table 4: Experiments with C dataset.

Experiment

name

mAP

(%)

Rank-K accuracy

(%)

(objects)

Train

augmentation

Embedding

augmentation

Net Head

EC3 45.01 43.07 20.1552 ﬂip -

ResNet50 v1

(pre-trained on ImageNet) Fully connected

(1024x1024)

EC5 44.62 42.57 20.7063 ﬂip+crop -

EC6 49.27 45.91 16.855 - ﬂip

EC7 45.81 43.54 21.3495 - crop

EC9 48.66 45.79 17.9821 ﬂip ﬂip

EC12 46.50 44.06 20.072 ﬂip+crop ﬂip+crop

EC15 47.01 44.36 19.4776 ﬂip+crop ﬂip+crop ResNet101 v1

(pre-trained on ImageNet)EC16 50.82 47.47 15.7853 ﬂip ﬂip

the novelty of the testing objects used as the model

had never seen them. The 7.97% decrease in mAP,

10.18% decrease in Rank-K accuracy and the increase

of 2.9620 objects in the AD metric show that the cor-

rect objects are further away from the ﬁrst positions

as well as more distant from each other.

The results of the experiments with the C dataset

are reported in the Table 4. The main difference is

the signiﬁcant lower metric values, with a decrease of

33.88% in mAP, 30.39% in Rank-K accuracy, and an

increase of 13.7278 objects in AD. Despite, the goal

of generating a more discriminative model through

using more training samples, the performance im-

pact suggests that the new objects added to the C

dataset are in fact harder to distinguish, thus result-

ing in lower metrics overall.

Lastly, after querying the system with a variety of

objects, one can notice a pattern regarding the real

size of the objects. In fact, all objects are cropped

by their bounding box and resized to height of 256

pixels and a width of 128 pixels. Consequently, it be-

comes impossible to compare their actual size since

objects with major size differences can ﬁll the same

percentage of the ﬁnal image. As a result, objects

which have a similar appearance but a noticeable size

difference are usually challenging for the system to

distinguish. Figure 5 shows some practical examples

were the problem appears.

Figure 5: Examples were the size problem is evident. The

ﬁrst column shows a query object and the second column

shows an object that appears within the ﬁrst 10 positions of

the results.

5 CONCLUSION

Due to the importance of the robustness in object de-

tection models, having control over the objects in both

the training and testing phase is crucial. The identiﬁ-

cation of all unique objects using existing labelling

tools becomes an inefﬁcient workﬂow since they only

allow the analyses of one image at once, which could

even make the labelling process unfeasible in large

datasets.

This paper proposes a web based tool coupled

with the Triplet-ReID model to improve the labelling

process. The addition of the machine learning module

allows the sorting of unlabelled objects by similarity.

Furthermore, being a web-based application enables

its deployment locally or on external servers, hence

allowing for both private and crowdsourced labelling

strategies.

The experiments demonstrate the viability of the

solution both in terms of sorting performance and la-

belling time when compared to a random approach,

specially when using a dataset of with high number of

different objects and each with a low number of rep-

etitions. Moreover, given the generic architecture of

the developed framework, it can certainly be applied

to a wide range of use cases.

As future work, the experiments should be repli-

cated with larger datasets as the size of the ones used

can inﬂuence the reported results. Also, the addition

of visual attention learning mechanism to re-weight

the learned feature maps will be explored to access

their impact on the objects similarity sorting perfor-

mance (Li et al., 2018) (Song et al., 2018) (Si et al.,

2018) (Chen et al., 2020). Moreover, several tech-

niques to maintain the scale between objects will be

experimented in order to make their size a valuable

feature on the object re-identiﬁcation task.

Web based Object Annotation Tool using a Triplet-ReID Sorting Approach

285

ACKNOWLEDGEMENTS

This work is supported by: European Structural

and Investment Funds in the FEDER component,

through the Operational Competitiveness and Interna-

tionalization Programme (COMPETE 2020) [Project

039334; Funding Reference: POCI-01-0247-

FEDER-039334].

REFERENCES

Bolle, R. M., Connell, J. H., Pankanti, S., Ratha, N. K., and

Senior, A. W. (2005). The Relation Between the ROC

Curve and the CMC. Technical report.

Chen, Y., Wang, H., Sun, X., Fan, B., and Tang, C. (2020).

Deep Attention Aware Feature Learning for Person

Re-Identiﬁcation.

Deng, J., Dong, W., Socher, R., Li, L.-J., Kai Li, and Li

Fei-Fei (2010). ImageNet: A large-scale hierarchical

image database. pages 248–255. Institute of Electrical

and Electronics Engineers (IEEE).

Dikmen, M., Akbas, E., Huang, T. S., and Ahuja, N. Pedes-

trian Recognition with a Learned Metric. Technical

report.

Everingham, M., Eslami, S. M., Van Gool, L., Williams,

C. K., Winn, J., and Zisserman, A. (2014). The Pas-

cal Visual Object Classes Challenge: A Retrospective.

International Journal of Computer Vision, 111(1):98–

136.

Everingham, M., Van Gool, L., Williams, C. K., Winn, J.,

and Zisserman, A. (2010). The pascal visual object

classes (VOC) challenge. International Journal of

Computer Vision, 88(2):303–338.

Farenzena, M., Bazzani, L., Perina, A., Murino, V., and

Cristani, M. (2010). Person re-identiﬁcation by

symmetry-driven accumulation of local features. In

Proceedings of the IEEE Computer Society Confer-

ence on Computer Vision and Pattern Recognition,

pages 2360–2367.

Hermans, A., Beyer, L., and Leibe, B. (2017). In Defense

of the Triplet Loss for Person Re-Identiﬁcation.

Krasin, I., Duerig, T., Alldrin, N., Ferrari, V., Abu-El-

Haija, S., Kuznetsova, A., Rom, H., Uijlings, J.,

Popov, S., Kamali, S., Malloci, M., Pont-Tuset,

J., Veit, A., Belongie, S., Gomes, V., Gupta, A.,

Sun, C., Chechik, G., Cai, D., Feng, Z., Narayanan,

D., and Murphy, K. (2017). OpenImages: A

public dataset for large-scale multi-label and multi-

class image classiﬁcation. Dataset available from

https://storage.googleapis.com/openimages/web/index.html.

Li, W. and Wang, X. (2013). Locally aligned feature trans-

forms across views. In Proceedings of the IEEE Com-

puter Society Conference on Computer Vision and

Pattern Recognition, pages 3594–3601.

Li, W., Zhu, X., and Gong, S. (2018). Harmonious Atten-

tion Network for Person Re-Identiﬁcation.

Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick,

R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L.,

and Doll

ar, P. (2014). Microsoft COCO: Common Ob-

jects in Context.

Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu,

X., and Pietik

ainen, M. (2020). Deep Learning for

Generic Object Detection: A Survey. International

Journal of Computer Vision, 128(2):261–318.

Mathur, N., Mathur, S., Mathur, D., and Dadheech, P.

(2020). A Brief Survey of Deep Learning Techniques

for Person Re-identiﬁcation. In Proceedings of 3rd In-

ternational Conference on Emerging Technologies in

Computer Engineering: Machine Learning and Inter-

net of Things, ICETCE 2020, pages 129–138. Institute

of Electrical and Electronics Engineers Inc.

Quan, R., Dong, X., Wu, Y., Zhu, L., and Yang, Y. (2019).

Auto-ReID: Searching for a Part-aware ConvNet for

Person Re-Identiﬁcation.

R. Kawamura (2020). RectLabel.

Russell, B. C., Torralba, A., Murphy, K. P., and Freeman,

W. T. (2008). LabelMe: A database and web-based

tool for image annotation. International Journal of

Computer Vision, 77(1-3):157–173.

Schroff, F., Kalenichenko, D., and Philbin, J. (2015).

FaceNet: A Uniﬁed Embedding for Face Recognition

and Clustering.

Si, J., Zhang, H., Li, C. G., Kuen, J., Kong, X., Kot,

A. C., and Wang, G. (2018). Dual Attention Match-

ing Network for Context-Aware Feature Sequence

Based Person Re-identiﬁcation. In Proceedings of

the IEEE Computer Society Conference on Computer

Vision and Pattern Recognition, pages 5363–5372.

IEEE Computer Society.

Song, C., Huang, Y., Ouyang, W., and Wang, L. (2018).

Mask-Guided Contrastive Attention Model for Person

Re-identiﬁcation. In Proceedings of the IEEE Com-

puter Society Conference on Computer Vision and

Pattern Recognition, pages 1179–1188. IEEE Com-

puter Society.

Tzutalin. LabelImg.

Wang, G., Lai, J., Huang, P., and Xie, X. (2018). Spatial-

Temporal Person Re-identiﬁcation.

Wojke, N. and Bewley, A. (2018). Deep Cosine Metric

Learning for Person Re-Identiﬁcation.

Yang, X., Wang, M., and Tao, D. (2019). Person Re-

identiﬁcation with Metric Learning using Privileged

Information.

Yi, D., Lei, Z., and Li, S. Z. (2014). Deep Metric Learning

for Practical Person Re-Identiﬁcation.

Zhao, R., Ouyang, W., and Wang, X. (2013). Unsupervised

salience learning for person re-identiﬁcation. In Pro-

ceedings of the IEEE Computer Society Conference

on Computer Vision and Pattern Recognition, pages

3586–3593.

Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., and

Tian, Q. (2015). Scalable Person Re-identiﬁcation: A

Benchmark. In 2015 IEEE International Conference

on Computer Vision (ICCV), pages 1116–1124. IEEE.

Zheng, W. S., Gong, S., and Xiang, T. (2011). Person re-

identiﬁcation by probabilistic relative distance com-

parison. In Proceedings of the IEEE Computer Society

Conference on Computer Vision and Pattern Recogni-

tion, pages 649–656. IEEE Computer Society.

ICPRAM 2021 - 10th International Conference on Pattern Recognition Applications and Methods

286