Unsupervised Annotation and Detection of Novel Objects Using Known

Objectness

Harsh Singh Jadon

∗

, Jagdish Deshmukh

∗

, Kamakshya Prasad Nayak

Kamalakar Vijay Thakare

and Debi Prosad Dogra

† c

Indian Institute of Technology Bhubaneswar, Odisha, PIN 752050, India

Keywords:

Novel Object Detection, Unsupervised Learning, Weakly Annotated Dataset, Clustering.

Abstract:

The paper proposes a new approach to detecting and annotating novel objects in images that are not precisely

part of a training dataset. The ability to detect novel objects is essential in computer vision, enabling machines

to recognise objects that have not been seen before. Current models often fail to detect novel objects as they

rely on predeﬁned categories in the training data. Our approach overcomes this limitation by leveraging a large

and diverse dataset of objects obtained through web scraping. We extract features using a backbone network

and perform clustering to remove redundant data. The resulting dataset is used to retrain the object detection

models to obtain results. The method provides deep insights into the effect of clustering and data redundancy

removal on performance. Overall, the work contributes to the ﬁeld of object detection by providing a new

approach for detecting novel objects. The method has the potential to be applied to a variety of real-world CV

applications.

1 INTRODUCTION

Region proposals are vital building blocks in object

detection tasks. An improved pre-processing of pro-

posals usually leads to signiﬁcant improvements in

various applications, such as object recognition (Li

et al., 2022b; Li et al., 2022a; Yang et al., 2022;

Cheng et al., 2022; Hou et al., 2022), video track-

ing (Qin et al., 2022; Tang and Ling, 2022; Blatter

et al., 2023), object discovery (H

enaff et al., 2022;

Wang et al., 2022; Bao et al., 2022) and segmenta-

tion (Park et al., 2022; Xu et al., 2022). Due to such

advancement, object detection using region proposals

has attracted tremendous attention. Such tasks aim to

generate region proposals and predict the labels with

acceptable threshold. However, one major limitation

of current object detection models is their inability to

detect novel objects that are not present in the training

data. This limitation hinders the ability of the models

to recognize unknown objects, which may be a criti-

cal requirement for various real-world applications.

https://orcid.org/0000-0002-0356-8377

https://orcid.org/0000-0003-4587-4126

https://orcid.org/0000-0002-3904-732X

∗

These authors contributed equally to this work

†

Corresponding author

Several attempts (LaBonte et al., 2023; Xu et al.,

2020; Wu et al., 2021; Zhu et al., 2021; Fan et al.,

2020) have been made in recent years to detect novel

objects. Kuo et al. (Kuo et al., 2015) have developed

DeepBox that generates bottom-up proposals and re-

ranks them using CNNs, given any RGB image. Kang

et al. (Kang et al., 2019) have designed a novel few-

shot detection model that 1) learns generalized meta-

features and 2) automatically re-weights the features

for novel class detection by producing class-speciﬁc

activating coefﬁcients from a few support samples.

However, unlike the learning-free counterparts (Bao

et al., 2022; H

enaff et al., 2022; Kuo et al., 2015),

these methods tend to over-ﬁt annotated categories

and struggle with novel objects.

To overcome this limitation, this paper proposes a

new approach to detect novel objects in images that

are not precisely part of a training dataset. Our ap-

proach leverages a large and diverse dataset of objects

obtained through web scraping. It enables the detec-

tion of novel objects with higher accuracy as com-

pared to recent methods. We extract features using

well-known backbone networks and perform cluster-

ing to remove redundant data. The resulting dataset is

then used to retrain the object detection model to ob-

tain ﬁnal results. Existing models rely on predeﬁned

categories and context-free learning in the training

694

Jadon, H., Deshmukh, J., Nayak, K., Thakare, K. and Dograψ, D.

Unsupervised Annotation and Detection of Novel Objects Using Known Objectness.

DOI: 10.5220/0012412400003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 2: VISAPP, pages

694-701

ISBN: 978-989-758-679-8; ISSN: 2184-4321

data and they often fail to detect novel objects. Our

approach has been evaluated on several benchmarks,

demonstrating its effectiveness in detecting novel ob-

jects with high accuracy. Moreover, we provide in-

sights into the effect of clustering and data redun-

dancy removal on the performance of the proposed

approach. The paper offers the following technical

contributions:

• We propose a systematic, easy-to-follow approach

to accumulate relevant data samples for novel ob-

ject discovery. With sufﬁcient related keywords,

the proposed approach can generate a large pool

of auto-annotated images that can be used for

training object detectors.

• We comprehensively evaluate pre-trained object

detection models on a newly generated dataset.

This evaluation encompasses various aspects such

as threshold tuning, accuracy on novel classes,

and the effect of the number of images on the

model’s accuracy.

The rest of the paper is organized as follows. Sec-

tion 2 provides a review of related work in the ﬁeld of

object detection and detection of novel objects. Sec-

tion 3 describes the proposed approach in detail, in-

cluding data collection, feature extraction, clustering,

and retraining. Section 4 presents the experimental

setup and results. Section 5 discusses the proposed

approach’s results and limitations. Finally, Section 6

concludes the paper and discusses future work.

2 RELATED WORK

2.1 Object Detection

Traditionally, object detection algorithms relied heav-

ily on handcrafted features. However, the advent

of RCNNs, proposed by Girshick et al. (Girshick,

2015) has signiﬁcantly boosted the progress of ob-

ject detection. Subsequently, Faster-RCNNs and Fast-

RCNNs have been introduced, which have improved

the RCNN models by jointly training a detector and a

bounding box regressor within the same network con-

ﬁguration. Based on Faster RCNNs (Ren et al., 2015),

the authors have proposed Feature Pyramid Networks

(FPN). The FPN architecture uses a top-down ap-

proach with lateral connections to generate high-level

semantics at all scales. FPN has exhibited signiﬁcant

progress in detecting objects at varying scales. An-

other improvement in object detection is Mask RCNN

proposed by He et al. (He et al., 2017), which incorpo-

rates pixel-level masks into Faster-RCNN. However,

majority of the object detection algorithms demand

fully annotated data for an object class they aim to

detect. With the remarkable progress in object detec-

tion, the focus of recent object proposal research has

transitioned from object discovery to detection. Ob-

ject discovery proposals aim to propose all objects in

an image, whereas detection proposals are designed to

propose only the labelled categories for downstream

classiﬁcation.

2.2 Class Agnostic Object Detection

Class-agnostic object detection is a subﬁeld of com-

puter vision that aims to detect objects without rely-

ing on pre-deﬁned classes. It focuses on detecting and

localizing objects in an image regardless of their cat-

egory. Class-agnostic object detection has numerous

applications, including object tracking, scene under-

standing, and robotics. Recent advancements in deep

learning models, such as one-stage detectors has sig-

niﬁcantly improved the accuracy and speed of class-

agnostic object detection. However, this ﬁeld still has

several challenges, including object occlusion, scale

variations, and background clutter. Previously, class-

agnostic object detection has been tackled using tra-

ditional methods like Selective Search (Uijlings et al.,

2013), EdgeBox (Zitnick and Dollar, 2014), Deep-

Mask (O. Pinheiro et al., 2015), and MCG (Pont-

Tuset et al., 2017). However, recent advancements

such as the Object Localisation Network (Kim et al.,

2021) and Multiscale Attention Vision Transformer

with Late Fusion (Maaz et al., 2022) have emerged

in this ﬁeld. While existing approaches have made

signiﬁcant advancements, they still have some limita-

tions that need to be addressed such as:

Limited Generalization to Novel Objects. Pre-

trained object detection models often struggle to gen-

eralize well to novel objects not present in their train-

ing dataset. When faced with unseen object cate-

gories, these models may exhibit reduced accuracy

and reliability, hindering their performance in real-

world scenarios where object variability is high.

Dependency on Annotated Training Data. Most

state-of-the-art object detection models rely heavily

on large-scale annotated training data. The manual

annotation process is time-consuming, expensive, and

subject to human error, limiting the scalability and ac-

cessibility of these approaches.

3 METHOD

The pipeline of the proposed approach is summarized

in Fig. 1. It consists of three stages: Accumulating

novel object dataset using an image search engine,

Unsupervised Annotation and Detection of Novel Objects Using Known Objectness

695

Keywords Pool

Seach Engine

Pseudo Objects Dataset

RP-FE Block

Clustering

Weakly Annotated Data

Custom Training

Figure 1: The architecture of the proposed framework. It has three important stages: (i) Extraction of relevant object dataset

using Web Scrapper tool leveraging a large set of keyword pools. (ii) RP-FE block: Extract region proposals using class-

agnostic object detectors and feed them to the backbone network for feature extraction. (iii) Utilize clustering to remove

redundant image samples and use the newly generated dataset to custom training of the existing object detection model.

RP-FE block, and reﬁnement of most relevant data

using clustering followed by custom training of detec-

tion model with new data. In the following sections,

we have explained each stage in detail.

3.1 Dataset Generation

The critical component of a typical object detec-

tion method is obtaining the labelled training data

that contains examples of well-known object classes.

However, there is an absence of labelled data accessi-

ble for training regarding new things. It is challenging

to build supervised learning techniques for identify-

ing new objects due to the scarcity of data. Moreover,

generating labelled data is a tedious task.

Figure 2: A depiction of a few image samples retrieved us-

ing the Web Scraping. The keywords are, mango, ripped

mango, mango with white background, and so on.

To mitigate the aforementioned problem, we have

employed a Web Scraper. This tool directly searches

the World Wide Web based on a search keyword us-

ing the Hypertext Transfer Protocol and extracts im-

age ﬁles from embedded HTML of the web page. We

have employed a fully automated tool that retrieves

non-copyright images from websites based on a par-

ticular keyword. Thus, we can obtain novel objects

data as it is directly retrieved from the internet with-

out supervision. Fig. 2 depicts a collection of image

samples for the object mango.

Let the set of n keywords (similar) for an un-

known or novel object be given in (1). For exam-

ple, if K is the set of keywords related to mango,

then k

= ripped mango, k

= green mango, k

mango with knife, etc. We have employed various

keywords for individual objects based on natural ob-

servations, brainstorming sessions, and suggestions

from search engines. For instance, mango with leaf,

tennis ball with racket, etc., have been used to search

images on web.

K = {k

, k

, ..., k

} (1)

We also assume C = {C

, C

, ..., C

} be the set of ob-

ject images retrieved from the web using the keyword

set K. The dataset C is an extensive collection of non-

copyrighted images related to that particular object.

The images of the dataset are then processed through

an existing object detection model as discussed in the

next section.

3.2 RP-FE Block

The approach uses a class-agnostic object detection

model to detect all potential objects present in all the

images of the dataset. By employing a class-agnostic

model, we are able to capture a wide range of ob-

jects without being limited to a predeﬁned class of

object. The detection process generates region-of-

interest proposals (RP) that encompass the spatial lo-

cations of these objects. Subsequently, we extract dis-

criminative features (FE) from these object proposals,

enabling us to capture meaningful information spe-

ciﬁc to each object.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

696

Region Pooling (RP) Extraction. Let M be the

class-agnostic object detection model with pre-trained

weights W

as given in (2), where p

represents the

parameters.

M = (W

, p

) (2)

Also, we assume that F = (W

, p

) is a back-

bone network that extracts features from the given

image. We ﬁrst employ a class-agnostic object de-

tector M and obtain the set of bounding boxes B =

, B

, . . . , B

}, where B

is the set of coordinates of

objects present in C

. Since C is a collection of novel

object images, we need a reﬁnement strategy to ﬁlter

the most relevant data that accurately represent K. To

accomplish this, we have employed well-known clus-

tering techniques and obtained the curated data.

Feature Extraction (FE). First, we extract features

by employing a set of backbone networks, F =

, F

}. This extraction is shown in (3), where

features are extracted from the image set C

using

i −th network with pre-trained weights W

= F

, W

) (3)

Here, F

, F

and F

are popular ResNet-50, Incep-

tionV3, and EfﬁcientNet. In the later stage, D =

, D

, . . . , D

} is the set of extracted features fed

to the clustering algorithm. The clustering step is

important in eliminating redundant data and organiz-

ing the extracted features into distinct groups or clus-

ters. This process helps us to identify and retain the

novel object’s most relevant and representative data

samples. Since the unlabelled images are obtained

through keyword searching, clustering enables us to

group similar features, thereby facilitating the selec-

tion of the most meaningful and discriminative rep-

resentations for further analysis. To capture the most

discriminative features with minimal intra-class vari-

ations, we have selected the smallest cluster for ex-

perimentation.

3.3 Custom Training

Once we obtain the most relevant data through clus-

tering, we create a new weakly labelled dataset. The

dataset is the foundation for the custom training of an

existing object detection model. The objective is to

ﬁne-tune an existing trained model using the weakly

generated dataset, focusing on the targeted class of

objects. During the class-agnostic detection and ROI

pooling stages, we not only detect objects but also

preserve their spatial locations within the image. This

spatial information is important for accurate annota-

tion preparation and training. By maintaining the spa-

tial context of each object, we ensure that the model

learns to recognize and localize the targeted class ef-

fectively.

Training. The annotation process involves labelling

the weakly generated dataset and providing class-

speciﬁc annotations for the targeted objects. These

annotations serve as ground truth labels for training

the pre-trained object detection model and reﬁning its

ability to detect and classify the desired class of ob-

jects accurately. To learn the parameters of the model,

we have incorporated IoU loss (L

IoU

), which is given

in (4), where G

is a set of four coordinates as ground

truth labels of the objects and P is set of coordinates

predicted by the model.

IoU

= −ln

∩ P)

∪ P)

(4)

This IoU loss helps the model to learn intrinsic fea-

tures resulting better localization results.

Evaluation. We have comprehensively evaluated the

pre-trained object detection models using a newly

available dataset. As the newly acquired dataset is

utilized for training purposes, we have derived the

test set from this newly constructed data. To assess

the performance of the models, we have designed and

conducted four distinct experiments. Firstly, we have

performed threshold tuning, systematically varying

the threshold values to investigate their impact on the

models’ detection accuracy. Secondly, we have eval-

uated the models’ accuracy on novel classes, which

are not included in the original training set. This has

allowed us to measure the ability to generalize to un-

seen object categories. Thirdly, after ﬁne-tuning, we

have quantiﬁed the accuracy improvement achieved

by the pre-trained models on these novel classes. This

analysis has provided us insights into the transferabil-

ity of the learned representations. Finally, we have

explored the effect of varying the number of train-

ing images on the models’ performance, investigating

how the scale of the dataset inﬂuenced their detection

capabilities.

4 EXPERIMENTS AND RESULTS

In this section, we present experimental results for the

following four categories: (i) Threshold tuning, (ii)

accuracy on novel classes, iii) Accuracy on existing

classes, and iv) effect of number of training samples.

4.1 Implementation Details

We have utilised two class agnostic objection

detection models: OLN (Kim et al., 2021) and

MAVL (Maaz et al., 2022). OLN (Kim et al.,

Unsupervised Annotation and Detection of Novel Objects Using Known Objectness

697

2021) is a two-stage object proposer similar to Faster

R-CNN (Ren et al., 2015). It consists of a fully convo-

lutional FCN and a region-based ROI stage followed

by locality predictions. Meanwhile, MAVL (Maaz

et al., 2022) has been trained on LMDet (Peng et al.,

2020), a large-scale object detection dataset with

478,000 images and 10.5 millions of object instances.

For the feature extraction phase of the pipeline, we

have considered three feature extraction models:

ResNet50, InceptionV3 and EfﬁcientNetB0. These

models have been selected due to their excellent

performance on image classiﬁcation. We have used

the implementations available with ScikitLearn’s for

KMeans and DBSCAN clustering approaches. In

addition, we have trained and evaluated YOLOv7,

currently the best model in the YOLO family.

Evaluation Metrics. We have used recall (R), preci-

sion (P), and mAP (mean Average Precision) metrics

for evaluation. Recall measures the ability to identify

all relevant instances. Precision evaluates the accu-

racy of the detected instances. And mAP assesses the

overall detection performance by considering the av-

erage precision across all classes or categories. These

metrics help evaluate the model’s performance, offer-

ing insights into its object detection capabilities and

accuracy.

4.2 Threshold Tuning

Threshold tuning is very important in object detec-

tion as it allows for adjusting the detection sensitiv-

ity. By selecting an appropriate threshold, the trade-

off between false positives and false negatives can

be controlled, thereby inﬂuencing the overall preci-

sion and recall of the system. Optimal threshold tun-

ing ensures that the model balances accurately, de-

tecting objects and minimizing erroneous detections,

leading to improved performance and more reliable

results. We have conducted several experiments us-

ing different thresholds on OLN (Kim et al., 2021)

and MAVL (Maaz et al., 2022) models. Tab. 1 repre-

sents the values of different accuracy metrics on vary-

ing threshold values.

4.3 Experiments on Pre-Trained Classes

Even though the COCO dataset includes the sports

ball class and YOLOv7 has been trained on it, the

model often identiﬁes the tennis balls as instances of

oranges instead, resulting in a reduced precision for

the testing partition. The prediction for the tennis ball

has been shown in Fig. 5. We have utilized the OLN,

ResNet50, and K-means clustering with the search

Table 1: Accuracy metric values at various thresholds for

the OLN (Kim et al., 2021) and MAVL (Maaz et al., 2022)

model.

Threshold Tuning

Stage-I Accuracy Metrics

Model Threshold P R mAP@.5 mAP@.5:.95

OLN 0.5 0.819 0.964 0.793 0.664

OLN 0.6 0.906 0.96 0.902 0.74

OLN 0.66 0.96 0.96 0.941 0.78

OLN 0.7 0.96 0.938 0.937 0.755

OLN 0.75 0.963 0.88 0.921 0.728

MAVL 0.7 0.41 0.81 0.55 0.389

MAVL 0.75 0.44 0.809 0.6 0.44

MAVL 0.8 0.49 0.806 0.613 0.471

MAVL 0.85 0.49 0.756 0.606 0.436

MAVL 0.9 0.49 0.72 0.577 0.38

Figure 3: Object detection results using YOLOv7 on novel

category DSLR camera and Rubic cube. The top row is the

YOLOv7 prediction before the custom training and the bot-

tom row presents the prediction of the retrained model with

the proposed approach.

query “tennis ball” to create a tennis ball dataset with

the instances of tennis balls labelled as sports balls.

We created two partitions, one for training and the

other for testing. Tab 2 and 3 present the prediction

results related to improvement in accuracy when the

model is retrained with the proposed approach. As we

can see, the accuracy gain is higher for sports ball as

the number of samples for the tennis ball is low in the

COCO dataset and often gets predicted as orange. Af-

ter retraining the model with a new dataset, the model

successfully detected tennis balls. However, it is not

Figure 4: Object detection results using YOLOv7 on two

novel categories e.g. dustbin and Headphones. The top row

is the YOLOv7 prediction before custom training and the

bottom is the prediction of the retrained model with the pro-

posed approach.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

698

Figure 5: Top row: tennis ball is part of the COCO dataset,

but YOLOv7 predicts its label incorrectly. Bottom row:

Prediction results after retraining the new dataset generated

by the proposed pipeline.

the same with suitcase as it is often correctly classi-

ﬁed.

Table 2: Improvement in accuracy metrics after additional

training on the “sports ball” class.

Sports Ball

Accuracy Metrics P R mAP@.5 mAP@.5:.95

Before Training 0.669 0.806 0.705 0.655

After Training 0.923 (+0.254) 0.854 (+0.048) 0.902 (+0.197) 0.775 (+0.12)

Table 3: Improvement in accuracy metrics after extra train-

ing on the “suitcase” class.

Suitcase

Accuracy Metrics P R mAP@.5 mAP@.5:.95

Before Training 0.917 0.834 0.902 0.724

After Training 0.941 (+0.024) 0.857 (+0.023) 0.919 (+0.017) 0.739 (+0.015)

4.4 Accuracy on Novel Classes

Evaluating the accuracy of the pre-trained object de-

tection models on novel classes is essential to assess

their generalization capabilities. We have trained and

evaluated YOLOv7 with the following classes: Ru-

bic cube, dustbin, headset headphones, DSLR cam-

era, and tennis ball categories. We have reported the

accuracy of YOLOv7 with different metrics on the

Rubic cube in Table 4 and DSLR Camera in Table 5.

Few resulting predictions are shown in Fig. 3 and 4.

4.5 Results Using Varying Number of

Images

As we are scraping images from web without using

any paid APIs, we can fetch a maximum of 400-500

images on a single run. This number can further vary

based on the quality of the internet connection and the

server load. Hence we decided to observe the effect

of the number of fetched images on the Novel Class

Training accuracy. The results presented in Tab. 6

show that higher mAP can be achieved with more

number of samples.

Table 4: Accuracy metrics of YOLOv7 when trained on the

dataset produced by the proposed pipeline using the search

term “Rubix Cube”, considering all possible combinations

of the pipeline’s three stages.

Search Query: Rubix Cube

Task Accuracy Metrics

Class Agnostic OD Feature Extraction Clustering P R mAP@.5 mAP@.5:.95

OLN ResNet50 K-Means 0.96 0.96 0.941 0.78

OLN InceptionV3 K-Means 0.842 0.95 0.869 0.74

OLN EfﬁcientNet K-Means 0.889 0.96 0.922 0.788

MAVL ResNet50 K-Means 0.45 0.731 0.512 0.39

MAVL InceptionV3 K-Means 0.47 0.721 0.517 0.4

MAVL EfﬁcientNet K-Means 0.49 0.806 0.613 0.471

OLN ResNet50 DBSCAN 0.72 0.69 0.702 0.541

OLN InceptionV3 DBSCAN 0.641 0.604 0.647 0.48

OLN EfﬁcientNet DBSCAN 0.556 0.62 0.57 0.39

MAVL ResNet50 DBSCAN 0.69 0.68 0.67 0.5

MAVL InceptionV3 DBSCAN 0.56 0.54 0.547 0.372

MAVL EfﬁcientNet DBSCAN 0.492 0.523 0.48 0.31

Table 5: Accuracy metrics of YOLOv7 when trained on the

dataset produced by the proposed pipeline using the search

term “DSLR Camera”, considering all possible combina-

tions of the pipeline’s three stages.

Search Query: DSLR Camera

Task Accuracy Metrics

Class Agnostic OD Feature Extraction Clustering P R mAP@.5 mAP@.5:.95

OLN ResNet50 K-Means 0.958 0.958 0.985 0.826

OLN InceptionV3 K-Means 0.833 0.95 0.86 0.737

OLN EfﬁcientNet K-Means 0.88 0.955 0.917 0.779

MAVL ResNet50 K-Means 0.552 0.709 0.518 0.464

MAVL InceptionV3 K-Means 0.556 0.711 0.518 0.498

MAVL EfﬁcientNet K-Means 0.589 0.83 0.54 0.523

OLN ResNet50 DBSCAN 0.57 0.713 0.532 0.499

OLN InceptionV3 DBSCAN 0.548 0.719 0.51 0.49

OLN EfﬁcientNet DBSCAN 0.536 0.808 0.504 0.482

MAVL ResNet50 DBSCAN 0.432 0.607 0.41 0.343

MAVL InceptionV3 DBSCAN 0.433 0.619 0.403 0.33

MAVL EfﬁcientNet DBSCAN 0.459 0.64 0.429 0.37

Table 6: Effect of nnumber of scraped images on accuracy

metrics after training.

Effect of no. of scraped images

No of images\Metrics P R mAP@.5 mAP@.5:.95

50 0.378 0.4 0.371 0.2

100 0.457 0.44 0.487 0.338

200 0.579 0.48 0.553 0.378

300 0.823 0.56 0.706 0.544

400 0.923 0.949 0.929 0.72

500 0.96 0.96 0.941 0.78

5 CONCLUSIONS

In conclusion, the research paper presents a compre-

hensive approach of unsupervised novel object detec-

tion. We have detected a few potential objects in the

image using a class-agnostic object detection model

and obtained ROI proposals. Extracting features from

these proposals has facilitated the representation of

each object in a discriminative manner. Through clus-

tering, we have reduced redundancy and obtained a

reﬁned dataset generated from data obtained via key-

word searching on the web. Custom training has been

performed using the weakly generated dataset to en-

Unsupervised Annotation and Detection of Novel Objects Using Known Objectness

699

hance the object detection model further. Spatial in-

formation of the objects has been preserved during the

-agnostic detection and ROI pooling stages, ensuring

accurate annotation preparation. The model has been

ﬁne-tuned on weakly generated datasets, focusing on

the targeted class, resulting in improved object detec-

tion capabilities.

Evaluation of the custom-trained model has

demonstrated its effectiveness in detecting and local-

izing the targeted class of objects. The integration

of clustering, weakly generated data, spatial preser-

vation, and custom training has contributed to the

overall success of the proposed approach. This re-

search provides new insights into unsupervised novel

object detection, addressing the challenges of limited

labelled data for novel objects. The methodology pre-

sented in this paper offers a practical framework for

detecting and localizing novel objects in various do-

mains, paving the way for advancements in computer

vision and object detection research. Future work can

focus on extending this approach to real-time applica-

tions and exploring additional techniques to enhance

the accuracy and efﬁciency of unsupervised novel ob-

ject detection systems.

REFERENCES

Bao, Z., Tokmakov, P., Jabri, A., Wang, Y.-X., Gaidon, A.,

and Hebert, M. (2022). Discovering objects that can

move. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

11789–11798.

Blatter, P., Kanakis, M., Danelljan, M., and Van Gool,

L. (2023). Efﬁcient visual tracking with exemplar

transformers. In Proceedings of the IEEE/CVF Win-

ter Conference on Applications of Computer Vision,

pages 1571–1581.

Cheng, G., Wang, J., Li, K., Xie, X., Lang, C., Yao, Y., and

Han, J. (2022). Anchor-free oriented proposal gener-

ator for object detection. IEEE Transactions on Geo-

science and Remote Sensing, 60:1–11.

Fan, Q., Zhuo, W., Tang, C.-K., and Tai, Y.-W. (2020). Few-

shot object detection with attention-rpn and multi-

relation detector. In Proceedings of the IEEE/CVF

conference on computer vision and pattern recogni-

tion, pages 4013–4022.

Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE

international conference on computer vision, pages

1440–1448.

He, K., Gkioxari, G., Doll

ar, P., and Girshick, R. (2017).

Mask r-cnn. In Proceedings of the IEEE international

conference on computer vision, pages 2961–2969.

enaff, O. J., Koppula, S., Shelhamer, E., Zoran, D.,

Jaegle, A., Zisserman, A., Carreira, J., and Arand-

jelovi

c, R. (2022). Object discovery and representa-

tion networks. In Computer Vision–ECCV 2022: 17th

European Conference, Tel Aviv, Israel, October 23–

27, 2022, Proceedings, Part XXVII, pages 123–143.

Springer.

Hou, L., Lu, K., Xue, J., and Li, Y. (2022). Shape-adaptive

selection and measurement for oriented object detec-

tion. In Proceedings of the AAAI Conference on Arti-

ﬁcial Intelligence, volume 36, pages 923–932.

Kang, B., Liu, Z., Wang, X., Yu, F., Feng, J., and Dar-

rell, T. (2019). Few-shot object detection via feature

reweighting.

Kim, D., Lin, T., Angelova, A., Kweon, I. S., and Kuo, W.

(2021). Learning open-world object proposals without

learning to classify. CoRR, abs/2108.06753.

Kuo, W., Hariharan, B., and Malik, J. (2015). Deepbox:

Learning objectness with convolutional networks.

LaBonte, T., Song, Y., Wang, X., Vineet, V., and Joshi, N.

(2023). Scaling novel object detection with weakly

supervised detection transformers. In Proceedings of

the IEEE/CVF Winter Conference on Applications of

Computer Vision, pages 85–96.

Li, W., Chen, Y., Hu, K., and Zhu, J. (2022a). Oriented

reppoints for aerial object detection. In Proceedings

of the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, pages 1829–1838.

Li, Y., Mao, H., Girshick, R., and He, K. (2022b). Ex-

ploring plain vision transformer backbones for object

detection. In Computer Vision–ECCV 2022: 17th Eu-

ropean Conference, Tel Aviv, Israel, October 23–27,

2022, Proceedings, Part IX, pages 280–296. Springer.

Maaz, M., Rasheed, H., Khan, S., Khan, F. S., Anwer,

R. M., and Yang, M.-H. (2022). Class-agnostic ob-

ject detection with multi-modal transformer. In 17th

European Conference on Computer Vision (ECCV).

Springer.

O. Pinheiro, P. O., Collobert, R., and Dollar, P. (2015).

Learning to segment object candidates. In Cortes, C.,

Lawrence, N., Lee, D., Sugiyama, M., and Garnett,

R., editors, Advances in Neural Information Process-

ing Systems, volume 28. Curran Associates, Inc.

Park, K., Woo, S., Oh, S. W., Kweon, I. S., and Lee, J.-Y.

(2022). Per-clip video object segmentation. In Pro-

ceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pages 1352–1361.

Peng, J., Bu, X., Sun, M., Zhang, Z., Tan, T., and Yan, J.

(2020). Large-scale object detection in the wild from

imbalanced multi-labels. CoRR, abs/2005.08455.

Pont-Tuset, J., Arbelaez, P., T.Barron, J., Marques, F., and

Malik, J. (2017). Multiscale combinatorial grouping

for image segmentation and object proposal genera-

tion. IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence, 39(1):128–140.

Qin, H., Yu, C., Gao, C., and Sang, N. (2022). D2t: A

framework for transferring detection to tracking. Pat-

tern Recognition, 126:108544.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. Advances in neural information

processing systems, 28.

Tang, F. and Ling, Q. (2022). Ranking-based siamese visual

tracking. In Proceedings of the IEEE/CVF Conference

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

700

on Computer Vision and Pattern Recognition, pages

8741–8750.

Uijlings, J. R. R., van de Sande, K. E. A., Gevers, T., and

Smeulders, A. W. M. (2013). Selective search for ob-

ject recognition. International Journal of Computer

Vision, 104(2):154–171.

Wang, Y., Shen, X., Hu, S. X., Yuan, Y., Crowley, J. L., and

Vaufreydaz, D. (2022). Self-supervised transform-

ers for unsupervised object discovery using normal-

ized cut. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

14543–14553.

Wu, A., Han, Y., Zhu, L., and Yang, Y. (2021). Universal-

prototype enhancing for few-shot object detection. In

Proceedings of the IEEE/CVF International Confer-

ence on Computer Vision, pages 9567–9576.

Xu, Q., Fang, F., Gauthier, N., Li, L., and Lim, J.-H. (2020).

Active image sampling on canonical views for novel

object detection. In 2020 IEEE International Confer-

ence on Image Processing (ICIP), pages 2241–2245.

IEEE.

Xu, X., Wang, J., Li, X., and Lu, Y. (2022). Reliable

propagation-correction modulation for video object

segmentation. In Proceedings of the AAAI Conference

on Artiﬁcial Intelligence, volume 36, pages 2946–

2954.

Yang, J., Liu, S., Li, Z., Li, X., and Sun, J. (2022). Real-time

object detection for streaming perception. In Proceed-

ings of the IEEE/CVF Conference on Computer Vision

and Pattern Recognition, pages 5385–5395.

Zhu, C., Chen, F., Ahmed, U., Shen, Z., and Savvides,

M. (2021). Semantic relation reasoning for shot-

stable few-shot object detection. In Proceedings of the

IEEE/CVF conference on computer vision and pattern

recognition, pages 8782–8791.

Zitnick, L. and Dollar, P. (2014). Edge boxes: Locating

object proposals from edges. In European Conference

on Computer Vision.

Unsupervised Annotation and Detection of Novel Objects Using Known Objectness

701