An Unsupervised IR Approach Based Density Clustering Algorithm

Achref Ouni

Laboratoire LIMOS, CNRS UMR 6158, Universite Clermont Auvergne, 63170 Aubiere, France

Keywords:

Image Retrieval, BoVW, Descriptors, Classiﬁcation.

Abstract:

Finding the most similar images to an input query in the database is an important task in computer vision.

Many approaches have been proposed from visual content have proven its effectiveness in retrieving the most

relevant images. Bag of visual words model (BoVW) is one of the most algorithm used for image classiﬁcation

and recognition. Even the discriminative power of BoVW, the problem of retrieving the relevant images from

the dataset is still a challenge.

In this paper, we propose an efﬁcient method inspired by the BoVW algorithm. Our key idea is to convert

the standard BoVW model into a BoVP (Bag of Visual Phrase) model based on a density-spatial clustering

algorithm. We show experimentally that the proposed model is able to perform better than classical methods.

We examine the performance of the proposed method on four different datasets.

1 INTRODUCTION

Content based image retrieval (CBIR) is the problem

of ﬁnding from a database the images that are similar

to a query image. This ﬁeld is a key step for many ap-

plications in computer vision as pose estimation, vir-

tual reality, medical analysis. CBIR is based on the

robustness of the signature image. The CBIR system

is composed by three key steps: (1) Features extrac-

tion (2) Signature construction (3) Retrieved images.

Two major state-of-the-art works proposed the best

performing signature images: Bag of Visual Words

(BoVW) and deep learning based convolutional neu-

ral network (CNN). In a CNN, the signature image

consists of N ﬂoating-point vectors extracted from

feature layers within the architectural network. In

BoVW, signature images are computed based on the

frequency of visual words in the image. After signa-

ture images are created for a query and dataset, the

selection of candidates considered similar to the in-

put query is determined by the spacing between sig-

natures using a speciﬁc metric.

We focus on this work to enhance BoVW because

it is not necessary to train the algorithm on a set of im-

ages. In addition, BoVW is a robust algorithm with

low complexity and signature creation is faster than

CNN approaches. So, our aim in this paper is to in-

crease the BoVW accuracy by transforming it to bag

of visual phrase (BoVP).A visual phrase is a set of re-

https://orcid.org/0000-0002-1197-253X

lated visual words that aim to more robustly encode

the visual features in image. The transformation can

be applied with different ways (Pedrosa and Traina,

2013) (Ren et al., 2013) (Chen et al., 2014). We pro-

posed in this paper a robust transformation based den-

sity clustering. In this case the image signature is de-

ﬁned by a matrix with size M ∗ M where M is the num-

ber of visual words. After building the images’ signa-

tures for both queries and datasets, the candidates will

be selected based on the euclidean distance. We con-

sider the images with low scores to be similar to an

input query. We test this approach on three different

datasets and two different visual descriptors (SURF,

KAZE). We show experimentally that the proposed

approach achieve a better results in terms of accuracy

compared to the state of the art methods.

The paper is structured as follows: We give a brief

overview of the work related to metaphor words in

Section 2. We detail our proposition in Section 3. We

show the experimental part on 4 datasets in Section 4.

Conclusion in Section 5.

We ﬁrst discuss the state-of-the-art in both of our

contribution ﬁelds: Bag of Visual Words model and

Density-Based Spatial Clustering approach.

1.1 Bag of Visual Words Model

Bag of Visual Words proposed by (Csurka et al.,

2004) is one of the most common model used to clas-

sify the images by content. This approach is com-

Ouni, A.

An Unsupervised IR Approach Based Density Clustering Algorithm.

DOI: 10.5220/0011892800003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 4: VISAPP, pages

491-496

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

491

posed of three main steps: (i) Detection and Fea-

ture extraction (ii) Codebook construction (iii) Vec-

tor quantization. Detecting and extracting features

in an image can be performed using extractor algo-

rithms. Many descriptors have been proposed to en-

codes the images into a vector. Scale Invariant Feature

Transform (SIFT) (Lindeberg, 2012) and Speeded-

up Robust Features (SURF) (Bay et al., 2006) are

the most used descriptors in CBIR. Interesting work

from Arandjelovi

c and Zisserman (Arandjelovic and

Zisserman, 2013) introduces an improvement by up-

grading SIFT to RootSift. On the other side, binary

descriptors have proven useful (Rublee et al., 2011)

proposes ORB (Oriented FAST and Rotated BRIEF)

to speed up the search. An other work (Leuteneg-

ger et al., 2011) combines two aspects: precision

and speed thanks to BRISK (Binary Robust Invari-

ant Scalable Keypoints) descriptor. (Iakovidou et al.,

2019) present a discriminative descriptor for image

dependent on both contour and color information.

In (Chatzichristoﬁs and Boutalis, 2008), the authors

present descriptors both color and edge information.

Due to the limit of bag of visual words model many

improvement have been proposed for more precision.

Bag of visual phrases (BoVP) (Pedrosa and Traina,

2013) is a high-level description using a more than

word for representing an image. formed phrases us-

ing a sequence of n-consecutive words regrouped by

L2 metric. (Ren et al., 2013) Build an initial graph

then split the graph into a ﬁxed number of sub-graphs

using the N-Cut algorithm. Each histogram of visual

words in a sub graph forms a visual phrases. (Ouni

et al., 2021) present a method based on grid cluster-

ing algorithm for linking the visual words. (Chen

et al., 2014) Groups visual words in pairs using the

neighbourhood of each point of interest. The pair

words are chosen as visual phrases. Perronnin and

Dance (Perronnin and Dance, 2007) applies Fisher

Kernels to visual words represented by means of a

Gaussian Mixture Model(GMM). Similar approach,

introduced a simpliﬁcation for Fisher kernel. Simi-

lar to bag of visual words, vector of locally aggre-

gated descriptors (VLAD) (J

egou et al., 2010) assign

each feature or keypoint to its nearest visual word

and accumulate this difference for each visual word.

(Mehmood et al., 2016) proposes a framework be-

tween local and global histograms of visual words.

in other side, CNN based approach are widely used

in CBIR context (Krizhevsky et al., 2012) (Szegedy

et al., 2017) (Simonyan and Zisserman, 2014). In

many instances, CNN supersedes local detectors and

descriptors. The main idea is to train the network on

set of images. The next step use the pretrained CNN

for extracting the signatures image.

1.2 Density-Based Spatial Clustering

The notion of the ” density of a cell”, deﬁned relative

to the number of objects contained in the cell. Density

clustering algorithms are based on a similar notion,

complemented by other fundamental concepts such

as the neighborhood of an object, kernel object, ac-

cessibility, or connection between objects. The com-

plexity of this algorithm is O(nlogn) which makes it

a rather inexpensive method. In addition, the clusters

obtained can be of various forms. However the algo-

rithm has a major disadvantage: the choice of param-

eters ε and M. Even if the authors of the algorithm

propose a heuristic to automatically determine these

parameters, this choice remains difﬁcult in practice.

The data are not generally distributed identically and

these parameters should be able to vary according to

the regions of the space.

DBSCAN (Schubert et al., 2017) is an algorithm

of O(nlogn) complexity which makes it a fairly in-

expensive method. It is a density-based algorithm in

the measure that relies on the estimated density of

the clusters to perform partitioning. DBCLASD (Xu

et al., 1998)(Distribution-Based Clustering of Large

Spatial Databases) algorithm proposes a distributional

approach to manage this problem of variation in local

densities. Same authors propose OPTICS (Ankerst

et al., 2008) algorithm (Ordering Points To Identify

the Clustering Structure). OPTICS deﬁnes an order

on the objects which can then be used by DBSCAN,

in the cluster expansion phase.

2 BAG OF VISUAL WORDS

BASED DENSITY

CLUSTERING: DBoVW

We aim in this section to improve the image repre-

sentation. As described in previous section, bag of

visual words is the model that can describe the visual

features in image in a robust way. However, even the

new representation, the given outcomes do not fulﬁll

the ideal need. Therefore we attempt to improve the

BoVW model by changing to a bag of visual phrases

based on density clustering approaches.

The goal of using a clustering strategy is to rep-

resent an image as a set of clusters, each cluster con-

taining at least two keypoints and representing one or

more visual sets. For this case, each cluster takes on a

visual representation when its size is >2. Then, from

the acquired groups, we consolidated the visual words

inside each cluster to fabricate a discriminative signa-

ture denoted ”bag of visual phrase”. Also, we test

and study the distinction between the density cluster-

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

492

Figure 1: Bag of visual words model.

Figure 2: Density-based spatial clustering (Schubert et al.,

2017) (DBSCAN).

ing techniques (DBSCAN, Optics) at the degree of

execution, productivity and which one has better im-

provement the image representation.

In the ﬁrst part of our framework, we start by rec-

ognizing and extracting of the keypoints from the in-

put query. After using the ofﬂine-created visual words

or vocabulary, we use the L2 metric to match each

keypoint detected in the image with the correspond-

ing visual vocabulary.

Figure 3: Clustering Key-points using Density cluster-

ing(DBSCAN).

Figure 3 presents the clusters obtained by density

clustering approach. A cluster is the maximum set of

connected points(Key-points in our case). The algo-

rithm has a major disadvantage: the choice of parame-

ters ε and Min ε. In our case we propose a heuristic to

automatically determine these parameters. We ﬁxed

Minε= 1 and we begin from ε=0.001 and in each it-

eration we increase the value of ε until the number of

cluster will be equal or close to the half size of key-

points. As shown in ﬁgure 3 we obtained 9 clusters

using 20 key-point. We succeeded in ﬁnding the ideal

number of clusters in relation to the number of keys-

points. The number of key-points in each cluster is

different to others. We consider each cluster as a set

of visual phrases.

The image signature size is M ∗ M where M is the

number of visual words. We initialize the matrix to

zero. Next, we ﬁll the matrix with the indices of the

visual phrases. For example, if the pair visual words

are VW

and VW

then we increment the element at

index (5,30). Finally, the similarity is computed by

the euclidean distance between the matrix query and

the matrix from dataset.

3 RESULTS

In this section, we present the potential of our ap-

proach on large datasets. Our goal is to increase

the CBIR accuracy and reduce the execution time.

To evaluate our proposition we test on the following

datasets:

• Corel 1K or Wang (Wang et al., 2001) is a dataset

of 1000 images divided into 10 categories and each

category contains 100 images. The evaluation is com-

puted by the average precision of the ﬁrst 100 nearest

neighbors among 1000.

• The University of Kentucky Benchmark proposed

by Nister, abbreviated as UKB for ease of read-

ing. UKB contains 10200 images divided into 2550

groups, each group contains 4 images of the same

object with different conditions (rotation, out-of-

focus...). The score is the average accuracy across all

images of the 4 nearest neighbors.

• INRIA Holidays, referred to as Holidays, is a col-

lection of 1491 candidates, of which 500 are query

pictures and the remaining 991 are corresponding re-

lated candidates. Public holidays are evaluated based

on mean precision (mAP) [29].

• MSRC v1 (Microsoft Research in Cambridge) sug-

An Unsupervised IR Approach Based Density Clustering Algorithm

493

Figure 4: The 10 categories of Corel-1K dataset.

Figure 5: Example of images from UKB dataset.

Figure 6: Example of images from holidays dataset.

gested by Microsoft Research Team. MSRC v1 con-

tains 241 images divided into 9 categories. Scoring

for MSRC v1 is based on MAP score (mean preci-

sion)

In this section we present the experiments of our

approach based density clustering approach. To test

the proﬁciency of our proposed strategies, we led the

experimentation on four datasets(MSRC V1, UKB,

Figure 7: Example of images from MSRC v1 dataset.

Holidays, Wang) and two descriptors(SURF, KAZE).

In tables 1, 2, we separated the outcomes into two

principle parts. Above, we build the signature utiliz-

ing DBSCAN approach. Down, the signature con-

structed using Optics algorithm. The outcomes got

utilizing DBSCAN approach are superior to the out-

comes got utilizing Optics. We perform the results by

concatenating the GBOP

sur f

and GBOP

kaze

Table 1: Bag of visual phrase based DBSCAN algorithm.

Datasets GBoVP

sur f

GBoVP

kaze

GBoVP

sur f

kaze

MSRC v1 0.53 0.57 0.68

UKB 3.09 3.11 3.55

Holidays 0.55 0.56 0.68

Wang 0.55 0.52 0.63

Table 2: Bag of visual phrase based OPTICS algorithm.

Datasets GBoVP

sur f

GBoVP

kaze

GBoVP

sur f

kaze

MSRC v1 0.57 0.6 0.69

UKB 3.1 3.18 3.61

Holidays 0.56 0.56 0.65

Wang 0.59 0.55 0.65

Finally, we compare our approach against two dif-

ferent state-of-the-art methods in Table3. The ﬁrst is

methods based on keypoints for building the image

signatures and the second is methods based on deep

leaning approaches. As show the aftereffects of our

proposed present great presentation for all datasets

compared to methods based on keypoints.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

494

Table 3: Comparison of the accuracy of our approach with methods from the state of the art.

Methods MSRC v1 Wang Holidays UKB

BoVW (Csurka et al., 2004) 0.48 0.41 0.50 2.95

SaCoCo(Iakovidou et al., 2019) - 0.51 0.76 3.33

CEDD (Chatzichristoﬁs and Boutalis, 2008) - 0.54 0.72 3.24

VLAD (J

egou et al., 2010) - - 0.53 3.17

N-Gram (Pedrosa and Traina, 2013) - 0.34 - -

Grid (Ouni et al., 2021) 0.67 0.62 0.64 3.57

Fisher (Perronnin and Dance, 2007) - - 0.69 3.07

Ours 0.69 0.65 0.68 3.61

4 CONCLUSION

In this paper, we present an effective bag of visual

phrase model dependent on grouping approach. We

show that the utilization of density clustering ap-

proach joined with BoVW model increment the CBIR

precision. Utilizing two descriptors (KAZE, SURF),

our methodology accomplish a superior outcomes

compared to the state of the art methods.

REFERENCES

Ankerst, M., Breunig, M., Kriegel, H., Ng, R., and Sander,

J. (2008). Ordering points to identify the clustering

structure. In Proc. ACM SIGMOD, volume 99.

Arandjelovic, R. and Zisserman, A. (2013). All about vlad.

In Proceedings of the IEEE conference on Computer

Vision and Pattern Recognition, pages 1578–1585.

Bay, H., Tuytelaars, T., and Van Gool, L. (2006). Surf:

Speeded up robust features. In European conference

on computer vision, pages 404–417. Springer.

Chatzichristoﬁs, S. A. and Boutalis, Y. S. (2008). Cedd:

Color and edge directivity descriptor: A compact de-

scriptor for image indexing and retrieval. In Interna-

tional conference on computer vision systems, pages

312–322. Springer.

Chen, T., Yap, K.-H., and Zhang, D. (2014). Discriminative

soft bag-of-visual phrase for mobile landmark recog-

nition. IEEE Transactions on Multimedia, 16(3):612–

622.

Csurka, G., Dance, C., Fan, L., Willamowski, J., and Bray,

C. (2004). Visual categorization with bags of key-

points. In Workshop on statistical learning in com-

puter vision, ECCV, volume 1, pages 1–2. Prague.

Iakovidou, C., Anagnostopoulos, N., Lux, M.,

Christodoulou, K., Boutalis, Y., and Chatzichristoﬁs,

S. A. (2019). Composite description based on salient

contours and color information for cbir tasks. IEEE

Transactions on Image Processing, 28(6):3115–3129.

egou, H., Douze, M., Schmid, C., and P

erez, P. (2010).

Aggregating local descriptors into a compact image

representation. In 2010 IEEE computer society con-

ference on computer vision and pattern recognition,

pages 3304–3311. IEEE.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Advances in neural information process-

ing systems, pages 1097–1105.

Leutenegger, S., Chli, M., and Siegwart, R. Y. (2011).

Brisk: Binary robust invariant scalable keypoints. In

Computer Vision (ICCV), 2011 IEEE International

Conference on, pages 2548–2555. IEEE.

Lindeberg, T. (2012). Scale invariant feature transform.

Mehmood, Z., Anwar, S. M., Ali, N., Habib, H. A., and

Rashid, M. (2016). A novel image retrieval based on

a combination of local and global histograms of visual

words. Mathematical Problems in Engineering, 2016.

Ouni, A., Royer, E., Chevaldonn

e, M., and Dhome, M.

(2021). Robust visual vocabulary based on grid clus-

tering. In Intelligent Decision Technologies, pages

221–230. Springer.

Pedrosa, G. V. and Traina, A. J. (2013). From bag-of-

visual-words to bag-of-visual-phrases using n-grams.

In 2013 XXVI Conference on Graphics, Patterns and

Images, pages 304–311. IEEE.

Perronnin, F. and Dance, C. (2007). Fisher kernels on visual

vocabularies for image categorization. In 2007 IEEE

conference on computer vision and pattern recogni-

tion, pages 1–8. IEEE.

Ren, Y., Bugeau, A., and Benois-Pineau, J. (2013). Visual

object retrieval by graph features.

Rublee, E., Rabaud, V., Konolige, K., and Bradski, G.

(2011). Orb: An efﬁcient alternative to sift or surf.

In Computer Vision (ICCV), 2011 IEEE international

conference on, pages 2564–2571. IEEE.

Schubert, E., Sander, J., Ester, M., Kriegel, H. P., and Xu,

X. (2017). Dbscan revisited, revisited: why and how

you should (still) use dbscan. ACM Transactions on

Database Systems (TODS), 42(3):1–21.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A.

(2017). Inception-v4, inception-resnet and the impact

of residual connections on learning. In Thirty-ﬁrst

AAAI conference on artiﬁcial intelligence.

Wang, J. Z., Li, J., and Wiederhold, G. (2001). Simplicity:

An Unsupervised IR Approach Based Density Clustering Algorithm

495

Semantics-sensitive integrated matching for picture li-

braries. IEEE Transactions on pattern analysis and

machine intelligence, 23(9):947–963.

Xu, X., Ester, M., Kriegel, H.-P., and Sander, J. (1998). A

distribution-based clustering algorithm for mining in

large spatial databases. In Proceedings 14th Interna-

tional Conference on Data Engineering, pages 324–

331. IEEE.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

496