Rotation Equivariance for Diamond Identiﬁcation

Floris De Feyter

1 a

, Bram Claes

and Toon Goedem

1 b

EAVISE—PSI—ESAT, KU Leuven, Sint-Katelijne-Waver, Belgium

Antwerp Labs, Antwerp, Belgium

ﬂ

Keywords:

Diamond Identiﬁcation, Rotational Equivariance, Polar Warping.

Abstract:

To guarantee integrity when trading diamonds, a certiﬁed company can grade the diamonds and give them a

unique ID. While this is often done for high-valued diamonds, it is economically less interesting to do this

for lower-valued diamonds. While integrity could be checked manually as well, this involves a high labour

cost. Instead, we present a computer vision-based technique for diamond identiﬁcation. We propose to apply

a polar transformation to the diamond image before passing the image to a CNN. This makes the network

equivariant to rotations of the diamond. With this set-up, our best model achieves an mAP of 100% under a

stringent evaluation regime. Moreover, we provide a custom implementation of the polar warp that is multiple

orders of magnitude faster than the frequently used implementation of OpenCV.

1 INTRODUCTION

To securely trade high-valued diamonds, the stones

are graded by a certiﬁed company like GIA and are

individually packed with a unique barcode. In some

cases, the unique ID is even engraved in the girdle

of the diamond. For smaller and lower-valued di-

amonds, this grading is economically less interest-

ing. Therefore, such diamonds are often not provided

with a unique ID. To make the trade of these smaller

diamonds—which make up the vast majority of di-

amonds sold—more secure, while keeping it cost-

effective, we propose to employ current computer vi-

sion techniques.

More speciﬁcally, we propose to train a Convolu-

tional Neural Network (CNN) to transform the im-

age of a diamond into a descriptive embedding—a

ﬁngerprint—that can be used to compute a similarity

score for pairs of diamond images. Such an approach

has shown promising results for diamond identiﬁca-

tion before (De Feyter et al., 2019). Instead of simply

training a CNN on our data, however, we propose to

modify the input images in such a way that the CNN

becomes rotation equivariant.

All diamonds in our dataset are photographed

from a top view, i.e., with their tables

parallel to the

camera sensor (see Fig. 3). Of course, apart from any

horizontal and vertical translation, a diamond is free

https://orcid.org/0000-0003-2690-0181

https://orcid.org/0000-0002-7477-8961

The ﬂat part on top of the diamond.

Original Polar warp

Figure 1: Polar warping applied to some random samples

from our dataset.

to have any rotation around the central axis perpen-

dicular to its table. A model that can match mul-

tiple images of the same diamond (each with a dif-

ferent orientation), therefore, must be insensitive to

these rotations. We propose to solve this by apply-

ing a polar warping operation to the diamond images

(see Fig. 1). Due to the nature of the polar warping,

a rotation of an object in an image results in a trans-

lation in the warped version of that image, as shown

in Fig 4. As the convolution operation is equivariant

to translations (Esteves et al., 2018), rotated versions

of the same diamond should yield similar outputs and

it should be easier to train a CNN for the task of dia-

mond identiﬁcation. In Section 6.2, we show that the

addition of polar warping leads to models that achieve

100% mean Average Precision (mAP).

De Feyter, F., Claes, B. and Goedemé, T.

Rotation Equivariance for Diamond Identiﬁcation.

DOI: 10.5220/0011658400003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP, pages

115-123

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

115

While OpenCV (Bradski, 2000) contains an im-

plementation for such a polar transformation, it is not

suited for a deep learning pipeline. Therefore, in

Sec. 4.2, we develop our own implementation. Our

implementation supports GPU acceleration and can

perform the polar warp in batch. Moreover, the imple-

mentation allows for part of the transformation to be

precomputed. As we show in Sec. 6.3, this all leads to

a polar warp that—when executed on GPU—can run

750 times faster than OpenCV.

To summarize, the main contributions of this pa-

per are:

• The application of polar transformation to dia-

mond identiﬁcation, with models reaching 100%

mAP for the identiﬁcation of unseen diamonds;

• The implementation of a polar warp function that

is 750 times faster than the implementation of

OpenCV.

2 RELATED WORK

In this work, we employ rotation equivarience for im-

proving CNN-based diamond identiﬁcation. The cur-

rent section discusses previous research on relevant

topics.

2.1 Gemstone Classiﬁcation and

Identiﬁcation

In (Chow and Reyes-Aldasoro, 2022), the authors

experimented with multiple feature extraction tech-

niques and machine learning algorithms, along with

ResNet-18 and ResNet-50 models (He et al., 2016) to

classify gemstone images of the Kaggle Gemstones

Images dataset (Chemkaeva, 2020). This dataset con-

tains more than 3200 images of 87 gemstone classes

(diamond, but also emerald, ruby, amethyst. . . ). In

their set-up, with an accuracy of 69.4%, the combi-

nation of a Random Forest algorithm and an RGB

eight-bin colour histogram and local binary features

turned out to be the most optimal. On this same

dataset, (Freire et al., 2022) ﬁnetuned an Inception-

v3 (Szegedy et al., 2015), achieving an accuracy of

72%. The models from these works, however, are

only capable of classifying in one of the 87 predeﬁned

classes. We focus on a single type of gemstone, i.e.,

diamonds, and want the model to identify individual

stones. This implies that the model should generalize

to identifying diamonds it has not encountered during

training.

In the literature, we only found (De Feyter

et al., 2019) to employ computer vision tech-

niques for gemstone—or, more speciﬁcally dia-

mond—identiﬁcation. Here, the authors ﬁnetune a

Darknet-19 backbone (Redmon and Farhadi, 2016)

that was pretrained on ImageNet (Deng et al., 2009).

They report a top-1 accuracy of 99.7%. Their vali-

dation set, however, contains different images from

the same diamonds as were used during training. We

believe that the only way to truly evaluate the gen-

eralization of an identiﬁcation system is by validat-

ing on different identities than those that were used

during training. Similar to (De Feyter et al., 2019),

we ﬁnetune a CNN that was pertrained on ImageNet.

We use a different model, however, that outputs more

compact embeddings with 512 ﬂoating point numbers

instead of 1024. Additionally, we demonstrate that a

polar warping operation notably improves the CNN

baseline.

2.2 Rotation Equivariance for CNNs

An important aspect of the solution we propose to di-

amond identiﬁcation, is to make the model equivari-

ant to rotations of the diamond. Making a CNN ro-

tation equivariant has been studied before. In (Hen-

riques and Vedaldi, 2017), the authors present a gen-

eral approach to make CNNs equivariant to a set of

two-parameter transformations, among which rota-

tion (and scale). The equivarience is attained by warp-

ing the input image, based on a ﬂow grid that is de-

ﬁned by the nature of the two-parameter transforma-

tion. The Polar Transform Network (PTN) introduced

by (Esteves et al., 2018) focuses on rotation equiv-

ariance only. Their network predicts a polar origin

and uses this to warp the input image around that ori-

gin. The warped image is passed to a conventional

CNN classiﬁer. In our application, however, it is rela-

tively easy to ﬁnd a good polar origin and as such, we

can avoid the extra complexity of an origin predictor

network. Similar to (Kim et al., 2020), instead, we di-

rectly apply a polar transform to the input image with-

out passing the image through an origin predictor ﬁrst.

The polar origin used by (Kim et al., 2020), however,

is deﬁned as the image center. To limit the changes

in the appearance of a diamond in the warped view,

we choose to use the diamond centroid as polar origin

instead. In their implementation, (Kim et al., 2020)

make use of the OpenCV function warpPolar(). We

have implemented our own polar transformation func-

tion that can process image batches on GPU and as

such is more than 750 times faster than the OpenCV

implementation (see Sec. 6.3).

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

116

Figure 2: Example of how we arranged a batch of diamonds

during data collection.

3 DATASET

We developed a large dataset suitable for training

a CNN for diamond identiﬁcation. Our dataset is

an order of magnitude larger than the one used by

(De Feyter et al., 2019). Apart from better training,

this allows for a more stringent and more accurate

evaluation. Due to the high value of diamonds, er-

roneously matching diamonds might lead to signiﬁ-

cant losses. Therefore, an accurate evaluation of our

model is crucial in this application.

3.1 Data Collection

In order to collect a large dataset, we have built a set-

up that can automatically process diamonds in batch.

The set-up consists of a robot arm, a rotating glass

disk, a light source and a camera mounted on a ﬁxed

height above the glass plate. The whole set-up is en-

closed such that no light can enter from outside. A

batch of diamonds is manually prepared in an array

as shown in Fig. 2. A camera mounted on the robot

arm scans this array and detects where the diamonds

are positioned. One by one, the robot arm grabs a ﬁrst

set of diamonds and lays them on the rotating disk.

The disk stops rotating once a diamond passes under

the ﬁxed camera after which the light source projects

a colour pattern through the diamond and a picture

is taken. Inspired by (De Feyter et al., 2019), we

have designed this colour pattern to produce specif-

ically colour-coded diamond images, as can be seen

in Fig. 3. Once a diamond has been photographed,

the robot arm picks it up from the glass plate and puts

it back in the array. When all diamonds in the array

are photographed, the whole procedure is repeated N

times to have N photographs per diamond.

Figure 3: Some random samples from our dataset.

3.2 Dataset Properties

Via the procedure described in Sec. 3.1, we were

able to compose a dataset with 96385 RGB images

of 1021 diamonds. The images have a resolution of

2448×2048 pixels. We split up this dataset in a train-

ing dataset with 77 060 images of 816 diamonds and

a validation dataset with 19 325 images of 205 dia-

monds. Note that not only the set of training images

and the set of validation images, but also the sets of di-

amond identities are disjoint. Fig 3 shows some sam-

ples from our dataset. For our experiments (Sec. 6),

the validation dataset is split up further into a query

and a gallery set. The gallery set contains 10 images

of each diamond, the query set contains the rest of the

validation images.

4 POLAR WARPING

In our application, the orientation of the diamond in

an image is meaningless. Hence, our model should be

robust against diamond rotations. By applying a po-

lar warp to the input image, rotations of the diamond

can be transformed to translations, to which a CNN

is equivariant (Esteves et al., 2018; Kim et al., 2020).

Therfore, with polar warping, the CNN can be made

equivariant to the diamond rotations by design.

For their CyCNN, (Kim et al., 2020) make

use of the polar warp function implemented in

OpenCV (Bradski, 2000), i.e., warpPolar(). This

implementation, however, is not suited for a typical

deep learning set-up. It cannot perform the polar warp

on GPU, let alone apply the transformation to a batch

of images. So, to process a batch of images that is

Rotation Equivariance for Diamond Identiﬁcation

117

Cartesian Polar

Figure 4: Toy example of applying a polar warp to rotated

version of the same object. The center of the emoji is chosen

as polar origin and the polar radius is chosen equal to the

radius of the emoji’s head. As the emoji rotates, the polar

warp shifts horizontally.

loaded into GPU memory, we would need to move the

batch to CPU, pass each image individually through

warpPolar(), put the results into a batch and move

that batch back to GPU memory. This clearly is a

wasteful round trip. One way to avoid this would be

to apply the polar warp to the entire dataset ofﬂine

and load the warped image directly into GPU mem-

ory. This is undesirable, however, as it makes some

data augmentations difﬁcult to perform during train-

ing, e.g., random cropping, random rotation or ran-

domly offsetting the polar origin. Also, this would

require a lot more storage as each image will have a

regular and a warped version.

4.1 Formal Deﬁnition of Polar Warping

Let ⃗p be the position of a point in an image. The polar

coordinates (φ, ρ) of this point with respect to some

point at ⃗c (which we call the polar origin), then, are

deﬁned as:



φ = ∠ (⃗p −⃗c)

ρ = ∥⃗p −⃗c∥,

(1)

with φ ∈ [0,2π[ and ρ ∈ R

. This deﬁnition can

be reformulated to express the Cartesian coordinates

of points in the input image as a function of their polar

coordinates (φ,ρ). Let (x,y) and (c

) be the Carte-

sian coordinates of ⃗p and ⃗c, respectively. Then, from

Eqn. 1 it follows that,











cosφ =

x − c

⇐⇒ x = c

+ ρ · cosφ

sinφ =

y − c

⇐⇒ y = c

+ ρ · sinφ.

(2)

It is customary to limit ρ ∈ [0,R], with polar ra-

dius R and R ∈ R

, such that the mapped points all lie

in a circle with center at⃗c and radius R and the output

of the transformation will have a rectangular shape.

4.2 Implementation

A standard way to implement geometric transforma-

tions in software is by deﬁning a ﬂow ﬁeld grid G .

Let I be the input image of shape (W, H,C) and I

′

the warped output image of shape (W

′

,C). Then,

the ﬂow ﬁeld grid has shape (W

′

,2), i.e., it has

the same width and height as I

′

, but instead of color

channels, G consists of an x and a y channel. The

(x,y) value stored at a certain location (u,v) in G de-

scribes the coordinates of the point in I that should

come at location (u,v) in I

′

. Note that the coordinates

stored in G is typically non-integer, and an interpola-

tion of multiple pixel values is used to compute an

output pixel value.

To compose G for a polar mapping, we ﬁrst create

two arrays of evenly spaced numbers. For φ, this array

contains numbers in the interval [0,2π[ and has length

′

, i.e., the output width. For ρ, the numbers are

in [0,R] and the array’s length is equal to the output

height H

′

. Now let ∆φ and ∆ρ be the step sizes used

in the arrays for φ and ρ, respectively. Then, from

Eqn. 2, we ﬁnd that the value at location (u, v) in G,

with u ∈ {0,1, .. .,W

′

− 1} and v ∈ {0,1, .. ., H

′

− 1},

should be equal to



(u,v) = c

+ v∆ρ · cos(u∆φ)

(u,v) = c

+ v∆ρ · sin(u∆φ),

(3)

where G

and G

are the x and y channel of

G, respectively. By passing G to a function like

grid_sample() in PyTorch (Paszke et al., 2019),

along with the input image I , we apply the polar map-

ping to I and obtain I

′

4.3 Improved Implementation with

Fixed Polar Radius

From Eqn. 3, we can see that the ﬂow ﬁeld grid G

used to perform the polar warp consists of two terms,

one depends on polar origin ⃗c and one depends on

the array of values used for ρ and φ. Hence, when

applying a polar warp multiple times with the same

values for ρ, the second term can be precomputed (the

values used for φ are assumed to be constant). We

deﬁne a base grid

G such that the value at location

(u,v) is given by

(

(u,v) = v∆ρ · cos (u∆φ)

(u,v) = v∆ρ · sin (u∆φ).

(4)

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

118

Then, any G can be easily computed from

G by

simply adding the coordinates of the polar origin,

(

G(u,v) = c

(u,v)

G(u,v) = c

(u,v).

(5)

Algorithm 1 shows pseudocode for the precompu-

tation of G, while Algorithm 2 shows how the polar

mapping is performed from a precomputed ﬂow ﬁeld

grid.

Algorithm 1: Precomputing the Base Flow Field Grid.

procedure BASE GRID(width, height, R)

∆φ ← 2π/width ▷ Deﬁne step size

∆ρ ← R/height

φ ← [0 : ∆φ : 2π[ ▷ Deﬁne φ, ρ as row vectors

ρ ← [0 : ∆ρ : R]

← ρ

cosφ ▷ Compute both channels of

← ρ

sinφ

return

Algorithm 2: Polar Mapping with Precomputed Flow Field

Grid.

procedure POLAR MAP(I ,

G,c

)

← c

▷ Compute both channels of G

← c

= grid_sample(I ,G )

return I

5 POLAR WARPING FOR

DIAMOND IDENTIFICATION

As discussed in Section 4, applying a polar warping

operation to an image requires two parameters: a po-

lar origin ⃗c and a polar radius R. In Sec. 5.1, we mo-

tivate why one would choose the diamond centroid to

be the polar origin and how we can reliably ﬁnd this

centroid in each image. In Sec. 5.2, we discuss how

we can determine a (ﬁxed) polar radius.

5.1 Diamond Centroid as Polar Origin

The closer a region is to the polar origin in the input

image I , the more space it will occupy in the warped

image I

′

. The centroid of the diamond is the point

where the sum of distances to all points on the dia-

mond is minimized, so, the centroid seems like a good

choice for the polar origin, as this will maximize the

amount of pixels in I

′

that contain information of the

diamond. Moreover, the centroid of the diamond can

easily and reliably be retrieved in the images of our

Figure 5: The detected center and estimated radius for 25

random samples from our dataset.

dataset, which is important to have transformed ver-

sions of the same diamond only differ by a horizontal

shift.

As can be seen in Fig. 3, the diamonds are clearly

visible against the black background. This sug-

gests that classic computer vision techniques should

sufﬁce to detect the centroid. Indeed, a sim-

ple binary threshold is enough to segment the dia-

mond from the background. By applying OpenCV’s

findContours() (Bradski, 2000) on the binarized

image and selecting the contour with the largest area,

we can draw a boundary around the diamond. Next,

via OpenCV’s moments() function, we can retrieve

the coordinates of the diamond’s centroid. Fig. 5

shows some examples of the diamonds and centroids

that were detected in this way. We store each dia-

mond’s centroid and radius information in a separate

ﬁle that accompanies the respective image. In each

image, the detection algorithm found exactly one cen-

troid. A visual check of hundreds of random samples

conﬁrmed that the algorithm was able to correctly de-

tect the diamond boundaries and centroids.

5.2 Fixed Polar Radius

A desirable by-product of our centroid detection

method is that the radii of the diamonds are also mea-

sured. Figure 6 shows the distribution of all radii. For

the polar radius R, one option would be to adapt the

radius to the size of the diamond so that each dia-

mond would approximately take up the same space

in the warped image. This would make the model less

sensitive to scale changes in diamond images. How-

ever, a different scale could be an easy way for our

model to know that two diamonds are different. Our

Rotation Equivariance for Diamond Identiﬁcation

119

Figure 6: The distribution of the estimated radii of the dia-

monds in our dataset. The radii are expressed relative to the

image height.

system always photographs the diamonds from the

same distance, so a diamond should have a consistent

scale across all images. Indeed, when we compute

the difference between the largest and smallest radius

of each diamond (see Fig. 7), we ﬁnd that for 97.4%

of the diamonds, there is no more than 1 pixel differ-

ence. Furthermore, there are no diamonds for which

the difference is larger than 3 pixels.

From Fig. 6, we know that there are no dia-

monds larger than about 40% of the image height.

As we resize the images to a height of 256 pixels,

this corresponds to about 100 pixels. Adding in some

head space, we ﬁx the polar radius R at 112 pixels.

Note that, with this ﬁxed value for R, we open the

door to the implementation improvement described in

Sec. 4.3.

6 EXPERIMENTS AND RESULTS

This section presents our experimental set-up and re-

sults. In Sec. 6.1, we provide the implementation de-

tails of the models and describe how we train them for

diamond identiﬁcation. We apply multiple ablations

to our baseline model, among which polar warping of

the input, and report the results in Sec. 6.2. Finally, in

Sec. 6.3, we compare our polar warp implementation

with OpenCV’s warpPolar().

Figure 7: The distribution of the range of the estimated radii

across the images of each individual diamond in our dataset.

6.1 Model Implementation Details

For all our experiments, we employ a ResNet-18

model (He et al., 2016) that was pretrained on Ima-

geNet (Deng et al., 2009). We allow all weights to

train (no frozen layers). We use Stochastic Gradient

Descent (SGD) to optimize the weights with a learn-

ing rate of 0.1, a momentum of 0.95 and no weight

decay. We apply a linear learning rate warm-up for

the ﬁrst 400 iterations (Goyal et al., 2017) and half the

learning rate after 6 and 9 epochs. We limit the num-

ber of epochs to 10, as none of our models showed

any further improvement after that. All models are

trained on a single NVIDIA Tesla V100 GPU.

The model outputs an embedding of length 512.

During model training, however, we append an ex-

tra trainable fully-connected layer that transforms this

embedding to a vector with the same length as the

number of training classes. From this, we can com-

pute the softmax cross-entropy loss. As such, the

model is de facto trained as a classiﬁer and the em-

bedding is trained implicitly. Note that, during vali-

dation, this fully-connected layer is not used.

We split up the dataset into a training, a valida-

tion gallery and a validation query set as described

in Sec. 3.2. We use a batch size of 60 images for

training and validation. The images are resized to a

height of 256 pixels. After resizing, a random square

region of 224 pixels wide is cropped out of each im-

age. The validation data is center cropped to the same

size. Then, the image is normalized with either Im-

ageNet (Deng et al., 2009) statistics or the mean and

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

120

standard deviation of the pixel values in the training

set. In some experiments, we also apply a random ro-

tation to the training images. This is done before the

random crop. Note that each image is accompanied

by a ﬁle that contains the coordinates of the diamond

centroid (see Sec. 5.2). These coordinates are trans-

formed along so that the polar transformation can be

correctly applied to the input batch. Polar warping

is performed after the data transformation pipeline.

Note that the coordinates of the centroid are trans-

formed along with the image so that the position of

the centroid does not change relative to the diamond.

6.2 Ablation Study

We explore the effect of adding the polar warping de-

scribed in Sec. 4, along with other hyperparameters.

Table 1 summarizes this ablation experiment. We re-

port the mAP after one epoch of training and after ten

epochs of training, averaged over 5 runs (mean ± std.

dev.). To compute the mAP, we ﬁrst pass both the

query images and the gallery images (see Sec. 3.2)

through the model, obtaining an embedding for each

image. Then, we compute a similarity matrix between

the query embeddings and the gallery embeddings

from their cosine similarities. For every query, we

sort the similarities from most to least similar to the

query. From these sorted similarity sequences, along

with the ground truth query and gallery labels, we can

compute an AP for each query. The mAP is then ob-

tained by averaging all APs.

From Table 1, we can see that when the base-

line is trained with random resized cropping (Base-

line+RRC), the model performs signiﬁcantly worse

than when we apply random cropping without ran-

dom resizing (Baseline). Note that random resized

cropping involves that the input images are ﬁrst re-

sized to a random size, after which a random region

of a ﬁxed size is cropped out. This data augmentation

technique is typically used to make a model invariant

to scale changes. However, due to our camera set-up,

the same diamond will always have the same scale in

the image and the model can safely use the scale of

a diamond as a descriptive feature. This is conﬁrmed

by the drop of more than 10 percent points when we

add random resized cropping to the baseline.

A slight increase in mAP after 1 epoch of train-

ing is found when we replace ImageNet normaliza-

tion with the mean and standard deviation of the dia-

mond dataset itself (Baseline+Norm). The images in

our dataset contain a lot more dark areas than typi-

cal ImageNet (Deng et al., 2009) images. Therefore,

the mean red, green and blue pixel values are much

smaller than in ImageNet. After 10 epochs, however,

as the BatchNorm (Ioffe and Szegedy, 2015) layers in

the model adapted to the data distribution, the mAP

difference with the baseline becomes negligible.

The largest increase with respect to the baseline

is seen when the input is transformed with the polar

mapping presented in Sec. 4 (Baseline+Norm+Polar

and Baseline+Norm+Rot+Polar), with an addi-

tional increase when we add random rotations.

Note that these random rotations result in random

horizontal shifts of the diamond in the warped

image. During training, there were two individ-

ual runs—one Baseline+Norm+Polar model, one

Baseline+Norm+Rot+Polar—that achieved 100%

mAP. These models are able to ﬁnd, for each of

17 480 query images of unseen diamonds, the 10 out

of 2050 gallery images of the same diamond. None

of the other methods had a run that performed so well

anytime during training.

This result greatly surpasses the result of

(De Feyter et al., 2019), who needed a kNN with k = 5

to ﬁnally achieve 100% mAP on their tiny dataset of

64 diamond classes.

As shown in Table 2, it takes about 1/3 longer

to train 10 epochs when polar transformation is per-

formed before passing the input to the model. How-

ever, from Table 1 we know that, after only a single

epoch, models with polar transformation already per-

form on par with non-polar methods trained for 10

epochs. So, in an mAP per time sense, the polar meth-

ods clearly outperform the non-polar methods.

6.3 Polar Warp Comparison

We measure the duration of our polar warp im-

plementation (see Sec. 4.2) under different settings

and compare it to the warpPolar() function of

OpenCV (Bradski, 2000). We select 16 random im-

ages from our dataset (size 2448 × 2048) and apply

a polar transformation using the detected centroid of

the diamond (see Sec. 5.2) as polar origin and a ﬁxed

radius of 1024. As can be seen from Table 3, our Py-

Torch implementation runs about 1.2 times faster than

OpenCV on CPU (Intel Xeon E5-2630 v2) and more

than 200 times faster on GPU (NVIDIA GeForce

GTX 1180). When precomputing a base ﬂow grid,

as presented in Sec. 4.2, our method runs even 750

times faster than OpenCV. The polar warps created

by OpenCV and by our own PolarTorch are visually

identical, as demonstrated in Fig. 8. When subtract-

ing the pixel values of outputs from both implemen-

tations, we found some differences, though, but we

consider these negligible.

Rotation Equivariance for Diamond Identiﬁcation

121

Table 1: Results of the ablation study. The results are reported as mAP on the validation set after 1 and 10 epochs. Each

conﬁguration is trained 5 times; we report the mean and std. dev. RRC: Random Resized Crop; Norm: Normalized with

statistics of our own dataset; Rot: Random Rotation augmentation; Polar: With polar warping applied.

Method mAP, ep 1 (%) mAP, ep 10 (%)

Baseline 99.31597 ± 0.18825 99.96971 ± 0.01056

Baseline+RRC 97.59266 ± 0.76504 99.63744 ± 0.12250

Baseline+Norm 99.69073 ± 0.10654 99.96785 ± 0.02039

Baseline+Norm+Polar 99.93370 ± 0.02317 99.98011 ± 0.02720

Baseline+Norm+Rot+Polar 99.93151 ± 0.05273 99.98864 ± 0.00994

cv.warpPolar() Our polar warp Difference

Figure 8: Applying OpenCV’s and our implementation of a polar warp to 5 samples from our dataset. The right-hand column

shows the difference between both outputs, with pixel values shifted and scaled to fall in range [0, 255], i.e., gray means 0

difference.

Table 2: Training duration (10 epochs) of the ablation con-

ﬁgurations.

Method Time (10 eps.)

Baseline 21’10” ± 21”

Baseline+RRC 21’40” ± 32”

Baseline+Norm 21’34” ± 29”

Baseline+Norm+Polar 28’36” ± 33”

Baseline+Norm+Rot+Polar 28’27” ± 34”

7 CONCLUSION

We have shown that a CNN is well suited for diamond

identiﬁcation. When applying a polar transformation

to the input image, with the diamond’s centroid as po-

lar origin and a ﬁxed predeﬁned radius, the CNN can

be trained to perform better in less epochs. Our cus-

tom polar warp implementation signiﬁcantly reduces

the computation time when compared to OpenCV’s

implementation, up to a factor 750. Our best models

achieved an mAP of 100%, i.e., they were able to ﬁnd,

for each of 17 480 images of unseen diamonds the 10

out of 2050 gallery images of the same diamond.

Table 3: Duration (mean ± std. dev. for 10 runs) for per-

forming a polar warp of 16 images of size 2448× 2048 with

different methods. The CPU methods are executed on an

Intel Xeon E5-2630 v2 (2.60 GHz), the GPU methods on

an NVIDIA GeForce GTX 1180. “

G” indicates that the

method uses a precomputed ﬂow ﬁeld (see Sec. 4.2). The

PolarTorch implementation on GPU with a precomputed

ﬂow ﬁeld is more than 750 times faster than OpenCV.

Method Time for 16 images

OpenCV (CPU) 825.3 ms ± 7.9 ms

PolarTorch (CPU) 668.4 ms ± 8.5 ms

PolarTorch (CPU,

G) 642.1 ms ± 30.8 ms

PolarTorch (GPU) 3.7 ms ± 0.2 ms

PolarTorch (GPU,

G) 1.1 ms ± 1.3 ms

REFERENCES

Bradski, G. (2000). The OpenCV library. Dr. Dobb’s Jour-

nal of Software Tools.

Chemkaeva, D. (2020). Gemstones Images.

https://www.kaggle.com/datasets/lsind18/gemstones-

images.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

122

Chow, B. H. Y. and Reyes-Aldasoro, C. C. (2022). Au-

tomatic Gemstone Classiﬁcation Using Computer Vi-

sion. Minerals, 12(1):60.

De Feyter, F., Hulens, D., Claes, B., and Goedeme, T.

(2019). Deep Diamond Re-ID. In 2019 18th IEEE

International Conference On Machine Learning And

Applications (ICMLA), pages 2020–2025.

Deng, J., Dong, W., Socher, R., Li, L., Kai Li, and Li Fei-

Fei (2009). ImageNet: A large-scale hierarchical im-

age database. In 2009 IEEE Conference on Computer

Vision and Pattern Recognition, pages 248–255.

Esteves, C., Allen-Blanchette, C., Zhou, X., and Dani-

ilidis, K. (2018). Polar Transformer Networks.

arXiv:1709.01889 [cs].

Freire, W. M., Amaral, A. M. M. M., and Costa, Y. M. G.

(2022). Gemstone classiﬁcation using ConvNet with

transfer learning and ﬁne-tuning. In 2022 29th Inter-

national Conference on Systems, Signals and Image

Processing (IWSSIP), volume CFP2255E-ART, pages

1–4.

Goyal, P., Doll

ar, P., Girshick, R., Noordhuis, P.,

Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and

He, K. (2017). Accurate, Large Minibatch SGD:

Training ImageNet in 1 Hour.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep Resid-

ual Learning for Image Recognition. In The IEEE

Conference on Computer Vision and Pattern Recog-

nition (CVPR), pages 770–778, Las Vegas, NV, USA.

IEEE.

Henriques, J. F. and Vedaldi, A. (2017). Warped Convolu-

tions: Efﬁcient Invariance to Spatial Transformations.

In Proceedings of the 34th International Conference

on Machine Learning, pages 1461–1469. PMLR.

Ioffe, S. and Szegedy, C. (2015). Batch Normalization: Ac-

celerating Deep Network Training by Reducing Inter-

nal Covariate Shift. arXiv:1502.03167 [cs].

Kim, J., Jung, W., Kim, H., and Lee, J. (2020). CyCNN:

A Rotation Invariant CNN using Polar Mapping and

Cylindrical Convolution Layers. arXiv:2007.10588

[cs, eess].

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,

Chanan, G., Killeen, T., Lin, Z., Gimelshein, N.,

Antiga, L., Desmaison, A., K

opf, A., Yang, E., De-

Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S.,

Steiner, B., Fang, L., Bai, J., and Chintala, S. (2019).

PyTorch: An Imperative Style, High-Performance

Deep Learning Library.

Redmon, J. and Farhadi, A. (2016). YOLO9000: Better,

Faster, Stronger. arXiv:1612.08242 [cs].

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,

Z. (2015). Rethinking the Inception Architecture for

Computer Vision.

Rotation Equivariance for Diamond Identiﬁcation

123