Deep Distance Metric Learning for Similarity Preserving Embedding of

Point Clouds

Ahmed Abouelazm, Igor Vozniak, Nils Lipp, Pavel Astreika and Christian Mueller

Deutsches Forschungszentrum f

ur K

unstliche Intelligenz (DFKI), Saarbruecken, Germany

{ﬁrst name.last name}@dfki.de

Keywords:

Point Clouds, 3D Deep Learning, Distance Metric Learning, Similarity Preserving Embedding.

Abstract:

Point cloud processing and 3D model retrieval methods have received a lot of interest as a result of the recent

advancement in deep learning, computing hardware, and a wide range of available 3D sensors. Many state-

of-the-art approaches utilize distance metric learning for solving the 3D model retrieval problem. However,

the majority of these approaches disregard the variation in shape and properties of instances belonging to the

same class known as intra-class variance, and focus on semantic labels as a measure of relevance. In this work,

we present two novel loss functions for similarity-preserving point cloud embedding, in which the distance

between point clouds in the embedding space is directly proportional to the ground truth distance between them

using a similarity or distance measure. The building block of both loss functions is the forward passing of n-

pair input point clouds through a Siamese network. We utilize ModelNet 10 dataset in the course of numerical

evaluations under classiﬁcation and mean average precision evaluation metrics. The reported quantitative and

qualitative results demonstrate enhancement in retrieved models both quantitatively and qualitatively by a

signiﬁcant margin.

1 INTRODUCTION

Point Cloud Data (PCD) is a data representation ob-

tained by LiDAR sensors, and is one of the most

prominent 3D data structures. PCD depicts a scanned

object as a set of discrete points scattered in a Eu-

clidean space. Compared to 2D images, PCDs are

substantially more resistant to changes in lighting

conditions and accurately capture object depth. Due

to the obvious inherent advantages of PCDs, they

have been used in a variety of application domains

such as robotics and autonomous driving. PCDs have

been employed in a range of applications, including

classiﬁcation, scene segmentation, model retrieval,

and reconstruction (Qi et al., 2017a; Qi et al., 2017b;

Wang et al., 2019; Uy and Lee, 2018; Mescheder

et al., 2019; Peng et al., 2020; Park et al., 2019).

In the scope of this work, we are particularly in-

terested in classiﬁcation and reconstruction applica-

tions of PCDs. The majority of these pipelines re-

quire efﬁcient and meaningful encoding of the PCDs

(Qi et al., 2017a; Mescheder et al., 2019). Encoding

PCD into a feature space is thus a bottleneck of such

approaches, motivating us to explore further into the

resulting embedding and strategies to improve their

quality. One of the most dominant methods for judg-

ing the quality of an embedding space is investigating

(a) Query Model

(b) Retrieved Models

Figure 1: Problem description: the row-wise retrieval of the

query model on the left (a) is presented on the right (b). The

upper row results in (b) are from the same class but don’t

incorporate any visual similarity. In contrast, the bottom

row in (b) accounts for intra-class similarity and is a desir-

able retrieval result. Note: in this work, a sampled point

cloud from ModelNet 10 mesh models is used, whereas

mesh models are depicted for explanatory purposes only.

the model retrieval problem over such a space (Had-

sell et al., 2006). Model retrieval is a search prob-

lem in which a query model is given, and its most

relevant counterparts are retrieved from a large-scale

dataset. In deep learning, features are extracted from

N training objects using an encoder architecture (Had-

sell et al., 2006; Melekhov et al., 2016; Wang et al.,

2014; Hoffer and Ailon, 2015). A data structure, e.g.,

kdtree, octree is used to ﬁt these features into memory

570

Abouelazm, A., Vozniak, I., Lipp, N., Astreika, P. and Mueller, C.

Deep Distance Metric Learning for Similarity Preserving Embedding of Point Clouds.

DOI: 10.5220/0011627100003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 4: VISAPP, pages

570-581

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

Figure 2: Sample of intra-class variance of the class chair

based on ModelNet 10 dataset.

and a query database efﬁciently. When given a query

model, the same network is used to extract a feature

vector. A distance metric is then utilized to extract

the most relevant objects from the feature database.

However, relevance is commonly satisﬁed when the

query object and the matched object are of the same

semantic class (Qi et al., 2017a). Based on ﬁgure (1),

we can conclude that this relevance metric is ﬂawed

because it ignores intra-class variation, resulting in an

embedding space that disregards similarity between

instances of the same class. Intra-class variance is

the variance of similarity in structure, appearance, and

properties between instances in the same class, an ex-

ample of such variance on chairs is illustrated in ﬁg-

ure (2). This shortcoming encouraged us to investi-

gate the well-established distance metric learning ap-

proaches toward improving embedding space quality

by accounting for intra-class variance. The ultimate

target is to achieve a ﬁne-grained model retrieval over

a PCD dataset that takes structural similarity between

instances in the same class into account.

The contribution of this work is threefold. First,

the loosely deﬁned retrieval relevance metric which

merely takes the labels of the instances into account

is replaced by a tighter metric which only accepts

a retrieval if it belongs to the nearest neighbors of

the query given a ground truth similarity measure.

Chamfer distance is used as a similarity measure be-

tween PCDs in this work. Second, the deep distance

metric learning is utilized for performing the learn-

ing directly on the feature space. Finally, we pro-

pose two novel loss functions, based on contrastive

and triplet losses from the state-of-the-art literature

(Hadsell et al., 2006; Wang et al., 2014; Hoffer and

Ailon, 2015), that take into consideration both, inter-

and more essentially intra-class variance, and clarify

their training process inside the distance metric learn-

ing paradigm, to accomplish our ﬁne-grained model

retrieval goal. To the best of our knowledge, there is

no instance of a loss function used for distance metric

learning that explicitly accounts for intra-class vari-

ance prior to this work.

2 RELATED WORK

Model retrieval problem as a search problem is com-

monly solved by applying distance metric learning

on a feature embedding space according to (Hadsell

et al., 2006). Distance Metric Learning is a ma-

chine learning task that involves automatically in-

ferring a discriminative similarity or distance mea-

sures over data instances as deﬁned in (Xing et al.,

2002). A discriminative measure has a small distance

between similar objects and pushes different objects

away from each other. This learning technique has

been utilized in a wide range of applications, includ-

ing nearest-neighbor models, clustering, dimension-

ality reduction, and model retrieval as will be clari-

ﬁed in this section. The ﬁrst work on inferring such

measures is presented in (Xing et al., 2002), where

Mahalanobis distance and linear transformation are

used as a distance metric as in equation (1). This ap-

proach was further extended in (Chatpatanasiri et al.,

2010) by employing kernel learning to learn non-

linear transformations over input data.

D(x

, x

) =

− x

)

M (x

− x

)

(L x

− L x

)

(L x

− L x

)

(1)

where x

, x

are two input data instances. M is a

positive deﬁnite matrix which can be decomposed as

M = L

L with L representing the linear transforma-

tion over the input data.

According to (Hadsell et al., 2006), classical dis-

tance metric learning has two major drawbacks: it re-

quires meaningful and computable distance measures,

and it does not provide an explicit function to map

new input data whose relationship to the training data

is unknown. Due to the aforementioned limitations,

several attempts have been made to address distance

metric learning within a deep learning framework.

To accomplish distance metric learning, a network is

trained to learn non-linear transformations over input

data guided by a custom loss function. The majority

of deep distance metric learning algorithms feed N

instances of training to a neural network and compare

between them using the aforementioned loss function.

(Hadsell et al., 2006) is a pioneer work in deep

distance metric learning. This paper introduced con-

trastive loss, where the loss function has two training

instances as an input. The loss function attempts to

cluster instances from the same class in the feature

space and separates them from instances belonging to

different classes, as shown in equation (2). The input

pairs are sampled at random and are used to learn in-

Deep Distance Metric Learning for Similarity Preserving Embedding of Point Clouds

571

variant mapping achieving dimensionality reduction.

Contrastive Loss(x

, x

) =

y × d

, x

)

(1 − y) ×



max(0, m − d

, x

))



(2)

where x

, x

are the input training instance, y indicates

whether x

, x

share the same label and m is the desired

margin separating different classes.

In order to improve image-matching results,

(Melekhov et al., 2016) employed contrastive loss.

Finding matching images across large datasets is a

key difﬁculty in many applications, including multi-

view 3D reconstruction and image retrieval. Based on

the feature vectors in Euclidean space embedded by a

neural network trained via contrastive loss, this work

discovers matching and non-matching pairs of im-

ages. In addition, a heuristic approach for determin-

ing the margin between classes is presented. SigNet

(Dey et al., 2017) uses convolutional neural networks

and contrastive loss to perform ofﬂine signature ver-

iﬁcation. A minor change to contrastive loss is pro-

posed, which assigns different weights to the pull

force between similar instances and the push force

between dissimilar ones. When compared to state-of-

the-art approaches, SigNet outperformed most of the

benchmark signature datasets.

Triplet loss is introduced in (Wang et al., 2014;

Hoffer and Ailon, 2015). The input of the triplet loss

is extended to be a triplet. Positive-negative triplets

are the focus of this loss function. A positive-negative

triplet is one in which the ﬁrst (anchor) and second

(positive) members belong to the same class, but the

third (negative) member belongs to a different class.

Using this loss, the anchor and positive members are

simultaneously drawn near each other while being

pushed away from the third member, as formulated in

equation (3). The ﬁrst application in which this loss

is utilized is image ranking in (Wang et al., 2014).

Triplet Loss(x

, x

) =

{max(0, d

, x

)

+m − d

, x

))}

(3)

where x

, x

are the anchor, positive and negative

training instances respectively, and m stands for the

desired margin separating different classes.

FaceNet (Schroff et al., 2015) achieves state-of-

the-art facial recognition performance by combining

triplet loss with an inception architecture (Szegedy

et al., 2015). Face recognition is investigated by em-

bedding face images in a Euclidean feature space,

where face similarity may be measured directly. Us-

ing the L

norm, the feature space is conﬁned within

a d-dimension hyper-sphere. In addition, rather than

mining triplets ofﬂine using network checkpoints, this

work provides a unique technique for mining triplets

online. A huge mini-batch of thousands of instances

is generated, and the hardest positive-negative triplets

are sampled. The hardest triplets are those in which

the distance between the anchor and the positive in-

stance is similar to the distance between the anchor

and the negative instance. (Nazir et al., 2021) is con-

cerned with utilizing triplet loss for reconstruction

and completion challenges in PCD domain. In this

work, deep distance metric learning is used to learn

global features with adequate separation between dif-

ferent classes. To accomplish discriminative recon-

struction of PCDs, the ﬁnal loss function is a combi-

nation of reconstruction loss calculated by Chamfer

distance and triplet loss.

In the literature, many variants based on con-

trastive and triplet losses have been suggested. To ac-

complish gait recognition, (Xu, 2021) proposes Deep

Large Margin Nearest Neighbor (DLMNN) loss. Gait

recognition is the challenge of identifying a far away

human from their walking manner. The DLMNN loss

is a linear combination of the triplet and contrastive

losses controlled by weighting, γ as clariﬁed in equa-

tion (4). The goal of such a loss function is to guar-

antee that similar examples are drawn as close to each

other as possible, while also learning to distinguish

between examples from different classes.

DLMNN Loss(x

, x

) = Triplet Loss(x

, x

γ × ContrastiveLoss(x

, x

)

(4)

PointNetVLAD (Uy and Lee, 2018) provides

a novel pipeline for PCD-based retrieval for place

recognition that utilizes PointNet and normalizes its

outputs using NetVLAD (Arandjelovic et al., 2016).

This work proposes two variants of the triplet loss:

lazy triplet and quadruplet losses. Both losses sam-

ple N negative examples rather than sampling just

one. These two loss functions have more stable train-

ing and faster convergence. PointNetVLAD is able

to achieve state-of-the-art performance on its refer-

ence task. In (Wang et al., 2017), the angular loss

is introduced as an alternative formulation of a triplet.

This loss assumes a triangle, linking the input triplet

, x

) and constrains the angle at the triplet’s

negative member. Angular loss has two inherent ad-

vantages over triplet loss: scale invariance and robust-

ness against feature variance. The results indicate that

angular loss has faster convergence and outperforms

triplet loss on image clustering and retrieval tasks.

Using a modiﬁed contrastive loss function, graph

proximity loss as in (Bai et al., 2019) learns a graph-

level representation. The purpose of this work is to

encapsulate an entire graph into a vector space that

takes graph-graph proximity into account. The mod-

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

572

iﬁed contrastive loss pulls the distance between two

training graphs to be identical to their ground truth

distances, independent of which class each graph be-

longs to. To the best of our knowledge, this is the only

loss function that explicitly considers intra-class vari-

ation. However, it does not provide any separation

between different classes. This work achieves com-

petitive results on similarity ranking, and visualiza-

tion tasks by learning graph representation in unsu-

pervised and inductive manners.

3 METHODOLOGY

3.1 Preliminaries

We establish a consistent mathematical notation that

will be utilized throughout the rest of this paper. The

introduced notation is as follows:

• Given a labeled dataset with M tuples where each

tuple is a PCD and its label, the dataset is noted as

D =

{

, y

)

}

i=1

• A PCD containing N points, where each point is

represented just by its coordinates, with no ad-

ditional attributes taken into account, is denoted

as x





j=1

∈ R

N×3

. Each PCD is sampled

with the same number of points. The same pre-

processing as introduced in (Qi et al., 2017a) is

applied consistently over all PCDs.

• The discrete label assigned to each PCD x

is indi-

cated by y

and can belong to a set of pre-deﬁned

labels ∈

{

0, 1, ....,C

}

• A neural network φ parameterized by weights

θ maps the input PCD into a Euclidean feature

space such that φ : x

7→ f

. The PCD embed-

ding in the feature space is formulated as follows:

= φ(x

; θ) ∈ R

•

i, j

is the L

distance between two PCDs (x

, x

) in

the feature space, and it is formulated as follows:

(5).

i, j



− f



φ(x

; θ) − φ (x

; θ)



(5)

• Finally, d

i, j

is the ground truth distance between

two PCDs (x

, x

) based on a similarity or distance

measure.

3.2 Modiﬁed Relevance Metric

In this work, we start with a modiﬁcation of the rel-

evance metric used in evaluating correct retrievals in

model retrieval problems. We propose a tighter met-

ric that only accepts retrievals to be correct if they

belong to the nearest neighbors of the query object on

a ground truth distance metric. The Chamfer distance

is used as a ground truth distance metric on PCDs in

this work. Chamfer distance is an asymmetric dis-

tance measure that has been widely utilized in the lit-

erature for comparing PCDs as in (Mescheder et al.,

2019; Peng et al., 2020; Park et al., 2019). Equa-

tion (6) introduces the Chamfer distance between two

PCDs (x

, x

). The ﬁrst term in the equation iterates

over each point p ∈ x

and calculates the average dis-

tance to their corresponding nearest point q ∈ x

. The

second term achieves the same, but with the roles of

and x

are swapped.

CD (x

, x

) =

∑

p∈x

min

q∈x

∥

p − q

∥



∑

q∈x

min

p∈x

∥

p − q

∥

(6)

The suggested relevance metric necessitates the

development of a loss function that accounts for intra-

class variation, diversity in structure, and attributes of

objects in the same class. To the best of our knowl-

edge, graph proximity loss (Bai et al., 2019) is the pi-

oneer loss function for including intra-class variation.

The graph proximity loss takes an input pair (x

, x

)

and minimizes the difference between their distance

in the feature space

i, j

and their ground truth dis-

tance d

i, j

, as shown in equation (7). This loss func-

tion is used for learning graph embeddings. Despite

having desirable properties related to the ﬁne-grained

model retrieval that we are interested in, graph prox-

imity loss is not directly applicable to our work be-

cause the loss setting assumes a single global class

throughout the entire dataset and does not provide ad-

equate separation between different classes, which is

appropriate for graph similarity ranking on a macro

level but not for our setting.

Graph Proximity Loss (x

, x

) =

×(

i, j

−d

i, j

)

(7)

3.3 Proposed Pair Loss

We begin our proposed pair loss design by review-

ing the basic contrastive loss formulation as provided

in equation (2). Contrastive loss pulls pairs sam-

pled from the same class while repelling different

pairs with at least a hyperparameter margin, as sim-

pliﬁed in equation (8). Equation (9) clearly shows

that the second component in the loss function does

not contribute unless the distance between two in-

stances from different classes is smaller than the mar-

gin. This term is designed to simplify the training

procedure and prevent the network from exerting ad-

Deep Distance Metric Learning for Similarity Preserving Embedding of Point Clouds

573

ditional pushing force when different classes are ade-

quately separated. Contrastive loss overlooks intra-

class variance since it minimizes the distance be-

tween examples belonging to the same class regard-

less of their similarity. Thus, in order to include intra-

class variation and accomplish ﬁne-grained model re-

trieval, contrastive loss formulation must be modiﬁed.

Loss (x

, x

) =



i, j

, i f y

= y

max(0, m −

i, j

)

, i f y

̸= y



(8)

max(0, m −

i, j

) =



zero , i f

i, j

≥ m

+ve , i f

i, j

< m



(9)

The proposed intra-class pair loss modiﬁes con-

trastive loss, as formulated in equation (10). The ﬁrst

component is modiﬁed to minimize the square differ-

ence between the distance, in the embedding space,

between a pair of (x

, x

) and their ground truth dis-

tance calculated by Chamfer distance, as in equation

(6). The beneﬁt of this improvement is the incorpo-

ration of intra-class variation by embedding PCDs in

a Euclidean feature space that respects their similar-

ity based on chamfer distance. The second term is

maintained owing to its capacity to learn clear sep-

aration between different classes and regularization

of its contribution depending on the hyperparameter

margin.

ICPL (x

, x

) =



(

i, j

− d

i, j

)

, i f y

= y

max(0, m −

i, j

)

, i f y

̸= y



(10)

Figures (3, 4, 5) demonstrate the expected em-

bedding of positive and negative PCD pairs using

a siamese network and the proposed intra-class pair

loss. A positive pair consists of PCDs from the same

class, whereas a negative pair consists of PCDs from

different classes. A siamese network is a number of

copies of a network with identical shared weights. A

visually similar positive pair (x

, x

) is shown in ﬁg-

ure (3). A visually similar pair has a low Chamfer

distance between its members, and the proposed loss

function promotes the network to learn an embedding

in which the distance between the members in the fea-

ture space is equal to their Chamfer distance.

The pair of PCDs (x

, x

) is passed to the embed-

ding pipeline as shown in ﬁgure (4). Since x

and

are visually dissimilar, the Chamfer distance be-

tween them is considerably greater than the distance

between x

and x

. As a result, the network strives to

learn an embedding that is true to the ground truth dis-

tance guided by the proposed loss. As seen from Fig-

ures (3, 4),

1,2

1,3

, indicating that the constructed

feature space takes into account intra-class variation.

Figure 3: Visually similar positive pair (x

, x

) example to

illustrate proposed intra-class pair loss behavior.

Figure 4: Visually dissimilar positive pair (x

, x

) example

to illustrate proposed pair loss behavior.

Figure (5) illustrates an example in which the sam-

pled pair is a negative one, because x

belongs to the

class chair and x

belongs to the class toilet. Despite

the fact that x

and x

are visually similar and share

some geometric properties, the second term in Equa-

tion (8) forces the network to push their embeddings

apart from each other by at least the value of the mar-

gin. As a result, our loss function instructs the net-

work to learn a reasonable separation between differ-

ent classes, even if they share similarities.

The feature vector extracted from a PCD x

is nor-

malized using the L

norm, such that

∥

= 1. This

normalization is initially introduced in (Schroff et al.,

2015) to constrain the learned feature vectors within

a d-dimensional hyper-sphere. The advantage of us-

ing the L

norm is that the squared Euclidean distance

between two PCDs is conﬁned between [0, 4], making

margin selection easier and more meaningful. The ap-

proach given in (Melekhov et al., 2016) is utilized to

automatically set the margin value to achieve appro-

priate separation between classes and more efﬁcient

training. The margin value is speciﬁed to be twice

the average distance between PCD pairs computed via

random network initialization.

Dataset D has





pairs. Choosing pairs that con-

tribute to the loss is critical for network training to

achieve fast convergence. Positive pairs are not an

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

574

Figure 5: Negative pair (x

, x

) example to illustrate pro-

posed intra-class pair loss behavior.

issue because they all contribute to the loss; how-

ever, sampling negative pairs that break the constraint,

called hard negative pairs, is a challenge. Hard nega-

tive pairs are pairs that have members from different

classes and have a distance in the feature space that is

smaller than the margin value. It is infeasible to sam-

ple these pairs over the whole dataset at each train-

ing stage. Thus, we adopt the online sampling tech-

nique introduced in (Schroff et al., 2015). First, a bal-

anced mini-batch is created with k samples from each

class. The intention behind a balanced mini-batch is

to have a minimum number of examples from each

class in each mini-batch. Using the siamese network

of choice, the sampled mini-batch is encoded into the

feature space. All positive pairs in the mini-batch are

generated, and an equal number of the hardest (near-

est) negative pairs are sampled. Even though online

sampling requires a high computational effort, it is

preferable to ofﬂine sampling. Since ofﬂine sampling

uses a stationary network checkpoint and the sampled

pairs may become outdated and cease to contribute to

the loss when the current network weights change.

An online sampling strategy by itself is insufﬁ-

cient for efﬁcient network training. Since sampling

PCDs presents a problem in that their mini-batch size

is limited when compared to work done on images

owing to PCDs high memory and computation re-

quirements. In this work, for example, just a hundred

samples can be packed into a mini-batch compared to

a mini-batch containing 1800 images used in (Schroff

et al., 2015). Due to the small mini-batch size, only a

few negative pairs contribute to the loss by breaching

the margin constraint as training advances, resulting

in slower network convergence. Thus, we modiﬁed

our ﬁnal loss to be a linear combination of cross en-

tropy loss and the proposed pair loss, with α and γ

regulating the trade-off between them as seen in equa-

tion (11). By including the cross entropy loss into the

proposed loss function, the network is able to learn a

decision hyper-plane that compensates for the limited

mini-batch size while also improving the discrimina-

tive ability of the network as proposed in (Wen et al.,

2016; Em et al., 2017).

Total Loss = α × Cross Entropy Loss+

γ × Intra Class Pair Loss

(11)

3.4 Proposed Triplet Loss

(Schroff et al., 2015) introduced the triplet loss, which

is used to compare the embedding of three input in-

stances within deep distance metric learning frame-

works. The formulation of triplet loss, which only ac-

cepts a positive - negative triplet (x

, x

) as an in-

put, is clariﬁed by Equation (3). The objective of this

loss function is to learn an embedding in which the

difference between the distance between the anchor

and positive members

a,p

and the distance between

the anchor and negative members

a,n

is larger than

the margin value. Only when the separation between

members is less than the margin value, as shown in

Equation (12), can a triplet contribute to the loss. This

formulation is preferable since well-separated triplets

do not contribute to network weight updates and are

instead guided only by ill-separated triplets. Triplet

loss guides the network to learn an embedding of a

triplet that treats instances from the same class exactly

the same, regardless of their similarity. This loss is

modiﬁed in order to incorporate intra-class variance

and learn a ﬁne-grained embedding space based on

the similarity between PCDs.

The proposed triplet loss alters the triplet loss as

shown in Equation (13). The proposed formulation

broadens the kind of triplets accepted as input to the

loss function to include both (positive-negative) and

(positive-positive) triplets. The (positive-positive)

triplets are triplets with the same label for all three

members. Part one of equation (13) depicts the

loss component responsible for handling (positive-

positive) triplets. The goal of this component is to

minimize the difference between the ratio of distances

in the embedding space compared to their ground

truth distances as estimated by Chamfer distance.

This component is further clariﬁed in Equation (14),

where a triplet ceases contributing to the loss when

the ratio of their distances is equivalent to the ground

truth distance. The (positive-negative) triplets are

handled in the second term of equation (13), which

penalizes the difference between the embedding dis-

tance ratio compared to the ratio of Chamfer distance

between the anchor and the positive members and the

margin value. This term leads the network to believe

that the Chamfer distance between the anchor and

negative members is equal to at least the margin, caus-

ing the network to learn an embedding that pushes

them further apart. In Equation (15), the case based

Deep Distance Metric Learning for Similarity Preserving Embedding of Point Clouds

575

max(0,

a,p

+ m −

a,n

) =



zero , i f

a,p

+ m ≤

a,n

+ve , i f

a,p

+ m >

a,n



(12)

Intra Class Triplet Loss(x

, x

) =

(

a,i

× d

a, j

−

a, j

× d

a,i

)

, i f y

= y

and y

= y

max(0,

a,i

× m −

a, j

× d

a,i

)

, i f y

= y

and y

̸= y

)

(13)

Figure 6: Positive - Positive triplet (x

, x

) example to

illustrate proposed intra-class triplet loss behavior.

behavior of the loss for (positive-negative) triplets is

formulated. Once members are separated by at least

the margin value, this component of the loss function

vanishes.

The expected embedding of various PCD triplets

using a siamese network guided by the proposed

triplet loss is illustrated in Figures (6, 7). In ﬁgure (6),

a positive - positive triplet (x

, x

) is selected as the

siamese network input. When compared to x

, x

and

are more visually similar. As a result, the proposed

loss guides the network to learn an embedding that

respects the ratio of the members’ Chamfer distances.

Since the distance between the ﬁrst and second mem-

bers of the triplet

1,2

is less than the distance between

the ﬁrst and third members

1,3

, the resulting vector

space preserves intra-class variation.

(

a,i

×d

a, j

−

a, j

×d

a,i

)











zero , i f

a,i

a, j

a,i

a, j

+ve , i f

a,i

a, j

̸=

a,i

a, j











(14)

max(0,

a,i

×m−

a, j

×d

a,i

) =











zero , i f

a,i

a, j

≤

a,i

+ve , i f

a,i

a, j

a,i











(15)

A positive - negative triplet is depicted in ﬁgure

(7), with x

, x

belonging to the class chair and x

belonging to the class toilet. The sampled triplet is

the siamese network’s input. The loss function is

designed to build an embedding space that respects

Figure 7: Positive - Negative triplet (x

, x

) example to

illustrate proposed intra-class triplet loss behavior.

the Chamfer distance between examples of the same

class (x

, x

) while pushing examples from a different

class x

away with the margin. By separating different

classes with a margin, the inter-class variance within

the dataset is preserved.

The feature space is conﬁned inside a d-

dimensional hyper-sphere by normalizing extracted

feature vectors with the L

norm. As discussed in

section 3.3, the conﬁned space has a favorable inﬂu-

ence on the margin selection and distance computa-

tion between vectors. However, the margin selection

criteria used in the preceding section is not applica-

ble to the proposed triplet loss. Since the proposed

method’s margin was empirically found to be narrow

and does not offer an adequate separation between

distinct classes. As a result, the margin selection for

this loss is considered a part of the hyperparameter

selection process.

triplets are present in dataset D. It is crucial

for stable network training and efﬁcient convergence

to choose triplets that contribute to the loss. The se-

lection of (positive-positive) triplets is not problem-

atic, since all of these triplets contribute to the loss.

However, sampling (positive-negative) triplets that vi-

olate the constraint, known as (positive-hard negative)

triplets, poses a challenge. (Positive-hard negative)

triplets are triplets where the difference between the

feature space distances between the anchor and posi-

tive members and the anchor and negative members is

less than the margin value. It is impractical to sample

these triplets over the whole dataset at each training

step.

Section 3.3 examined online and ofﬂine sampling

strategies. Given its beneﬁts over the ofﬂine strategy,

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

576

online sampling is well suited for (positive-hard nega-

tive) triplets selection. A balanced mini-batch is sam-

pled in the same way as was mentioned in the pre-

ceding section. A pre-deﬁned number of (positive-

positive) triplets are randomly sampled from the mini-

batch, as it is computationally infeasible to sample

all these triplets. In addition, an equal number of the

(positive-hardest negative) triplets is sampled

Due to the restricted mini-batch size of PCDs, an

online sampling technique alone is insufﬁcient for ef-

ﬁcient network training. As a result of this constraint,

only a few (positive-negative) triplets contribute to the

loss by violating the margin constraint at the advanced

phases of the training process. As shown in Equation

(16), the total loss is adjusted to be a linear combina-

tion of cross entropy loss and the proposed intra-class

triplet loss, with α and γ regulating the trade-off be-

tween both. By integrating the cross entropy loss into

the proposed intra-class loss function, the network has

an increased discriminative capacity that compensates

for the restricted mini-batch, as proposed in Section

3.3.

Total Loss = α × Cross Entropy Loss

+γ × Intra Class Triplet Loss

(16)

3.5 Network Architecture

DGCNN (Wang et al., 2019) is a network architecture

that draws inspiration from PointNet (Qi et al., 2017a)

and convolution operators. DGCNN, like PointNet,

offers a versatile architecture that can be used for a

variety of high-level applications. DGCNN presents

the Edge-Conv layer, which is a differentiable layer

that captures local geometries by considering a point

correlation with its neighbors. This layer receives N

points as input, with each point represented by d

fea-

tures. To begin, a knn −graph with a ﬁxed number of

neighbors is constructed over the points to emphasize

the underlying local geometric structure. The con-

structed graph is processed using a convolution-like

operator to compute a per-point output with d

out

di-

mensions. The knn − graph is not ﬁxed and is dy-

namically modiﬁed in each Edge-Conv layer of the

network. Dynamically updating implies that the list of

k-nearest neighbors in each edge-conv layer is unique

and is built from the layer’s input features rather than

the network’s initial input.

The convolution operator learns the edge weight

between the center point p

and its nearest k-

neighbors

{

, ..., q

}

. The edge weights are formu-

lated in Equation (17) to capture local correlations

with neighbors and preserve global shape structure.

The edge weights capture local geometries based on

the difference between a center point features and

each of its neighbor features (q

− p

) and extract rel-

evant information from it by multiplying with learn-

able weights θ

. While the global shape structure

is preserved by center point features p

and infor-

mation is inferred from these features with learnable

weights θ

. Using Leaky RELU non-linearities, non-

linearities are injected into the combined learned fea-

tures. In contrast to RELU, which does not allow

any negative values, Leaky RELU enables a small

slope for negative values. When compared to RELU,

Leaky RELU produced superior empirical results in

DGCNN. DGCNN is one of the best performing

point-based models in PCD analysis, owing to the

favorable properties of the edge-conv layer. Since

neither the graph structure nor the convolution-like

operator is order-dependent, edge-conv is permuta-

tion invariant. Furthermore, it exhibits partial transla-

tion invariance since the difference between features

− p

) is a relative term that is translation invariant.

However, the center point feature p

term is absolute.

Finally, this layer displays non-locality characteristics

since the proximity in the feature space is not the same

as the proximity in the input space, information is dis-

persed in a non-local manner.

i, j

= LeakyRelu(θ

× (q

− p

) + θ

× p

) (17)

4 EVALUATIONS

4.1 Dataset

The ModelNet-10 dataset is used in this work to

evaluate the proposed loss functions’ discrimina-

tive and model retrieval performance. ModelNet-

10 (Wu et al., 2015), published by Princeton Uni-

versity, is a benchmark for 3D object classiﬁcation

and retrieval, where the dataset includes 4899 CAD-

generated meshes that were saved in Object File

Format (OFF). The meshes are separated into 3991

training meshes and 908 testing meshes. ModelNet

is well-known in the research ﬁeld as it is a well-

structured dataset, including pre-aligned clean shapes

picked from several categories. The dataset includes

ten classes: bathtub, bed, chair, desk, dresser, moni-

tor, nightstand, sofa, table, and toilet. The output PCD

is generated by randomly sampling evenly distributed

points from the triangle faces of the CAD-generated

mesh. The points that were sampled are normalized

and put into a bounding box between [−1, 1].

4.2 Experimental Setting

The implementation of the proposed losses and net-

work architecture is established in Python 3.9 with

Deep Distance Metric Learning for Similarity Preserving Embedding of Point Clouds

577

PyTorch 1.10.0 and Cuda 11.5 on an Nvidia GeForce

RTX 3090 graphics card with 24 GB of vRAM.

DGCNN network architecture is implemented as

in (Wang et al., 2019). The network is trained with a

balanced mini-batch having 10 samples per class for

15 epochs. Besides, it is optimized using an SGD op-

timizer with an initial learning rate of 0.1, momentum

of 0.9, and weight decay of 10

−4

. The learning rate is

decreased using a cosine annealing schedule to reach

a ﬁnal learning rate of 0.001 as recommended in the

original implementation. Upon complication of the

learning, the training dataset is ﬁtted in kd-tree for a

faster query during the inference of model retrieval

metrics.

4.3 Evaluation Metrics

Two aspects of the designed loss functions will be

evaluated in this work: discriminative and retrieval

abilities. The discriminative ability is measured us-

ing well-known classiﬁcation metrics such as recall,

precision, F−score, and accuracy. The mean average

precision (mAP) metric is used to evaluate the model

retrieval quality and stands for the mean value of the

average precision. mAP is ﬁrst presented in (Harman,

1993) as metric of information retrieval quality over

text. Average Precision (AP) is parameterized by k

which is the number of retrieved examples, as seen in

Equation (19). Average Precision@k is the product of

precision@k, as in Equation (18), and relevance@k,

normalized by the number of relevant retrieved in-

stances, also known as Ground Truth Positives (GTP).

Relevance@k is an indicator of whether the retrieved

instance is correct or not. Precision@k is the number

of relevant retrieved instances to the total number of

retrieved instances, as illustrated in equation (18).

precision@k =

correct results @ k

(18)

AP@ K =

GT P

∑

k=1

precision@k × relevance@ k

(19)

4.4 Quantitative Results

Table 1 illustrates the discriminative ability of

DGCNN trained with various loss functions based on

the classiﬁcation metrics stated earlier. The results in-

dicate that both of our proposed losses outperform the

cross entropy loss on the classiﬁcation task by 2% on

average on all metrics. This improvement indicates

that the proposed intra-class pair and triplet losses de-

velop a more discriminative hyperspace for classiﬁ-

cation with good separation between classes in the

Figure 8: Confusion matrix for DGCNN network trained by

the proposed pair loss.

dataset. When compared to the proposed triplet loss,

which is slightly better, the proposed pair loss has a

very competitive performance.

Figures (8, 9) demonstrate the confusion matrix

extracted from classiﬁcation statistics of a DGCNN

classiﬁer trained by the proposed intra-class pair and

triplet losses, respectively. The highest values always

appear on the confusion matrix diagonals, indicating

a good classiﬁcation performance across all classes.

The little dispersed values scattered over the confu-

sion matrix indicate the classiﬁcation error for a cer-

tain class distributed across the other classes. A mi-

nor categorization error occurs when instances from

the class nightstand are classiﬁed as belonging to the

class dresser, and when instances from the class table

are labeled as belonging to the class desk. The fact

that both class pairs (nightstand, dresser) and (table,

desk) are visually similar and might be difﬁcult for

humans to categorize leads to these minor inter-class

errors.

Table 2 introduces model retrieval results on mAP

metric using labels sharing as relevance measure. The

results are calculated across a range of retrieval sizes,

starting with 5 and up to 20 with a step size of 5,

to provide a better insight into our proposed losses’

performance. Two extra baselines are added in this

table, speciﬁcally the original contrastive and triplet

losses formulations, which were created explicitly for

the model retrieval task based on the labels. The re-

sults demonstrate that our proposed intra-class pair

and triplet losses are competitive on the task at hand.

On all retrieval sizes except 20, where triplet loss is

marginally better, the proposed pair loss performs bet-

ter than all baselines. We may deduce from this table

that our proposed losses do not degrade performance

as they keep the inter-class term of the original losses

and, in fact, outperform the state-of-the-art losses like

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

578

Table 1: Classiﬁcation results, where the highest values indicate better performance.

Metric Cross Entropy Loss

Intra-Class Pair Loss

(Ours 1)

Intra-Class Triplet Loss

(Ours 2)

Accuracy 0.9152 0.9306 0.9317

Average Recall 0.9082 0.9306 0.9301

Average Precision 0.9178 0.9334 0.9335

Average F-score 0.9133 0.9298 0.93

Table 2: DGCNN model retrieval results with labels as relevance measure, where the highest values indicate better perfor-

mance.

Top - K Cross Entropy Loss Contrastive Loss Triplet Loss

Intra-Class Pair

Loss (Ours 1)

Intra-Class Triplet

Loss (Ours 2)

5 0.9120 0.9139 0.9191 0.9239 0.9210

10 0.9025 0.9063 0.9115 0.9150 0.9102

15 0.8953 0.8991 0.9055 0.9062 0.9035

20 0.8857 0.8945 0.9005 0.8997 0.8969

Figure 9: Confusion matrix for DGCNN network trained by

the proposed triplet loss.

contrastive and triplet loss.

Furthermore, the model retrieval task with cham-

fer distance as a relevance metric is examined in Table

3. When compared to all three baselines, the proposed

pair and triplet losses are substantially superior, with

the proposed intra-class pair loss being slightly bet-

ter. The nearest loss in performance to our losses is

the cross entropy loss. The lack of a term dedicated

to addressing intra-class variance in both the original

formulations of contrastive and triplet losses hinders

performance on this objective.

4.5 Qualitative Results

Figure (10) shows an example of model retrieval re-

sults using the Chamfer distance as a measure of rel-

evance. Figure(10a) represents the query model for

which the most similar counterparts in the dataset

should be retrieved. Figure (10) is the ground truth

nearest neighbors to the query model PCD with

Chamfer distance as a similarity measure. Fig-

ure (10) demonstrate retrieved objects using cross-

entropy loss. The retrieved models, as shown in the

ﬁgure, belong to the same class, but they are not struc-

turally or visually similar, and none of them is a mem-

ber of the ground truth query. The obtained results

based on the proposed intra-class pair loss are quite

visually similar to the query model, as shown in ﬁg-

ure (10). The ﬁrst four models retrieved are members

of the ground truth query, whereas the ﬁfth model re-

trieved is not part of the ground truth query. Despite

the fact that the last retrieval is incorrect, it has a tol-

erable visual resemblance to the query model. Thus,

ﬁgure (10) supports our hypothesis that a loss func-

tion that accounts for intra-class variance improves

model retrieval results.

5 DISCUSSION AND FUTURE

WORK

According to empirical results, the network has issues

in distinguishing PCDs from classes that are struc-

turally similar, such as the class pairs (nightstand,

dresser) and (table, desk), and can not provide sufﬁ-

cient separation between them. The difﬁculty arises

as these class pairs have visually similar instances

that are difﬁcult to distinguish even by humans. To

solve this deﬁciency, a more extensive structural com-

parison between these related instances is necessary.

As a result, in future work, a network pre-trained for

the part segmentation task or jointly trained for part

segmentation will be investigated. This suggestion

is based on the notion that the increased level of in-

Deep Distance Metric Learning for Similarity Preserving Embedding of Point Clouds

579

Table 3: DGCNN model retrieval results with Chamfer distance as relevance measure, where the highest values indicate better

performance.

Top - K Cross Entropy Loss Contrastive Loss Triplet Loss

Proposed Pair Loss

(Ours 1)

Proposed Triplet Loss

(Ours 2)

5 0.6531 0.5974 0.5178 0.6986 0.6951

10 0.6837 0.6363 0.5698 0.7200 0.7152

15 0.6924 0.6417 0.5764 0.7287 0.7271

20 0.6989 0.6479 0.5770 0.7329 0.7311

(a) Query

(b) Ground Truth Retrieved Objects

(d) Retrieved Object using Proposed Intra-Class Pair Loss

Figure 10: Model retrieval results with Chamfer distance as relevance measure under different loss functions. Note: the

qualitative results for the proposed triplet loss were excluded, since intra-class pair loss showed slightly better performance.

formation captured during part segmentation can em-

phasize the structural differences between similar in-

stances from different classes.

In addition, we suggest using clustering to sim-

plify the similarity-aware embedding task. This sim-

pliﬁcation is designed to eliminate the combinatorial

nature of the proposed embedding task. The combina-

torial nature comes from the fact that altering weights

based on a mini-batch affects the network objective

for all remaining instances. We propose establishing

intra-class clusters based on the similarity between in-

stances from the same class to simplify this problem.

To begin, a graph is constructed, with each PCD in

the class representing a node and edge weights pro-

portional to the RBF kernel of the Chamfer distance

between the two PCDs connected by the edge. This

graph can be clustered using spectral graph clustering,

which searches for the optimal graph cut to cluster the

graph. Finally, the network objective is revised to par-

tition the features space into intra-class clusters sepa-

rated by an intra-class margin, as well as, separating

different classes by at least a distance of a margin.

6 CONCLUSION

In this work, two novel loss functions are proposed

by modifying contrastive and triplet loss formula-

tions to include a term that handles intra-class vari-

ance and generates PCD embeddings that respect the

Chamfer distance between PCDs from the same class.

DGCNN is the network of choice for this work since

it dynamically generates local knn − graphs across

the PCD, encapsulating local neighborhoods and fea-

tures at each layer. Results demonstrate that DGCNN

trained with either of the proposed loss functions out-

performs all baselines in classiﬁcation, label-based

model retrieval, and similarity-based model retrieval

due to the quantitatively and qualitatively enhance-

ment of embedding.

REFERENCES

Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., and Sivic,

J. (2016). Netvlad: Cnn architecture for weakly super-

vised place recognition. In Proceedings of the IEEE

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

580

conference on computer vision and pattern recogni-

tion, pages 5297–5307.

Bai, Y., Ding, H., Qiao, Y., Marinovic, A., Gu, K., Chen,

T., Sun, Y., and Wang, W. (2019). Unsupervised in-

ductive graph-level representation learning via graph-

graph proximity. arXiv preprint arXiv:1904.01098.

Chatpatanasiri, R., Korsrilabutr, T., Tangchanachaianan, P.,

and Kijsirikul, B. (2010). A new kernelization frame-

work for mahalanobis distance learning algorithms.

Neurocomputing, 73(10-12):1570–1579.

Dey, S., Dutta, A., Toledo, J. I., Ghosh, S. K., Llad

os, J.,

and Pal, U. (2017). Signet: Convolutional siamese

network for writer independent ofﬂine signature veri-

ﬁcation. arXiv preprint arXiv:1707.02131.

Em, Y., Gag, F., Lou, Y., Wang, S., Huang, T., and Duan, L.-

Y. (2017). Incorporating intra-class variance to ﬁne-

grained visual recognition. In 2017 IEEE Interna-

tional Conference on Multimedia and Expo (ICME),

pages 1452–1457. IEEE.

Hadsell, R., Chopra, S., and LeCun, Y. (2006). Dimen-

sionality reduction by learning an invariant mapping.

In 2006 IEEE Computer Society Conference on Com-

puter Vision and Pattern Recognition (CVPR’06), vol-

ume 2, pages 1735–1742. IEEE.

Harman, D. K. (1993). The ﬁrst text retrieval conference

(TREC-1), volume 500. US Department of Com-

merce, National Institute of Standards and Technol-

ogy.

Hoffer, E. and Ailon, N. (2015). Deep metric learn-

ing using triplet network. In International workshop

on similarity-based pattern recognition, pages 84–92.

Springer.

Melekhov, I., Kannala, J., and Rahtu, E. (2016). Siamese

network features for image matching. In 2016 23rd in-

ternational conference on pattern recognition (ICPR),

pages 378–383. IEEE.

Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S.,

and Geiger, A. (2019). Occupancy networks: Learn-

ing 3d reconstruction in function space. In Proceed-

ings of the IEEE/CVF Conference on Computer Vision

and Pattern Recognition, pages 4460–4470.

Nazir, D., Afzal, M. Z., Pagani, A., Liwicki, M., and

Stricker, D. (2021). Contrastive learning for 3d point

clouds classiﬁcation and shape completion. Sensors,

21(21):7392.

Park, J. J., Florence, P., Straub, J., Newcombe, R., and

Lovegrove, S. (2019). Deepsdf: Learning continuous

signed distance functions for shape representation. In

Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pages 165–174.

Peng, S., Niemeyer, M., Mescheder, L., Pollefeys, M.,

and Geiger, A. (2020). Convolutional occupancy net-

works. In European Conference on Computer Vision,

pages 523–540. Springer.

Qi, C. R., Su, H., Mo, K., and Guibas, L. J. (2017a). Point-

net: Deep learning on point sets for 3d classiﬁcation

and segmentation. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 652–660.

Qi, C. R., Yi, L., Su, H., and Guibas, L. J. (2017b). Point-

net++: Deep hierarchical feature learning on point sets

in a metric space. Advances in neural information pro-

cessing systems, 30.

Schroff, F., Kalenichenko, D., and Philbin, J. (2015).

Facenet: A uniﬁed embedding for face recognition

and clustering. In Proceedings of the IEEE conference

on computer vision and pattern recognition, pages

815–823.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,

Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-

novich, A. (2015). Going deeper with convolutions.

In Proceedings of the IEEE conference on computer

vision and pattern recognition, pages 1–9.

Uy, M. A. and Lee, G. H. (2018). Pointnetvlad: Deep point

cloud based retrieval for large-scale place recognition.

In Proceedings of the IEEE conference on computer

vision and pattern recognition, pages 4470–4479.

Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J.,

Philbin, J., Chen, B., and Wu, Y. (2014). Learning

ﬁne-grained image similarity with deep ranking. In

Proceedings of the IEEE conference on computer vi-

sion and pattern recognition, pages 1386–1393.

Wang, J., Zhou, F., Wen, S., Liu, X., and Lin, Y. (2017).

Deep metric learning with angular loss. In Proceed-

ings of the IEEE international conference on com-

puter vision, pages 2593–2601.

Wang, Y., Sun, Y., Liu, Z., Sarma, S. E., Bronstein, M. M.,

and Solomon, J. M. (2019). Dynamic graph cnn

for learning on point clouds. Acm Transactions On

Graphics (tog), 38(5):1–12.

Wen, Y., Zhang, K., Li, Z., and Qiao, Y. (2016). A discrim-

inative feature learning approach for deep face recog-

nition. In European conference on computer vision,

pages 499–515. Springer.

Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X.,

and Xiao, J. (2015). 3d shapenets: A deep representa-

tion for volumetric shapes. In Proceedings of the IEEE

conference on computer vision and pattern recogni-

tion, pages 1912–1920.

Xing, E., Jordan, M., Russell, S. J., and Ng, A. (2002).

Distance metric learning with application to cluster-

ing with side-information. Advances in neural infor-

mation processing systems, 15.

Xu, W. (2021). Deep large margin nearest neighbor

for gait recognition. Journal of Intelligent Systems,

30(1):604–619.

Deep Distance Metric Learning for Similarity Preserving Embedding of Point Clouds

581