Segmentation Improves 3D Object Classiﬁcation in Graph Convolutional

Networks

Clara Holzh

uter

, Florian Teich

and Florentin W

org

otter

III. Physikalisches Institut, Georg-August University, Friedrich-Hundt Platz 1, G

ottingen, Germany

Keywords:

3D, Computer Vision, Classiﬁcation, Point Clouds, Segmentation, Graph Convolution.

Abstract:

3D object classiﬁcation is involved in many computer vision pipelines such as autonomous driving or robotics.

However, the irregular format of 3D data makes it challenging to develop suitable deep learning architectures.

This paper proposes CompointNet, a graph convolutional network architecture, which performs 3D object

classiﬁcation by means of part decomposition. Our model consumes a 3D point cloud in the form of a part

graph which is constructed from segmented 3D shapes. The model learns a global descriptor by hierarchically

aggregating neighbourhood information using simple graph convolutions. To capture both local and global

information, a global classiﬁcation method processing each point separately is combined with our part graph

based approach into a hybrid version of CompointNet. We compare our approach to several state-of-the art

methods and demonstrate competitive performance. Particularly, in terms of per class accuracy, our hybrid

approach outperforms the compared methods. The proposed hybrid variants achieve a high classiﬁcation

accuracy, while being much more efﬁcient than those benchmark models with a comparable performance.

The conducted experiments show that part based approaches levering structural information about a 3D object,

indeed, can improve the classiﬁcation performance of 3D deep learning models.

1 INTRODUCTION

Computer Vision applications are more prevalent in

our everyday lives than ever before. From Vir-

tual Reality applications (Kharroubi et al., 2019) on

our smartphones to Just-Walk-Out-Shopping (Pfeiffer

et al., 2020) and autonomous driving (Arnold et al.,

2019), Computer Vision is aiming to improve our

quality of life in multiple aspects. 3D object clas-

siﬁcation is an essential ingredient to various of the

mentioned pipelines. In many of these systems, ap-

proaches are required to categorize the perceived ob-

jects in order to interact with them. With increasing

3D scanner quality as well as decreasing hardware

prices, 3D data becomes more abundant and easier

to access (Mart

ınez et al., 2015; Straub and Kerlin,

2014). However, as 3D data is more complex than

traditional 2D image data, specialized approaches

are necessary to realize classiﬁcation pipelines on

this input modality. Most popular representations

for 3D data nowadays include voxels, point clouds,

https://orcid.org/0000-0001-8365-5544

https://orcid.org/0000-0001-6708-7233

https://orcid.org/0000-0001-8206-9738

meshes or implicit surfaces. Many of the current

3D classiﬁcation methods can be categorized into two

archetypes: global 3D classiﬁcation such as PointNet

(Qi et al., 2017a), where each point of the point cloud

is processed individually, not considering its neigh-

borhood. Subsequently, in these global methods, in-

formation from all entities is aggregated by primi-

tive operations such as sum or max. This aggrega-

tion behaviour may thus neglect local information and

therefore does not fully acknowledge that an objects’

surface varies locally. The second archetype of 3D

classiﬁcation methods are more advanced approaches

such as PointNet++ (Qi et al., 2017b) that work with

grouping or clustering of input entities in order to

hierarchically create the shape descriptor and subse-

quently classify the overall object. However, these

methods are typically using very primitive clustering

mechanisms which do not fully leverage the under-

lying shape topology. A side effect of all these ap-

proaches is that when faced with out-of-distribution

samples, correct class prediction becomes challeng-

ing: if the methods are only trained on e.g. mugs with

one handle and at evaluation, a mug with four han-

dles is queried, many methods may confuse the shape

with instances of other object classes. On the other

290

Holzhüter, C., Teich, F. and Wörgötter, F.

Segmentation Improves 3D Object Classiﬁcation in Graph Convolutional Networks.

DOI: 10.5220/0010778100003124

In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 4: VISAPP, pages

290-298

ISBN: 978-989-758-555-5; ISSN: 2184-4321

Figure 1: Four object point clouds and their part instance segmentation. The part graphs are displayed on the bottom left

corner of each image. The red node in the the leftmost image corresponds to an occluded leg of the chair and the green node

in the middle right ﬁgure indicates the almost fully occluded leg of the table. The node features are learned from the part’s

point cloud representations.

hand, when parts of an object are occluded, current

approaches may again misclassify the object in ques-

tion.

In this work, we are proposing a bottom-up clas-

siﬁcation approach by means of part decomposi-

tion. The approach is inspired by the Recognition-by-

components theory developed by (Biederman, 1987).

It states that humans recognize objects as an assembly

of their parts. Based on the object’s components and

their arrangement, humans are able to identify its cat-

egory. In this paper, we are exploring the possibility

of using segmentation information about the objects

in order to create part-graphs. Leveraging these part-

graphs has two theoretical advantages. First, shape

variance across a single part is usually lower than

across the entire object. Second, using graph sim-

ilarity methods, objects with redundant or occluded

parts may be easier to predict correctly based on their

part-graphs compared to global approaches. More im-

portantly, we are trying to identify the potential of

such bottom-up approach by means of ground-truth

segmentation. For the part graph classiﬁcation, we

are employing Graph Convolutional Neural Networks

(GCNs).

The rest of the paper is structured as follows:

Chapter 2 provides overview on current 3D classi-

ﬁcation approaches, in Chapter 3 the classiﬁcation

pipeline and its components are explained in more de-

tail. Chapter 4 describes multiple setups to test differ-

ent part based 3D pointcloud classiﬁers. The results

are presented in Chapter 5. Chapter 6 summarizes the

overall results and discusses possible improvements

for future extensions of the presented method.

2 RELATED WORK

Approaches to 3D object classiﬁcation can be divided

into two categories: traditional, hand-engineered fea-

ture extraction and subsequent classiﬁcation thereof

and end-to-end classiﬁcation pipelines via deep learn-

ing.

The ﬁrst of these two categories is often em-

ployed in robotics applications or embedded devices

that have limited resources. As a ﬁrst step, the input

object, i.e. the point cloud is passed into an extrac-

tion module that collects statistics on predeﬁned fea-

tures such as angle between triplets of points from the

cloud, distance between points of the clouds, etc. This

information is discretized into histograms, resulting

in a descriptor for each queried object. Subsequently,

these descriptors are passed to the ﬁnal classiﬁer (of-

ten SVM or MLP) that is trained on this data and -

during evaluation - is able to predict the target class

of a queried point cloud, based on its extracted fea-

ture descriptor.

For methods of the second category, i.e. deep

learning classiﬁers, the extracted features are usually

learned implicitly from the training data. In 2015,

VoxNet (Maturana and Scherer, 2015) pushed the ca-

pabilities of such 3D classiﬁers to new limits with its

CNN architecture that uses a voxelized representation

of the 3D object as input. This CNN design was in-

spired by architectures that proved to be successful

in 2D classiﬁcation tasks (Krizhevsky et al., 2017)

and could directly be adopted to the 3D scenario -

thanks to the discrete input modality. However, dis-

cretizing the input space leads to the neglection of de-

tails. Sparse and very big input objects which need

to be stored, lead to a high computational cost in-

side the architecture itself, which, in turn, results in

long training and evaluation times. The widely em-

ployed PointNet (Qi et al., 2017a) architecture en-

abled classiﬁcation on point clouds sampled from the

objects surface. The key of their approach is to utilize

a symmetric function in order to tackle the permu-

tation problem: the network’s output should not de-

pend on the order of the points inside the input cloud.

Segmentation Improves 3D Object Classiﬁcation in Graph Convolutional Networks

291

By using an MLP with shared weights for all points,

PointNet extracts a high dimensional feature vector

from each point individually. The global aggregation

method (max-operator) then reduces these features to

a single global shape descriptor which is in turn fed

into the classiﬁcation head (MLP). Similar to Point-

Net, Momen

t (Joseph-Rivlin et al., 2019) is based on

MLP layers and max pooling as well, but as a key ad-

vantage it augments the point coordinates using sim-

ple polynomial functions. The products are concate-

nated to the original point features before the features

are passed to a classiﬁcation MLP. Successors to the

PointNet method started to incorporate the concept of

locality into the pipeline. By grouping points (e.g. us-

ing kNN) and evaluating PointNet on each of these

clusters, PointNet++ (Qi et al., 2017b) manages to

create a hierarchy of point clusters that ultimately su-

perseded PointNet on several classiﬁcation datasets.

However, PointNet++’s notion of clusters has no se-

mantic base but is just a spatial aggregation of points

at arbitrary regions inside the 3D object. Another

architecture exploring local regions is PointCNN (Li

et al., 2018), which applies discrete convolutions. The

convolutional layer identiﬁes the neighbourhood of a

point using kNN and subtracts the coordinates of the

target point from each of its neighbours to store their

relative positions. Subsequently, an MLP learns high

level features for each point in the local region and

concatenates these feature to the original ones. To

weight and permute the features into a more canonical

order a so-called χ-transformation is applied, which

is implemented using an MLP. Afterwards, a standard

convolution can be applied. Similar to PointNet++,

Sim2Real (Weibel et al., 2019) developed a Graph

Neural Network (GNN) (Kipf and Welling, 2016) that

segments query objects into eight ﬁxed regions that

are then evaluated by a PointNet architecture. Re-

sults of the eight parts are later aggregated and lever-

aged for the ﬁnal class prediction. Again, this method

neglects high-level part-boundaries as the segmenta-

tion method that is employed often results in segments

that do not contain speciﬁc semantic meaning. Never-

theless, these segment based approaches demonstrate

that considering part-wise point clusters of the in-

put may boost classiﬁcation performance compared

to point-wise methods. In contrast to these methods,

we focus on semantically meaningful segmentation.

Another approach is to represent a point cloud as

a graph, in which each node corresponds to one 3D

point and adjacency is determined by the point’s dis-

tance. A graph convolutional method, which aims to

improve PointNet is the so-called kernel correlation

proposed by (Shen et al., 2018). Similar to a convo-

lutional ﬁlter in 2D, a set of learnable points serves as

kernel to apply it to a local region of the kNN graph of

a 3D shape. The similarity of the kernel and the input

is measured by a Gaussian kernel such that regions,

which are similar to the kernel point set produce high

activations. The activations within a certain neigh-

bourhood are aggregated using max pooling. Another

popular kNN graph based method is DGCNN (Wang

et al., 2019), which enables non-local information dif-

fusion via a changing graph topology during the for-

ward pass. It performs edge convolutions, which learn

features for a target node by applying an MLP to all

edges originating from that node and aggregating the

computed features. Initial edge features are computed

from the input points using an MLP. After each con-

volutional layer, the kNN graph is updated such that

each node can have a new set of neighbours in the next

layer. A GNN based on mathematically substantiated

rotation invariance is ClusterNet (Chen et al., 2019),

which deﬁnes a representation of a point cloud that

contains all relevant information except the rotation

of the 3D object such that the output for a point cloud

and its rotated version are the same. The mapping is

determined by the norm of a point and its neighbours

and several properties deﬁned by the angle between a

point and its neighbours. This representation is used

to encode the node of a kNN graph of the point cloud,

which can be processed using an MLP. The neigh-

bourhood of a target point is aggregated using max

pooling. To reduce the dimension of the point cloud

and merge clusters in order to obtain a global feature

descriptor in the end, agglomerative hierarchical clus-

tering is applied. Additionally, similar to DGCNN,

MLPs are used to extract edge features. However, in-

stead of the difference between points the above de-

scribed rotation invariant mapping is used.

Recently, several transformer methods such as

(Zhao et al., 2020) and (Guo et al., 2021) have been

proposed. These methods are applied on a sequence

of points and learn relationships between these points

using a self attention mechanism, which estimates the

importance of one point to another. As self attention

mechanisms operate on an input set, they can deal

with the unorderedness of point clouds. The layers

of the Point Transformer classiﬁcation network (Zhao

et al., 2020) perform self-attention on a local neigh-

bourhood of a 3D point in a vectorized manner. To

encode the location of a 3D point, an MLP learns a

position embedding which is added to the transformed

input feature and the learned attention vectors. To re-

duce the dimension of the point cloud during the for-

ward pass, max pooling within a spatial region is ap-

plied (Zhao et al., 2020). In (Guo et al., 2021) the

input point clouds are transformed to a higher dimen-

sional feature space and passed to several attention

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

292

modules, which are based on MLPs. Features are

computed as the matrix product of the input features

and the computed attention values. Before applying

an MLP to obtain class scores, the authors perform

offset attention, which refers to subtracting the input

feature from the self-attended features.

3 METHODS

3.1 Overview

This paper proposes CompointNet, a bottom-up clas-

siﬁcation method for 3D objects based on part graphs,

which are constructed from 3D objects segmented

into their components. Each component of an object

serves as a node in the part graph. Examples for 3D

objects and their corresponding part graphs are shown

in Figure 1. In order to classify 3D shapes Com-

pointNet learns a global representation from their part

graphs in a bottom-up manner. Two basic variants of

CompointNet have been developed, one is based on

the 1-WL proposed by (Morris et al., 2019) and the

other applies the graph attention layer presented in

(Veli

ckovi

c et al., 2018). To learn a more robust latent

representation, which fuses local and global informa-

tion, two extensions of CompointNet are proposed.

They combine a global Pointnet approach with the

GAT and WL based CompointNet respectively, such

that each input shape is processed by both, the global

and the local model, in parallel. The vector repre-

sentation computed by both methods are concatenated

and further processed to produce a latent representa-

tion from which the ﬁnal class scores can be inferred.

3.2 Feature Extraction and Part Graph

Construction

The part graph for a 3D object is constructed using

k-nearest neighbour and requires a point cloud seg-

mented into its components. Each node of the part

graph represents a component of the 3D shape and

an edge between two nodes indicates that the corre-

sponding object parts are spatially connected. This

connectivity identiﬁed by searching for neighbouring

points with different segmentation labels using kNN.

The resulting part graph is undirected and does not

contain edge labels. Each node comprises a feature

representation, which describes the corresponding ob-

ject part. For this purpose, a PointNet model is ap-

plied to the 3D coordinates of each component to pro-

duce per-node features.

3.3 Part Graph Learning using a GCN

Architecture

To learn a global representation from which the ob-

ject classes can be inferred, two different Compoint-

Net variants have been developed. The simpler vari-

ant is based on the 1-WL layer proposed by (Morris

et al., 2019), which aggregates the neighbourhood of

a node in a learnable way. The hidden representation

of node i is computed as:

= σ(W

∑

j∈N(i)

), (1)

where h

is the feature vector of node i, W

and W

are learnable weights and N(i) is the neighbourhood

of node i. σ refers to a non-linear activation func-

tion. Several of these layers sequentially applied

to an input graph deﬁne a convolutional neural net-

work architecture, which (Morris et al., 2019) refer

to as 1-GNN. It implements a basic message pass-

ing scheme, in which a node aggregates the informa-

tion of its neighbours. This way information is propa-

gated across the part graph across the edges. The WL

based CompointNet sequentially applies three convo-

lutional blocks consisting of several WL layers de-

scribed above. The hidden representation computed

by each block are concatenated and further processed

by a set of fully connected layers with softmax acti-

vation in the end.

The other version of CompointNet is based on the

graph attention layer proposed by (Veli

ckovi

c et al.,

2018). The proposed layer computes the hidden rep-

resentation of a target node as the weighted sum of the

features of its neighbours including itself. For each

target node an attention mechanism assigns attention

coefﬁcients to the adjacent nodes to aggregate their

information based on the importance of each node to

the target node. The coefﬁcient e

i j

for a pair of nodes

i and j is computed as follows:

i j

= a(W h

, W h

) = a

[W h

kW h

], (2)

where h

and h

correspond to the features of node i

and j respectively. W is a weight matrix and k refers

to the concatenation operation. e

i j

is only computed if

node i and node j are adjacent. The attention mecha-

nism is implemented as multiplication with the learn-

able weight vector a and the coefﬁcients are normal-

ized using a softmax function. The hidden represen-

tation h

of a target node i is computed as

= σ(

∑

j∈N

i j

W h

). (3)

σ refers to a non-linear activation function applied

to the linear combination of the neighbour of target

Segmentation Improves 3D Object Classiﬁcation in Graph Convolutional Networks

293

5 1 2 x n p

1 2 8

P o i n t n e t

M L P

K N N

G l o b a l P o o l

3 x n

3 2 * 8 x n p

1 2 8 x n p 1 2 8 x n p

n c l a s s e s

P a r t f e a t u r e s

P o i n t

c o o r d i n a t e s

G A T

6 4 x n p

1 2 8 x n p

6 4 x n p

W L - A l t e r n a t i v e t o G AT

2 5 6

C o n c a t e n a t e

D r o p o u t

C o n v

E L U

B a t c h n o r m

Figure 2: Hybrid CompointNet Architecture using graph attention. The input point cloud is passed to a PointNet architecture

as raw 3D coordinates (lower branch). The corresponding part graph is constructed using kNN, and node features are extracted

from the point cloud of each part using a PointNet model. Subsequently, the graph based GCN (GAT) followed by an MLP is

applied to the part graphs (upper branch). In the end the learned representations are combined using another MLP. The GAT

based GCN can be replaced with the WL based GCN shown on the right.

node i. Accordingly, the feature of a target is com-

puted by summing the transformed features of adja-

cent nodes, which are weighted according to their im-

portance to the target node. To increase the model’s

capacity, (Veli

ckovi

c et al., 2018) further introduce

multi-head attention, which separately applies several

attention mechanism in the same layer and concate-

nates the resulting feature representations. The cor-

responding variant of CompointNet sequentially ap-

plies a set of GAT layers with multi head attention,

of which the last layer computes only one attention

head. The obtained feature vector is passed to a set of

fully connected layers for ﬁnal classiﬁcation.

3.4 Combining Local and Global

Features

Both variants of CompointNet are extended to a hy-

brid version, which combines the above described lo-

cal CompointNet with the global MLP based method

PointNet. The resulting architecture fuses local in-

formation extracted from the part graphs with global

information obtained by individually processing each

point of the entire shape by a PointNet model. The

network is composed of two branches, which are

combined in the end. The PointNet based branch

takes the entire point cloud as input and produces a

feature vector, whilst the part graph based methods

learns a representation from the corresponding part

graph. Both representations are concatenated and fur-

ther transformed an MLP to predict the class scores.

The architecture is shown in Figure 2. The hybrid

CompointNet extends the strictly part graph based

version by the information about the rough overall

input shape learned from the keypoints extracted by

PointNet.

4 EXPERIMENTS

The proposed classiﬁcation methods are applied to the

PartNet dataset (Mo et al., 2019), which comprises

about 27 000 distinct three-dimensional CAD mod-

els of 24 different object categories. PartNet provides

different levels of segmentation granularity, of which

the most ﬁne-grained one is used in the conducted ex-

periments. To process the 3D objects of PartNet, only

few preprocessing is required and no data augmenta-

tion is applied. As input to all models 1024 points

are dynamically sampled from the point cloud during

training and the entire shape is normalized into a unit

sphere. The performance of the proposed methods is

evaluated using 5-fold cross validation using the en-

tire PartNet dataset.

The conﬁguration of the PointNet model, which

computes the node feature for each component of a

3D shape follows the conﬁguration proposed by the

authors of PointNet with the alteration that the last

two fully connected layers are removed. The resulting

feature vector has a size of 512.

The WL based CompointNet consists of three

convolutional blocks and the output of each block

serves as input to the next block to increase the re-

ceptive ﬁeld with increasing network depth. In the

ﬁrst block three convolutional layers transform the

node features of the part graphs from size 512 to 64

via three convolutional layers. The next two convo-

lutional blocks consist of two convolutional layers of

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

294

output size 64. Each of the layers applies ELU ac-

tivation and Layernorm and before each convolution

block dropout with a probability of 0.5 is applied. To

aggregate the per node features global pooling is ap-

plied on the output of the three convolutional blocks

to obtain one vector per graph and per block. The re-

sulting feature representation, accordingly, consists of

three concatenated vector of size 64 produced by the

three convolutional blocks.

In the GAT based CompointNet two subsequent

GAT layers are applied to the part graphs, of which

the ﬁrst layer computes eight attention heads of 32

features each and the second a single attention head

with 128 features. In both layers attention coefﬁcients

are dropped out with a probability of 0.5. ELU is used

as non-linear activation and Layernorm is applied af-

ter each convolutional layer. The output of the GAT

layers is a vector of size 128 per node, which is mean

pooled into a single feature vector per graph. Deeper

versions of both GCNs with more convolutional lay-

ers have been tested without signiﬁcant performance

improvements, therefore only the conﬁgurations de-

scribed above are considered here.

The obtained feature representations for both vari-

ants and passed to a three-layer MLP with ELU acti-

vation and Layernorm and a log softmax activation in

the end to make the ﬁnal class predictions.

The GCN applied in the hybrid CompointNet

models are conﬁgured as described above except that

the fully connected layers, which are applied in the

end, are removed. The PointNet model integrated into

the hybrid CompointNet is conﬁgured as described

in the corresponding paper with the alteration that

the last three fully connected layers are removed as

well. The feature representation obtained by the two

branches are concatenated and passed to three-layer

MLP equally to the non-hybrid variants.

The training procedure for all proposed variants of

CompointNet includes a validation set, which com-

prises 20% of the training data to monitor the valida-

tion error and accuracy. Based on that, early stopping

with terminates the training if the overall validation

accuracy did not increase within the last 15 epochs.

The models are optimized on the negative log likeli-

hood loss using Adam with a learning rate of 0.001.

Every second epochs the learning rate is decayed to

0.9 of its original value. The training is performed on

GPU with a batch size of eight.

5 RESULTS

We compare our model to four benchmark methods:

Vanilla PointNet (Qi et al., 2017a), PointNet++ (Qi

et al., 2017b) and two very recent transformer meth-

ods, which are PCT: Point Cloud Transformer (Guo

et al., 2021) and Point Transformer (Zhao et al., 2020)

described in section 2. The implementation of Point-

Net is provided by (Xia, 2017), which is referred to

by the authors of PointNet. PointNet++ is provided

by Pytorch Geometric and the transformer methods

have been implemented by (You, 2021).

Figure 3 shows a boxplot of the mean overall

accuracy across the different folds for our different

CompointNet variants and the benchmark models. It

Figure 3: Overall Accuracy on PartNet. The hybrid ver-

sions of CompointNet perform consistently well, whereas

the non-hybrid variants reach lower accuracies. PointNet++

performs better than PointNet, however it does not achieve

the accuracy of the hybrid CompointNet. The transformer

methods (right) outperform the PointNet methods and the

non-hybrid variants of CompointNet, but cannot keep up

with the hybrid WL based CompointNet.

Figure 4: Average Class Accuracy on PartNet. The hybrid

variants of CompointNet consistently outperform the com-

pared methods.

Segmentation Improves 3D Object Classiﬁcation in Graph Convolutional Networks

295

Figure 5: Overall accuracy of our approach vs. number of nodes in the part graph of an object. Different colors indicate the

amount of test samples with n nodes. The darker the color the higher the number of objects with that amount of nodes. The

red line indicates the mean accuracy.

Table 1: Classiﬁcation results on PartNet. Our approach

achieves state-of-the-art performance. The hybrid WL

based CompointNet outperforms all other compared meth-

ods in terms of average per class and overall accuracy.

Model overall per class

GAT CompointNet 0.93 0.88

WL CompointNet 0.94 0.90

Hybrid GAT CompointNet 0.95 0.92

Hybrid WL CompointNet 0.97 0.94

PointNet(Qi et al., 2017a) 0.93 0.88

PointNet++(Qi et al., 2017b) 0.94 0.89

PCT(Guo et al., 2021) 0.95 0.91

Point Transformer (Zhao et al., 2020) 0.95 0.90

can be observed that our approach achieves com-

petitive performance among the benchmark architec-

tures. Amongst the variants of CompointNet, the

hybrid methods perform signiﬁcantly better than the

strictly part-graph based models, particularly the WL

based hybrid CompointNet achieves a high accuracy

of 97% on the test dataset. It outperforms Point-

Net and PointNet++ by 3% and 4% respectively and

achieves a 2% improvement over the transformer ar-

chitectures. In general the WL based variants achieve

higher accuracy than the GAT based models, however

the GAT based hybrid model has signiﬁcantly fewer

variation across different cross validation runs simi-

lar to the transformer based approaches. The average

per class accuracy shown in Figure 4 indicates that

our approach does not only perform well on frequent

classes, but is able to generalize to new objects of

classes with fewer training data as well. Both hybrid

versions of CompointNet outperform all compared

methods and also the non-hybrid WL based Com-

pointNet can keep up with the PointNet models and

PCT in terms of per class accuracy. The GAT based

CompointNet reaches the same per class accuracy as

PointNet. Since the PartNet dataset is very unbal-

anced, the per class accuracy is of particular impor-

tance. Thus, the improvement of 3% and 4% achieved

by the hybrid WL based CompointNet models over

the two compared transformer methods is a promis-

ing result. The exact performance in terms of overall

accuracy and average per class accuracy is shown is

Table 1. Extending the part-graph based model by

including global information using a PointNet archi-

tecture enhances the performance of both, the GAT

based and the WL based model.

Figure 5 illustrates the overall accuracy across

objects with a certain number of nodes. It can be

observed that the hybrid models improves the pure

part graph based models, particularly for objects with

more nodes, i.e. more complex objects. The reason

might be that many nodes usually imply a ﬁne-grained

segmentation into tiny parts, which might be more

difﬁcult to detect for the GCN, since ﬁner segmen-

tation tends to result in geometrically more similar

parts. Additionally, the information diffusion across

the graph is limited locally by the number of convo-

lutional layers in the GCN. Thus, for large graphs the

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

296

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07

Inference Time in seconds

Pointnet

Pointnet++

PCT

Hybrid GAT

CompointNet

Hybrid WL

CompointNet

GAT

CompointNet

Figure 6: Inference times of our approach and the com-

pared models. The inference times are averaged over the

full PartNet dataset using a batch size of one.

neighbourhood of a node covered by the GCN is small

compared to the size of the graph. This might affect

the performance of CompointNet on larger graphs. As

indicated by the colors a large fraction of objects is

composed of less than 30 nodes. Accordingly, the hy-

brid approach can improve the accuracy of Compoint-

Net on graph classes that are rather rare.

Figure 6 shows the inference times of the vari-

ants of our approach and the compared approaches.

PointNet is by far the fastest method, while Point-

Net++ and the CompointNet models show similar in-

ference times. The transformer methods are drasti-

cally slower. Accordingly, their superior performance

over the PointNet models comes with a signiﬁcant in-

crease of computational complexity. This does not

apply for our models, which are much more efﬁ-

cient than the transformer methods. Even the hybrid

approach, which show equal or better performance

than the transformer methods are signiﬁcantly faster.

Since PointNet is included as feature extractor in each

part graph based model, it is clear that our approach

cannot keep up with the inference time of PointNet.

However, the computational complexity of the strictly

part graph based CompointNet is lower than for Point-

Net++. The hybrid versions of CompointNet require

a similar amount of time for the forward pass despite

the per part feature extraction and the global shape

processing, both performed separately using a Point-

Net model. The GCN applied in our approach ac-

counts only for a small fraction of the inference time,

since the per part feature extraction requires one for-

ward pass through the PointNet model for each part.

6 CONCLUSION

This paper investigates the potential of a part graph

based bottom-up approach to improve the classiﬁca-

tion of 3D objects by means of ground truth segmen-

tation. It could be applicable for 3D search engines

or for cataloging of CAD models to improve their

accuracy and robustness. The proposed architecture

variants of CompointNet successfully leverage part

decomposition of 3D objects to learn local 3D fea-

tures using two different graph convolutional network

architectures. Particularly, the hybrid approach inte-

grating both, a global approach and a part graph based

approach in parallel, achieves state-of-the art results.

The experiments have shown that a very basic GCN,

which computes node features as the sum over adja-

cent nodes multiplied with a learnable weight matrix,

is sufﬁcient to learn high-level features for object clas-

siﬁcation. It even outperforms the more sophisticated

GAT based GCN. The hybrid methods outperform

Vanilla PointNet and PointNet++ and can keep up

with the compared transformer architectures. In the

conducted experiments the hybrid WL based Com-

pointNet even outperforms the transformer methods,

while being much more efﬁcient. The hybrid models

achieved the highest per class accuracy amongst all

compared architecture, which is of particular impor-

tance for unbalanced datasets such as PartNet. The

proposed models leverage that the variation across

a single part is usually lower than across the entire

object, which is expected to make the application of

PointNet as feature extractor more effective. Further-

more, using GCN to extract 3D features makes use of

the structure of an object, which is a theoretical ad-

vantage compared to global approaches. The applica-

tion of CompointNet on automatic instead of ground

truth segmentation to further investigate the poten-

tial of such bottom-up approaches is left for future

work. To enable CompointNet to capture non-local

relationships between object components, the integra-

tion of deeper GCNs should be further investigated

to enhance the performance of the strictly part graph

based models on larger graphs. Furthermore, different

approaches for the per part feature extraction could

be explored to accelerate this task. By replacing the

PointNet model by a hand-engineered feature extrac-

tion, the inference time would be reduced drastically.

This might open up new application opportunities. Fi-

nally, the WL based CompointNet, which applies a

so-called 1-GNN proposed by (Morris et al., 2019),

could instead apply their 2- or 3-GNN, which operates

on k-sets of nodes instead of single nodes. This might

lead to further improvements regarding the classiﬁca-

tion performance.

Segmentation Improves 3D Object Classiﬁcation in Graph Convolutional Networks

297

REFERENCES

Arnold, E., Al-Jarrah, O. Y., Dianati, M., Fallah, S., Ox-

toby, D., and Mouzakitis, A. (2019). A survey on

3d object detection methods for autonomous driving

applications. IEEE Transactions on Intelligent Trans-

portation Systems, 20(10):3782–3795.

Biederman, I. (1987). Recognition-by-components: A the-

ory of human image understanding. Psychological Re-

view, 94(2):115–147.

Chen, C., Li, G., Xu, R., Chen, T., Wang, M., and Lin, L.

(2019). ClusterNet: Deep Hierarchical Cluster Net-

work With Rigorously Rotation-Invariant Represen-

tation for Point Cloud Analysis. In 2019 IEEE/CVF

Conference on Computer Vision and Pattern Recog-

nition (CVPR), pages 4989–4997, Long Beach, CA,

USA. IEEE.

Guo, M.-H., Cai, J.-X., Liu, Z.-N., Mu, T.-J., Martin, R. R.,

and Hu, S.-M. (2021). PCT: Point cloud transformer.

Computational Visual Media, 7(2):187–199.

Joseph-Rivlin, M., Zvirin, A., and Kimmel, R. (2019).

Momen

t: Flavor the Moments in Learning to Clas-

sify Shapes. In 2019 IEEE/CVF International Confer-

ence on Computer Vision Workshop (ICCVW), pages

4085–4094, Seoul, Korea (South). IEEE.

Kharroubi, A., Hajji, R., Billen, R., and Poux, F. (2019).

Classiﬁcation and integration of massive 3d points

clouds in a virtual reality (vr) environment. Interna-

tional Archives of the Photogrammetry, Remote Sens-

ing and Spatial Information Sciences, 42(W17):165–

171.

Kipf, T. N. and Welling, M. (2016). Semi-supervised clas-

siﬁcation with graph convolutional networks. arXiv

preprint arXiv:1609.02907.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2017). Im-

agenet classiﬁcation with deep convolutional neural

networks. Commun. ACM, 60(6):84–90.

Li, Y., Bu, R., Sun, M., Wu, W., Di, X., and Chen,

B. (2018). Pointcnn: Convolution on x-transformed

points. In Proceedings of the 32nd International Con-

ference on Neural Information Processing Systems,

NIPS’18, page 828–838, Red Hook, NY, USA. Cur-

ran Associates Inc.

Mart

ınez, J. L., Morales, J., Reina, A. J., Mandow, A.,

Pequeno-Boter, A., and Garc

ıa-Cerezo, A. (2015).

Construction and calibration of a low-cost 3d laser

scanner with 360 ﬁeld of view for mobile robots.

In 2015 IEEE International Conference on Industrial

Technology (ICIT), pages 149–154. IEEE.

Maturana, D. and Scherer, S. (2015). Voxnet: A 3d convolu-

tional neural network for real-time object recognition.

In 2015 IEEE/RSJ International Conference on Intel-

ligent Robots and Systems (IROS), pages 922–928.

Mo, K., Zhu, S., Chang, A. X., Yi, L., Tripathi, S., Guibas,

L. J., and Su, H. (2019). PartNet: A large-scale bench-

mark for ﬁne-grained and hierarchical part-level 3D

object understanding. In The IEEE Conference on

Computer Vision and Pattern Recognition (CVPR).

Morris, C., Ritzert, M., Fey, M., Hamilton, W. L., Lenssen,

J. E., Rattan, G., and Grohe, M. (2019). Weisfeiler

and Leman Go Neural: Higher-Order Graph Neural

Networks. Proceedings of the AAAI Conference on

Artiﬁcial Intelligence, 33(01):4602–4609.

Pfeiffer, J., Pfeiffer, T., Meißner, M., and Weiß, E.

(2020). Eye-tracking-based classiﬁcation of informa-

tion search behavior using machine learning: evidence

from experiments in physical shops and virtual real-

ity shopping environments. Information Systems Re-

search, 31(3):675–691.

Qi, C. R., Su, H., Mo, K., and Guibas, L. J. (2017a). Point-

net: Deep learning on point sets for 3d classiﬁcation

and segmentation. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 652–660.

Qi, C. R., Yi, L., Su, H., and Guibas, L. J. (2017b). Point-

net++: Deep hierarchical feature learning on point sets

in a metric space. arXiv preprint arXiv:1706.02413.

Shen, Y., Feng, C., Yang, Y., and Tian, D. (2018). Mining

Point Cloud Local Structures by Kernel Correlation

and Graph Pooling. In 2018 IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

4548–4557, Salt Lake City, UT. IEEE.

Straub, J. and Kerlin, S. (2014). Development of a large,

low-cost, instant 3d scanner. Technologies, 2(2):76–

95.

Veli

ckovi

c, P., Cucurull, G., Casanova, A., Romero, A.,

o, P., and Bengio, Y. (2018). Graph Attention Net-

works. International Conference on Learning Repre-

sentations.

Wang, Y., Sun, Y., Liu, Z., Sarma, S. E., Bronstein, M. M.,

and Solomon, J. M. (2019). Dynamic Graph CNN

for Learning on Point Clouds. ACM Transactions on

Graphics, 38(5):1–12.

Weibel, J.-B., Patten, T., and Vincze, M. (2019). Addressing

the sim2real gap in robotic 3-d object classiﬁcation.

IEEE Robotics and Automation Letters, 5(2):407–413.

Xia, F. (2017). Pointnet.pytorch .

https://github.com/fxia22/pointnet.pytorch.

You, Y. (2021). Point-transformers .

https://github.com/qq456cvb/Point-Transformers.

Zhao, H., Jiang, L., Jia, J., Torr, P., and Koltun, V. (2020).

Point Transformer. arXiv:2012.09164 [cs]. arXiv:

2012.09164.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

298