Scan2Part: Fine-grained and Hierarchical

Part-level Understanding of Real-World 3D Scans

Alexandr Notchenko,Vladislav Ishimtsev, Alexey Artemov, Vadim Selyutin, Emil Bogomolov

and Evgeny Burnaev

Skolkovo Institute of Science and Technology, Moscow, Russian Federation

Keywords:

Semantic Segmentation, Scene Understanding, Volumetric Scenes, Part-level Segmentation.

Abstract:

We propose Scan2Part, a method to segment individual parts of objects in real-world, noisy indoor RGB-D

scans. To this end, we vary the part hierarchies of objects in indoor scenes and explore their effect on scene

understanding models. Speciﬁcally, we use a sparse U-Net-based architecture that captures the ﬁne-scale detail

of the underlying 3D scan geometry by leveraging a multi-scale feature hierarchy. In order to train our method,

we introduce the Scan2Part dataset, which is the ﬁrst large-scale collection providing detailed semantic labels

at the part level in the real-world setting. In total, we provide 242,081 correspondences between 53,618

PartNet parts of 2,477 ShapeNet objects and 1,506 ScanNet scenes, at two spatial resolutions of 2 cm

and

5 cm

. As output, we are able to predict ﬁne-grained per-object part labels, even when the geometry is coarse

or partially missing. Overall, we believe that both our method as well as newly introduced dataset is a stepping

stone forward towards structural understanding of real-world 3D environments.

1 INTRODUCTION

In the recent years, a wide variety of consumer-grade

RGB-D cameras such as Intel Real Sense (Kesel-

man et al., 2017), Microsoft Kinect (Zhang, 2012), or

smartphones equipped with depth sensors, enabled in-

expensive and rapid RGB-D data acquisition. Increas-

ing availability of large, labeled datasets (e.g., (Chang

et al., 2017; Dai et al., 2017)) made possible develop-

ment of deep learning methods for 3D object classiﬁ-

cation and semantic segmentation. At the same time,

acquired 3D data is often incomplete and noisy; while

one can identify and segment the objects in the scene,

reconstructing high-quality geometry of objects re-

mains a challenging problem.

An important class of approaches, e.g., recent

work (Avetisyan et al., 2019), uses a large dataset of

clean, labeled geometric shapes (Chang et al., 2015),

for classiﬁcation/segmentation associating the input

point or voxel data with object labels from the dataset,

along with adapting geometry to 3D data. This ap-

proach ensures that the output geometry has high

quality, and is robust with respect to noise and miss-

ing data in the input. At the same time, a “ﬂat” classi-

ﬁcation/segmentation approach, with each object in

the database corresponding to a separate label and

matched to a subpart of the input data corresponding

to the whole object, does not scale well as the num-

ber of classes grows and often runs into difﬁculties in

the cases of extreme occlusion (only a relatively small

part of an object is visible). Signiﬁcant improvements

can be achieved by considering object parts, or, more

generally, part hierarchies. Part-based segmentation

of 3D datasets promises to offer a signiﬁcant improve-

ment in a variety of tasks such as ﬁnding the best

matching shape in the dataset or recognizing objects

from highly incomplete data (e.g., from one or two

parts).

Man-made environments (e.g., indoor scenes)

commonly consist of objects that naturally form part

hierarchies where objects and their parts can be di-

vided into ﬁner parts. In our work, we use scene

and object representation based on such part hierar-

chies and focus on the key problem of semantic part

segmentation of separate objects in the scenes, en-

abling further improvements in dataset-based recon-

struction. To this end, we construct a volumetric part-

labeled dataset of scanned 3D data suitable for ma-

chine learning applications. We use this new dataset

to explore the limits of part-based semantic segmen-

tation by training a variety of sparse 3D convolutional

neural networks (CNNs) in multiple setups.

Our contributions include:

Notchenko, A., Ishimtsev, V., Artemov, A., Selyutin, V., Bogomolov, E. and Burnaev, E.

Scan2Part: Fine-grained and Hierarchical Part-level Understanding of Real-World 3D Scans.

DOI: 10.5220/0010848200003124

In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 5: VISAPP, pages

711-722

ISBN: 978-989-758-555-5; ISSN: 2184-4321

711

1. A new method Scan2Part, aiming to segment vol-

umetric objects into semantic parts and their in-

stances, leveraging a sparse volumetric 3D CNN

model trained on a large-scale part-annotated col-

lection of objects.

2. A new dataset Scan2Part, composed of 1,506 3D

reconstructions of real-world scenes with 2,477

aligned 3D CAD models represented by real-

world 3D geometry and annotated using hierar-

chical annotation, which links 3D scene recon-

struction with part-annotation of indoor objects.

2 RELATED WORK

Deep Learning for 3D Scene Understanding. Deep

learning has been applied for semantic 3D scene un-

derstanding in a variety of ways, and we only review

the core work related to the semantic and instance

segmentation of 3D scenes. Most relevant to our

work, deep learning approaches have been used on

3D reconstructions of scenes represented by 3D volu-

metric grids (Dai et al., 2017; Dai and Nießner, 2018;

Hou et al., 2019; Liu and Furukawa, 2019). For vol-

umetric grid, 3D convolutions may be deﬁned anal-

ogously to 2D convolutions in image domains, giv-

ing rise to 3D convolutional neural nets (3D CNNs).

Memory requirements have been addressed by adap-

tive data structures (Wang et al., 2017). Similar to

this body of work, we operate on volumetric repre-

sentations of 3D scenes, but perform segmentation of

individual parts.

Methods operating on raw point clouds provide

an alternative to volumetric 3D CNNs by construct-

ing an appropriate operations directly on unstruc-

tured point clouds for a variety of applications in-

cluding semantic labeling (e.g., (Qi et al., 2017a; Qi

et al., 2017b; Klokov and Lempitsky, 2017; Wang

et al., 2018; Wang et al., 2019b)). Most recently, in-

stance (Elich et al., 2019; Liang et al., 2019; Elich

et al., 2019; Yi et al., 2019; Yang et al., 2019; Zhang

and Wonka, 2019; Engelmann et al., 2020) and joint

semantic-instance (Wang et al., 2019a; Pham et al.,

2019b) segmentation tasks on point clouds have been

considered closely. While point-based methods re-

quire less computations, learning with irregular struc-

tures such as point clouds is challenging. To segment

instances, recently proposed volumetric and point-

based approaches use metric learning to extract per-

point embeddings that are subsequently grouped into

object instances (Elich et al., 2019; Yi et al., 2019;

Lahoud et al., 2019; Liu and Furukawa, 2019). We

take advantage of this mechanism in our instance seg-

mentation methods.

Part-aware segmentation methods commonly fo-

cus on meshes or complete, clean point clouds con-

structed from 3D CAD models. The most closely re-

lated work is semantic parts labeling (e.g., (Yi et al.,

2016; Wang et al., 2017; Qi et al., 2017a; Mo et al.,

2019b; Yi et al., 2019; Zhang and Wonka, 2019))

and part instance segmentation (Zhang and Wonka,

2019) for voxelized or point-sampled 3D shapes.

Other works focus on leveraging parts structure of

clean shapes for co-segmentation (Chen et al., 2019;

Zhu et al., 2019), hierarchical mesh segmentation (Yi

et al., 2017), shape assembly/generation (Mo et al.,

2019a; Wu et al., 2019b; Wu et al., 2019a; Mo et al.,

2020), geometry abstraction (Russell et al., 2009; Li

et al., 2017; Sun et al., 2019), and other applications.

In comparison, our focus is on learning part-based se-

mantic and instance segmentation of noisy and frag-

mented real-world 3D scans. Very recently, initial ap-

proaches to semantic 3D segmentation have been pro-

posed (Bokhovkin et al., 2021; Uy et al., 2019) but

for a signiﬁcantly less extensive part hierarchy. More

speciﬁcally, (Bokhovkin et al., 2021) targets predict-

ing part hierarchy at object and coarse parts levels,

discarding smaller parts altogether; in contrast, we are

able to predict parts at ﬁner levels in the hierarchy.

Other methods have been studied in the context

of 3D scene segmentation, such as complementing

CNNs with conditional random ﬁelds (Pham et al.,

2019b; Pham et al., 2019a; Wang et al., 2017), how-

ever, these are beyond the scope of this paper.

3D Scene Understanding Datasets. A body of work

focuses on rendering-based methods, aiming to realis-

tic 3D scenes procedurally (Fisher et al., 2012; Handa

et al., 2016; Song et al., 2017; McCormac et al.,

2017; Li et al., 2018; Garcia-Garcia et al., 2018).

Such datasets can in principle provide arbitrarily ﬁne

semantic labels but commonly suffer from the real-

ity gap caused by synthetic images; in contrast, our

proposed dataset is built by transferring part annota-

tions to real-world noisy scans. Recent advances in

RGB-D sensor technology have resulted in the de-

velopment of a variety of 3D datasets capturing real

3D scenes (Armeni et al., 2016; Hua et al., 2016;

Dai et al., 2017; Chang et al., 2017; Armeni et al.,

2017; Straub et al., 2019), however, none of these

provide part-level object annotations. In contrast, our

dataset provides semantic and instance part labels for

a large-scale collection of indoor 3D reconstructions.

ScanObjectNN (Uy et al., 2019) provides parts an-

notation for objects in real-world scenes, however, it

does not include annotations in occluded regions and

only speciﬁes parts labeling at the coarsest levels in

the parts hierarchy. Thus, this collection thus does not

allow reasoning about the ﬁne grained structure of the

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

712

Figure 1: A pipeline for obtaining Scan2Part dataset. We project the PartNet (Mo et al., 2019b) labels to the ShapeNet (Chang

et al., 2015) coordinate system (left), then use alignments in Scan2CAD (Avetisyan et al., 2019) (bottom) to map labels to

real indoor scenes from ScanNet (Dai et al., 2017) (right).

objects; differently, our data construction approach al-

lows to ﬂexibly select part representation levels, and

our experiments include results for 13, 36, and 79 part

categories at ﬁrst through third levels in the hierarchy.

Early collections of part-annotated meshes (Chen

et al., 2009) are limited by their relatively smaller

scale. With the introduction of a comprehensive

ShapeNet benchmark (Chang et al., 2015), a coarse

semantic part annotation has been created using active

learning (Yi et al., 2016). More recently, a large-scale

effort to systematically annotate 3D shapes within a

coherent hierarchy was presented (Mo et al., 2019b).

Still, none of these CAD-based collections include

real-world 3D data, limiting their potential use. Our

benchmark is designed to address this reality gap.

Large-scale 3D understanding datasets commonly

require costly manual annotations by tens to hundreds

of expert crowd workers (annotators), preceded by

the development of custom labeling software (Armeni

et al., 2016; Hua et al., 2016; Dai et al., 2017; Chang

et al., 2017; Armeni et al., 2017; Straub et al., 2019;

Yi et al., 2016; Mo et al., 2019b). Moreover, anno-

tating parts in 3D objects from scratch is connected

to inherent ambiguity in part deﬁnitions, as revealed

by (Yi et al., 2016; Mo et al., 2019b). This chal-

lenge is even more pronounced for noisy, incomplete

3D scans produced by RGB-D fusion. We have cho-

sen to instead build our Scan2Part dataset fully au-

tomatically by leveraging correspondences between

four publicly available 3D collections: ScanNet (Dai

et al., 2017), Scan2CAD (Avetisyan et al., 2019),

ShapeNet (Chang et al., 2015), and PartNet (Mo et al.,

2019b) datasets. As a result, we (1) become free from

ambiguity in part deﬁnitions by re-using consistent,

well-deﬁned labels from (Mo et al., 2019b), and (2)

are able to compute appropriate levels of semantic de-

tail for our benchmark without manual re-labeling.

Assembly from Parts and Hierarchy. A lot of re-

searchers over the last 20 years came to the idea

that scenes and images are better represented as dis-

crete structures with relational and hierarchical prop-

erties. New datasets that make explicit relations on

visual objects were created recently (Krishna et al.,

2017), spurring new research in scene graphs (John-

son et al., 2018; Xu et al., 2017) and reconstruction.

In cognitive science, it have been conjectured for a

long time that ability to compose complex objects

and scenes from parts is a fundamental part of human

perception (Hoffman and Richards, 1984; Biederman,

1987). The concept of ”Recognition-by-components”

is closely related to ”analysis-by-synthesis” (Yuille

and Kersten, 2006; Yildirim et al., 2015) approach

in machine learning. Mumford and Zhu in (Zhu and

Mumford, 2006) assume that spatial scenes can be

deﬁned by a ”grammar” of spatial objects and geo-

metric primitives, similar to sentences in natural lan-

guage, where a sequence of words can have multiple

parsing trees providing multiple explanations to sin-

gle sentence. Multiple solutions are expected because

solving an under-constrained inverse problem (in this

case inverse graphics problem) can have multiple so-

lutions. The solution can be made unique either with

stronger priors or can be resolved in downstream pro-

cessing if uncertainty is propagated.

Scan2Part: Fine-grained and Hierarchical Part-level Understanding of Real-World 3D Scans

713

3 METHOD

3.1 Dataset Construction

Data Preparation. Our ﬁrst objective is to develop a

large-scale 3D scene understanding dataset with part-

level annotations. We use the 3D geometry in 1,506

3D scenes in Scannet dataset (Dai et al., 2017) recon-

structed from RGB-D scans in the form of truncated

Signed Distance Function (SDF) at voxel resolution

= 2 cm and 5 cm. For labeling the scan geometry, we

use the parts taxonomy in PartNet dataset (Mo et al.,

2019b) represented as a tree structure where nodes en-

code parts at various detail levels and edges encode

the “part of” relationship. We associate to each 3D

scene a per-voxel mask storing leaf part IDs from the

taxonomy (i.e., the most ﬁne-grained categories). To

label the 3D scene at a given level d of semantic de-

tail (d = 1 meaning whole objects and d = 8 meaning

ﬁnest parts), we start with the leaf labels in each voxel

and traverse the taxonomy tree until hitting depth d.

We further describe the main steps taken to create

our Scan2Part benchmark below, leaving the detailed

discussion of the technicalities for the supplementary.

Annotation schema for our dataset is shown in Fig-

ure 1. The ﬁnal result of our procedure is a collection

of scenes comprised of volumetric instances of sepa-

rate objects, where background and non-object infor-

mation is removed, allowing to focus on part based

segmentation.

Transferring Labels to Volumetric 3D Grids. To

obtain ground-truth semantic parts annotations for

real-world 3D scenes, we establish correspondences

between each volume in the 3D scene in Scannet and

a set of part-annotated mesh vertices in a registered

3D CAD model from PartNet. To this end, we ﬁrst

ﬁnd accurate 9 degrees of freedom (9 DoF) transfor-

mations between PartNet 3D models and their orig-

inal versions from ShapeNet, and next use the man-

ually annotated 9 DoF transformations and their re-

spective object categories provided by Scan2CAD to

obtain the ﬁnal scan-to-part alignments. We further

perform a simple majority voting, selecting only the

most frequent (among the vertices) part label as the

ground truth voxel label.

This procedure results in 242,081 correspon-

dences represented as 9 DoF transformations between

1,506 reconstructions of real-wold Scannet scenes

and 53,618 unique parts of 2,477 ShapeNet objects.

Parts of each object have a tree structure, similar

to (Mo et al., 2019b). Note that the majority-based

voting implementation of annotation transfer results

in some semantic parts labels not being represented

in the 3D scan, ultimately affecting parts taxonomy,

which we discuss below.

Parts Taxonomy Processing. The taxonomy of

3D shape parts is represented as a single tree struc-

ture based on (Mo et al., 2019b). However, as the

voxel resolution of the real-world 3D scan is relatively

coarse, particular part categories (e.g., small shape de-

tails such as keyboard buttons or door handles) cannot

be represented by sufﬁcient number of 3D data points,

thus implying a reduction in the original taxonomy.

We proceed with this reduction by ﬁrst choosing an

appropriate occurrence threshold (we pick 1800 vox-

els, but demonstrate the effect of different threshold

values in the supplementary) and remove classes that

have smaller number of representatives in the dataset.

We ﬁnalize the taxonomy by pruning trivial paths in

the tree (i.e., if a vertex has only one child, then we

delete this vertex by connecting the child and the par-

ent of this vertex), but keeping the leaf labels intact.

We display the number of vertices at different gran-

ularities in the original and resulting part taxonomy

levels in Table 1. Some parts (leafs in the part tax-

onomy) are not represented in ScanNet data, so we

remove these from the tree. Figure 3 demonstrates

example annotations produced by our automatic pro-

cedure.

Table 1: Number of parts on each tree level.

Level

1 2 3 4 5 6 7 8

(object) (part)

Full Taxonomy 18 50 133 223 269 302 306 307

Pruned Taxonomy 13 36 79 — — — — —

To aid reproducibility, we will release all the nec-

essary code to combine datasets with minimal efforts

while respecting their licence terms.

3.2 Evaluation Protocol

Based on our dataset, we propose a novel benchmark

for part-level scan 3D understanding, offering three

core tasks, namely semantic labeling, hierarchical se-

mantic segmentation, and semantic instance segmen-

tation; we overview these tasks in the context of our

datasets below. As inputs in all tasks we supply the

voxelized SDF (with RGB information), represent-

ing 3D geometry and appearance of individual ob-

jects, already separated from the background voxels

in the scene. Note that our tasks are similar to part-

level object understanding but operate on real-world

3D shapes.

Evaluation Tasks. In semantic labeling, the goal

is to associate a set of n semantic part labels y

,...,y

) with each voxel v

, at detail lev-

els d

,...,d

. Compared to object-level segmenta-

tion, the part-level task becomes even more challeg-

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

714

Figure 2: Top: our dataset is obtained by combining PartNet synthetic data with ScanNet sensor data. Bottom: the PartNet

object hierarchy is compressed to include only parts sufﬁciently well represented in ScanNet data.

PartNet synthetic data

ScanNet

Scan2Part (Ours)

Figure 3: Example annotations produced by our automatic procedure at the spatial resolution of 5 cm

. From left to right,

part-annotated meshes and voxelized shapes in PartNet; fragments of their respective reconstructions in ScanNet; voxelized

part-annotated shapes with and without background (non-object) voxels in Scan2Part.

Scan2Part: Fine-grained and Hierarchical Part-level Understanding of Real-World 3D Scans

715

ing, particularly at deeper levels in the taxonomy that

require predicting at increasingly more ﬁne-grained

categories. We provide 1,506 scenes with 53,618

unique parts of 2,477 objects.

For hierarchical semantic segmentation, one must

perform segmentation at all levels in the hierarchy, in-

ferring labels in coarse- and ﬁne-grained detail levels

simultaneously.

For instance segmentation, the goal is to simulta-

neously perform part-level semantic labeling and as-

sign each voxel v

a unique part instance ID (e.g., to

differentiate separate legs of a table).

For both semantic and instance segmentations

we follow original train/val division of Scannet

dataset (Dai et al., 2017).

The Choice of Scene Understanding Levels. We

evaluate the algorithms at three granularity levels

for each object category: coarse-, middle- and ﬁne-

grained, roughly corresponding to evaluation in (Mo

et al., 2019b).

Quality Measures. We evaluate semantic labeling

and hierarchical segmentation models by inferring

the semantic labels for entire input scenes and com-

puting quality measures at each scene understand-

ing level d

separately. More speciﬁcally, for each

class c present in the set of classes C

at granularity

, we compute the standard Intersection over Union

score IoU

and the balanced accuracy score Acc

We report these per-class numbers along with mean

IoU and mean balanced accuracy averaged over C

mIoU

∑

c∈C

IoU

, mAcc

∑

c∈C

Acc

We additionally evaluate hierarchical semantic

segmentation by averaging mIoU over all hierarchy

levels k ∈ {1, . . . , K}.

Instance segmentation is assessed as object detec-

tion and thus evaluate this task using average pre-

cision (AP) with IoU threshold at 0.5. To generate

object hypotheses, each instance is checked against

a threshold of conﬁdence equal to 0.25, to ﬁlter out

noisy voxels.

3.3 Proposed Approach

The input to all our models is the voxelized object

SDF, representing 3D geometry either with or without

associated RGB values.

Part-level 3D Understanding. To predict parts in

our multi-label formulation, we produce a set of soft-

max scores p

= (p

,..., p

). Our models for se-

mantic and instance segmentation tasks are 3D CNNs

identical in architecture up to the last layer. We

note that training a 3D CNN model for segmentation

is a computationally challenging problem; thus, we

opted to make it more tractable by using frameworks

for sparse differentiable computations. Speciﬁcally,

we use Minkowski Networks (Choy et al., 2019a;

Choy et al., 2019b), a popular sparse CNN backbone,

to implement large sparse fully-convolutional neu-

ral networks for geometric feature learning. We use

the Res16UNet34C architecture that showed state-of-

the-art performance on multiple scene understanding

tasks. For each input voxel, the network produces a

32-dimensional feature vector that is further appro-

priately transformed by the last layer, accommodating

a required number of predicted classes. All our net-

works are fully convolutional, enabling inference for

scenes with arbitrary spatial extents.

We train the network in multiple setups, differing

by the structure of supervision available to the net-

work, each time evaluating labeling performance at

each detail level d

. Speciﬁcally, we deﬁne our loss

function L to be a weighted sum of cross entropy

losses for each level of detail

L(p,y) =

∑

k=1

) (1)

and specify a set of weighting schemes for α

,...,α

This loss structure allows expressing both “ﬂat” seg-

mentation formulations (e.g., choosing α

= δ

segment at level d

only), that we view as base-

lines, and multi-task formulations that integrate train-

ing signal across multiple levels of semantic detail.

Similarily to (Mo et al., 2019b), we approach

this task using bottom-up, and top-down methods.

Bottom-up method performs segmentation at the most

ﬁne-grained level and propagates the labels to object

level, leveraging the taxonomy structure. Conversely,

the top-down approach infers labels ﬁrst at coarse

level (starting with objects) and subsequently at ﬁner

levels (parts), recursively descending along predicted

taxonomy branches. We note that multi-task training

using (1) also results in a hierarchical segmentation

method, and include it in the comparison.

Part Instance Segmentation. We employ a dis-

criminative loss function which has demonstrated

its effectiveness in previous works (De Brabandere

et al., 2017; Pham et al., 2019b) and integrates

intra-instance clustering and inter-instance separation

terms along with a small regularization component.

Compared to architectures that use region proposal

modules (Yi et al., 2019; Pham et al., 2019b; Engel-

mann et al., 2020), this results in a more computation-

ally lightweight architecture, while reducing instance

segmentation problem to a metric learning task. At in-

ference time, we cluster voxel feature vectors to pro-

duce part instances in a scene using mean-shift algo-

rithm (Comaniciu and Meer, 2002).

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

716

Figure 4: Histograms of object statistics for the coarsest (d = 1) level of details: LEFT: number of voxels per object, CENTER:

the number of corresponding objects in Scan2Part, RIGHT: the total number of object voxels in Scan2Part.

4 EXPERIMENTS

4.1 Experimental Setup

Data and Training. We select 80% of the scenes in

ScanNet for training, keeping 5% as a mini-validation

to tune hyperparameters, and put aside 20% scenes

for testing. Due to the high imbalance some classes

can be under-represented in a testset. Testset was se-

lected through iterative discrete optimization, using

desired 80/20 proportion through all classes (espe-

cially smaller ones).

We optimize our models using Adam (Kingma

and Ba, 2014), initializing learning rate proportion-

ally to the batch size with base learning rate = 0.003

and β

= 0.9, β

= 0.999. Learning rate during train-

ing follows Multi-step exponential decay schedule

with γ = 0.2, and a weight decay of 10

−4

. We train

until early stopping with patience parameter of 10, or

until we reach a maximum epoch limit of 200. De-

pending on the speciﬁc model training can take sev-

eral hours on our dedicated server equiped with 4

Nvidia GTX 1080.

Baselines. SparseConvNet (Graham et al., 2018) is

an implementation of sparse 3D CNNs; though it

has a smaller number of sparse layer types, it has

enough features to implement a U-Net style fully-

convolutional architecture we need to compute geo-

metric features of voxels. As a baseline, we also train

a Dense 3D CNN (Lee et al., 2017) model based on

U-Net architecture with ResNet blocks.

To train the baselines, we extract 16 volumetric

crops of size 64

from each voxelized scene and ran-

domly shufﬂe them across all scenes in the training

split. The batch size was selected from 4 to 48 in dif-

ferent experiments, subject to ﬁtting the model into

the available GPU memory. Because this model’s im-

plementation requires slicing of the scene, it cannot

be evaluated for per-instance metrics.

Semantic Part Segmentation. Table 5 summarizes

our model speciﬁcations, differing by the choice of

weights in (1). The ﬁrst three rows correspond to

training a single level-of-detail semantic segmenta-

tion model. The last three rows deﬁne a Multi-Task-

Training (MTT) objective where learned features of

the occupied voxels are projected using linear layers

to different level labels, and their loss functions are

combined for training.

Part Instance Segmentation. This task test the abil-

ity of our models to separate parts solely based on

their shape and mutual position within object, disre-

garding semantic class information.

Hierarchical Segmentation. Performing hierarchi-

cal segmentation of voxels in a scene according to a

taxonomy of objects and their parts described in 3.1 is

a challenging task that can be approached in different

ways:

• Top-Down. approach assigns labels to a voxel by

solving a sequence of smaller classiﬁcation tasks.

This approach predicts the distribution of seman-

tic object labels ﬁrst, followed by prediction of

ﬁrst-level parts, and continuing until a voxel can

be assigned a leaf label from the taxonomy. Ev-

ery prediction produces SoftMax distribution over

possible sub-parts; voxels are masked based on

the ground truth of the ”parent” part during eval-

uation.

• Bottom-Up. A model predicts a leaf label to a

voxel on the ﬁnest level-of-detail, and semantic

part segmentation on any level-of-detail can be

computed by ”projecting” up. Probabilities of part

labels having the same ”parent” part are added to-

gether.

4.2 Comparative Studies

Can We Recognize Parts in Real-world Scans?

Table 4 demonstrates part-level semantic 3D under-

standing performance of our approach vs. the base-

line methods. We show that adding semantic informa-

tion from multiple levels of detail via part objectives

Scan2Part: Fine-grained and Hierarchical Part-level Understanding of Real-World 3D Scans

717

(a)

(b)

Figure 5: Semantic segmentation prediction from Minkowski Engine (a) and Submanifold (b) models.

Table 2: Instance segmentation results on levels d

, d

, and d

Model Lvl Micr. Disp. Lamp Lapt. Bag Stor. Bed Table Chair Dishw. Trash. Vase Keyb. Avg

Ours (w/ color) d

0.790 0.676 0.794 0.852 0.808 0.798 0.795 0.852 0.900 0.884 0.840 1.000 0.874 0.839

0.856 0.356 0.667 0.710 0.275 0.414 0.505 0.467 0.562 0.423 0.640 0.488 0.440 0.757

0.228 0.146 0.200 0.234 0.484 0.279 0.743 0.667 0.490 0.176 0.333 0.244 0.807 0.711

num. instances 142 26 11 33 496 69 543 754 6 199 9 18 21 179

Figure 6: Qualitative hierarchical semantic segmentation

results using our models, trained in each respective part-

category level and our proposed method. Note the segmen-

tation performance, particularly at ﬁner levels in the parts

taxonomy.

in in (1) signiﬁcantly improves models segmentation

performance across hierarchy levels.

Does Adding Part-level Annotation Improve

Object-level 3D Understanding? Table 3 demon-

strates results across different object classes and in

different inference setups obtained at the object-level

and lower level-of-detail.

Does Pre-training for Part Segmentation Help

Achieve Hierarchical / Instance Tasks? We present

performance of hierarchical semantic segmentation in

Table 3. Note that despite the baselines are focusing

solely on a single level of semantic detail, our method

is able to leverage a multi-task objective in (1) to per-

form more efﬁcient segmentation.

Part Instance Segmentation Results. We present

instance segmentation results in Table 2, results indi-

cate that segmentation accuracy decreases when the

level of detail is reduced, presumably because object-

part-masks become more generic, closet to shape

primitives. The metric performs on par to the object

level on particular objects with less shape variability

(e.g., Bag, Bed, Table, Keyboard).

4.3 Ablative Studies

What is required for efﬁcient part-level understand-

ing? The following ablations can inform to what is

more important in our problem domain.

Effect of Backbone Network. In Table 4 one can

see that having a sparse backbone is crucial to the se-

mantic segmentation ability of the model. There ex-

ists a moderate effect between model size/complexity

and performance. We are conﬁdent that we got close

to the limits of segmentation performance given the

quality of data in the benchmark.

Effect of Loss. Most of our models in Table 4 are

trained with weighted cross-entropy Loss to combat

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

718

Table 3: Hierarchical semantic segmentation results in terms of mean IoU and mean balanced accuracy for different semantic

granularities. We include mIoU averaged over all hierarchy levels as an integral measure.

Model Lvl Micr. Disp. Lamp Lapt. Bag Stor. Bed Table Chair Dishw. Trash. Vase Keyb. Avg

Flat d1 0.408 0.704 0.558 0.202 0.581 0.811 0.777 0.783 0.896 0.003 0.639 0.403 0.112 0.529

d2 0.199 0.471 0.526 0.298 0.346 0.035 0.423 0.351 0.301 0.183 0.199 0.302 0.033 0.282

d3 0.070 0.262 0.380 0.170 0.214 0.213 0.231 0.344 0.193 0.000 0.102 0.208 0.029 0.186

Top-Down d1 0.099 0.266 0.079 0.091 0.338 0.431 0.281 0.527 0.610 0.009 0.276 0.054 0.019 0.237

d2 - 0.769 0.442 0.856 0.661 0.499 0.339 0.391 0.171 0.697 0.313 0.199 0.585 0.494

d3 - 0.522 0.631 0.439 0.529 0.288 - 0.741 0.355 0.682 0.402 - - 0.510

Bottom-Up d1 0.377 0.626 0.489 0.170 0.515 0.792 0.673 0.748 0.872 0.001 0.582 0.281 0.062 0.476

d2 0.188 0.469 0.000 0.041 0.155 0.053 0.006 0.126 0.147 0.000 0.064 0.063 0.031 0.102

d3 0.279 0.542 0.198 0.385 0.260 0.232 0.110 0.404 0.335 0.145 0.267 0.128 0.034 0.210

Table 4: Fine-grained semantic part segmentation performance in terms of mean IoU and mean per-Instance IoU for different

semantic granularities, using our model in various setups and baselines.

Categ. mIoU, % ↑ Inst. mIoU, % ↑

Method d

Dense 3D CNN (Lee et al., 2017) 0.225 0.131 0.058 - - -

SparseConvNet (Graham et al., 2018) 0.4212 0.2561 0.1789 0.7751 0.7151 0.4923

Ours (w/o color) 0.5179 0.2913 0.2185 0.8387 0.7219 0.5140

Ours (w/ color) 0.5290 0.2848 0.2231 0.8422 0.7217 0.5137

Ours (MTT balanced) 0.5209 0.3104 - 0.8364 0.7363 -

Ours (MTT ﬁne-grained) 0.4953 0.2929 0.2123 0.8065 0.7217 0.5093

Ours (MTT coarse) 0.5091 0.2963 0.1886 0.841 0.7253 0.4414

Num. classes 13 36 79 13 36 79

(a) (b)

Figure 7: Semantic segmentation results for the same scene voxelized at 5 cm

(a) and 2 cm

(b) voxel resolution. Despite

coarser geometry yields somewhat more robust segmentation, it does not allow representing ﬁner parts.

high unbalance for part labels in scenes. Training on

part segmentation in a multi-task setup with differ-

ent choices of objective weights has demonstrated a

trade-off in the effectiveness of the geometric features

and their ability to predict labels on different level-of-

details.

Effect of Color Information. Surprisingly, introduc-

ing color information only contributes slightly to the

performance of semantic part segmentation (see Ta-

ble 4). This result disagrees with the expectations that

voxel color would be a highly predictive factor for

part correspondence. We hypothesize that this could

be due to the high variability in lightning and acquisi-

tion conditions in the original ScanNet dataset.

Effect of Voxel Resolution. We conducted experi-

ments at a 2.5× ﬁner voxel resolution of 2 cm

and

present their results in Table 6. Despite higher com-

plexity geometry, we have observed an improvement

in performance metrics across levels of detail. How-

ever, due to a signiﬁcant increase in computational

Scan2Part: Fine-grained and Hierarchical Part-level Understanding of Real-World 3D Scans

719

Table 5: Our objective conﬁgurations in (1).

Conﬁguration α

Base coarse 1 0 0

Base middle 0 1 0

Base ﬁne 0 0 1

MTT-12 .5 .5 0

MTT-123-coarse .7 .2 .1

MTT-123-ﬁne .1 .2 .7

Table 6: Our method is able to effectively work on vari-

ous voxel resolution leves. On 5cm resolution tipical train-

ing time for equal number of epochs increased from several

hours to a day.

Categ. mIoU, % ↑ Inst. mIoU, % ↑

Method d

Voxel size 5 cm

0.5179 0.2913 0.2185 0.8387 0.7219 0.5140

Voxel size 2 cm

0.6465 0.4138 0.3415 0.8958 0.8263 0.6295

Num. classes 13 36 79 13 36 79

requirements, we performed the majority of experi-

ments on the 5 cm

version of our dataset.

5 CONCLUSION

We introduced Scan2Part, a novel method and a chal-

lenging benchmark for part-level understanding of

real-world 3D objects. The core of our method is

to leverage structural knowledge of objects compo-

sition to perform a variety of segmentation tasks in

setting with complex geometry, high levels of uncer-

tainty due to noise. To achieve that, we explore the

part taxonomies of common objects in indoor scenes,

on multiple scales and methods of compressing them

for more effective use in machine learning applica-

tions. We demonstrated that speciﬁc ways of training

deep segmentation models like ours sparse Residual-

U-Net architecture on these novel tasks we introduced

are better at capturing inductive biases in structured

labels on some parts of a taxonomy but not the others.

Further research on relationships between structure of

real-world scenes and perception models is required

and we hope our benchmark and dataset will acceler-

ate it. In the near future we plan on releasing a sec-

ond version of the dataset with background of scenes

is not removed, so that and we can study performance

of our models in scenarios closer to raw sensor data.

ACKNOWLEDGEMENTS

This work has been supported by the Russian Science

Foundation under Grant 19-41-04109. The authors

acknowledge the use of Zhores (Zacharov et al., 2019)

for obtaining the results presented in this paper.

REFERENCES

Armeni, I., Sax, A., Zamir, A. R., and Savarese, S. (2017).

Joint 2D-3D-Semantic Data for Indoor Scene Under-

standing. ArXiv e-prints.

Armeni, I., Sener, O., Zamir, A. R., Jiang, H., Brilakis, I.,

Fischer, M., and Savarese, S. (2016). 3d semantic

parsing of large-scale indoor spaces. In Proceedings

of the IEEE Conference on Computer Vision and Pat-

tern Recognition, pages 1534–1543.

Avetisyan, A., Dahnert, M., Dai, A., Savva, M., Chang,

A. X., and Nießner, M. (2019). Scan2cad: Learning

cad model alignment in rgb-d scans. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition, pages 2614–2623.

Biederman, I. (1987). Recognition-by-components: a the-

ory of human image understanding. Psychological re-

view, 94(2):115.

Bokhovkin, A., Ishimtsev, V., Bogomolov, E., Zorin, D.,

Artemov, A., Burnaev, E., and Dai, A. (2021). To-

wards part-based understanding of rgb-d scans. In

Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pages 7484–

7494.

Chang, A., Dai, A., Funkhouser, T., Halber, M., Niebner,

M., Savva, M., Song, S., Zeng, A., and Zhang, Y.

(2017). Matterport3d: Learning from rgb-d data in in-

door environments. In 2017 International Conference

on 3D Vision (3DV), pages 667–676. IEEE.

Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan,

P., Huang, Q., Li, Z., Savarese, S., Savva, M.,

Song, S., Su, H., et al. (2015). Shapenet: An

information-rich 3d model repository. arXiv preprint

arXiv:1512.03012.

Chen, X., Golovinskiy, A., and Funkhouser, T. (2009). A

benchmark for 3D mesh segmentation. ACM Trans-

actions on Graphics (Proc. SIGGRAPH), 28(3).

Chen, Z., Yin, K., Fisher, M., Chaudhuri, S., and Zhang,

H. (2019). Bae-net: branched autoencoder for shape

co-segmentation. In Proceedings of the IEEE Interna-

tional Conference on Computer Vision, pages 8490–

8499.

Choy, C., Gwak, J., and Savarese, S. (2019a). 4d spatio-

temporal convnets: Minkowski convolutional neural

networks. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages

3075–3084.

Choy, C., Park, J., and Koltun, V. (2019b). Fully convolu-

tional geometric features. In Proceedings of the IEEE

International Conference on Computer Vision, pages

8958–8966.

Comaniciu, D. and Meer, P. (2002). Mean shift: A robust

approach toward feature space analysis. IEEE Trans-

actions on pattern analysis and machine intelligence,

24(5):603–619.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

720

Dai, A., Chang, A. X., Savva, M., Halber, M., Funkhouser,

T., and Nießner, M. (2017). Scannet: Richly-

annotated 3d reconstructions of indoor scenes. arXiv

preprint arXiv:1702.04405.

Dai, A. and Nießner, M. (2018). 3dmv: Joint 3d-multi-

view prediction for 3d semantic scene segmentation.

In Proceedings of the European Conference on Com-

puter Vision (ECCV), pages 452–468.

De Brabandere, B., Neven, D., and Van Gool, L. (2017).

Semantic instance segmentation with a discriminative

loss function. arXiv preprint arXiv:1708.02551.

Elich, C., Engelmann, F., Schult, J., Kontogianni, T., and

Leibe, B. (2019). 3d-bevis: birds-eye-view instance

segmentation. arXiv preprint arXiv:1904.02199.

Engelmann, F., Bokeloh, M., Fathi, A., Leibe, B., and

Nießner, M. (2020). 3d-mpa: Multi proposal aggre-

gation for 3d semantic instance segmentation. arXiv

preprint arXiv:2003.13867.

Fisher, M., Ritchie, D., Savva, M., Funkhouser, T., and

Hanrahan, P. (2012). Example-based synthesis of 3d

object arrangements. In ACM SIGGRAPH Asia 2012

papers, SIGGRAPH Asia ’12.

Garcia-Garcia, A., Martinez-Gonzalez, P., Oprea, S.,

Castro-Vargas, J. A., Orts-Escolano, S., Garcia-

Rodriguez, J., and Jover-Alvarez, A. (2018). The

robotrix: An extremely photorealistic and very-large-

scale indoor dataset of sequences with robot trajecto-

ries and interactions. In 2018 IEEE/RSJ International

Conference on Intelligent Robots and Systems (IROS),

pages 6790–6797. IEEE.

Graham, B., Engelcke, M., and van der Maaten, L. (2018).

3d semantic segmentation with submanifold sparse

convolutional networks. CVPR.

Handa, A., Patraucean, V., Badrinarayanan, V., Stent, S.,

and Cipolla, R. (2016). Understanding real world in-

door scenes with synthetic data. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition, pages 4077–4085.

Hoffman, D. D. and Richards, W. A. (1984). Parts of recog-

nition. Cognition, 18(1-3):65–96.

Hou, J., Dai, A., and Nießner, M. (2019). 3d-sis: 3d seman-

tic instance segmentation of rgb-d scans. In Proceed-

ings of the IEEE Conference on Computer Vision and

Pattern Recognition, pages 4421–4430.

Hua, B.-S., Pham, Q.-H., Nguyen, D. T., Tran, M.-K., Yu,

L.-F., and Yeung, S.-K. (2016). Scenenn: A scene

meshes dataset with annotations. In 2016 Fourth In-

ternational Conference on 3D Vision (3DV), pages

92–101. IEEE.

Johnson, J., Gupta, A., and Fei-Fei, L. (2018). Image gen-

eration from scene graphs. CoRR, abs/1804.01622.

Keselman, L., Iselin Woodﬁll, J., Grunnet-Jepsen, A., and

Bhowmik, A. (2017). Intel realsense stereoscopic

depth cameras. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition

Workshops, pages 1–10.

Kingma, D. P. and Ba, J. (2014). Adam: A

method for stochastic optimization. arXiv preprint

arXiv:1412.6980.

Klokov, R. and Lempitsky, V. (2017). Escape from cells:

Deep kd-networks for the recognition of 3d point

cloud models. In Proceedings of the IEEE Interna-

tional Conference on Computer Vision, pages 863–

872.

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata,

K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-

J., Shamma, D. A., et al. (2017). Visual genome:

Connecting language and vision using crowdsourced

dense image annotations. International Journal of

Computer Vision, 123(1):32–73.

Lahoud, J., Ghanem, B., Pollefeys, M., and Oswald, M. R.

(2019). 3d instance segmentation via multi-task met-

ric learning. In Proceedings of the IEEE International

Conference on Computer Vision, pages 9256–9266.

Lee, K., Zung, J., Li, P., Jain, V., and Seung, H. S. (2017).

Superhuman accuracy on the snemi3d connectomics

challenge. arXiv preprint arXiv:1706.00120.

Li, J., Xu, K., Chaudhuri, S., Yumer, E., Zhang, H., and

Guibas, L. (2017). Grass: Generative recursive au-

toencoders for shape structures. ACM Transactions

on Graphics (TOG), 36(4):52.

Li, W., Saeedi, S., McCormac, J., Clark, R., Tzoumanikas,

D., Ye, Q., Huang, Y., Tang, R., and Leutenegger, S.

(2018). Interiornet: Mega-scale multi-sensor photo-

realistic indoor scenes dataset. In British Machine Vi-

sion Conference (BMVC).

Liang, Z., Yang, M., and Wang, C. (2019). 3d graph em-

bedding learning with a structure-aware loss function

for point cloud semantic instance segmentation. arXiv

preprint arXiv:1902.05247.

Liu, C. and Furukawa, Y. (2019). Masc: multi-scale afﬁnity

with sparse convolution for 3d instance segmentation.

arXiv preprint arXiv:1902.04478.

McCormac, J., Handa, A., Leutenegger, S., and J.Davison,

A. (2017). Scenenet rgb-d: Can 5m synthetic images

beat generic imagenet pre-training on indoor segmen-

tation?

Mo, K., Guerrero, P., Yi, L., Su, H., Wonka, P., Mitra, N.,

and Guibas, L. J. (2019a). Structurenet: Hierarchi-

cal graph networks for 3d shape generation. arXiv

preprint arXiv:1908.00575.

Mo, K., Wang, H., Yan, X., and Guibas, L. J. (2020). Pt2pc:

Learning to generate 3d point cloud shapes from part

tree conditions. arXiv preprint arXiv:2003.08624.

Mo, K., Zhu, S., Chang, A. X., Yi, L., Tripathi, S., Guibas,

L. J., and Su, H. (2019b). Partnet: A large-scale

benchmark for ﬁne-grained and hierarchical part-level

3d object understanding. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recogni-

tion, pages 909–918.

Pham, Q.-H., Hua, B.-S., Nguyen, T., and Yeung, S.-K.

(2019a). Real-time progressive 3d semantic segmen-

tation for indoor scenes. In 2019 IEEE Winter Con-

ference on Applications of Computer Vision (WACV),

pages 1089–1098. IEEE.

Pham, Q.-H., Nguyen, T., Hua, B.-S., Roig, G., and Yeung,

S.-K. (2019b). Jsis3d: joint semantic-instance seg-

mentation of 3d point clouds with multi-task point-

wise networks and multi-value conditional random

Scan2Part: Fine-grained and Hierarchical Part-level Understanding of Real-World 3D Scans

721

ﬁelds. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages

8827–8836.

Qi, C. R., Su, H., Mo, K., and Guibas, L. J. (2017a). Point-

net: Deep learning on point sets for 3d classiﬁcation

and segmentation. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 652–660.

Qi, C. R., Yi, L., Su, H., and Guibas, L. J. (2017b). Point-

net++: Deep hierarchical feature learning on point sets

in a metric space. arXiv preprint arXiv:1706.02413.

Russell, C., Kohli, P., Torr, P. H., et al. (2009). Associa-

tive hierarchical crfs for object class image segmen-

tation. In Computer Vision, 2009 IEEE 12th Interna-

tional Conference on, pages 739–746. IEEE.

Song, S., Yu, F., Zeng, A., Chang, A. X., Savva, M., and

Funkhouser, T. (2017). Semantic scene completion

from a single depth image. Proceedings of 30th IEEE

Conference on Computer Vision and Pattern Recogni-

tion.

Straub, J., Whelan, T., Ma, L., Chen, Y., Wijmans, E.,

Green, S., Engel, J. J., Mur-Artal, R., Ren, C., Verma,

S., Clarkson, A., Yan, M., Budge, B., Yan, Y., Pan,

X., Yon, J., Zou, Y., Leon, K., Carter, N., Briales, J.,

Gillingham, T., Mueggler, E., Pesqueira, L., Savva,

M., Batra, D., Strasdat, H. M., Nardi, R. D., Goesele,

M., Lovegrove, S., and Newcombe, R. (2019). The

Replica dataset: A digital replica of indoor spaces.

arXiv preprint arXiv:1906.05797.

Sun, C.-Y., Zou, Q.-F., Tong, X., and Liu, Y. (2019). Learn-

ing adaptive hierarchical cuboid abstractions of 3d

shape collections. ACM Transactions on Graphics

(TOG), 38(6):1–13.

Uy, M. A., Pham, Q.-H., Hua, B.-S., Nguyen, T., and

Yeung, S.-K. (2019). Revisiting point cloud clas-

siﬁcation: A new benchmark dataset and classiﬁca-

tion model on real-world data. In Proceedings of the

IEEE/CVF International Conference on Computer Vi-

sion, pages 1588–1597.

Wang, P.-S., Liu, Y., Guo, Y.-X., Sun, C.-Y., and Tong,

X. (2017). O-cnn: Octree-based convolutional neu-

ral networks for 3d shape analysis. ACM Transactions

on Graphics (TOG), 36(4):1–11.

Wang, W., Yu, R., Huang, Q., and Neumann, U. (2018).

Sgpn: Similarity group proposal network for 3d point

cloud instance segmentation. In Proceedings of the

IEEE Conference on Computer Vision and Pattern

Recognition, pages 2569–2578.

Wang, X., Liu, S., Shen, X., Shen, C., and Jia, J. (2019a).

Associatively segmenting instances and semantics in

point clouds. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages

4096–4105.

Wang, Y., Sun, Y., Liu, Z., Sarma, S. E., Bronstein, M. M.,

and Solomon, J. M. (2019b). Dynamic graph cnn

for learning on point clouds. ACM Transactions on

Graphics (TOG), 38(5):1–12.

Wu, R., Zhuang, Y., Xu, K., Zhang, H., and Chen, B.

(2019a). Pq-net: A generative part seq2seq network

for 3d shapes. arXiv preprint arXiv:1911.10949.

Wu, Z., Wang, X., Lin, D., Lischinski, D., Cohen-Or, D.,

and Huang, H. (2019b). Sagnet: Structure-aware gen-

erative network for 3d-shape modeling. ACM Trans-

actions on Graphics (TOG), 38(4):1–14.

Xu, D., Zhu, Y., Choy, C. B., and Fei-Fei, L. (2017). Scene

graph generation by iterative message passing. In

The IEEE Conference on Computer Vision and Pat-

tern Recognition (CVPR).

Yang, B., Wang, J., Clark, R., Hu, Q., Wang, S., Markham,

A., and Trigoni, N. (2019). Learning object bounding

boxes for 3d instance segmentation on point clouds. In

Advances in Neural Information Processing Systems,

pages 6737–6746.

Yi, L., Guibas, L., Hertzmann, A., Kim, V. G., Su, H., and

Yumer, E. (2017). Learning hierarchical shape seg-

mentation and labeling from online repositories. arXiv

preprint arXiv:1705.01661.

Yi, L., Kim, V. G., Ceylan, D., Shen, I.-C., Yan, M., Su, H.,

Lu, C., Huang, Q., Sheffer, A., and Guibas, L. (2016).

A scalable active framework for region annotation in

3d shape collections. ACM Transactions on Graphics

(TOG), 35(6):1–12.

Yi, L., Zhao, W., Wang, H., Sung, M., and Guibas, L. J.

(2019). Gspn: Generative shape proposal network for

3d instance segmentation in point cloud. In Proceed-

ings of the IEEE Conference on Computer Vision and

Pattern Recognition, pages 3947–3956.

Yildirim, I., Kulkarni, T. D., Freiwald, W. A., and Tenen-

baum, J. B. (2015). Efﬁcient and robust analysis-by-

synthesis in vision: A computational framework, be-

havioral tests, and modeling neuronal representations.

In Annual conference of the cognitive science society,

volume 1.

Yuille, A. and Kersten, D. (2006). Vision as bayesian in-

ference: analysis by synthesis? Trends in cognitive

sciences, 10(7):301–308.

Zacharov, I., Arslanov, R., Gunin, M., Stefonishin,

D., Bykov, A., Pavlov, S., Panarin, O., Mal-

iutin, A., Rykovanov, S., and Fedorov, M. (2019).

“zhores”—petaﬂops supercomputer for data-driven

modeling, machine learning and artiﬁcial intelligence

installed in skolkovo institute of science and technol-

ogy. Open Engineering, 9(1):512–520.

Zhang, B. and Wonka, P. (2019). Point cloud instance

segmentation using probabilistic embeddings. arXiv

preprint arXiv:1912.00145.

Zhang, Z. (2012). Microsoft kinect sensor and its effect.

IEEE multimedia, 19(2):4–10.

Zhu, C., Xu, K., Chaudhuri, S., Yi, L., Guibas, L., and

Zhang, H. (2019). Cosegnet: Deep co-segmentation

of 3d shapes with group consistency loss. arXiv

preprint arXiv:1903.10297.

Zhu, S.-C. and Mumford, D. (2006). A stochastic gram-

mar of images. Foundations and Trends® in Com-

puter Graphics and Vision, 2(4):259–362.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

722