3D Reconstruction of Indoor Scenes using a Single RGB-D Image

Panagiotis-Alexandros Bokaris

, Damien Muselet

and Alain Trémeau

LIMSI-CNRS, University of Paris-Saclay, Univ. Paris-Sud, 91405 Orsay Cedex, France

Laboratoire Hubert Curien, CNRS, UMR 5516, Université Jean Monnet, 42000 Saint-Étienne, France

Keywords:

3D Reconstruction, Cuboid Fitting, Kinect, RGB-D, RANSAC, Bounding Box, Point Cloud, Manhattan

World.

Abstract:

The three-dimensional reconstruction of a scene is essential for the interpretation of an environment. In this

paper, a novel and robust method for the 3D reconstruction of an indoor scene using a single RGB-D image

is proposed. First, the layout of the scene is identiﬁed and then, a new approach for isolating the objects in

the scene is presented. Its fundamental idea is the segmentation of the whole image in planar surfaces and the

merging of the ones that belong to the same object. Finally, a cuboid is ﬁtted to each segmented object by a

new RANSAC-based technique. The method is applied to various scenes and is able to provide a meaningful

interpretation of these scenes even in cases with strong clutter and occlusion. In addition, a new ground truth

dataset, on which the proposed method is further tested, was created. The results imply that the present work

outperforms recent state-of-the-art approaches not only in accuracy but also in robustness and time complexity.

1 INTRODUCTION

3D reconstruction is an important task in computer

vision since it provides a complete representation of

a scene and can be useful in numerous applications

(light estimation for white balance, augment synthetic

objects in a real scene, design interiors, etc). Nowa-

days, with an easy and cheap access to RGB-D im-

ages, as a result of the commercial success of the

Kinect sensor, there is an increasing demand in new

methods that will beneﬁt from such data.

A lot of attention has been drawn to 3D recon-

struction using dense RGB-D data (Izadi et al., 2011;

Neumann et al., 2011; Dou et al., 2013). Such data

are obtained by multiple acquisitions of the consid-

ered 3D scene under different viewpoints. The main

drawback of these approaches is that they require a

registration step between the different views. In or-

der to make the 3D reconstruction of a scene feasible

despite the absence of a huge amount of data, this pa-

per focuses on reconstructing a scene using a single

RGB-D image. This challenging problem has been

less addressed in the literature (Neverova et al., 2013).

The lack of information about the shape and position

of the different objects in the scene due to the single

viewpoint and occlusions makes the task signiﬁcantly

more difﬁcult. Therefore, various assumptions have

to be made in order to make the 3D reconstruction

Figure 1: (left) Color and Depth input images, (right) 3D

reconstruction of the scene.

feasible (object nature, orientation).

In this paper, starting from a single RGB-D image,

a fully automatic method for the 3D reconstruction of

an indoor scene without constraining the object orien-

tations is proposed. In the ﬁrst step, the layout of the

room is identiﬁed by solving the parsing problem of

an indoor scene. For this purpose, the work of (Taylor

and Cowley, 2012) is exploited and improved by bet-

ter addressing the problem of the varying depth reso-

lution of the Kinect sensor while ﬁtting planes. Then,

the objects of the scene are segmented by using a

novel plane-merging approach and a cuboid is ﬁtted to

each of these objects. The reason behind the selection

of such representation is that most of the objects in a

common indoor scene, such as drawers, bookshelves,

394

Bokaris P., Muselet D. and TrÃl’meau A.

3D Reconstruction of Indoor Scenes using a Single RGB-D Image.

DOI: 10.5220/0006107803940401

In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2017), pages 394-401

ISBN: 978-989-758-227-1

tables or beds have a cuboid shape. For the cuboid ﬁt-

ting step, a new “double RANSAC”-based (Fischler

and Bolles, 1981) approach is proposed. The out-

put of the algorithm is a 3D reconstruction of the ob-

served scene, as illustrated in Fig. 1. In order to as-

sess the quality of the reconstruction, a new dataset

of captured 3D scenes is created, in which the ex-

act positions of the objects are measured by using a

telemeter. In fact, by knowing the exact 3D positions

of the objects, one can objectively assess the accuracy

of all the 3D reconstruction algorithms. This ground

truth dataset will be publicly available for future com-

parisons. Finally, the proposed method is tested on

this new dataset as well as on the NYU Kinect dataset

(Silberman et al., 2012). The obtained results indicate

that the proposed algorithm outperforms the state-of-

the-art even in cases with strong occlusion and clutter.

2 RELATED WORK

The related research to the problem examined in this

paper can be separated in two different categories.

The ﬁrst category is the extraction of the main layout

of the scene while the second one is the 3D represen-

tation of the objects in the scene.

Various approaches have been followed in com-

puter vision for recovering the spatial layout of a

scene. Many of them are based on the Manhat-

tan World assumption (Coughlan and Yuille, 1999).

Some solutions only consider color images without

exploiting depth information (Mirzaei and Roumeli-

otis, 2011; Bazin et al., 2012; Hedau et al., 2009;

Schwing and Urtasun, 2012; Zhang et al., 2014) and

hence provide only coarse 3D layouts. With Kinect,

depth information is available, which can be signiﬁ-

cantly beneﬁcial in such applications. (Zhang et al.,

2013) expanded the work of (Schwing and Urtasun,

2012) and used the depth information in order to

reduce the layout error and estimate the clutter in

the scene. (Taylor and Cowley, 2011) developed a

method that parses the scene in salient surfaces using

a single RGB-D image. Moreover, (Taylor and Cow-

ley, 2012) presented a method for parsing the Manhat-

tan structure of an indoor scene. Nonetheless, these

works are based on assumptions about the content of

the scene (minimum size of a wall, minimum ceiling

height, etc.). Moreover, in order to address the prob-

lem of the depth accuracy in Kinect, they used the

depth disparity differences, which is not the best so-

lution as it is discussed in section 3.1.

Apart from estimating the layout of an indoor

scene, a considerable amount of research has been

done in estimating surfaces and objects from RGB-

D images. (Richtsfeld et al., 2012) used RANSAC

and NURBS (Piegl, 1991) for detecting unknown 3D

objects in a single RGB-D image, requiring learn-

ing data from the user. (Cupec et al., 2011; Jiang,

2014) segment convex 3D shapes but their grouping

to complete objects remains an open issue. To the

best of our knowledge, (Neverova et al., 2013) was the

ﬁrst method that proposed a 3D reconstruction start-

ing from a single RGB-D image under the Manhat-

tan World assumption. However, it has the signiﬁcant

limitation that it only reconstructs 3D objects which

are parallel or perpendicular to the three main orien-

tations of the Manhattan World. (Lin et al., 2013) pre-

sented a holistic approach that takes into account 2D

segmentation, 3D geometry and contextual relations

between scenes and objects in order to detect and clas-

sify objects in a single RGB-D image. Despite the

promising nature of such approach it is constrained

by the assumption that the objects are parallel to the

ﬂoor. In addition, the cuboid ﬁtting to the objects is

performed as the minimal bounding cube of the 3D

points, which is not the optimal solution when work-

ing with Kinect data, as discussed by (Jia et al., 2013).

Recently, an interesting method that introduced the

“Manhattan Voxel” was developed by (Ren and Sud-

derth, 2016). In their work the 3D layout of the room

is estimated and detected objects are represented by

3D cuboids. Being a holistic approach that prunes

candidates, there is no guarantee that a cuboid will be

ﬁtted to each object in the scene. Based on a single

RGB image, (Dwibedi et al., 2016) developed a deep-

learning method to extract all the cuboid-shaped ob-

jects in the scene. This novel technique differs from

our perspective since the intention is not to ﬁt a cuboid

to a 3D object but to extract a present cuboid shape in

an image.

The two methods (Jiang and Xiao, 2013; Jia et al.,

2013) are similar with our approach since their au-

thors try to ﬁt cuboids using RANSAC to objects

of a 3D scene acquired by a single RGB-D image.

(Jia et al., 2013) followed a 3D reasoning approach

and investigated different constraints that have to be

applied to the cuboids, such as occlusion, stability

and supporting relations. However, this method is

applicable only to pre-labeled images. (Jiang and

Xiao, 2013) coarsely segment the RGB-D image into

roughly piecewise planar patches and for each pair of

such patches ﬁt a cuboid to the two planes. As a re-

sult, a large set of cuboid candidates is created. Fi-

nally, the best subset of cuboids is selected by opti-

mizing an objective function, subject to various con-

straints. Hence, they require strong constraints (such

as intersections between pairs of cuboids, number of

cuboids, covered area on the image plane, occlusions

3D Reconstruction of Indoor Scenes using a Single RGB-D Image

395

Figure 2: An overview of the proposed method.

among cuboids, etc.) during the global optimization

process. This pioneer approach provides promising

results in some cases but very coarse ones in others

even for dramatically simple scenes (see Figs. 9 and

10 and images shown in (Jiang and Xiao, 2013)).

In this paper, in order to improve the quality of the

reconstruction, we followed a different approach and

propose an accurate segmentation step using novel

constraints. The objective is to isolate the objects

from each other before ﬁtting the cuboids due to the

fact that the cuboid ﬁtting step can be signiﬁcantly

more efﬁcient and accurate when working with each

object independently.

3 METHOD OVERVIEW

The method proposed in this paper can be separated

in three different stages. The ﬁrst stage is to deﬁne the

layout of the scene. This implies to extract the ﬂoor,

all the walls and their intersections. For this purpose,

the input RGB-D image is segmented by ﬁtting 3D

planes to the point cloud. The second stage is to seg-

ment all the objects in the scene and to ﬁt a cuboid

to each one separately. Finally, in stage 3 the results

of the two previous stages are combined in order to

visualize the 3D model of the room. An overview of

this method can be seen in Fig. 2

3.1 Parsing the Indoor Scene

In order to parse the indoor scene and extract the com-

plete layout of the scene, an approach based on the

research of (Taylor and Cowley, 2012) is used. Ac-

cording to this work, the image is separated in pla-

nar regions by ﬁtting planes to the point cloud using

RANSAC, as can be seen in Fig 2b. Then the ﬂoor

and the walls are detected by analyzing their surfaces,

angles with vertical and angles between them. This

method provides the layout of the room in less than

6 seconds. The ﬁnal result of the layout of the scene,

visualized in the 3D Manhattan World, can be seen in

the bottom of Fig. 2c.

While working with depth values provided by the

Kinect sensor, it is well known that the depth accu-

racy is not the same for the whole range of depth

(Andersen et al., 2012), i.e. the depth information is

more accurate for points that are close to the sensor

than for points that are farther. This has to be taken

into account in order to deﬁne a threshold according

to which the points will be considered as inliers in a

RANSAC method. Points with a distance to a plane

inside the range of Kinect error should be treated as

inliers of that plane. In order to address this prob-

lem, (Taylor and Cowley, 2012) proposed to ﬁt planes

in the disparity (inverse of depth) image instead of

working directly with depth. This solution improves

the accuracy but we claim that the best solution would

be to use a threshold for the computation of the resid-

ual errors in RANSAC that increases according to the

distance from the sensor. This varying threshold is

computed once by ﬁtting a second degree polynomial

function to the depth values provided by (Andersen

et al., 2012). The difference between the varying

threshold proposed by (Taylor and Cowley, 2012) us-

ing disparity and the one proposed here can be seen

in Fig. 3. As observed in the graph, our threshold

follows signiﬁcantly better the experimental data of

(Andersen et al., 2012) compared to the threshold of

(Taylor and Cowley, 2012).

Figure 3: Comparison of the varying threshold set in (Taylor

and Cowley, 2012) and the one proposed in this paper.

The impact of the proposed threshold on the room

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

396

layout reconstruction can be seen in the two character-

istic examples in Fig. 4. As it can be easily noticed,

with the new threshold the corners of the walls are

better deﬁned and complete walls are now detected.

This adaptive threshold is further used in the cuboid

ﬁtting step and signiﬁcant improvements are obtained

for various objects, as it is discussed in section 3.3.

Figure 4: Impact of the proposed threshold in the room lay-

out reconstruction. (left column): Input image (middle col-

umn): Threshold in (Taylor and Cowley, 2012). (right col-

umn): Threshold proposed here.

3.2 Segmenting the Objects in the Scene

As an output of the previous step, the input image is

segmented in planar regions (Fig. 2b). Moreover, it

is already known which of these planar regions corre-

spond to the walls and to the ﬂoor in the scene (bot-

tom of Fig. 2c). By excluding them from the image,

only planar regions that belong to different objects in

the image are left, as can be seen in the top of Fig.

2c. In order to segment the objects in the scene, the

planar regions that belong to the same object have to

be merged. For this purpose, the edges of the pla-

nar surfaces are extracted using a Canny edge detec-

tor and the common edge between neighboring sur-

faces is calculated. Then, we propose to merge two

neighbor surfaces by analyzing i)the depth continu-

ity across surface boundaries, ii)the angle between the

surface normals and iii)the size of each surface.

For the ﬁrst criterion, we consider that two neigh-

boring planar surfaces that belong to the same object

have similar depth values in their common edge and

different ones when they belong to different objects.

The threshold in the mean depth difference is set to 60

mm in all of our experiments. The second criterion is

necessary in order to prevent patches that do not be-

long to the same object to be merged. In fact, since

this study is focused on cuboids, the planar surfaces

that should be merged need to be either parallel or

perpendicular to each other. The ﬁnal criterion forces

neighboring planar surfaces to be merged if both of

their sizes are relatively small (less than 500 points).

The aim is to regroup all small planar regions that

constitute an object that does not have a cuboid shape

(sphere, cylinder, etc.). This point is illustrated in Fig.

5, where one cylinder is extracted. The proposed al-

gorithm checks each planar region with respect to its

neighboring regions (5 pixels area) in order to decide

whether they have to be merged or not. This step is

crucial for preparing the data before ﬁtting cuboids in

the next step.

Figure 5: An example of merging objects that are not

cuboids.(left) original input image. (middle):Before merg-

ing. (right):After merging.

3.3 Fitting a Cuboid to Each Object

The aim of this section is to ﬁt an oriented cuboid to

each object. As discussed by (Jia et al., 2013), the

optimal cuboid is the one with the minimum volume

and the maximum points on its surface. Since the im-

age has been already segmented, i.e. each object is

isolated from the scene, the strong global constraints

used by (Jiang and Xiao, 2013) can be relaxed and

more attention to each cuboid can be drawn. There-

fore, we propose the following double-RANSAC pro-

cess. Two perpendicular planar surfaces are sufﬁcient

to deﬁne a cuboid. Hence, in order to improve the ro-

bustness of the method, we propose to consider only

the two biggest planar surfaces of each object. In fact,

in a single viewpoint of a 3D scene only two surfaces

of an object are often visible. Thus, ﬁrst, for each seg-

mented object, the planar surface with the maximum

number of inliers is extracted by ﬁtting a plane to the

corresponding point cloud using RANSAC (with our

adaptive threshold described in section 3.1). The ori-

entation of this plane provides the ﬁrst axis of the

cuboid. We consider that the second plane is perpen-

dicular to the ﬁrst one but this information is not sufﬁ-

cient to deﬁne the second plane. Furthermore, in case

of noise or when the object is thin (few points in the

other planes) or far from the acquisition sensor, the

3D orientation of the second plane might be poorly es-

timated. Hence, we propose a robust solution which

projects all the remaining points of the point cloud

on the ﬁrst plane and then ﬁts a line using another

RANSAC step to the projected points. The orienta-

tion of this line provides the orientation of the second

plane. This is visualized in Fig. 6. In the experiments

section, it is shown that this double RANSAC pro-

cess provides very good results while ﬁtting cuboids

to small, thin or far objects.

Furthermore, as a second improvement of the

3D Reconstruction of Indoor Scenes using a Single RGB-D Image

397

RANSAC algorithm, we propose to analyze its qual-

ity criterion. In fact, RANSAC ﬁts several cuboids to

each object (10 cuboids in our implementation) and

selects the one that optimizes a given quality crite-

rion. Thus, the chosen quality criterion has a big im-

pact on the results. As it was discussed before, in

RGB-D data a well estimated cuboid should have a

maximum of points on its surface. Given one cuboid

returned by one RANSAC iteration, we denote area

f 1

and area

f 2

the areas of its two faces and area

and

area

the areas deﬁned by the convex hull of the

inlier points projected on these two faces, respec-

tively. In order to evaluate the quality of the ﬁtted

cuboid, Jiang and Xiao proposed the measure deﬁned

as min(

area

f 1

area

f 2

) which is equal to the maximum

value of 1 when the ﬁtting is perfect. This measure

assimilates the quality of a cuboid to the quality of

the worst plane among the two, without taking into

account the quality of the best ﬁtting plane. Never-

theless, the quality of the best ﬁtting plane could help

in deciding between two cuboids characterized by the

same ratio. Furthermore, the relative sizes of the two

planes are completely ignored in this criterion. In-

deed, in case of a cuboid composed by a very big

plane and a very small one, this measure does not pro-

vide any information about which one is well ﬁtted to

the data, although this information is crucial to assess

the quality of the cuboid ﬁtting. Consequently, we

propose to use a similar criterion which does not suf-

fer from these drawbacks: ratio =

area

+area

area

f 1

+area

f 2

. Like-

wise, for an ideal ﬁtting this measure is equal to 1.

In order to illustrate the improvement due to the pro-

posed adaptive threshold (of section 3.1) and the pro-

posed ratio in the cuboid ﬁtting step, 3 typical exam-

ples are shown in in Fig. 7. There, it can be seen that

the proposed method (right column) increases signif-

icantly the performance for far and thin objects.

Figure 6: Illustration of our cuboid ﬁtting step. (left): The

inliers of the ﬁrst ﬁtted 3D plane are marked in green. The

remaining points and their projection on the plane is marked

in red and blue, respectively. A 3D line is ﬁtted to these

points. (right): The ﬁtted cuboid.

In the ﬁnal step of the method, the ﬁtted cuboids

are projected in the Manhattan World of the scene, in

order to obtain the 3D model of the scene, as illus-

trated in Fig. 2f. Additionally, the cuboids are pro-

Figure 7: Impact of the selected threshold and ratio on the

cuboid ﬁtting. (left): Fixed global threshold and ratio pro-

posed here. (middle): Varying threshold proposed here and

ratio proposed in (Jiang and Xiao, 2013) (right): Threshold

and ratio proposed here.

jected on the input RGB image in order to demon-

strate how well the ﬁtting procedure performs (see

Fig. 2e).

4 NEW GROUND TRUTH

DATASET

For an objective evaluation, a new dataset with mea-

sured ground truth 3D positions was built. This

dataset is composed by 4 different scenes and each

scene is captured under 3 different viewpoints and 4

different illuminations. Thus, each scene consists of

12 images. For all these 4 scenes, the 3D positions

of the vertices of the objects were measured using a

telemeter. These coordinates constitute the ground

truth. As the reference point was considered the in-

tersection point of the three planes of the Manhattan

World. It should be noted that the measurement of

vertices positions in a 3D space with a telemeter is

not perfectly accurate and the experimental measure-

ments show that the precision of these ground truth

data is approximately ±3.85mm. Some of the dataset

images can be seen in the ﬁgures of the next section.

5 EXPERIMENTS

5.1 Qualitative Evaluation

As a ﬁrst demonstration of the proposed method some

reconstruction results are shown in Fig. 8. It can

be seen that it performs well even in very demand-

ing scenes with strong clutter. Moreover, it is able to

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

398

handle small and thin objects with convex surfaces.

Subsequently, our method is compared with the re-

cent method proposed by (Jiang and Xiao, 2013)

since their method not only performs cuboid ﬁtting

to RGB-D data but also outperforms various other ap-

proaches. A ﬁrst visual comparison can be performed

on both our dataset and the well-known NYUv2

Kinect Dataset (Silberman et al., 2012) in Figs. 9

and 10, respectively. It should be noted that all the

thresholds in this paper were tuned to the provided

numbers for both ours and the NYUv2 dataset. This

point highlights the generality of our method that was

tested in a wide variety of scenes. (Jiang and Xiao,

2013) have further improved their code and its last re-

lease (January 2014) was used for our comparisons.

A random subset of 40 images that contain informa-

tion about the layout of the room was selected from

the NYUv2 Kinect dataset. The results imply that

our method provides signiﬁcantly better reconstruc-

tions than this state-of-the-art approach. Furthermore,

in various scenes in Fig. 9, it can be observed that

the global cuboid ﬁtting method of (Jiang and Xiao,

2013) can result in cuboids that do not correspond to

any object in the scene. The reason for this is the large

set of candidate cuboids that they produce for each

two planar surfaces in the image. The strong con-

straints that they apply afterwards, in order to elimi-

nate the cuboids which do not correspond to an object,

do not always guarantee an optimal solution. Another

drawback of this approach is that the aforementioned

constraints might eliminate a candidate cuboid that

does belong to a salient object. In the next section,

the improvement of our approach is quantiﬁed by an

exhaustive test on our ground truth dataset.

5.2 Quantitative Evaluation

In order to test how accurate is the output of the pro-

posed method and how robust it is against different

viewpoints and illuminations, the following proce-

dure was used. The 3D positions of the reconstructed

vertices are compared to their ground truth positions

by measuring their Euclidean distance. The mean

value (µ) and the standard deviation (σ) of these Eu-

clidean distances as well as the mean running time of

the algorithm over the 12 images of each scene are

presented in Table 1. The results using the code of

(Jiang and Xiao, 2013) are included in the table for

comparison. It should be noted that since this method

does not provide the layout of the room, their esti-

mated cuboids are rotated to the Manhattan World ob-

tained by our method for each image.

During the experiments, it was noticed that the re-

sults of (Jiang and Xiao, 2013) were very unstable

Figure 8: Various results of the proposed method on differ-

ent real indoor scenes.

Table 1: Mean value (µ) and standard deviation (σ) of the

Euclidean distances in mm between the ground truth and

the reconstructed vertices over the 12 images of each scene

and mean running time (t) in seconds of each algorithm.

Our method (Jiang and Xiao, 2013)

µ σ t

∗

µ σ t

∗

Scene 1 52.4 8.8 8.8 60.9 19.6 25.3

Scene 2 60.4 20.9 12.3 132.7 65.9 26.1

Scene 3 69.7 20.2 14.2 115.7 48.3 27.2

Scene 4 74.9 35.3 12.2 145.3 95.4 26.8

∗

Running on a Dell Inspiron 3537, i7 1.8 GHz, 8 GB RAM

and various times their method could not provide a

cuboid for each object in the scene. Moreover, since

the RANSAC algorithm is non-deterministic, neither

are both our approach and the one of (Jiang and Xiao,

2013). In order to quantify this instability, each al-

gorithm was run 10 times on the exact same image

(randomly chosen) of each scene. The mean (µ) and

standard deviation (σ) of the Euclidean distance be-

tween the ground truth and the reconstructed 3D po-

sitions were measured. The results are presented in

Table 2. It should be noted that the resulting 3D po-

sitions of both algorithms are estimated according to

the origin of the estimated layout of the room. Thus,

the poor resolution of the Kinect sensor is perturb-

ing the estimation of both the layout and the 3D po-

sitions of the objects and the errors are cumulating.

However, the values of the mean and standard devia-

tion for our method are relatively low with respect to

the depth resolution of Kinect sensor at that distance,

which is approximately 50 mm at 4 meters (Andersen

3D Reconstruction of Indoor Scenes using a Single RGB-D Image

399

Figure 9: Comparison of the results obtained by (Jiang and Xiao, 2013) (odd rows) and the method proposed in this paper

(even rows) for the NYUv2 Kinect dataset.

Figure 10: Random results of (Jiang and Xiao, 2013) (top 2

rows) and the corresponding ones of our method (bottom 2

rows) for the ground truth dataset.

et al., 2012). Furthermore, the standard deviations of

Table 2 are considerably low and state a maximum

deviation of the result less than 4.5 mm.

Finally, as can be seen in Table 1, the computa-

tional cost of our method is dramatically lower than

the one of (Jiang and Xiao, 2013). It should be noted

that in this running time our method estimates the

complete 3D scene reconstruction of the scene. It re-

quires around 9 seconds for a simple scene and less

Table 2: Mean value (µ) and standard deviation (σ) of the

Euclidean distances between the ground truth and the re-

constructed vertices over 10 iterations of the algorithm on

the same image.

Our method (Jiang and Xiao, 2013)

µ (mm) σ (mm) µ (mm) σ (mm)

Scene 1 50.2 2.7 54.3 10.5

Scene 2 57.0 2.9 104.9 37.8

Scene 3 81.9 2.7 93.2 20.5

Scene 4 72.0 4.5 195.4 35.7

than 20 seconds for a demanding scene with strong

clutter and occlusion on a Dell Inspiron 3537, i7 1.8

Ghz, 8 GB RAM. It is worth mentioning that no op-

timization was done in the implementation. Thus, the

aforementioned running times could be considerably

lower.

6 CONCLUSIONS

In this paper, a new method that provides accurate 3D

reconstruction of an indoor scene using a single RGB-

D image is proposed. First, the layout of the scene is

extracted by exploiting and improving the method of

(Taylor and Cowley, 2012). The latter is achieved by

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

400

better addressing the problem of the non-linear rela-

tionship between depth resolution and distance from

the sensor. For the 3D reconstruction of the scene,

we propose to ﬁt cuboids to the objects composing

the scene since this shape is well adapted to most of

the indoor objects. Unlike the state-of-the-art method

(Jiang and Xiao, 2013) that runs a global optimization

process over sets of cuboids with strong constraints,

we propose to automatically segment the image, as a

preliminary step, in order to focus on the local cuboid

ﬁtting on each extracted object. It is shown that our

method is robust to viewpoint and object orientation

variations. It is able to provide meaningful interpreta-

tions even in scenes with strong clutter and occlusion.

More importantly, it outperforms the state-of-the-art

approach not only in accuracy but also in robustness

and time complexity. Finally, a ground truth dataset

for which the exact 3D positions of the objects have

been measured is provided. This dataset can be used

for future comparisons.

REFERENCES

Andersen, M. R., Jensen, T., Lisouski, P., Mortensen, A.,

Hansen, M., Gregersen, T., and Ahrendt, P. (2012).

Kinect depth sensor evaluation for computer vision

applications. Technical Report ECE-TR-06, Aarhus

University.

Bazin, J. C., Seo, Y., Demonceaux, C., Vasseur, P., Ikeuchi,

K., Kweon, I., and Pollefeys, M. (2012). Globally

optimal line clustering and vanishing point estimation

in manhattan world. In CVPR, pages 638–645.

Coughlan, J. M. and Yuille, A. L. (1999). Manhattan world:

Compass direction from a single image by bayesian

inference. In ICCV, pages 941–947.

Cupec, R., Nyarko, E. K., and Filko, D. (2011). Fast 2.5d

mesh segmentation to approximately convex surfaces.

In ECMR, pages 49–54.

Dou, M., Guan, L., Frahm, J.-M., and Fuchs, H. (2013).

Exploring high-level plane primitives for indoor 3d re-

construction with a hand-held rgb-d camera. In ACCV

2012 Workshops, volume 7729 of Lecture Notes in

Computer Science, pages 94–108. Springer Berlin

Heidelberg.

Dwibedi, D., Malisiewicz, T., Badrinarayanan, V., and Ra-

binovich, A. (2016). Deep cuboid detection: Beyond

2d bounding boxes.

Fischler, M. A. and Bolles, R. C. (1981). Random sample

consensus: A paradigm for model ﬁtting with appli-

cations to image analysis and automated cartography.

Commun. ACM, 24(6):381–395.

Hedau, V., Hoiem, D., and Forsyth, D. (2009). Recovering

the spatial layout of cluttered rooms. In ICCV, pages

1849–1856.

Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe,

R., Kohli, P., Shotton, J., Hodges, S., Freeman, D.,

Davison, A., and Fitzgibbon, A. (2011). Kinectfu-

sion: real-time 3d reconstruction and interaction using

a moving depth camera. In UIST, pages 559–568.

Jia, Z., Gallagher, A., Saxena, A., and Chen, T. (2013). 3d-

based reasoning with blocks, support, and stability. In

CVPR.

Jiang, H. (2014). Finding Approximate Convex Shapes in

RGBD Images, pages 582–596. Springer International

Publishing, Cham.

Jiang, H. and Xiao, J. (2013). A linear approach to matching

cuboids in rgbd images. In CVPR.

Lin, D., Fidler, S., and Urtasun, R. (2013). Holistic scene

understanding for 3d object detection with rgbd cam-

eras. In ICCV, pages 1417–1424.

Mirzaei, F. and Roumeliotis, S. (2011). Optimal estimation

of vanishing points in a manhattan world. In ICCV,

pages 2454–2461.

Neumann, D., Lugauer, F., Bauer, S., Wasza, J., and

Hornegger, J. (2011). Real-time rgb-d mapping and

3-d modeling on the gpu using the random ball cover

data structure. In ICCV Workshops, pages 1161–1167.

Neverova, N., Muselet, D., and Trémeau, A. (2013). 2 1/2

d scene reconstruction of indoor scenes from single

rgb-d images. In CCIW, pages 281–295.

Piegl, L. (1991). On nurbs: a survey. IEEE Computer

Graphics and Applications, 11(1):55–71.

Ren, Z. and Sudderth, E. B. (2016). Three-dimensional ob-

ject detection and layout prediction using clouds of

oriented gradients. IEEE CVPR.

Richtsfeld, A., Mörwald, T., Prankl, J., Balzer, J., Zillich,

M., and Vincze, M. (2012). Towards scene under-

standing - object segmentation using rgbd-images. In

CVWW.

Schwing, A. and Urtasun, R. (2012). Efﬁcient exact infer-

ence for 3d indoor scene understanding. In ECCV,

volume 7577 of Lecture Notes in Computer Science,

pages 299–313. Springer Berlin Heidelberg.

Silberman, N., Hoiem, D., Kohli, P., and Fergus, R. (2012).

Indoor segmentation and support inference from rgbd

images. In ECCV, pages 746–760. Springer.

Taylor, C. and Cowley, A. (2011). Fast scene analysis using

image and range data. In ICRA, pages 3562–3567.

Taylor, C. and Cowley, A. (2012). Parsing indoor scenes

using rgb-d imagery. In RSS.

Zhang, J., Kan, C., Schwing, A. G., and Urtasun, R. (2013).

Estimating the 3d layout of indoor scenes and its clut-

ter from depth sensors. In ICCV, pages 1273–1280.

Zhang, Y., Song, S., Tan, P., and Xiao, J. (2014). PanoCon-

text: A Whole-Room 3D Context Model for Panoramic

Scene Understanding, pages 668–686. Springer Inter-

national Publishing, Cham.

3D Reconstruction of Indoor Scenes using a Single RGB-D Image

401