Near Real-time Object Detection in RGBD Data

Ronny H

ansch, Stefan Kaiser and Olaf Helwich

Computer Vision & Remote Sensing, Technische Universit

at Berlin, Berlin, Germany

Keywords:

Object Detection, Random Forests, RGBD-data.

Abstract:

Most methods of object detection with RGBD cameras set hard constraints on their operational area. They

only work with speciﬁc objects, in speciﬁc environments, or rely on time consuming computations. In the

context of home robotics, such hard constraints cannot be made. Speciﬁcally, an autonomous home robot

shall be equipped with an object detection pipeline that runs in near real-time and produces reliable results

without restricting object type and environment. For this purpose, a baseline framework that works on RGB

data only is extended by suitable depth features that are selected on the basis of a comparative evaluation. The

additional depth data is further exploited to reduce the computational cost of the detection algorithm. A ﬁnal

evaluation of the enhanced framework shows signiﬁcant improvements compared to its original version and

state-of-the-art methods in terms of both, detection performance and real-time capability.

1 INTRODUCTION AND

RELATED WORK

Recent advances in robotics allow research to focus

on the potential integration of mobile manipulators

into home environments. One task of such robots

is to fetch and carry objects by utilising informa-

tion that are managed and stored by smart home sys-

tems.The task is subdivided into ﬁve steps: Search

object, estimate object pose, grip object, carry ob-

ject, and release-object. For a successful execution,

the robot needs to be equipped with a reliable object

detection algorithm that is able to cope with several

challenges including multiple object categories, only

weakly constrained object types,object poses with

six degrees of freedom, scenes cluttered by every-

day household objects, various backgrounds, chang-

ing object appearances due to different lighting con-

ditions, and a near real-time performance while main-

taining high detection accuracy.

Fulﬁlling all these requirements is a hard chal-

lenge, which is usually approached by equipping mo-

bile manipulators with multiple sensors, i.e. depth

and RGB cameras. While depth data is complemen-

tary to RGB images, their joint usage has its own chal-

lenges such as a higher computational load.

A sophisticated object detection toolkit for robot

manipulation tasks is proposed in (M

orwald et al.,

2010). An edge-based tracker aligns a given 3D CAD

model of the object, such that its projection ﬁts the

training object in the image. Distinctive SIFT feature

points are extracted and stored in a codebook along

with their three-dimensional coordinates on the model

surface. During the detection phase, SIFT points are

extracted from the scene and and matched against the

entries in the codebook. In (Tombari and Di Stefano,

2010) a 3D Hough voting scheme is used to local-

ize objects in point clouds. It is based on 3D fea-

ture points that have been extracted from the point

cloud of the object. For each feature point a local

reference frame and offset vector to the object’s cen-

ter are stored. For detection, feature points are ex-

tracted from the scene and matched against model

feature points which leads to a set of point-to-point

correspondences. The scene feature points vote into

a 3D Hough space by using the stored local reference

frame as well as the offset vector. Both methods re-

port state-of-the-art recognition and pose estimation

results. However, they lack several of the require-

ments for our work since not all object types can be

described well by feature points: While the perfor-

mance in (M

orwald et al., 2010) drastically decreases

for objects at different scales or weakly textured sur-

faces, the used 3D detector in (Tombari and Di Ste-

fano, 2010) fails for simple shapes (such as boxes).

Template-based approaches utilize modalities

such as object shape and thus perform well for objects

without distinctive surface features. Early approaches

are based on matching each trained object template

with the image or its Fourier-transformation in a slid-

ing window approach (Vergnaud, 2011). A huge

HÃd’nsch R., Kaiser S. and Helwich O.

Near Real-time Object Detection in RGBD Data.

DOI: 10.5220/0006101401790186

In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2017), pages 179-186

ISBN: 978-989-758-226-4

 2017 by SCITEPRESS – Science and Technology Publications, Lda. All rights reser ved

179

speedup is achieved in (Hinterstoisser et al., 2010)

by quantising image gradients from the templates.

LINEMOD (Hinterstoisser et al., 2011) extends this

work by adding depth features and is able detect an

impressive amount of objects simultaneously in near

real-time However, it produces many false positives in

cluttered scenes which requires time consuming post-

processing steps. The more recent works of (Hin-

terstoisser et al., 2013; Rios-Cabrera and Tuytelaars,

2013) additionally include color information. The

work in (Rusu et al., 2010) and its extension (Wang

et al., 2013) are operating on the point cloud level

(instead of intensity and depth image) and compute

viewpoint dependent feature histograms. During de-

tection, the learnt histograms are matched against his-

tograms of point cloud regions. The use case is re-

stricted to tabletop scenarios in which all objects re-

side on the dominant plane in the scene which allows

an easy foreground-background segmentation.

Recently, convolutional networks have been suc-

cessfully applied to RGBD tasks, such as learning

representations from depth images.The multiscale se-

mantic segmentation of (Farabet et al., 2013) is ex-

tended in (Couprie et al., 2013) to work directly on

RGBD images. The work in (Gupta et al., 2014) uses

a large convolutional network that was pre-trained on

RGB images to generate features for depth images

and obtains a substantial improvement of detection

accuracy. In (Bo et al., 2014) hierarchical match-

ing pursuit is used instead of deep nets to learn fea-

tures from images captured by RGBD cameras. These

works focus on detection accuracy and seldom make

statements about run time and computational costs.

The Implicit Shape Model (ISM) (Leibe et al.,

2006) applies a Hough voting scheme based on a

codebook of “visual words”, i.e. clustered image

patches, along with 2D offset vectors that cast prob-

abilistic votes for object centres. A 3D version of

ISM is proposed in (Knopp et al., 2010) where 3D

features and a 3D Hough space replace its 2D coun-

terparts. Both versions rely on feature point detec-

tors to reduce the search space leading to similar re-

strictions as discussed above. Hough Forests (Gall

et al., 2012) also learn to distinguish patches and cor-

responding offset vectors but do not use any feature

detector to ﬁnd salient object locations. While a ran-

domized subset of object patches is sampled during

training, a sliding window is used during the detec-

tion stage. A Random Forest replaces patch cluster-

ing and codebook. The learning procedure aims to

distinguish patches of different classes (classiﬁcation)

and to merge patches with similar offset vectors (re-

gression) simultaneously. The Hough Forest comes

with several desirable properties: 1) The Hough vot-

ing produces probabilistic object hypotheses. 2) The

distinctive training of objects against other objects

(and background) leads to potentially low false posi-

tive rates. 3) No object-type, -texture, -shape, scene or

background assumptions are made. 4) The framework

is independent of speciﬁc image features. 5) Perspec-

tive invariance is achieved by feeding training images

of different view points. Scale invariance is achieved

by resizing the query image. The downside of this ap-

proach is that detection time scales rather poorly with

image resolution, the maximum tree depth, as well as

the number of classes, scales, and trees. The work of

(Badami et al., 2013) uses a Hough Forest which is

trained jointly on image and depth features.However,

the individual contribution of the different features

is not analyzed, although it is noted that the usage

of depth features increases performance signiﬁcantly

over using color information only. Furthermore, the

run time of the approach is not considered.

Our work focuses on leveraging the advantages of

Hough Forests for the object detection step of the full

pipeline while decreasing the computation time. To

reach reasonable classiﬁcation results, the potential

of several depth-based features (Section 3) is investi-

gated by evaluating their individual classiﬁcation ac-

curacies (Section 4). Based on these results, a ﬁnal

set of features is proposed. Several methodological as

well as implementational adjustments reduce the time

complexity (Section 5) and enable near real-time per-

formance, while still achieving state-of-the-art detec-

tion accuracy (Section 6).

2 HOUGH FOREST

Hough Forests (Gall et al., 2012) are a variant of Ran-

dom Forests (Breiman, 2001) which is an ensemble

learning framework capable of classiﬁcation and re-

gression. Similar to the Generalised Hough Trans-

form (Ballard, 1981), a Hough Forest accumulates

probabilistic object hypotheses in a voting space that

is parameterized by the object’s center (x, y), class c,

and scale s. Object candidates (c, x, y, s, p, b)

are ex-

tracted as maxima in this voting space, where p is

the candidate’s conﬁdence. The candidate’s bound-

ing box b is estimated by backprojecting patches that

voted for this candidate. A post-processing step re-

moves detections whose bounding box overlaps an-

other detection with a signiﬁcantly higher conﬁdence

value.

Our work builds on the implementation of (Gall

et al., 2012) which is adapted as described in Sec-

tion 5 and extended by using a different set of features

(see Section 4).

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

180

3 FEATURES

The following features are analyzed with respect to

classiﬁcation performance and computational load in

Section 4:

1. Intensity-based Features. The implementation

of (Gall et al., 2012) uses minimal and maximal

values in a 5 × 5 pixel neighbourhood of the ﬁrst

and second derivatives of the grayscale image and

the Histogram of Oriented Gradients (HOG) re-

sulting in a 32 dimensional feature vector.

2. Depth Value. As suggested in (Lai et al., 2011a)

the raw depth value d relative to the object size is

used, where the object size is replaced by scale s.

The scale-dependent depth value f

= d · s mod-

els the direct proportionality between object dis-

tance and scale.

3. Depth Derivatives. The ﬁrst and second order

derivatives of the (raw) depth value are computed

by the Sobel ﬁlter.

4. Depth HoG. While (Janoch et al., 2011) claims

that HoG on depth images falls behind its RGB

version, the Depth HoG outperforms the RGB

variant in (Lai et al., 2011a).

5. Histogram of Oriented Normal Vectors. The

Histogram of Oriented Normal Vectors (HONV,

(Tang et al., 2013)) extends the HoG by one

dimension and bins surface normals of a k-

neighbourhood into a two-dimensional histogram.

6. Principal Curvature. This feature describes the

local surface geometry in terms of minimal and

maximal curvature, corresponding to the eigen-

values of the covariance matrix of all points in a

k-neighbourhood projected onto the tangent plane

of the surface at a point (Arbeiter et al., 2012).

The vector indicating the direction of the maxi-

mum curvature (principal direction) contains fur-

ther surface information. To distinguish further

between curved surfaces, (Arbeiter et al., 2012)

suggests to use the ratio of minimum and maxi-

mum principal curvature.

7. (Fast) Point Feature Histograms. The Point

Feature Histogram (PFH) encodes shape proper-

ties by quantizing the geometrical relationship be-

tween pairs of points within the k-neighbourhood

of a query point (Rusu et al., 2009). While the

computational complexity of PFH is O(nk

) for a

point cloud with n points, its extension Fast Point

Feature Histogram (FPFH) reduces it to O(nk).

As for the RGB features, the minimum and maximum

in a 5× 5 neighbourhood is used for the scaled depth,

the depth derivatives, and the principal curvature.

4 FEATURE EVALUATION

The RGBD Object Dataset (Lai et al., 2011a) contains

about 300 objects in 51 categories recorded from vari-

ous perspectives. Each frame consists of an RGB and

a depth image, a bounding box, and a pixel-wise ob-

ject mask. The dataset provides indoor background

data and eight different annotated indoor scenes.

The scene table small 1 (Figure 3(a)) is used as

test set for the following comparisons due to its va-

riety of scene properties such as object size, surface

ﬂatness and texture, object-camera distances, and per-

spectives. It contains four objects: A bowl, a cereal

box, a coffee mug, and a soda can. The training data

consists of 36 images of a full 360

◦

rotation for three

different pitch angles.The background data contains

215 images of ordinary ofﬁce scenes, partly covering

the test set background but without the objects.

The evaluation results are reported as the area un-

der the precision/recall curves (AUC) over all frames

of the test scene. A detection is counted as true pos-

itive if the overlap of predicted and reference bound-

ing box is at least 50% of the joint area. The average

AUC is computed as mean over ﬁve training and test

runs.Multiple detections of the same object are con-

sidered as false positives.

The following standard parameters are used: 15

trees with a maximum depth of 25 levels, ﬁve scales

(0.33, 0.66, 1.0, 1.66, 2.33) with a query image res-

olution of 640 × 320, 250 patches of size 16 × 16

pixel are sampled from each training image. To get a

rich precision/recall evaluation, up to 250 detections

per class are allowed if they are above the detection

threshold of 0.1. All these parameters stay unchanged

during the following experiments and only the feature

space is modiﬁed by using the different depth features

(Section 3.2-7) additionally to the RGB features of

Section 3.1. The reported changes in performance are

always relative to the baseline of using only RGB.

1. RGB Features. The baseline detection perfor-

mance is an average AUC of 0.576. Figure 1(a)

shows that coffee mug and soda already perform

decent, while bowl and cereal box are far below

0.5. None of the objects reaches a perfect recall

and only coffee mug and soda can achieve a pre-

cision of 1.0 (coffee mug only at a very low re-

call). The non-monotonous trends of the bowl

and cereal box curves are caused by false detec-

tions with high conﬁdence values. There are sev-

eral reasons for the bad performance of bowl and

cereal box. The bowl does not have any texture

which can be described by the intensity gradient

based feature vector. Only color and shape infor-

mation are useful here. The surface is reﬂective

Near Real-time Object Detection in RGBD Data

181

0.0 0.2 0.4 0.6 0.8 1.0

recall

0.0

0.2

0.4

0.6

0.8

1.0

precision

bowl_4: 0.354

cereal_box_1: 0.418

coffee_mug_1: 0.671

soda_can_3: 0.845

mean 0.576

(a) Baseline

0.0 0.2 0.4 0.6 0.8 1.0

recall

0.0

0.2

0.4

0.6

0.8

1.0

precision

bowl_4: 0.614

cereal_box_1: 0.881

coffee_mug_1: 0.641

soda_can_3: 0.914

mean 0.768

(b) Depth value

0.0 0.2 0.4 0.6 0.8 1.0

recall

0.0

0.2

0.4

0.6

0.8

1.0

precision

bowl_4: 0.571

cereal_box_1: 0.632

coffee_mug_1: 0.755

soda_can_3: 0.819

mean 0.699

0.0 0.2 0.4 0.6 0.8 1.0

recall

0.0

0.2

0.4

0.6

0.8

1.0

precision

bowl_4: 0.720

cereal_box_1: 0.680

coffee_mug_1: 0.756

soda_can_3: 0.913

mean 0.772

(d) Depth HoG

0.0 0.2 0.4 0.6 0.8 1.0

recall

0.0

0.2

0.4

0.6

0.8

1.0

precision

bowl_4: 0.693

cereal_box_1: 0.568

coffee_mug_1: 0.668

soda_can_3: 0.900

mean 0.712

(e) HONV

0.0 0.2 0.4 0.6 0.8 1.0

recall

0.0

0.2

0.4

0.6

0.8

1.0

precision

bowl_4: 0.605

cereal_box_1: 0.766

coffee_mug_1: 0.673

soda_can_3: 0.919

mean 0.746

(f) Principal curvature

0.0 0.2 0.4 0.6 0.8 1.0

recall

0.0

0.2

0.4

0.6

0.8

1.0

precision

bowl_4: 0.586

cereal_box_1: 0.829

coffee_mug_1: 0.602

soda_can_3: 0.933

mean 0.742

(g) PFH

0.0 0.2 0.4 0.6 0.8 1.0

recall

0.0

0.2

0.4

0.6

0.8

1.0

precision

bowl_4: 0.863

cereal_box_1: 0.827

coffee_mug_1: 0.731

soda_can_3: 0.901

mean 0.836

(h) Combined features

Figure 1: Detection performance in the table small 1 test

set in terms of precision/recall curves. Numbers after the

label indicate the AUC for the respective object.

and prone to changes in lighting and perspective.

The cereal box on the other hand is rich of texture,

but very large. A 16 × 16 patch at the center of

the box might be useful for classiﬁcation, but not

for regressing the object’s center. The Hough vot-

ing space of the individual objects shown in Fig-

ure 3(b) illustrates those issues.

2. Depth Value. This feature performs with

0.768AUC, beating the baseline by 33%. The

changes in the detection performance are differ-

ent for different objects: Bowl and cereal box im-

proved most and roughly doubled their AUC (Fig-

ure 1(b)). The soda can improved slightly, while

the coffee mug performs 0.03AUC worse.

3. Simple Derivatives. The main parameter of the

ﬁrst and second order derivatives is the kernel

size, which is tested empirically. Most of the ob-

tained performance differences are not signiﬁcant.

Nevertheless, the general trend is that after a cer-

tain size the performance does not increase any-

more and is even dropping. 3 × 3 patches are sim-

ply too small to be very descriptive. Patches of

increasing size on the other hand contain more

data, that is decreasingly characteristic for the

query point. The kernel size of (7, 7) performs

best. The depth derivatives improve the detection

performance by at least 20% (Figure 1(c)). The

bowl improved most and lost the majority of the

strong false positives. The same applies to the ce-

real box, with the only difference that more false

positives are left. The coffee mug detection per-

formance improved slightly, but the soda can got

worse by 3%. While the soda can’s recall im-

proved, precision is lost at recall rates from 0.7

to 0.9, which means that additional false posi-

tives are detected. Since soda can and coffee mug

have very similar shape and surface properties (es-

pecially the curvature differs only slightly), they

have similar depth derivatives.

4. Depth HoG. As depicted in Figure 1(d), the

Depth HoG improves the detection performance

by 34%. Bowl and cereal box improved most with

+103% and +63%, especially in terms of preci-

sion. The already good performing coffee mug

was raised by 13%, the soda can by 8%.

5. Histogram of Oriented Normal Vectors. The

HONV feature is speciﬁed by a number of pa-

rameters. The parameter set of 4 and 3 bins in

azimuth and zenith, respectively, and 120 nearest

neighbours for normal computation and the his-

togram binning performed best within initial em-

pirical tests. As illustrated in Figure 1(e), the de-

tection performance is improved by 23%, whereas

the individual objects exhibit similar increments

as with the other feature vectors.

6. Principal Curvature. The two crucial parame-

ters here are the support areas for the normal com-

putation s

and for the principal curvature s

it-

self. A compromise between noise reduction and

preservation of sharp features is required. Sev-

eral settings (i.e. s

, s

∈ {50, 100, ..., 300}) were

tested empirically, where s

= 100 and s

= 225

performed best. The principal curvature outper-

forms the baseline detection by 30%. Figure 1(f)

shows that bowl and cereal box detection perfor-

mance has almost doubled, while the already well

performing soda can is slightly boosted by ap-

proximately 7%. Only the coffee mug does not

show any improvement and even lost some recall

performance. The general precision did improve

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

182

RGB D-Sobel D-Value

D-HoG PC

HoNV FPFH

0.50

0.55

0.60

0.65

0.70

0.75

0.80

mean p/r AUC and standard error (SEM)

Figure 2: Performance of the individual features.

(a) table small 1: Bowl (I), cereal box

(II), coffee mug (III), soda can (IV)

I II

III IV

(b) Initial version

I II

III IV

Figure 3: Hough voting spaces of an exemple frame.

a lot, while recall did only slightly. The high-

conﬁdent false positives vanished.

7. Point Feature Histogram. Similar to HONV and

principal curvature, the support sizes of the PFH

and the normal computation were determined em-

pirically and set to 200 and 100 nearest neigh-

bours, while the histogram size was set to 7.

The PFH’s detection performance is similar to the

principal curvature (+29%), but with different re-

sults for the individual objects as depicted in Fig-

ure 1(g). While all others improved, the coffee

mug lost 10%.

For the ﬁxed parameter set (patch size, tree depth,

etc.), the Depth HoG performs best among the tested

features (see Figure 2). It improves the detection

performance by approximately 34% and boosted in-

dividual as well as overall class purity. In contrast

to most other features, no object lost any of its de-

tection performance compared to the baseline. The

scaled depth value achieved similar results and is only

slightly worse. However, its power does not come

from its ability to describe the surface in a distinctive

fashion. It is rather a veriﬁcation feature, that con-

tains the physical relationship of depth and scale, and

thus mainly rules out physically impossible object lo-

cations in scale space.

Features that operate on surface normals are

mostly outperformed by depth-map features. While

the latter is based on neighbours deﬁned by the spa-

tial distances in image coordinates, the ﬁrst are com-

puted from points in 3D space, which should result in

a more reliable information. The Hough Forest intrin-

sic split statistics indicate a higher noise ratio of the

surface normal features, which might explain this dis-

crepancy. Neither different normal calculation meth-

ods, scale space changes, smoothing mechanisms or

hole-ﬁlling methods, nor different feature scalings

and transformations improved the performance be-

yond the presented results.

Despite the overall increased detection perfor-

mance by combining RGB with depth features, the

time complexity of the corresponding calculations has

to be taken into account. There are GPU implemen-

tations available for all investigated algorithms, but

some of them are still in their beta phase and not yet

released as stable version. As a consequence, the

detection performance of GPU versions for princi-

pal curvature and FPFH either falls behind their CPU

counterparts (which are presented here), or the input

data is restricted to a special kind of point cloud that

is different from the data of the application scenario.

For those reasons, the scaled depth value, the So-

bel derivatives, and the Depth HoG are combined with

the RGB features to form the ﬁnal feature vector. This

mixture sets the detection performance to 0.834AUC,

which is a jump of 46%. As depicted in Figure 1(h),

all strong false negatives disappeared and a precision

of more than 0.9 until a recall of 0.65 (coffee mug)

resp. 0.8 (all others) is achieved. The Hough space

in Figure 3(c) shows that the ambiguity between bowl

and coffee mug almost disappeared and there is less

clutter compared to the baseline (Figure 3(b)).

5 TIME COMPLEXITY AND

ADJUSTMENTS

The original Hough Forest implementation, with pa-

rameters set as in the feature evaluation, takes about

53 seconds on the target system for one single frame.

A household robot with this kind of detector would be

a real test of user patience.

The changes in run time reported in the follow-

Near Real-time Object Detection in RGBD Data

183

ing are not cummulative but are always given rela-

tive to the baseline implementation. A ﬁnal evalua-

tion with respect to time and accuracy based on the

selected features and all changes to decrease the com-

putational complexity is given in Section 6.

5.1 Data and Dimensionality Reduction

Of the many parameters that have an impact on the

execution time the resolution of the image and the

number of scales (e.g. of the Hough voting space)

are especially crucial. The same general setting as

above was chosen during the following experiments,

while the scaled depth value and the depth derivatives

are used as depth features. The performance evalua-

tion is based on the average of ﬁve runs. The results

are reported in terms of AUC, while the time mea-

surements refer to wall time on the target system (an

octo-core Intel



Core

i7-3770 CPU and a Nvidia



Titan

Black GPU).

5.1.1 Image Resolution

The original resolution of the images is 640 × 320,

which is downsized at the beginning of the process-

ing pipeline. Affected parameters (e.g. support size of

the features, the patch size, and the smoothing param-

eters of the voting space) are adjusted accordingly.

Training and detection are computed on those down-

sized images, while the detection results, i.e. position,

scale, and bounding box, are upsized to the original

size. This does not only compress information of the

query image, but also the voting space.

The best compromise of time saving and decrease

of detection performance was found by grid search at

a scale of 0.5. The detection time decreased from 56s

to 14s (-75%), while the detection performance even

gained 3%. The performance increase is most prob-

ably caused by the implicit change of the RGB fea-

ture’s support size and noise suppression. Stronger

downscaling decreased the performance almost expo-

nentially.

5.1.2 Depth Normalization

In absence of depth data, the scale space is needed to

detect objects at distances that are different from the

ones learnt. From depth data, however, the correct

scale can be derived easily. We depth-normalize the

two points of the binary test functions of the Hough

Forest as well as the patch-object offset vectors. Since

the bounding boxes are generated by backprojection

(see Section 2), this depth normalized binary test

and voting also renders the need for multiple Hough

spaces over scale obsolete, which further reduces the

desk_2 table_small_1 table_small_2

100

110

120

mean p/r AUC and SEM in %

baseline

single scale

depth normalized

Figure 4: Depth normalization. Single scale and depth nor-

malization results normalized to the baseline performance.

complexity. Note, however, that the support size of

the feature computation stays uneffected.

Since the table small 1 dataset does not contain

much scale variance, the desk 2 and table small 2

datasets are used additionally. Figure 4 shows the

performance of the depth normalization compared to

the baseline performance (using ﬁve scales) and to the

performance using only the original scale (1.0).

The general observation from the three test sets

is that the mean performance does not suffer much

if depth normalization is applied. The performance

is even increased by 12% in the desk 2 dataset. Its

effects are different for the different objects. While

most improved, some suffer from a recall loss that is

comparable to using only one single scale. However,

the risk of a decreased detection performance is com-

pensated by the time savings (-77%).

5.1.3 Patch Offset

The detection mechanism of the Hough Forest can

be regarded as a classical sliding window approach,

in which every region of the query image is investi-

gated (in contrast to, for instance, interest point based

methods, where only a subsample of the whole im-

age is examined deeply). Since overlapping image

regions share information, it is redundant to exam-

ine two highly overlapping patches. The original im-

plementation of the Hough Forest does not take this

into consideration, but visits every single patch. Fig-

ure 5 shows the impact of different offsets on detec-

tion performance and time complexity by illustrating

the relationship between time saving, offset and per-

formance decrease, measured percental in compari-

son to the original offset of one pixel. A window off-

set of 2 already reduces the time complexity of the

classiﬁcation and voting (not of the whole pipeline)

by about 70% while suffering a loss in performance of

only 1%. The inﬂuence on the whole pipeline heavily

depends on scale space and image resolution. With

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

184

2 3

4 5

-100

-80

-60

-40

-20

execution time in %

-7

-6

-5

-4

-3

-2

-1

p/r AUC in %

Figure 5: Performance and time saving for different slid-

ing window offsets. A larger offset decreases performance

exponentially, but time saving only logarithmically.

the original pipeline and an offset of two pixel, about

20s are retrieved (-40%).

5.2 Parallel Processing

5.2.1 GPU Algorithms

Although the computation of the selected features is

very fast compared to other parts of the (original)

pipeline, their inﬂuence grows with the measures of

Section 5.1. All feature operators are replaced by

available GPU implementations if they did not de-

grade the overall detection performance. The mini-

mum and maximum ﬁltration is replaced by equiva-

lent GPU erosion and dilation implementations.

With these changes, the feature computation is

35% faster, saving approximately 2 seconds from the

whole pipeline (-4%). The extraction of object hy-

potheses from the Hough space are ported to the GPU,

which further reduces the execution time of the whole

pipeline by another 4 seconds (-7%).

5.2.2 Multithreading

The target system’s multicore architecture is utilized

which allows parallel processing for most parts of the

detection pipeline. The run time decreases almost lin-

early with the number of cores. With ﬁve parallel

threads (and none of the other changes of this sec-

tion active) the detection ﬁnishes after approximately

11 seconds (-75%).

6 FINAL EVALUATION

The presented methods to increase performance (e.g.

different depth features) and decrease computation

time (e.g. depth normalization and subsampling) are

not independent of each other. The best combination

of the evaluated measures was tested empirically. The

desk_1

desk_2

desk_3

table_1

table_small_1

table_small_2

0.0

0.2

0.4

0.6

0.8

1.0

mean p/r AUC and SEM in %

LINEMOD

DHF

Figure 6: Final results.

ﬁnal solution uses the feature vector of Section 4, re-

sizes the input images with a factor of 0.7, applies

depth normalization and uses GPU calls as well as

multithreading. All parameters concerning spatial re-

lations were adapted according to the resizing opera-

tion while the patch size was set to 10 × 10 pixel.

A direct comparison to many of the state-of-the-

art methods (such as (Lai et al., 2011a; Bo et al.,

2013; Lai et al., 2011b)) is difﬁcult, since most of

them fail to explicitly state their evaluation setup and

performance measure, to clarify the exact train and

test data, or to publish their software. In this section,

the proposed enhanced Hough Forest is compared to

its original RGB version (Gall et al., 2012) and to

LINEMOD (Hinterstoisser et al., 2011).

The general evaluation setup of Section 4 is now

used with all six test sets. LINEMOD is trained with

the same input data as the other methods (apart from

background data). The individual training images are

ﬁltered with the foreground masks provided by the

database. The parameters of LINEMOD have been

optimized empirically for a fair comparison.

Figure 6 shows, that the enhanced Hough For-

est outperforms both reference methods by far.

LINEMOD produces a massive amount of false posi-

tives, which result in a low precision for most objects,

but also the recall statistics show major disadvantages

compared to the enhanced Hough Forest.

In terms of execution time, LINEMOD falls be-

hind as well. The used multi-scale variant needs ap-

proximately four seconds for each object, whereas the

optimized Hough Forest takes two seconds in total

and scales sub-linearly with the number of objects.

7 CONCLUSION AND FUTURE

WORK

The original implementation of the Hough Forest

from (Gall et al., 2012) is enhanced with depth data

Near Real-time Object Detection in RGBD Data

185

to improve both, its object detection performance as

well as runtime. By exploiting fast and descriptive

depth features, data reduction, as well as parallel pro-

cessing, the ﬁnal implementation runs in near real-

time and can compete with state-of-the-art methods.

The output of the current detector are two-

dimensional axis-aligned bounding boxes in the im-

age coordinate system. To retrieve the full six degree

of freedom pose, the next step is to extract the region

in the point cloud that corresponds to the bounding

box and run ICP (Rusinkiewicz and Levoy, 2001) be-

tween a (learnt) 3D model of the object and the point

cloud region. Time complexity and result of ICP are

often improved by providing a good initial transfor-

mation, which could be generated by the Hough For-

est. To this aim the 2D Hough voting scheme needs to

be extended to 3D, in which object hypotheses are ac-

cumulated in real world and not in image coordinates.

REFERENCES

Arbeiter, G., Fuchs, S., Bormann, R., Fischer, J., and Verl,

A. (2012). Evaluation of 3d feature descriptors for

classiﬁcation of surface geometries in point clouds. In

IROS 2012, pages 1644–1650.

Badami, I., St

uckler, J., and Behnke, S. (2013). Depth-

enhanced hough forests for object-class detection and

continuous pose estimation. In SPME 2013, pages

1168–1174.

Ballard, D. (1981). Generalizing the hough transform to de-

tect arbitrary shapes. Pattern Recognition, 13(2):111–

122.

Bo, L., Ren, X., and Fox, D. (2013). Unsupervised feature

learning for rgb-d based object recognition. In Inter-

national Symposium on Experimental Robotics, pages

387–402.

Bo, L., Ren, X., and Fox, D. (2014). Learning hierarchical

sparse features for RGB-(D) object recognition. I. J.

Robotics Res., 33(4):581–599.

Breiman, L. (2001). Random forests. Machine Learning,

45(1):5–32.

Couprie, C., Farabet, C., Najman, L., and LeCun, Y. (2013).

Indoor semantic segmentation using depth informa-

tion. CoRR.

Farabet, C., Couprie, C., Najman, L., and LeCun, Y.

(2013). Learning hierarchical features for scene la-

beling. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 35(8):1915–1929.

Gall, J., Razavi, N., and Van Gool, L. (2012). An introduc-

tion to random forests for multi-class object detection.

In Outdoor and Large-Scale Real-World Scene Anal-

ysis, pages 243–263.

Gupta, S., Girshick, R., Arbelaez, P., and Malik, J. (2014).

Learning rich features from RGB-D images for object

detection and segmentation. In ECCV 2014.

Hinterstoisser, S., Holzer, S., Cagniart, C., Ilic, S., Kono-

lige, K., Navab, N., and Lepetit, V. (2011). Multi-

modal templates for real-time detection of texture-less

objects in heavily cluttered scenes. In ICCV 2011,

pages 858–865.

Hinterstoisser, S., Lepetit, V., Ilic, S., Fua, P., and Navab,

N. (2010). Dominant orientation templates for real-

time detection of texture-less objects. In CVPR 2010,

pages 2257–2264.

Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski,

G., Konolige, K., and Navab, N. (2013). Model based

training, detection and pose estimation of texture-less

3d objects in heavily cluttered scenes. In ACCV 2012,

pages 548–562.

Janoch, A., Karayev, S., Jia, Y., Barron, J., Fritz, M.,

Saenko, K., and Darrell, T. (2011). A category-level

3-d object dataset: Putting the kinect to work. In ICCV

2011, pages 1168–1174.

Knopp, J., Prasad, M., Willems, G., Timofte, R., and

Van Gool, L. (2010). Hough transform and 3d surf

for robust three dimensional classiﬁcation. In ECCV

2010, pages 589–602.

Lai, K., Bo, L., Ren, X., and Fox, D. (2011a). A large-scale

hierarchical multi-view rgb-d object dataset. In ICRA,

pages 1817–1824. IEEE.

Lai, K., Bo, L., Ren, X., and Fox, D. (2011b). A scalable

tree-based approach for joint object and pose recogni-

tion. In AAAI 2011.

Leibe, B., Leonardis, A., and Schiele, B. (2006). An im-

plicit shape model for combined object categorization

and segmentation. In Toward Category-Level Object

Recognition, pages 508–524.

orwald, T., Prankl, J., Richtsfeld, A., Zillich, M., and

Vincze, M. (2010). BLORT - The Blocks World

Robotic Vision Toolbox. In Best Practice in 3D Per-

ception and Modeling for Mobile Manipulation (in

conjunction with ICRA 2010).

Rios-Cabrera, R. and Tuytelaars, T. (2013). Discrimina-

tively trained templates for 3d object detection: A real

time scalable approach. In ICCV 2013, pages 2048–

2055.

Rusinkiewicz, S. and Levoy, M. (2001). Efﬁcient variants

of the icp algorithm. In International Conference on

3-D Digital Imaging and Modeling.

Rusu, R., Blodow, N., and Beetz, M. (2009). Fast point

feature histograms (fpfh) for 3d registration. In ICRA

2009, pages 3212–3217.

Rusu, R., Bradski, G., Thibaux, R., and Hsu, J. (2010). Fast

3d recognition and pose using the viewpoint feature

histogram. In IROS 2010, pages 2155–2162.

Tang, S., Wang, X., Lv, X., Han, T. X., Keller, J., He,

Z., Skubic, M., and Lao, S. (2013). Histogram of

oriented normal vectors for object recognition with a

depth sensor. In ACCV 2012, pages 525–538.

Tombari, F. and Di Stefano, L. (2010). Object recognition in

3d scenes with occlusions and clutter by hough voting.

In PSIVT 2010, pages 349–355.

Vergnaud, D. (2011). Efﬁcient and secure generalized

pattern matching via fast fourier transform. In

AFRICACRYPT 2011, pages 41–58, Berlin, Heidel-

berg. Springer-Verlag.

Wang, W., Chen, L., Chen, D., Li, S., and Kuhnlenz, K.

(2013). Fast object recognition and 6d pose estimation

using viewpoint oriented color-shape histogram. In

ICME, pages 1–6.

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

186