Leveraging the Spatial Label Structure for Semantic Image Labeling
using Random Forests
Manuel W
¨
ollhaf, Ronny H
¨
ansch and Olaf Hellwich
Computer Vision & Remote Sensing, Technische Universit
¨
at Berlin, Berlin, Germany
Keywords:
Random Forests, Semantic Segmentation, Structured Prediction, Context Information.
Abstract:
Data used to train models for semantic segmentation have the same spatial structure as the image data, are
mostly densely labeled, and thus contain contextual information such as class geometry and cooccurrence.
We aim to exploit this information for structured prediction. Multiple structured label spaces, representing
different aspects of context information, are introduced and integrated into the Random Forest framework.
The main advantage are structural subclasses which carry information about the context of a data point. The
output of the applied classification forest is a decomposable posterior probability distribution, which allows
substituting the prior by information carried by these subclasses. The experimental evaluation shows results
superior to standard Random Forests as well as a related method of structured prediction.
1 INTRODUCTION
Contextual information plays a major role within the
human vision system (Hock et al., 1974; Biederman
et al., 1982) and enhances results in a variety of com-
puter vision tasks. However, the specific role of con-
text in image understanding and how to embed con-
textual information in corresponding methods is still
an open research question. We aim to improve se-
mantic segmentation results using semantic context
information and gain insights about how this infor-
mation contributes to the learning and inference pro-
cess. To gather these semantic contextual relations
we leverage the spatial structure of pixelwise labeled
training data.
The basis of most machine learning approaches
on semantic segmentation is a sliding window clas-
sification. These classifiers usually use features that
expose textural context information but do not at-
tempt to induce a meaningful structure in the out-
put space in an explicit manner. Recent work tackles
this mostly by topping the output of the classifier
with a second model to capture the structural infor-
mation (Shotton et al., 2006; Mottaghi et al., 2014).
These additional processing modules are probabilis-
tic graphical models which represent the spatial rela-
tionship of classes either as a Markov Random Field
(MRF) or as the pairwise potential of a Conditional
Random Field (CRF) (He et al., 2004). Using such
a sophisticated model allows to learn the structure of
the output space in every conceivable detail, including
smoothness assumptions, class context, or geometri-
cal relations and location priors. Inference on these
models is NP-hard in general and requires algorithms
that find approximate solutions like spatially limited
inference (Nowozin and Lampert, 2011). Our work
provides an alternative approach that integrates clas-
sification and structured prediction in a single learner.
This is accomplished by employing a Random Forest
(RF) that allows to combine both concepts in an in-
tuitive and comprehensible way. In contrast to our
work, most current work is based on deep learning in
the form of convolutional networks (ConvNets) (Long
et al., 2015; Lin et al., 2016). Their astonishing per-
formance is rooted in successive transformations of
the input data into feature spaces with increasing ab-
straction and meaningfulness. In terms of deep le-
arning, the ability to learn the parameters of a split
function adds one layer of feature transformation and
makes RFs a rather shallow learner with a depth of
two, while ConvNets usually consist of ten and more
layers. However, this large number of layers makes
huge amounts of training data necessary and requi-
res extraordinary computing capabilities, whereas our
work aims to improve segmentation results through a
more efficient use of training data. Besides this, the
natural handling of multi-class problems, the proba-
bilistic output, and their almost ideal statistical pro-
perties (Hastie et al., 2009) make RFs an attractive
method for semantic segmentation.
Wöllhaf, M., Hänsch, R. and Hellwich, O.
Leveraging the Spatial Label Structure for Semantic Image Labeling using Random Forests.
DOI: 10.5220/0006546801930200
In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018) - Volume 5: VISAPP, pages
193-200
ISBN: 978-989-758-290-5
Copyright © 2018 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
193
Figure 1: Application of context information on different spatial scales. From left to right: Image and ground truth, local,
small-scale, large-scale, global, and combined.
To use contextual information in a way that allows
efficient learning and inference, our work uses dom-
ain specific assumptions about relevant spatial scales
and the corresponding types of context. Label infor-
mation is categorized in four spatial classes: Local,
small-scale, large-scale, and global (Fig. 1). Our
work evaluates if and how these types of informa-
tion can be predicted from local appearance to allow
their integration into a simple patch-based semantic
segmentation method. All categories are evaluated
separately and are subsequently integrated into one
model. The details of the proposed method are ex-
plained in Section 3. The first kind of information,
called local label information, is the atomic class la-
bel that corresponds to an image patch. The second,
the small-scale information, is harnessed using a met-
hod introduced in (Kontschieder et al., 2011; Kont-
schieder et al., 2014). It uses label patches centered
at the same point as the image patch and represents
regional class geometry and regional class cooccur-
rences. This regional information does not capture
relationships between distant object parts. To incor-
porate the relations of image regions on an object le-
vel, which are referred to as large-scale information,
object shape is modeled using the implicit shape mo-
del (ISM) from (Leibe et al., 2004). The Generali-
zed Hough transform, which is part of the ISM, is
integrated into RFs in a series of publications (Gall
and Lempitsky, 2009; Gall et al., 2011; Kontschie-
der et al., 2012; Gall et al., 2012; Kontschieder et al.,
2014). Since RFs allow to combine classification and
regression, these so called Hough Forests are often
utilized to approach combined classification and de-
tection tasks. Our work extends this concept and
uses the detector activations of Hough Forests to re-
fine the segmentation. Object detection as an inter-
mediate step to refine semantic segmentation has al-
ready proven to be successful. One example is the
usage of detector outputs in (Ladick
´
y et al., 2010) as
additional potentials for a CRF to refine a semantic
segmentation and allow differentiation between ob-
ject instances. In (Yang et al., 2012) the generative
model from (Felzenszwalb et al., 2010) is used to im-
prove segmentations with the aid of detector activati-
ons. The works in (Gu et al., 2009; Arbel
´
aez et al.,
2012) use region-based object detectors and combine
the region proposals to a semantic segmentation using
the generated object hypotheses. Our work integra-
tes a discriminative approach on object detection and
semantic segmentation. Already (Leibe et al., 2004)
does not only introduce the ISM, but also use the pre-
dicted object hypotheses for a figure-ground segmen-
tation. Both of these parts are adopted in our work
and extended for the use in multi-class semantic seg-
mentation. However, the closed probabilistic formu-
lation for the segmentation in (Leibe et al., 2004) is
limited to pixels that were involved in the voting pro-
cess of the Hough transfrom. We propose an alterna-
tive probabilistic method that allows to propagate evi-
dence given by object hypotheses into a convex hull
formed by the voters of a hypothesis. On a fourth spa-
tial scale, global context is introduced by learning the
statistical relation between local appearance and the
global, image wide class distribution.
As it is possible to train a Random Forest model
on multiple label spaces, the model allows multiple
simultaneous predictions each representing an aspect
of contextual information corresponding to one of the
above mentioned categories. The different predictions
are combined using the maximum a posteriori formu-
lation of the classification problem:
ˆ
y
y
y = f (x
x
x) = argmax
y
y
y
p(x
x
x|y
y
y)p(y
y
y). (1)
The posterior contains semantic context as the joint
probabilities of the label variables y
y
y = (y
1
,...,y
n
).
Hence, likelihood and prior of the Bayesian decom-
position of the posterior can be interpreted as appea-
rance and context as in (Tu, 2008). Random Fore-
sts allow to decompose the posterior and to adopt this
view.
Summarizing the contribution of this work:
We incorporate the prediction of large-scale and
global context information into the Random Fo-
rest framework.
A comparison of small-scale, large-scale, and
global context information shows similar perfor-
mance improvements for all evaluated spatial sca-
les.
Combining predictions on different spatial scales
leads to a significant improvement compared to
the reference method.
VISAPP 2018 - International Conference on Computer Vision Theory and Applications
194
2 RANDOM FORESTS
Random Forests (RFs) are ensembles of decision
trees that are hierarchical structures of leaf- and split-
nodes used for classification and regression (Breiman,
2001). While the leaves contain the actual predicti-
ons, the split nodes are binary test functions f
θ
with
parameter vector θ T , propagating a data point
x X to one of two sub-trees:
f
θ
(x) =
(
1, if φ(x) τ
0, otherwise.
(2)
Here τ is a threshold value. A common choice for
φ(x) is the (absolute) difference of two pixels around
the pixel coordinates u:
φ(x) = |x
x
x
c
(u + b) x
x
x
c
(u + a)|, (3)
which is an approximation of the gradient on the con-
necting line between a and b and thus a simple edge
detector. In this example, the additional parameter c
describes the color channel of the image x
x
x resulting
in θ = (a,b,c, τ). For this work φ is drawn randomly
from a set of four different functions (see supplemen-
tary material
1
).
To generate a tree, the training data S
0
is split
successively. Each split is chosen in a way such that
the resulting subsets S
0
and S
1
are as pure as possible
regarding the class label. This purity is measured with
an objective function such as the information gain I
H
using the Shannon entropy H:
I
H
= H(S )
i∈{0,1}
|S
i
|
|S|
· H(S
i
). (4)
For an optimal split at node k, the second term of
the information gain, the sum of the weighted en-
tropies of the resulting subsets, must be minimized.
While an entropy-based objective function for regres-
sion is possible, regression forests usually choose a
split that minimizes the variance within the resulting
subsets for simplicity and to lower the computational
costs (Criminisi et al., 2012; Gall et al., 2012). The
objective function for the regression label spaces is
defined as
θ
k
= argmin
θT
i∈{0,1}
yY
dS
i
y
(d d)
2
, (5)
where Y is the set of labels and d the mean of the tar-
get variable for regression (e.g. offset vectors) in S
i
y
.
The normalization with the sample size usually found
in the variance equation is missing since the sub-
sets are weighted with their size to avoid unbalanced
1
Supplementary material can be found under: http://
rhaensch.de/structuredRF.html
splits. As it is not necessary to reduce the variance be-
tween data points belonging to different classes, this
formula is extended to only consider intraclass vari-
ance for forests in which classification and regression
are combined as in (Gall et al., 2011). The use of
variance minimization implies a unimodal data dis-
tribution. This assumption is often invalid. As in
most works this objective is chosen in the absence of
a computational tractable alternative for multimodal
distributions.
A leaf node is created if there are less than a cer-
tain number of samples in the subset left, the maxi-
mum tree depth is reached, or the subset is pure. Each
leaf stores a representation of the remaining data sam-
ples. Forests used for semantic segmentation usually
store the class frequency of the subset. As an approx-
imation of the posterior, this distribution includes the
prior distribution of the training data according to the
Bayes’ theorem. To allow to train the model with un-
balanced data the distribution gets re-balanced with
the reciprocal prior distribution
|S|
/|S
y
|.
All trees are trained independently and their out-
put is averaged for inference. The single trees are
randomized by choosing the optimal split only from
a relatively small subset of created split candidates by
randomly sampling split parameters θ. This proce-
dure leads to a very efficient training on high dimen-
sional data.
3 METHOD
This work aims to exploit topological information
from the label space with the aid of Random Forests
(RFs). Therefore, information on local, small-scale,
large-scale, and global level is incorporated into dif-
ferent structured label spaces, used as split criteria for
the tree nodes, stored in the leaf nodes, and combi-
ned into a consistent pixelwise class prediction. A
detailed description of implementation decisions and
parameter choices can be found in the supplementary
material
1
.
Structured Labels in RFs. The first step of the pro-
posed approach is to define a representation of the
context information contained in the structure of the
pixelwise labeled data. Local information is the ato-
mic label y = (u, y
y
y) corresponding to the center of a
data point (image patch) x = (u,r,x
x
x). Here u is the pa-
tch center, r the patch shape, x
x
x the image, and y
y
y the la-
bel image. This label representation does not contain
any topological information. Small-scale information
denotes a small region r
y
in y
y
y around u corresponding
Leveraging the Spatial Label Structure for Semantic Image Labeling using Random Forests
195
void sky grass tree cow
0
20
40
60
Class
Occurrence
Figure 2: Training data and structured labels. (left) Image with emphasized image patch. (middle) Label image with empha-
sized small-scale and large-scale information. (right) Global context label.
to a data point x (Fig. 2 middle). Note that the re-
gion r
y
has not necessarily the same size as region r
x
,
the image patch in x
x
x, but is centered at the same posi-
tion u. This label type is introduced in (Kontschieder
et al., 2011) and (Kontschieder et al., 2014). As labels
on the large-scale level, the well known Hough featu-
res are deployed (Leibe et al., 2004) (Fig. 2 middle).
Here, a Hough feature is an offset vector d describing
the displacement between u and the object center and
thus depicts the geometrical structure of the label in-
formation on an object level. Global, image level in-
formation is incorporated by assigning the same label
to every image patch, namely the image wide class
distribution (Fig. 2 right).
Training Objective. RFs can be utilized for clas-
sification and regression. A single RF can be trai-
ned using multiple different label spaces from both
problem classes by randomly selecting a label space
in each split-node. The split-nodes split the data set
using an objective function appropriate to the selected
label space. The concept from Eq. 4 can be directly
applied for the local labels. The small-scale label is
handled by randomly choosing two atomic labels for
each split node and separate the data by maximizing
the joint information gain. As suggested in (Kont-
schieder et al., 2011), choosing the center pixel as one
of the compared pixels ensures the separation of the
data by local information. Large-scale and global la-
bels are handled as a regression problem. The histo-
grams for the global labels are therefore interpreted as
data points with dimensions of the number of classes.
Leaf Nodes. To leverage the topological informa-
tion in the label space a representation of the structu-
red labels which are used to split the data sets must
also be preserved. Of the small-scale label patches
which set up a leaf node Kontschieder et al. use only
the one that represents the set of labels best. The
best representation is determined with an approxima-
tion of the joint probability assuming independence
between pixels. This approach allows to keep the
memory consumption of the tree acceptable despite
the high dimensional label. This work adopts this
method. The large-scale information is preserved
in the leaf nodes as a non-parametric spatial density
distribution p(v|x). A sparse representation of the
occurring coordinates (non-zero densities) describes
the distribution in a simple and detailed way, even if
the distribution is multimodal. The global label is an
additional histogram which is the average of the his-
tograms of class densities. Output distributions of dif-
ferent trees are combined by averaging. The predicti-
ons of the small-scale labels are unified over multiple
trees by maximization of the joint probability.
Prediction from Structured Labels. After a RF in-
stance is trained on one or more label spaces, the pre-
dictions ˆy about class distributions and label structure
must be fused into a meaningful result. This work
evaluates several prediction processes based on diffe-
rent label types. The normalized local class distribu-
tion, inferred from the prediction based on the local
label, is referred to with p
lh
as it is proportional to the
likelihood term of the Bayes’ formula. Small-, large-
scale, and global information are used to generate dis-
tributions which are supposed to enforce a regional
resp. global compliance to the predicted structural
properties of the image patch. The large-scale and
global distributions are combined with p
lh
in the hope
of achieving a posterior distribution which expresses
a per pixel class prediction consistent with large-scale
and global structure of the image. As the small-scale
label already contains local information, there is no
need for a combination with a distribution generated
from local information.
The small-scale information is transformed into a
position dependent distribution p(y,u) by fusing pre-
dictions from neighboring pixels to encourage a regi-
onal consensus. The neighborhood is defined as the
set of pixels in the region r
u
y
which is the label patch
centered at u. The distribution at u generated from the
small-scale labels is a voting of all labels correspon-
ding to pixels in r
u
y
:
2
p(y|u) =
1
|r
u
y
|
vr
u
y
[ ˆy(v,u) = y] (6)
2
This procedure is referred to as simple fusion process
in (Kontschieder et al., 2011).
VISAPP 2018 - International Conference on Computer Vision Theory and Applications
196
The attempt to reach a consistent prediction on
large-scale level is based on the idea that patches,
which belong to the same object, should make a mos-
tly coinciding prediction about the objects centers po-
sition. Therefore, an estimate about the position of
the object center for all patches of an image is col-
lected. The object hypotheses the most patches agree
on are used to encourage a classification of the single
patches which is conform to these hypotheses. Note,
that the actual position of the hypothesis is not impor-
tant as it is not used for object detection or objectwise
segmentation. For the generation of the large-scale
prior two different methods are evaluated and com-
pared. First, the forest is trained on the atomic and
the Hough labels and thus associates a class distribu-
tion p(y|x) and a spatial distribution p(v|x) with each
image patch x. The joint distribution p(y,v|x) of both
describes the hypothesis of the position v of an object
center and corresponding object class y. These dis-
tributions are used for a voting in a Hough space for
which all hypotheses of all image patches are summed
up:
H (y,v) =
xx
x
x
p(y,v|x) =
xx
x
x
p(y|x)p(v|x) (7)
The n most prominent maxima h
1...n
in the voting
space are selected using the maximum of the class
distribution for each pixel position. Given this list of
hypotheses all voters V
1...n
that voted for one of the
peaks are identified (see supplementary material
1
).
The first of the two evaluated methods generates
a convex hull from these voters for each hypothesis
and assigns the average local class distribution of the
voters to each pixel within the hull:
p
pr
(y|h
l
,u) =
1
|V
l
|
vV
l
p
lh
(v). (8)
Pixels outside the convex hull are treated as having
an uniformly distributed class prior. A weighted and
re-normalized sum of these distributions results in the
final prior distribution:
p
pr
(y|u) =
1
n
l=1
H (h
l
)
n
l=1
p
pr
(y|h
l
,u)H (h
l
) (9)
The second method is supposed to integrate small-
scale and large-scale information and therefore adopts
the probabilistic formulation from (Leibe et al., 2004)
to combine Generalized Hough Transformation and
semantic segmentation. The implicit shape model
(ISM) described in (Leibe et al., 2004) uses a code-
book inferred from the training data for the Hough
voting. Image patches trigger a number of codebook
entries which pass votes into the voting space. Ad-
ditionally, a set of segmentation masks is stored with
the codebook entries. The segmentation masks im-
plement the small-scale influence, i.e. p(y|h
l
,u,x)
reflects if u lies within the mask associated with x.
We adopt this part through a small-scale prediction,
using a model that is additionally trained on a bina-
rized small-scale label. As binary matrix it marks all
pixels of y
y
y in the region r that have the same class la-
bel as position u and describes the spatial distribution
of the class in the region.
For this method the prior is formulated as distri-
bution conditioned on an object hypothesis and mar-
ginalized over image patches x:
p
pr
(y|h
l
,u) =
xx
x
x
p(y|h
l
,u,x)p(x|h
l
,u) (10)
The first term describes the small-scale influence of
the image patch x on the class distribution at pixel u.
It is weighted with the contribution of the patch to
the object hypothesis h
l
. As only patches contai-
ning u have small-scale influence p(y|u, x) > 0 and
only patches that voted for h
l
have non-zero weight
p(x|h
l
) > 0, the sum reduces to the intersection of
these subsets.
p
pr
(y|h
l
,u) =
xr
u
x
p(y|h
l
,x)p(x|h
l
) (11)
=
xr
u
x
p(y|h
l
,x)
p(h
l
|x)p(x)
p(h
l
)
(12)
Assuming a uniform distribution for the priors p(x)
and p(h) one can substitute the term p(x|h
l
) with
p(h
l
|x) = p(y
l
,v
l
|x) from Eq. 7. Finally the priors ge-
nerated for each hypothesis are combined as in Eq. 9.
The global probability density describing the
image wide class occurrence is independent of the
image coordinates u. It is defined as the mean of the
global labels estimated for the patches of an image x
x
x.
This prior encourages an image wide consensus about
which classes are likely to appear in the scene.
p
pr
(y) =
1
|x
x
x|
vx
x
x
ˆy(v) (13)
4 EXPERIMENTS
The forests are trained and evaluated on
MSRCv2 (Shotton et al., 2006) data set, which
contains 276 training, 59 validation, 256 test images
and 21 object classes. Images are fed into the RF
using LAB color space and with nine additional
HOG-like feature channels (see supplementary
material
1
). Due to the grid sampling strategy (5×5),
the resulting set of data points is unbalanced. Three
metrics are listed for each experiment to evaluate the
Leveraging the Spatial Label Structure for Semantic Image Labeling using Random Forests
197
Figure 3: Results for the (convex-hull) large-scale prior. (top-left) Original image (top-right) Likelihood. (bottom-left) Ground
truth. (bottom-right) Posterior.
Figure 4: Results for global prior. (top-left) Original image. (top-right) Likelihood. (bottom-left) Global consensus prior.
(bottom-right) Posterior.
results besides a qualitative analysis. These are the
global recall (GR), the average recall (AR), and the
average intersection over union or average Jaccard
index (AJ) (Everingham et al., 2010). The baseline
for the experiments is a standard Random Forest
model without modifications. The forest parameters
are: Number of trees T = 10, maximum tree depth
D = 99, and minimum number of samples S
min
= 5.
All data points from the training set are used for the
training of all trees (i.e. no bagging). We fixed the
feature patch size for all experiments to 21 × 21 and
the small-scale label patch size to 11 × 11.
As a baseline, additional to the classification re-
sults based on the local label, the results for three
posterior distributions are given: Consensus (small-
scale), consensus (global), and consensus (both).
These three posterior distributions are generated with
three uninformed prior distributions. They enforce a
consensus in the classification result but incorporate
no knowledge inferred from the training data. They
are defined as an average of the local class distributi-
ons p
lh
:
p
pr
(y|u) =
1
|r
u
|
vr
u
p
lh
(v). (14)
For the consensus (small-scale) posterior, r
u
is a re-
gion around the pixel coordinates u with the same
size as the small-scale label patch to enforce a consen-
sus of the class distributions on this spatial scale. A
global consensus is encouraged with the mean of all
patchwise predictions throughout the image r
u
= x
x
x.
The combination of the consensus (small-scale) and
consensus (global) priors is denoted with consensus
(both). These posterior distributions are intended to
allow a more meaningful interpretation than a com-
parison to an arbitrary CRF model.
Results. The results for the integration of small-
scale information are very similar to those published
in (Kontschieder et al., 2014). They are slightly better
than the results achieved with the uninformed small-
scale consensus prior (Table 1). Note that this com-
parison concerns only the simple fusion process sug-
gested by Kontschieder et al..
Large-scale information is exploited using two
different approaches: Convex hull & ISM. The first
method surpasses the results achieved with the small-
scale label (for the AR and AJ score), the small-scale,
and the global baseline-priors. It even outperforms
the combination of both baseline-priors wrt. AR. Fi-
gure 3 emphasizes how a weak signal in the local-
appearance-based classification can lead to a robust
and correct classification of a region. This confirms
previous findings that object detection can improve
semantic segmentations and shows that the propo-
sed method is effective. The second method leads
to an improvement too, but cannot compete with the
convex-hull-based method. One reason for the com-
parably low performance is the property of the appro-
ach to propagate the knowledge about object hypot-
VISAPP 2018 - International Conference on Computer Vision Theory and Applications
198
Table 1: Results for baseline, different spatial scales and a combination of those.
GR AR AJ
baseline
likelihood 57.06 39.73 27.86
consensus (small-scale) 61.67 44.20 31.87
consensus (global) 63.23 44.27 33.16
consensus (both) 65.09 46.25 34.88
small-scale 63.56 47.17 33.94
large-scale
convex hull 61.50 49.61 34.67
ISM 58.27 43.84 30.22
global 63.83 49.69 35.71
combinations
local & convex hull & global 65.81 54.09 39.37
small & convex hull & global 66.94 57.86 40.77
heses in a local region around the voters.
The prior formulated using the patch-based pre-
dictions about the global class distribution outper-
forms the uninformed prior. Furthermore, it outper-
forms the results achieved by incorporating small-
scale and large-scale information. This shows that it
is possible to infer global image properties from lo-
cal appearance and to use this knowledge to improve
semantic segmentations. The qualitative analysis in
Figure 4 shows significantly better results for the in-
formed global prior compared to the uninformed glo-
bal consensus prior.
The convex-hull-based large-scale and the global
prior are combined and evaluated two times: On ba-
sis of local information and using small-scale infor-
mation as a basis for the large-scale prior. The com-
bination of large-scale and global information shows
to be hardly redundant. It leads to remarkable results
with a relative improvement of 40% for AJ compa-
red to the standard model. Even the slight decrease
of the GR score, comparing the baseline priors and
large-scale prior, is compensated by the combination
of both information levels. An additional combina-
tion with small-scale information improves the results
further, but at the cost of memory footprint computa-
tional load.
5 CONCLUSION
We leverage the spatial label structure of densely la-
beled image data to support the learning and inference
process of RFs for semantic segmentation. Different
structured labels are introduced that exploit contex-
tual information encoded at different spatial scales:
Small-scale, large-scale, and global. While the small-
scale level is based on (Kontschieder et al., 2011;
Kontschieder et al., 2014), the large-scale information
is introduced by a Hough-voting-based object detec-
tor. This leads to enhanced segmentations compared
to the baseline and performes on par with the refe-
rence method (Kontschieder et al., 2014). Similar
results are achieved through incorporation of global
context information. A combination of both introdu-
ced methods and of all three spatial scales improves
the results considerably, with a relative improvement
of 40% for the average Jaccard index compared to the
standard Random Forest model and 20% compared to
the reference method. Our work shows how to har-
ness the structure of the label space, to integrate con-
text information on different scales and demonstrates
that the potential of RFs is not yet exhausted.
Future Work. This work leaves multiple ways to
counteract the shortcomings of the ISM-based met-
hod to integrate large-scale information to further eva-
luation. One way is to increase the size of the used
small-scale label. Another is to use a less restrictive
voter identification process. Further development to
refine this approach would be worthwhile because it
has the potential to overcome the weak-point of the
convex-hull-based method: Using the convex hull as
basis for the top-down distribution of the detector
activations leads to an overestimation of the object
size, i.e. false positive classification of background
classes as ”thing” classes. A further way to improve
the results would be to use a non-convex hull, i.e. a
polygon. Additionally both methods could profit from
a substitution of the standard Hough voting, which
needs the fine tuning of many parameters, with the
closed probabilistic formulation from (Barinova et al.,
2012).
REFERENCES
Arbel
´
aez, P., Hariharan, B., Gu, C., Gupta, S., Bourdev, L.,
and Malik, J. (2012). Semantic segmentation using re-
Leveraging the Spatial Label Structure for Semantic Image Labeling using Random Forests
199
gions and parts. In Computer Vision and Pattern Re-
cognition (CVPR), 2012 IEEE Conference on, pages
3378–3385.
Barinova, O., Lempitsky, V., and Kholi, P. (2012). On de-
tection of multiple object instances using hough trans-
forms. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 34(9):1773–1784.
Biederman, I., Mezzanotte, R. J., and Rabinowitz, J. C.
(1982). Scene perception: Detecting and judging ob-
jects undergoing relational violations. Cognitive Psy-
chology, 14(2):143 – 177.
Breiman, L. (2001). Random forests. Machine Learning,
45(1):5–32.
Criminisi, A., Shotton, J., and Konukoglu, E. (2012). De-
cision forests: A unified framework for classification,
regression, density estimation, manifold learning and
semi-supervised learning. Found. Trends. Comput.
Graph. Vis., 7(2):81–227.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn,
J., and Zisserman, A. (2010). The pascal visual ob-
ject classes (voc) challenge. International Journal of
Computer Vision, 88(2):303–338.
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., and
Ramanan, D. (2010). Object Detection with Discrimi-
natively Trained Part-Based Models. IEEE Transacti-
ons on Pattern Analysis and Machine Intelligence,
32(9):1627–1645.
Gall, J. and Lempitsky, V. (2009). Class-specific hough fo-
rests for object detection. In Computer Vision and
Pattern Recognition, 2009. CVPR 2009. IEEE Con-
ference on, pages 1022–1029.
Gall, J., Razavi, N., and Gool, L. V. (2012). An Introduction
to Random Forests for Multi-class Object Detection.
In Outdoor and Large-Scale Real-World Scene Ana-
lysis, number 7474 in Lecture Notes in Computer
Science, pages 243–263. Springer Berlin Heidelberg.
Gall, J., Yao, A., Razavi, N., Van Gool, L., and Lem-
pitsky, V. (2011). Hough forests for object detection,
tracking, and action recognition. Pattern Analy-
sis and Machine Intelligence, IEEE Transactions on,
33(11):2188–2202.
Gu, C., Lim, J. J., Arbelaez, P., and Malik, J. (2009). Recog-
nition using regions. In Computer Vision and Pattern
Recognition, 2009. CVPR 2009. IEEE Conference on,
pages 1030–1037.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The
Elements of Statistical Learning. Springer Series in
Statistics. Springer New York, New York, NY.
He, X., Zemel, R. S., and Carreira-Perpinan, M. A. (2004).
Multiscale conditional random fields for image labe-
ling. In Proceedings of the 2004 IEEE Computer
Society Conference on Computer Vision and Pattern
Recognition, 2004. CVPR 2004, volume 2, pages II–
695–II–702 Vol.2.
Hock, H. S., Gordon, G. P., and Whitehurst, R. (1974). Con-
textual relations: The influence of familiarity, physi-
cal plausibility, and belongingness. Perception & Psy-
chophysics, 16(1):4–8.
Kontschieder, P., Bul, S. R., Pelillo, M., and Bischof, H.
(2014). Structured labels in random forests for seman-
tic labelling and object detection. IEEE Transacti-
ons on Pattern Analysis and Machine Intelligence,
36(10):2104–2116.
Kontschieder, P., Bul
`
o, S. R., Criminisi, A., Kohli, P., Pe-
lillo, M., and Bischof, H. (2012). Context-sensitive
decision forests for object detection. In Pereira, F.,
Burges, C. J. C., Bottou, L., and Weinberger, K. Q.,
editors, Advances in Neural Information Processing
Systems 25, pages 431–439. Curran Associates, Inc.
Kontschieder, P., Rota Bul, S., Bischof, H., and Pelillo, M.
(2011). Structured class-labels in random forests for
semantic image labelling. In Computer Vision (ICCV),
2011 IEEE International Conference on, pages 2190–
2197. IEEE.
Ladick
´
y, L., Sturgess, P., Alahari, K., Russell, C., and Torr,
P. H. S. (2010). What, Where and How Many? Com-
bining Object Detectors and CRFs. In Computer Vi-
sion ECCV 2010, number 6314 in Lecture Notes in
Computer Science, pages 424–437. Springer Berlin
Heidelberg. DOI: 10.1007/978-3-642-15561-1
31.
Leibe, B., Leonardis, A., and Schiele, B. (2004). Combined
object categorization and segmentation with an impli-
cit shape model. In Workshop on statistical learning
in computer vision, ECCV, volume 2, page 7.
Lin, G., Shen, C., van den Hengel, A., and Reid, I. (2016).
Efficient piecewise training of deep structured models
for semantic segmentation. In The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR).
Long, J., Shelhamer, E., and Darrell, T. (2015). Fully con-
volutional networks for semantic segmentation. In
The IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR).
Mottaghi, R., Chen, X., Liu, X., Cho, N.-G., Lee, S.-W.,
Fidler, S., Urtasun, R., and Yuille, A. (2014). The
Role of Context for Object Detection and Semantic
Segmentation in the Wild. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recogni-
tion, 2014. CVPR 2014, pages 891–898.
Nowozin, S. and Lampert, C. H. (2011). Structured Le-
arning and Prediction in Computer Vision. Found.
Trends. Comput. Graph. Vis., 6(34):185–365.
Shotton, J., Winn, J., Rother, C., and Criminisi, A.
(2006). TextonBoost: Joint Appearance, Shape and
Context Modeling for Multi-class Object Recogni-
tion and Segmentation. In Leonardis, A., Bischof,
H., and Pinz, A., editors, Computer Vision ECCV
2006, number 3951 in Lecture Notes in Compu-
ter Science, pages 1–15. Springer Berlin Heidelberg.
DOI: 10.1007/11744023 1.
Tu, Z. (2008). Auto-context and its application to high-level
vision tasks. In IEEE Conference on Computer Vision
and Pattern Recognition, 2008. CVPR 2008, pages 1–
8.
Yang, Y., Hallman, S., Ramanan, D., and Fowlkes, C. C.
(2012). Layered Object Models for Image Segmen-
tation. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 34(9):1731–1743.
VISAPP 2018 - International Conference on Computer Vision Theory and Applications
200