Dense Semantic Stereo Labelling Architecture for In-Campus Navigation

Jorge Beltr

an, Carlos Jaraquemada, Basam Musleh, Arturo de la Escalera and Jose Mar

ıa Armingol

Intelligent Systems Lab (LSI) Research Group, Universidad Carlos III de Madrid (UC3M), Legan

es, Madrid, Spain

jbeltran@ing.uc3m.es, carlosborja.jaraquemada@uc3m.es, {bmusleh, escalera, armingol}@ing.uc3m.es

Keywords:

Dense Labelling, Semantic Labelling, Stereo Vision, Off-road Navigation, ROS.

Abstract:

Interest on autonomous vehicles has rapidly increased in the last few years, due to recent advances in the

ﬁeld and the appearance of semi-autonomous solutions in the market. In order to reach fully autonomous

navigation, a precise understanding of the vehicle surroundings is required. This paper presents a novel ROS-

based architecture for stereo-vision-based semantic scene labelling. The objective is to provide the necessary

information to a path planner in order to perform autonomous navigation around the university campus. The

output of the algorithm contains the classiﬁcation of the obstacles in the scene into four different categories:

traversable areas, garden, static obstacles, and pedestrians. Validation of the labelling method is accomplished

by means of a hand-labelled ground truth, generated from a stereo sequence captured in the university campus.

The experimental results show the high performance of the proposed approach.

1 INTRODUCTION

The interest on autonomous vehicles has undergone

a signiﬁcant growth in the last 10 years due to its

rapid development and the arrival of the ﬁrst semi-

autonomous commercial solutions. As a conse-

quence, both companies and the university commu-

nity are putting much effort into doing research in this

ﬁeld driven by its potential advantages in a wide num-

ber of areas like trafﬁc management, road safety or

disabled-passengers mobility.

However, this upcoming horizon may require

dealing with complex tasks such as localization, nav-

igation or inter-vehicle cooperation, depending on the

vehicle desired level of automation.

A standardized scale for automation degrees is the

SAE International’s table (SAE On-Road Automated

Vehicle Standards Committee, 2014). It describes the

different automation levels which range from 0 (no

automation) to 5 (full automation). A human driver

is considered to be in charge of monitoring the envi-

ronment in the ﬁrst three levels. The big turning point

comes at level 3 (Conditional Automation), where the

system progressively replaces the driver intervention

as the automation level increases. Therefore, one of

the main challenges for automation involves acquir-

ing a detailed knowledge of the vehicle surroundings.

There are several methods for environment infor-

mation retrieval based on the kind of used sensor. On

the one hand, those based on laser (Urmson et al.,

Figure 1: Autonomous vehicle iCab.

2008) (Broggi et al., 2008) obtain very high precision

data although they only provide distance information.

Therefore, these approaches are usually suitable for

detection tasks and map generation. On the contrary,

they don’t provide enough information from the envi-

ronment to classify the different scene elements.

On the other hand, computer vision based sys-

tems obtain rich information of the vehicle surround-

ings at the expense of less precise distance measure-

ments. Concretely, stereo vision systems allow depth

estimation for all pixels in the image by computing

the disparity map. After this process, the uv-disparity

maps (Labayrade and Aubert, 2003) (Hu et al., 2005)

can be obtained so that it can be used to detect both

266

BeltrÃ ˛an J., Jaraquemada C., Musleh B., De La Escalera A. and Armingol J.

Dense Semantic Stereo Labelling Architecture for In-Campus Navigation.

DOI: 10.5220/0006131602660273

In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2017), pages 266-273

ISBN: 978-989-758-226-4

obstacles and free space within the scene (Bernini

et al., 2014), thus obtaining the so-called obstacles

map and free map (Soquet et al., 2007) (Guo et al.,

2009) (Musleh et al., 2011). In addition, there are

also many related works providing a more advanced

scene labelling, both for urban (Sengupta et al., 2012)

(Sengupta et al., 2013) (Long et al., 2015) and indoor

environments (Golodetz et al., 2015).

Apart from stereo rigs, monocular lenses are also

commonly used for scene understanding. Despite the

handicap of not having precise depth information out-

of-the-box, they are very suitable for classiﬁcation

tasks as they provide rich information at low cost.

In contrast with stereo approaches, monocular al-

gorithms for scene classiﬁcation and labelling does

not rely on previously segmented Regions of Interest

(ROIs) (Yao et al., 2012) (Mottaghi et al., 2014) (Ren

et al., 2015) (Arnab et al., 2016).

However, two main downsides are present in most

of the methods mentioned above: the high hardware

speciﬁcations requirements to guarantee real-time ex-

ecution, as in deep learning approaches, as well as

the need of massive datasets for training classiﬁers.

Regarding the availability issue of datasets, many re-

search laboratories are publicly releasing their own,

so there already exist some large annotated datasets

for scene labelling (Cordts et al., 2016).

Nevertheless, both the need of task-speciﬁc

datasets and the time required in annotation process

represent a bottleneck for widening the application

scope of this technology. As a result, recent work

(Richter et al., 2016) is taking advantage of video-

games calls to GPU interface to fetch labels for pix-

els of the different objects in the scene, so that la-

belling stage can be partially automated in order to

easily build large datasets.

The main contribution of this work is a ROS-based

architecture for dense image labelling able to obtain

rich understanding of vehicle surroundings for au-

tonomous navigation tasks. The presented approach

takes advantage of stereo information for scene seg-

mentation. The organization of the algorithm into

loosely coupled stages provides the proposed archi-

tecture with the capability of being extended with ease

so that other classiﬁers can be easily integrated.

The rest of the paper is organized as follows. Next

section, focuses on the architecture description. Sec-

tion 3 gives a description of the proposed algorithm.

Afterwards, Section 4 presents a novel database for

algorithm validation and a detailed description of the

experimental results. Finally in Section 5, conclu-

sions and future work are presented.

2 ARCHITECTURE DESIGN

The proposed architecture has been designed to run

in our research platform called iCab (Hussein et al.,

2016), a vehicle for autonomous in-campus naviga-

tion (see Fig. 1). Therefore, it has been fully inte-

grated with the iCab framework previously developed

by the authors (Marın-Plaza et al., 2016). As a result,

the architecture is built on top of ROS (Robot Operat-

ing System) (Garage, 2010).

Figure 2: Architecture scheme.

The architecture scheme presented in this work is

shown in Fig 2. As can be observed, it is composed by

5 different nodes corresponding to each of the stages

of the algorithm. First of all, in the acquisition stage

(camera), images are captured from the sensors both

in colour and gray-scale. Afterwards, the stereo pro-

cessing node receives the synchronized stereo images

and builds the disparity map. Therefore depth infor-

mation can be retrieved. Then, uv-disparity maps are

computed so that they can be used for calculating the

obstacle and free map. Later, segmentation stage sep-

arates the different obstacles present in its input map

into ROIs based on their disparity information. In

classiﬁcation phase, different classiﬁers make use of

the previously generated images and ROIs in order to

determine the class which each pixel belongs to, thus

producing per-class masks. Finally, at the labelling

stage the produced masks are fused in order to obtain

the ﬁnal label of the pixels.

3 ALGORITHM

The algorithm for dense semantic stereo labelling in-

troduced in this paper is based on different methods of

computer vision. A large proportion of the algorithm

uses the stereo information in order to detect the ob-

Dense Semantic Stereo Labelling Architecture for In-Campus Navigation

267

stacles and the free space in front of the vehicle (Sec-

tion 3.1), whereas the visible information is used to

classify obstacles as pedestrians and non-traversable

areas, such as gardens (Section 3.3). The different

stages of the proposed algorithm will be explained in

this section.

3.1 Obstacle and Free Space Estimation

As commented above, the stereo images supplied by

the vision system can be used in order to obtain 3D

information of the vehicle’s environment. This 3D in-

formation is usually represented by the disparity map

(see Fig. 3a), where the value of the each pixel is pro-

portional to its depth. An useful method to depict the

stereo information of the vehicle’s environment is the

uv-disparity which is obtained from the disparity map

(Hu and Uchimura, 2005). The uv-disparity contains

information of the location both of the obstacles and

ground ahead the vehicle; being able to distinguish

between them. A previous work (Musleh et al., 2011)

is then used to separate the disparity map into to dif-

ferent disparity maps: the obstacle map (see Fig. 3c),

which contains the pixels belonging to the obstacles,

and the free map (see Fig. 3d), which contains the

pixels of the ground.

3.2 Pedestrian Classiﬁcation

The university campus environment contains different

kind of areas and obstacles (buildings, trees, lamp-

post pedestrians, garden, etc.) that have to be avoided

while navigating. A basic classiﬁcation for in-campus

navigation requires at least identifying garden areas

and dynamic obstacles, mainly pedestrians.

In the approach presented in this paper, obstacle

classiﬁcation is based on the determination of Re-

gions of Interest in the visible image (Llorca et al.,

2012). These ROIs isolate obstacles so that they

can be processed by classiﬁers in subsequent stages.

However, due to the characteristics of the campus en-

vironment, most of the obstacles arise in the prox-

imity of the vehicle, so obstacles may be fragmented

into different ROIs as they may get different disparity

values due to high depth precision in the short dis-

tance, making classiﬁcation process much harder. To

address this issue, an algorithm for ROI grouping has

been designed. Looking at the u-disparity map, ob-

stacles are represented as continuous blobs in white.

These blobs are analysed to compute the maximum

and minimum disparity levels as well as the max and

min horizontal coordinates of each obstacle. After-

wards, both depth and width data is used to group

previously computed ROIs, therefore ﬁxing obstacle

fragmentation issue.

Once ROIs have been computed for each obstacle

in the image, a HOG classiﬁer is used to determine the

probability of each region to be a pedestrian. In case

of a positive classiﬁcation, obstacle map is threshed to

get a binary mask of the pixels within the previously

obtained region’s disparity range. Finally, pixels of

the mask inside the ROI area are labelled as pedestri-

ans.

This architecture design, where segmentation

and classiﬁcation stages are loosely coupled, makes

it possible to use different classiﬁers for multiple

classes and thus, provides great versatility.

3.3 Determination of Traversable Areas

Taking the free map as the starting point, backpro-

jection algorithm is used to obtain the probability of a

pixel belonging to garden class. For this stage, the vis-

ible image is converted into HSV colour-space as this

kind of colour representation is more robust against

light condition changes than RGB. After the conver-

sion is performed, a synthetic histogram is built en-

compassing the HUE range corresponding to green

values usually taken by garden areas. Then, it is used

as input to the backprojection algorithm.

Once the probabilities of belonging to garden

class are obtained, an empirically tuned threshold is

used to label pixels as garden or traversable area.

4 RESULTS

In order to test the performance of the proposed

method for dense image labelling, an annotated

database is used. Since the research platform is

aimed to navigate inside the university campus area

in harmony with the university community, a spe-

ciﬁc ground truth is required for evaluation. Thus,

a novel database (Beltr

an et al., 2016) has been gen-

erated considering the particular needs for this task.

4.1 Dataset Description

The developed dataset is composed of a set of 30

manually-annotated images with a 640x480 resolu-

tion (see Fig. 4). Frames are captured by a Bumble-

bee 2 stereo rig sensor with a focal length of 6mm, a

baseline of 0.12m, and a HFOV angle of 43

◦

. The

camera is mounted on the forepart of the iCab re-

search platform. The small resolution of the dataset is

due to the actual restrictions of the platform: limited

computing resources as well as the demanding com-

puting times required by real-time applications, like

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

268

(a) Visible image. (b) Disparity map. (c) Obstacle map. (d) Free map.

Figure 3: Example of the obstacle and ground estimation.

navigation in unstructured environments such as the

University campus.

All images in the collection are part of a single se-

quence recorded by the iCab moving around the Uni-

versity campus. The original video rate is 20 fps with

a total length of 60 seconds. However, although the

labelling algorithm is working in real-time, only one

out of every 40 frames of the video is used for perfor-

mance evaluation.

Four different classes are used for labelling the

dataset. The chosen categories correspond to the

most popular instances found around the Univer-

sity campus and compose the minimum set re-

quired for in-campus navigation. The four categories

are: traversable area, garden, obstacles and pedes-

trian. Segmentation and classiﬁcation of the selected

classes provide the necessary knowledge about the

platform surroundings in order to detect traversable

areas avoiding collision with static obstacles such as

buildings and urban furniture and most common dy-

namic obstacles like pedestrians, being able to guar-

antee that the platform will not navigate over green

areas and parks.

The selected dataset is publicly accessible and is

composed of two sets of images: the original pair

of colour images and their corresponding annotated

ground truth containing the labels for the aforemen-

tioned classes. Fig. 4 shows an example of an origi-

nal image and its annotations from the dataset. As can

be observed, the ground truth is made up of one fully

annotated RGB image .png for each original frame

where each pixel takes the colour associated to the

category it belongs to: traversable area in blue, gar-

den in green, obstacles in red and pedestrian in yel-

low. Additionally, unknown pixels will be assigned

black colour during classiﬁcation stage.

4.2 Metrics

The computer used to perform the experiments to

assess the performance of the proposed labelling

method is an on-board embedded computer having an

Intel Core i7 processor with 8 cores at 4.0 GHz and

16 GB RAM. All developed algorithms are running

over ROS Kinetic on an Ubuntu 16.04 environment.

ACC =

T P

num. of pixels

(1)

IoU

class

T P

class

T P

class

+ FP

class

+ FN

class

(2)

In order to assess the per-pixel semantic labelling

performance of our method two different metrics are

used. On the one hand, the global accuracy of the al-

gorithm is measured as in (1), allowing to determine

the proportion of properly-classiﬁed pixels. On the

other hand, for the purpose of identifying the per-class

classiﬁcation accuracy, the Jaccard Index, commonly

known as PASCAL VOC intersection-over-union, is

used. The Jaccard Index is computed as in (2), where

TP, FP, and FN stand for true positive, false posi-

tive and false negative pixels, respectively. The re-

ported results correspond to the mean performance

computed over the whole set of images.

In order to measure the suitability of our method

for the task of real-time image labelling for au-

tonomous navigation, two well-known disparity al-

gorithms are tested. Thus, it is possible to compare

which one provides the best trade-off between per-

pixel accuracy and computation time, provided that

our method strongly relies on the quality of the dis-

parity map at its ﬁrst stages. The disparity meth-

ods considered in our experiments are Block Match-

ing (BM) and Semi Global Block Matching (SGBM)

(Hirschmuller, 2008).

4.3 Experimental Results

The results of applying our labelling algorithm to the

dataset images are collected in Table 1 (see Fig. 5).

As can be observed, there are no signiﬁcant differ-

ences between the results of the two disparity meth-

ods, being the SGBM slightly more accurate in both

per-pixel and per-class classiﬁcation. These ﬁndings

meet the expected outcomes provided that the semi-

global algorithm take into account a greater amount

of information in order to compute the depth estima-

tion of each of the pixels.

Dense Semantic Stereo Labelling Architecture for In-Campus Navigation

269

Table 1: Four-classes classiﬁcation performance (%).

Disparity Pixel-wise IoU

class

Method Accuracy Free Garden Obstacle Pedestrian

BM 86.81 87.42 69.01 66.41 47.05

SGBM 87.79 88.34 70.31 67.75 43.43

(a) (b)

Figure 4: Image - Ground truth pair from dataset a) and

c) Visible images. b) and d) Labelled ground truth. Best

viewed in colour.

Despite of the minor differences, the proposed

method behaves similarly for each of the selected

metrics independently of the chosen disparity algo-

rithm, thus indicating an existing trend in terms of

per-pixel accuracy and per-class classiﬁcation. Tak-

ing into consideration the results provided for each of

the existing classes, it can be observed that there is an

important difference between traversable area classi-

ﬁcation and the other categories. This is explained by

the fact that the stereo sensor provides better depth

estimation in the nearest environment as the textures

are sharper. Thus, pixels corresponding to the areas

which are closer to the camera will be more likely to

get its disparity properly computed. Consequently, as

the area in front of the vehicle is usually ground, best

classiﬁcation corresponds to traversable category. As

a result, pixels belonging to obstacles usually lay on

farther areas and tend to be more likely misclassiﬁed.

The performance of the garden classiﬁer behaves

as expected. As it is based on ground detection, its

results are bounded by the free map segmentation

phase. Moreover, garden recognition through back-

projection (Swain and Ballard, 1992) algorithm is

only based on HUE channel, therefore suffering from

drastic contrast or light changes in the scene.

Finally, pedestrian classiﬁer provides the worst

per-class classiﬁcation output. This situation is a con-

sequence of using a standard HOG classiﬁer from

external libraries, which is not adjusting well to

multi-scale classiﬁcation and does not work on semi-

occluded persons that may appear at the edges of the

images or cropped at the segmentation phase. These

issues can be observed in detail in Fig. 6.

Regarding the overall per-pixel accuracy, the al-

gorithm labels properly more than 85% of the pixels

in each frame, overcoming by far the average IoU

class

and getting really close to the best classiﬁed category,

as it makes up most of the pixels in the sequence.

Table 2: Three-classes classiﬁcation performance (%).

Disparity Pixel IoU

class

Method Accuracy Free Garden Obst.

BM 88.35 87.42 69.01 72.31

SGBM 89.53 88.34 70.31 74.74

For the purpose of analysing the effect of the

pedestrian classiﬁcation results in the method accu-

racy, an experiment has been carried out considering

only the other three categories. Table 2 comprises the

performance of the proposed method for this case. As

can be observed, both the IoU

obstacle

and the overall

accuracy get higher, being more signiﬁcant the ob-

stacle class increment. These variations shows how

much pedestrian classiﬁer deteriorates overall accu-

racy. Furthermore, the big growth in obstacle classiﬁ-

cation performance indicates that not only the classi-

ﬁer produces many false positive and false negatives,

but also that some pixels labelled as pedestrian, re-

ally belong to obstacle category. However, as they are

wrongly grouped as part of the person shape in the

segmentation phase, they end up being considered as

part of the pedestrian class.

Considering traversable and garden categories,

they both provides the same performance as in the

4-class case, as long as they are not affected by the

pedestrian classiﬁer.

4.4 Computational Time

As time is a key factor in real-time algorithms, mea-

sured times for the performed experiments are pre-

sented in Table 3 for the purpose of determining

which conﬁguration is more suitable for the task of

scene labelling for autonomous navigation.

As can be appreciated, the Block Matching al-

gorithm is considerably less time consuming than

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

270

Figure 5: Labelling results. From left to right: visible image, ground truth and labelled image. Best viewed in colour.

Table 3: Computing times (ms).

Disparity # Classes

Method 4 classes 3 classes

BM 120.75 65.10

SGBM 187.98 134.32

SGBM at the expense of depth estimation accuracy.

In addition, there is a big difference between 4 and

3-class cases, indicating that pedestrian classiﬁer is

not only the worst at performance, but also one of

most time costly. However, taking into consideration

both the real-time requirements and the perception

needs for in-campus navigation task, it can be asserted

that the best conﬁguration for the proposed method is

composed of Block Matching disparity method and 4-

class classiﬁer, as long as it gives the best trade-off be-

tween the labelling accuracy and the needed operating

frame rate, since camera images come at 20Hz and

labelled images are available at 16Hz due to parallel

processing. Therefore, this setup is suitable to work in

real-time on the described platform since iCab’s top

speed is 10km/h, thus covering 0.18m between two

consecutive frames when driving at maximum speed.

5 CONCLUSIONS

Environment understanding plays a key role in any

autonomous navigation task. Therefore, being able to

classify the information obtained from sensors such

as cameras and lasers is essential for the process

of generating a reliable map that allows path plan-

ning. The main contribution of this paper is to pro-

vide an algorithm for dense scene labelling tested on

Dense Semantic Stereo Labelling Architecture for In-Campus Navigation

271

Figure 6: Labelling issues. From left to right: visible image, ground truth and labelled image. Best viewed in colour.

off-road environments based on a ﬂexible and decou-

pled architecture built on ROS. Moreover, a novel

manually-annotated dataset is presented for evalua-

tion purposes.

The results of the proposed approach show a high

performance at labelling in-campus scenarios. This

accuracy, together with the low computational times,

make the introduced algorithm a suitable and efﬁcient

solution for dense image classiﬁcation within the en-

vironment under research.

In addition, the modularity of the proposed archi-

tecture provides the ability to adapt the algorithm to

work well for other environments. The highly de-

coupled design gives the possibility of extending the

number of categories by adding other classiﬁers in a

plug-and-play manner. Similarly, the existing mod-

ules can be easily upgraded or replaced by new ones

which might grant better performance.

In the future, pedestrian classiﬁer will be replaced

by a more advanced algorithm with better capabilities

for multi-scale classiﬁcation. Moreover, garden clas-

siﬁer will be tuned to combine information from all

three channels (H, S and V) to improve results.

Finally, the released database will be extended

with images from other sequences as well as non-

used images of the actual video. In addition, database

might be re-annotated to include more classes in or-

der to achieve a better understanding of the vehicle

surroundings.

ACKNOWLEDGEMENTS

This work was supported by the Spanish Govern-

ment through the CICYTprojects (TRA2013-48314-

C3-1-R and TRA2015-63708-R) and Comunidad de

Madrid through SEGVAUTO-TRIES (S2013/MIT-

2713).

REFERENCES

Arnab, A., Jayasumana, S., Zheng, S., and Torr, P. H.

(2016). Higher order conditional random ﬁelds in

deep neural networks. In European Conference on

Computer Vision, pages 524–540. Springer.

Beltr

an, J., Jaraquemada, C., Musleh, B., de la Escalera,

A., and Armingol, J. M. (2016). SAUCE, Semantic

Annotated University Campus Environment. Dataset.

http://dx.doi.org/10.5281/ZENODO.167843.

Bernini, N., Bertozzi, M., Castangia, L., Patander, M.,

and Sabbatelli, M. (2014). Real-time obstacle detec-

tion using stereo vision for autonomous ground vehi-

cles: A survey. In Intelligent Transportation Systems

(ITSC), 2014 IEEE 17th International Conference on,

pages 873–878. IEEE.

Broggi, A., Cappalunga, A., Carafﬁ, C., Cattani, S., Ghi-

doni, S., Grisleri, P., Porta, P., Posterli, M., Zani, P.,

and Beck, J. (2008). The passive sensing suite of the

terramax autonomous vehicle. In Intelligent Vehicles

Symposium, 2008 IEEE, pages 769–774. IEEE.

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler,

M., Benenson, R., Franke, U., Roth, S., and Schiele,

B. (2016). The cityscapes dataset for semantic urban

scene understanding. In Proc. of the IEEE Conference

on Computer Vision and Pattern Recognition (CVPR).

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

272

Garage, W. (2010). Ros. ros. org.

Golodetz, S., Sapienza, M., Valentin, J. P., Vineet, V.,

Cheng, M.-M., Arnab, A., Prisacariu, V. A., K

ahler,

O., Ren, C. Y., Murray, D. W., et al. (2015). Seman-

ticpaint: A framework for the interactive segmentation

of 3d scenes. arXiv preprint arXiv:1510.03727.

Guo, C., Mita, S., and McAllester, D. (2009). Drivable road

region detection using homography estimation and ef-

ﬁcient belief propagation with coordinate descent op-

timization. In Intelligent Vehicles Symposium, 2009

IEEE, pages 317–323. IEEE.

Hirschmuller, H. (2008). Stereo processing by semiglobal

matching and mutual information. IEEE Transac-

tions on pattern analysis and machine intelligence,

30(2):328–341.

Hu, Z., Lamosa, F., and Uchimura, K. (2005). A complete

uv-disparity study for stereovision based 3d driving

environment analysis. In 3-D Digital Imaging and

Modeling, 2005. 3DIM 2005. Fifth International Con-

ference on, pages 204–211. IEEE.

Hu, Z. and Uchimura, K. (2005). Uv-disparity: an efﬁ-

cient algorithm for stereovision based scene analysis.

In IEEE Proceedings. Intelligent Vehicles Symposium,

2005., pages 48–54. IEEE.

Hussein, A., Mar

ın-Plaza, P., Mart

ın, D., de la Escalera, A.,

and Armingol, J. M. (2016). Autonomous off-road

navigation using stereo-vision and laser-rangeﬁnder

fusion for outdoor obstacles detection. In Intelligent

Vehicles Symposium (IV), 2016 IEEE, pages 104–109.

IEEE.

Labayrade, R. and Aubert, D. (2003). In-vehicle obstacles

detection and characterization by stereovision. Proc.

IEEE In-Vehicle Cognitive Comput. Vis. Syst, pages

1–3.

Llorca, D., Sotelo, M., Hell

ın, A., Orellana, A., Gavil

an,

M., Daza, I., and Lorente, A. (2012). Stereo regions-

of-interest selection for pedestrian protection: A sur-

vey. Transportation research part C: emerging tech-

nologies, 25:226–237.

Long, J., Shelhamer, E., and Darrell, T. (2015). Fully con-

volutional networks for semantic segmentation. In

Proceedings of the IEEE Conference on Computer Vi-

sion and Pattern Recognition, pages 3431–3440.

Marın-Plaza, P., Beltr

an, J., Hussein, A., Musleh, B.,

Martın, D., de la Escalera, A., and Armingol, J. M.

(2016). Stereo vision-based local occupancy grid map

for autonomous navigation in ros.

Mottaghi, R., Chen, X., Liu, X., Cho, N.-G., Lee, S.-W.,

Fidler, S., Urtasun, R., and Yuille, A. (2014). The

role of context for object detection and semantic seg-

mentation in the wild. In The IEEE Conference on

Computer Vision and Pattern Recognition (CVPR).

Musleh, B., de la Escalera, A., and Armingol, J. M. (2011).

Uv disparity analysis in urban environments. In In-

ternational Conference on Computer Aided Systems

Theory, pages 426–432. Springer Berlin Heidelberg.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. In Advances in neural information

processing systems, pages 91–99.

Richter, S. R., Vineet, V., Roth, S., and Koltun, V. (2016).

Playing for data: Ground truth from computer games.

arXiv preprint arXiv:1608.02192.

SAE On-Road Automated Vehicle Standards Committee

(2014). Taxonomy and deﬁnitions for terms related

to on-road motor vehicle automated driving systems.

Sengupta, S., Greveson, E., Shahrokni, A., and Torr, P. H.

(2013). Urban 3d semantic modelling using stereo

vision. In 2013 IEEE International Conference on

Robotics and Automation. IEEE.

Sengupta, S., Sturgess, P., Torr, P. H., et al. (2012). Au-

tomatic dense visual semantic mapping from street-

level imagery. In 2012 IEEE/RSJ International Con-

ference on Intelligent Robots and Systems, pages 857–

862. IEEE.

Soquet, N., Perrollaz, M., Labayrade, R., Aubert, D., et al.

(2007). Free space estimation for autonomous navi-

gation. In 5th International Conference on Computer

Vision Systems.

Swain, M. J. and Ballard, D. H. (1992). Indexing via color

histograms. In Active Perception and Robot Vision,

pages 261–273. Springer.

Urmson, C., Anhalt, J., Bagnell, D., Baker, C., Bittner, R.,

Clark, M., Dolan, J., Duggins, D., Galatali, T., Geyer,

C., et al. (2008). Autonomous driving in urban envi-

ronments: Boss and the urban challenge. Journal of

Field Robotics, 25(8):425–466.

Yao, J., Fidler, S., and Urtasun, R. (2012). Describing the

scene as a whole: Joint object detection, scene clas-

siﬁcation and semantic segmentation. In Computer

Vision and Pattern Recognition (CVPR), 2012 IEEE

Conference on, pages 702–709. IEEE.

Dense Semantic Stereo Labelling Architecture for In-Campus Navigation

273