Foveal Vision for Instance Segmentation of Road Images

Benedikt Ortelt

, Christian Herrmann

2,3

, Dieter Willersinn

and J

urgen Beyerer

2,3

Robert Bosch GmbH, Leonberg, Germany

Fraunhofer IOSB, Karlsruhe, Germany

Karlsruhe Institute of Technology KIT, Vision and Fusion Lab, Karlsruhe, Germany

Keywords:

Instance Segmentation, Multi-scale Analysis, Foveated Imaging, Cityscapes.

Abstract:

Instance segmentation is an important task for the interpretation of images in the area of autonomous or

assisted driving applications. Not only indicating the semantic class for each pixel of an image, but also

separating different instances of the same class, even if neighboring in the image, it can replace a multi-class

object detector. In addition, it offers a better localization of objects in the image by replacing the object

detector bounding box with a ﬁne-grained object shape. The recently presented Cityscapes dataset promoted

this topic by offering a large set of data labeled at pixel level. Building on the previous work of (Uhrig

et al., 2016), this work proposes two improvements compared to this baseline strategy leading to signiﬁcant

performance improvements. First, a better distance measure for angular differences, which is unaffected by

the −π/π discontinuity, is proposed. This leads to improved object center localization. Second, the imagery

from vehicle perspective includes a ﬁxed vanishing point. A foveal concept counteracts the fact that objects

get smaller in the image towards this point. This strategy especially improves the results for small objects in

large distances from the vehicle.

1 INTRODUCTION

Understanding the scene in road images is important

for assisted and autonomous driving applications. In-

formation about driving related aspects, such as road

boundaries, object types, free space or obstacles, is

of particular interest. Given images from vehicle per-

spective, an understanding of the surrounding envi-

ronment is required. Instance segmentation serves

this purpose by denoting for each image pixel the un-

derlying object instance and the object class (e.g., car,

truck or pedestrian). Compared with regular seman-

tic segmentation, neighboring instances of the same

class, e.g., two vehicles, are separated, which is im-

portant in trafﬁc situations where each object might

behave differently. In this matter, the instance seg-

mentation replaces an object detector and additionally

provides a pixel-level mask for each object instead of

a simple bounding box. This allows a better localiza-

tion of the object.

Following the widespread terminology (He et al.,

2017), semantic segmentation denotes the per-pixel

classiﬁcation of the image content, object detection

the acquiring of object bounding boxes and instance

segmentation the per-pixel indication of class and in-

instance

segmen-

tation

instance

segmen-

tation

instance

segmen-

tation

Figure 1: Foveal image analysis concept.

stance which makes this some kind of combination

of segmentation and detection. While semantic seg-

mentation is a very popular research ﬁeld, instance

segmentation receives less attention. This can also

be observed by the number of submitted results for

the popular Cityscapes benchmark

where regular se-

mantic segmentation leads with 52:13 submissions at

the time of writing. Nevertheless, the Cityscapes da-

taset (Cordts et al., 2016) boosted the instance seg-

mentation topic a lot by providing appropriately an-

notated data.

This work builds upon the prior work of Uhrig

et al. (Uhrig et al., 2016) and improves their solution

by two speciﬁc contributions. First, an improvement

https://www.cityscapes-dataset.com/benchmarks

Ortelt, B., Herrmann, C., Willersinn, D. and Beyerer, J.

Foveal Vision for Instance Segmentation of Road Images.

DOI: 10.5220/0006616103710378

In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018) - Volume 4: VISAPP, pages

371-378

ISBN: 978-989-758-290-5

371

which is speciﬁc to this approach by replacing the dis-

tance measure for angular differences with an impro-

ved version that is invariant to the −π/π discontinuity.

Second, a generic multi-scale foveal concept is pro-

posed to compensate for the perspective properties of

road image data where objects tend to become smal-

ler in the image with increasing distance. A multi-

scale ﬁne-grained analysis around the vanishing point

of the road allows to better address these small objects

as illustrated in ﬁgure 1.

2 RELATED WORK

The success of object detection approaches (Girshick,

2015; Ren et al., 2015; Liu et al., 2016) gave rise to

the ﬁne-grained detection of object contours in the

shape of instance segmentation. Two principle con-

cepts to address instance segmentation can be distin-

guished. First, instance candidate approaches build

upon the reliable bounding boxes of an object detector

to reﬁne the object shape. Second, candidate free met-

hods circumvent the object detection by direct estima-

tion of the object’s shape. Regarding dataset choices,

while Cityscapes (Cordts et al., 2016) is the current

state-of-the-art benchmark for instance segmentation

in road images, the MS COCO dataset (Lin et al.,

2014) offers a wider spectrum of image contents and

the KITTI dataset (Geiger et al., 2012) with additi-

onal instance labels (Chen et al., 2014; Zhang et al.,

2016a) is an older and smaller road image choice. Be-

cause this paper focuses on road scene understanding,

the Cityscapes dataset will be used for training and

testing the proposed methods.

2.1 Instance Candidate Methods

The reﬁnement of bounding boxes allows to dis-

tinguish between instance pixels and background

pixels. The most direct approach is to learn a mask-

regression in addition to the bounding box regres-

sion and the class label (He et al., 2017). Fusing

several sub-networks by a cascade (Dai et al., 2016)

or using an identiﬁcation network with a Conditio-

nal Random Field (CRF) (Arnab and Torr, 2017) are

further options. Fully Convolutional Neural Networks

(FCNs) (Long et al., 2015) can be used to improve er-

roneous bounding boxes by adjusting their size (Hay-

der et al., 2016).

2.2 Candidate Free Methods

Caused by the success of semantic labeling, methods

solving instance segmentation on pixel-level are the

second option. (Zhang et al., 2015; Zhang et al.,

2016b) apply a FCN and exploit depth information

for a stable training process to generate instance can-

didates which are reﬁned by a Markov Random Field

(MRF) which reduces the error by enforcing glo-

bal consistency. (Romera-Paredes and Torr, 2016)

propose a recurrent solution where instances of one

class are separated step by step. Another solution

for instance segmentation is to start from a seman-

tic segmentation result where predicting the contour

of an object allows to separate single instances of one

class (Ronneberger et al., 2015; van den Brand et al.,

2016; Kirillov et al., 2016). Such methods are difﬁ-

cult to train and have issues with split instances, e.g.,

because of occlusion. This is addressed by (Uhrig

et al., 2016) by additionally predicting the object dis-

tance and the direction to the respective object center

for each pixel. A post-processing step allows then to

detect and merge split instances. This paper builds

upon this approach which will be elaborated on in

more detail in the next section.

2.3 Multi-scale and Foveated Concepts

Multi-scale image processing, e.g., by image py-

ramids, is a widespread technique to address diffe-

rently sized objects in images. It is a key part of all po-

pular object detection frameworks (Viola and Jones,

2004; Ren et al., 2015; Liu et al., 2016). While neces-

sary to achieve good results, it can introduce a signi-

ﬁcant computational overhead. Foveated imaging is a

concept which reduces this additional burden by spa-

tially restricting the higher resolution scales around a

ﬁxation point (Bandera and Scott, 1989; Ude et al.,

2003; Wang and Bovik, 2006). An increasing resolu-

tion is gradually compensated by limitation to a smal-

ler spatial region, which imitates the human vision.

3 BASIC CONCEPT

The proposed method builds upon (Uhrig et al.,

2016). This approach performs instance segmenta-

tion in two stages. In the beginning, an extended

FCN8s (Long et al., 2015), based on VGG16 (Simo-

nyan and Zisserman, 2015), is used for CNN-based

pixelwise predictions of the semantic class, the depth

and the instance-based geometric direction to the in-

stance center, as shown at the top of ﬁgure 2.

These three prediction maps are then post-

processed to obtain the ﬁnal instance segmentation.

The semantic label is utilized to distinguish between

instances of different semantic classes and to classify

the resulting instances in the end while the depth la-

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

372

Figure 2: The post processing concept of (Uhrig et al.,

2016).

bel is mainly used to estimate each instance’s shape

and size. The direction predictions offer the direction

to the corresponding instance center of every instance

pixel which yields a high discriminativity at instance

borders.

In the post-processing, the predicted maps are

jointly employed to generate the instances by ﬁnding

the instance centers and assigning all pixels to the ap-

propriate centers. To reduce oversegmentation cau-

sed by wrong extra instance centers, instances are fu-

sed subsequently depending on their properties. Ini-

tial instance centers are determined with a template

matching procedure on the direction map using the

Normalized Cross Correlation (NCC)

R(x,y) =

∑

F(x + x

,y + y

) · T(x

)

∑

F(x + x

,y + y

)

∑

T(x

)

(1)

to compare the angular patterns between the direction

map F and the template T. The rectangular templa-

tes are scaled according to prior knowledge and the

information from the semantic and depth map. Max-

ima in the score maps correspond to instance centers

that are found with non-maximum suppression in a

region that is equivalent to the template area. Pixels

are assigned to the closest center where semantic, pre-

dicted direction and relative location agree which re-

sults in instance proposals. Two proposals are fused

afterwards if one proposal’s accumulated direction is

biased to a neighboring proposal with similar depth

and semantic. We refer to (Uhrig et al., 2016) for furt-

her details.

4 IMPROVED ANGULAR

DISTANCE

The comparison of the template and the directions

map by NCC sometimes leads to inaccurate initial

center detections. This includes spatial shifts, missed

centers or even multiple centers per instance. These

effects may lead to inaccurate instance localization,

missed instances or over-segmented instances.

For this reason, we suggest a more robust and pre-

cise Improved Angular Distance (IAD) to compare

the angular patterns for detecting the instance centers.

In contrast to the basic NCC approach, it is invariant

to the angle’s non-linearity between −π and π.

Instead of the scalar angles, comparison of the

center directions is performed directly by the di-

rection vectors in shape of vector ﬁelds. For a single

pixel, the network predictions p

for each discrete an-

gular range class are used as weights for each class’s

respective direction vector r

to get the ﬁnal direction

vector (Uhrig et al., 2016)

ρ =

∑

· r

. (2)

Consequently, the template T and the direction ﬁeld

F are extended to vector ﬁelds containing the two di-

mensional normalized direction vectors ˆρ at each lo-

cation.

For a template and a direction ﬁeld region of

height h and width w, the score S that indicates a

pixel’s likelihood of being an instance center is then

S(x,y) =

h · w

∑

F(x + x

,y + y

)

· T(x

). (3)

This corresponds to the cross correlation for the three-

dimensional case normalized by the number of pixels.

The normalization allows the comparison of scores of

different templates in the non-maximum suppression

step.

Leaving the pixel normalization aside, the com-

putation is equal to the inner product of the direction

vectors f and t of each pixel. Because of vectors ha-

ving unit length, this equals the cosine of the angle

between the vectors

hf,ti = cos(∠(f ,t)). (4)

Foveal Vision for Instance Segmentation of Road Images

373

Figure 3: Score maps S when using NCC (top) and IAD

(bottom).

For this reason, small angular differences have a

small impact while opposed vectors cause signiﬁcant

score reductions. So, lateral shifts of centers are pu-

nished harder leading to more accurate results in com-

parison to NCC. Additionally, this leads to more dis-

tinct maxima in the score maps which are depicted in

ﬁgure 3.

Due to normalization and the cosine function, the

overall scores ranges from -1 to 1 where a score of 1

indicates a perfect match.

There are several options to further improve the

score in equation 3. To better adjust the rectangu-

lar template to the instance shape, pixels classiﬁed as

background can be omitted from score computation.

Thereby, the rectangular template is adapted to the

real shape of the instance. Effects at the borders be-

tween instances where opposing directions occur can

be handled in two ways. The easiest way is to re-

duce the instance template size to avoid that neighbo-

ring instance pixels fall inside. We found that IAD

is more robust to smaller templates than NCC. Ad-

dressing border effects on score level is possible by

ignoring negative cosine values which indicate oppo-

sing directions occurring at the border.

5 FOVEAL STRATEGY

Foveated imaging is an intriguing but often unpopular

concept because of two common downsides for gene-

ral image analysis. First, it is often unclear how to

choose the ﬁxation point where resolution should be

the highest. Either there is no clear point or knowing

the ﬁxation point already solves the targeted problem,

which would be the case, e.g., in object detection. Se-

cond, applying foveal image transformations, such as

a log-polar transform (Schwartz, 1977), on the input

image alters object proportions depending on the ob-

ject location in the image. This makes the training of

a uniﬁed detection or segmentation method very dif-

Figure 4: Comparison of instance center detections with

NCC (left) and IAD (right). Color overlay for directions

is based on (Uhrig et al., 2016).

ﬁcult.

Both issues are addressed in this work. First, the

vanishing point of the road offers an obvious choice as

ﬁxation point in imagery captured in driving direction

from a vehicle. Second, we explicitly extract image

regions and scales instead of applying a highly non-

linear foveal image transformation to the input image.

This can also be understood as a multi-scale image

pyramid strategy with high resolution scales being

spatially focused and limited to the ﬁxation point.

Two different solutions to determine the ﬁxation

point in the road images are suggested:

1. Horizontally centered and vertically aligned at the

pre-calibrated horizon at ﬁxed image coordinates.

This assumes a ﬁxed and well aligned camera po-

sition. Extraction is quick.

2. Dynamically extracted based on the semantic

segmentation output of the Convolutional Neural

Network (CNN) at the original scale. This adapts

the ﬁxation point according to the currently analy-

zed scene but requires a small computational over-

head.

The applied strategy to extract the dynamic ﬁxation

point is based on the road segmentation on the whole

image. The topmost image area classiﬁed as road is

considered the vanishing point of the road and conse-

quently set as ﬁxation point. This strategy results in

an intersection over union of 0.693 between the crop-

ped regions around the detected point and around the

ground truth vanishing point of the road. Note that

the CityScapes dataset also includes depth informa-

tion from stereo vision which offers an additional way

to determine the vanishing point. However, we decide

for a strategy working also in the absence of depth in-

formation to broaden the scope of the approach.

5.1 Fusion Methods

Foveal regions are cropped in alignment with the ﬁxa-

tion point as visualized in ﬁgure 5. By avoiding regi-

ons centered around the ﬁxation point, instances are

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

374

Figure 5: Foveal regions and their alignment with the ﬁxa-

tion point.

better covered caused by the road beginning at the

bottom center of the image.

For each crop, the region is rescaled to ﬁt the input

size of the instance segmentation CNN. Thus, small

objects at larger distances are enlarged. This better

exploits the typical scale range of objects from 40 to

140 pixels which networks pre-trained on the Image-

Net data respond to (Hu and Ramanan, 2016). The

CNN is ﬁne-tuned separately for each foveal scale to

adjust for the differing alignment. Experiments with

a jointly trained network for all crops indicated worse

results. The post-processing template sizes are adjus-

ted accordingly to ensure consistency with the resized

objects.

The proposed foveal strategy then results in an in-

stance segmentation map for each crop, which has

the original resolution. The segmentation results of

the crops are subsequently scaled and merged into

the result of the whole image beginning with the lar-

gest crop. Three different fusion methods to combine

the crops are explored, with the ﬁrst two being ge-

nerally applicable to all instance segmentation appro-

aches and the third one being speciﬁc to the chosen

base approach.

Baseline Instance Fusion. As a baseline, a crop’s

instance segmentation map simply replaces the cor-

responding part in the larger crop’s instance segmen-

tation map. To avoid instances being cut at the crop

border, instances that overlap within a small overlap-

ping zone are merged.

Improved Instance Fusion. While smaller crops al-

low to better segment the smaller objects, it is likely

that big instances are split because they signiﬁcantly

exceed the typical object size range that can be de-

tected by the CNN. Additionally, there is less con-

text that the CNN can rely on if only a part of the

instance is visible in the crop. Following from this,

the baseline instance fusion might insert these split

instances even if they are correctly segmented in lar-

ger crops. This motivates an improved method to

fuse the instance segmentation maps pursuing the ob-

jective of combining bigger instances segmented at

Figure 6: Illustration of the feature fusion concept.

smaller scales and smaller instances segmented at lar-

ger scales. Merging a smaller crop into a larger crop

follows these steps:

1. Remove all instances in the larger crop that lie

completely within the smaller crop region. Keep

the remaining ones.

2. If more than 50% of an instance in the smaller

crop overlap with a remaining instance from the

larger crop

(a) then add the instance’s pixels to this remaining

instance.

(b) otherwise add the instance as a new one and

overwrite any overlap with other instances.

This strategy keeps the larger instances from the lar-

ger crops while still exploiting the ﬁner details in the

smaller crops, both in terms of better object contours

as well as small objects.

Feature Fusion. Both previous instance fusions are

independent of the instance segmentation algorithm.

Speciﬁc to the basic method of in this work, scale fu-

sion can also be done earlier, i.e., on the semantic,

depth and direction feature map. Because the post-

processing is tolerant to minor inaccuracies, the dif-

ferent feature map crops are scaled accordingly and

just copied into each other without any border effect

handling. These fused maps are then regularly post-

processed. Figure 6 illustrates this strategy.

6 EXPERIMENTS

The experiments are performed on the Cityscapes da-

taset (Cordts et al., 2016) following the ofﬁcial in-

stance segmentation evaluation protocol. Instance

segmentation results are thus given as average pre-

cision (AP) and AP

50%

, where AP

50%

denotes the

average precision for a ﬁxed instance overlap thres-

Foveal Vision for Instance Segmentation of Road Images

375

Figure 7: Comparison of instance segmentation improvement by IAD (left) compared with NCC (right).

Table 1: Evaluation of the IAD compared to the NCC base-

line on Cityscapes val. Numbers in percent.

NCC IAD NCC IAD

AP AP AP

50%

person 7.7 11.0 20.9 27.4

rider 7.4 8.8 25.3 26.9

car 22.0 24.5 38.6 40.6

truck 8.3 11.9 17.1 21.5

bus 15.8 17.3 35.5 34.8

train 7.3 10.1 14.5 22.0

motorcycle 5.3 7.0 16.3 19.0

bicycle 5.8 6.4 19.0 20.2

mean 10.0 12.1 23.4 26.6

hold in terms of the Jaccard index (intersection over

union) of 0.5. The CNN is trained on the training

set containing 2,975 images. Method optimization is

performed on the 500 image validation set. The test

set with 1,525 images serves for the ﬁnal evaluation

where tests have to be performed via the ofﬁcial da-

taset website because no annotations are released for

this set. Method validation will be performed against

the reimplemented baseline of (Uhrig et al., 2016).

The base network is trained the same way followed

by the same post-processing strategy, which results in

a comparable overall performance with an AP of 9.9

(theirs) versus 10.0 (ours) on the validation set.

6.1 Angular Distance

Replacing the NCC by the IAD in the post-processing

for instance center detection, promises to improve the

results because of the more robust and tolerant stra-

tegy. The results in table 1 clearly conﬁrm this across

all classes in the regular setting. Overall, results are

improved from 10.0 to 12.1 percent. An example

comparison of instance segmentation results between

both options is depicted in ﬁgure 7. It indicates less

decomposed instances for the IAD.

6.2 Multi-scale Analysis

Even though the Cityscapes dataset has a considera-

ble image resolution of 2048 × 1024 pixels, far ob-

jects are mostly below 30 × 30 pixels in size. This

Table 2: Comparison of ﬁxation point search strategies,

number of foveal crops and fusion strategy on Cityscapes

val. All numbers are in percent and larger is better.

foveal setting metric

feature

fusion

baseline

instance

fusion

improved

instance

fusion

none AP 12.1 - -

ﬁxed, 1 crop AP 13.3 13.1 14.4

ﬁxed, 2 crops AP 8.9 8.4 13.8

dynamic, 1 crop AP 12.6 12.5 14.1

dynamic, 2 crops AP 11.1 10.9 14.0

none AP

50%

26.6 - -

ﬁxed, 1 crop AP

50%

29.3 28.2 29.8

ﬁxed, 2 crops AP

50%

20.8 19.0 29.1

dynamic, 1 crop AP

50%

28.3 27.7 28.6

dynamic, 2 crops AP

50%

26.7 25.6 28.7

motivates the proposed foveal strategy which enlar-

ges the objects into the typical favorable range of the

analyzing CNN. Due to the typical high response of

ImageNet pre-trained networks to objects in the scale

range of 40 to 140 pixels, an overly dense scale sam-

pling of the cropped foveal regions is unnecessary.

We decide for a scaling factor of 2 between crops,

i.e., each region of interest has half the size of the

previous one. The cropped regions are then upsca-

led to push object sizes into the favorable range of

the CNN. Table 2 shows the results of the compari-

son between pre-calibrated and dynamically selected

ﬁxation points as well as an analysis of the number

of useful crops. It shows that the dynamic selection

becomes important if crops get smaller and more fo-

cused on small image regions. These regions must

then be selected well to improve performance. When

only using a single crop, the pre-calibrated strategy

is sufﬁcient because the vanishing point of the road

is usually already inside the crop which renders the

dynamic strategy unnecessary. Regarding the fusion

strategy, feature fusion and the improved instance fu-

sion are consistently superior to the baseline strategy.

Also, the improved instance fusion outperforms the

method speciﬁc feature fusion. Overall, the valida-

tion AP is increased from 12.1 to 14.4 percent.

Having a look at instance segmentation results for

different object categories in table 3 indicates that the

improved instance fusion better addresses the large

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

376

Figure 8: Result of the foveal strategy (left) on an image region where non-foveal processing detected no instances. The

ground truth is given on the right for reference.

Table 3: Detailed results for selected foveal strategies on Cityscapes val. All numbers are in percent and larger is better.

foveal setting fusion method metric

person

rider

car

truck

bus

train

motorcycle

bicycle

mean

none - AP 11.0 8.8 24.5 11.9 17.3 10.1 7.0 6.4 12.1

ﬁxed, 1 crop improved instance fusion AP 13.2 11.1 25.8 14.5 22.9 11.2 8.2 8.1 14.4

ﬁxed, 1 crop feature fusion AP 14.4 11.7 27.8 8.7 19.0 9.1 8.2 7.7 13.3

dynamic, 1 crop improved instance fusion AP 12.9 10.7 25.1 14.0 23.7 10.3 8.1 7.7 14.1

dynamic, 1 crop feature fusion AP 13.9 11.7 26.5 7.8 17.9 7.6 8.4 7.1 12.6

Table 4: Instance-level segmentation results on Cityscapes test. Comparison of the published class-based performance of the

basic concept (Uhrig et al., 2016) and our best performing method including IAD, the ﬁxed foveal strategy with 1 crop and

improved instance fusion. All numbers are in percent and larger is better.

metric

person

rider

car

truck

bus

train

motorcycle

bicycle

mean

basic concept (Uhrig et al., 2016) AP 12.5 11.7 22.5 3.3 5.9 3.2 6.9 5.1 8.9

Ours AP 13.4 11.4 24.5 9.4 14.5 12.2 8.0 6.7 12.5

basic concept (Uhrig et al., 2016) AP

50%

31.8 33.8 37.8 7.6 12.0 8.5 20.5 17.2 21.1

Ours AP

50%

31.5 29.7 40.0 16.0 23.8 21.7 19.2 19.9 25.2

basic concept (Uhrig et al., 2016) AP

100m

24.4 20.3 36.4 5.5 10.6 5.2 10.5 9.2 15.3

Ours AP

100m

24.5 19.6 39.3 14.5 24.2 18.5 11.1 11.1 20.4

basic concept (Uhrig et al., 2016) AP

50m

25.0 21.0 40.7 6.7 13.5 6.4 11.2 9.3 16.7

Ours AP

50m

24.7 20.2 42.5 17.2 27.6 21.8 11.7 11.3 22.1

object classes as intended. By avoiding to split ob-

jects, such as trucks or trains, at crop borders, seg-

mentation signiﬁcantly improves. This is opposing

to the feature fusion strategy where performance for

these classes drops heavily caused by split instances.

Finally, the best-performing combination is com-

pared with the published results of the basic method

on the Cityscapes test dataset. Signiﬁcant overall im-

provement is achieved. Only the results for the person

and rider class show mixed results, indicating that the

CNN used by (Uhrig et al., 2016) is signiﬁcantly bet-

ter at detecting people at the cost of all other classes.

7 CONCLUSION

Using an angular distance which is unaffected by the

−π/π discontinuity improved instance segmentation

results signiﬁcantly. Further performance progress

was made by a generally applicable foveal image ana-

lysis strategy with a multi-scale focus on the vanis-

hing point of the road. This allowed to better dis-

tinguish far objects which are typically small in the

image. Overall, the performance on the Cityscapes

test dataset was improved by both measures from 8.9

to 12.5 percent average precision.

Foveal Vision for Instance Segmentation of Road Images

377

REFERENCES

Arnab, A. and Torr, P. (2017). Pixelwise Instance Segmen-

tation with a Dynamically Instantiated Network. In

Conference on Computer Vision and Pattern Recogni-

tion.

Bandera, C. and Scott, P. D. (1989). Foveal machine vi-

sion systems. In Systems, Man and Cybernetics, 1989.

Conference Proceedings., IEEE International Confe-

rence on, pages 596–599. IEEE.

Chen, L.-C., Fidler, S., Yuille, A. L., and Urtasun, R.

(2014). Beat the mturkers: Automatic image labeling

from weak 3d supervision. In Conference on Compu-

ter Vision and Pattern Recognition, pages 3198–3205.

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler,

M., Benenson, R., Franke, U., Roth, S., and Schiele,

B. (2016). The Cityscapes Dataset for Semantic Urban

Scene Understanding. In Conference on Computer Vi-

sion and Pattern Recognition.

Dai, J., He, K., and Sun, J. (2016). Instance-aware Seman-

tic Segmentation via Multi-task Network Cascades. In

Conference on Computer Vision and Pattern Recogni-

tion.

Geiger, A., Lenz, P., and Urtasun, R. (2012). Are we re-

ady for autonomous driving? The KITTI vision ben-

chmark suite. In Conference on Computer Vision and

Pattern Recognition.

Girshick, R. (2015). Fast R-CNN. In International Confe-

rence on Computer Vision.

Hayder, Z., He, X., and Salzmann, M. (2016). Shape-

aware Instance Segmentation. arXiv preprint

arXiv:1612.03129.

He, K., Gkioxari, G., Doll

ar, P., and Girshick, R. (2017).

Mask R-CNN. arXiv preprint arXiv:1703.06870.

Hu, P. and Ramanan, D. (2016). Finding tiny faces. arXiv

preprint arXiv:1612.04402.

Kirillov, A., Levinkov, E., Andres, B., Savchynskyy, B., and

Rother, C. (2016). InstanceCut: from Edges to Instan-

ces with MultiCut. arXiv preprint arXiv:1611.08272.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra-

manan, D., Doll

ar, P., and Zitnick, C. L. (2014). Mi-

crosoft COCO: Common objects in context. In Euro-

pean Conference on Computer Vision, pages 740–755.

Springer.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu,

C.-Y., and Berg, A. C. (2016). SSD: Single shot mul-

tibox detector. In European Conference on Computer

Vision, pages 21–37. Springer.

Long, J., Shelhamer, E., and Darrell, T. (2015). Fully

Convolutional Models for Semantic Segmentation. In

Conference on Computer Vision and Pattern Recogni-

tion.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster R-

CNN: Towards real-time object detection with region

proposal networks. In Advances in Neural Informa-

tion Processing Systems, pages 91–99.

Romera-Paredes, B. and Torr, P. H. S. (2016). Recurrent

instance segmentation. In arXiv:1511.08250.

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net:

Convolutional networks for biomedical image seg-

mentation. In MICCAI.

Schwartz, E. L. (1977). Spatial mapping in the primate

sensory projection: analytic structure and relevance to

perception. Biological cybernetics, 25(4):181–194.

Simonyan, K. and Zisserman, A. (2015). Very deep con-

volutional networks for large-scale image recognition.

In International Conference on Learning Representa-

tions.

Ude, A., Atkeson, C. G., and Cheng, G. (2003). Com-

bining peripheral and foveal humanoid vision to de-

tect, pursue, recognize and act. In Intelligent Robots

and Systems, 2003.(IROS 2003). Proceedings. 2003

IEEE/RSJ International Conference on, volume 3, pa-

ges 2173–2178. IEEE.

Uhrig, J., Cordts, M., Franke, U., and Brox, T. (2016).

Pixel-level encoding and depth layering for instance-

level semantic segmentation. In German Conference

on Pattern Recognition.

van den Brand, J., Ochs, M., and Mester, R. (2016).

Instance-level Segmentation of Vehicles using Deep

Contours. In Asian Conference on Computer Vision.

Viola, P. and Jones, M. J. (2004). Robust real-time face

detection. International Journal of Computer Vision,

57(2):137–154.

Wang, Z. and Bovik, A. C. (2006). Foveated image and

video coding. Digital Video, Image Quality and Per-

ceptual Coding, pages 431–457.

Zhang, Z., A., S., S., F., and R., U. (2015). Monocular

object instance segmentation and depth ordering with

CNNs. In International Conference on Computer Vi-

sion.

Zhang, Z., Fidler, S., and Urtasun, R. (2016a). Instance-

level segmentation for autonomous driving with deep

densely connected mrfs. In Conference on Computer

Vision and Pattern Recognition, pages 669–677.

Zhang, Z., S., F., and R., U. (2016b). Instance-level segmen-

tation with deep densely connected MRFs. In Confe-

rence on Computer Vision and Pattern Recognition.

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

378