Interdependent Multi-task Learning for Simultaneous Segmentation

and Detection

Mahesh Reginthala

, Yuji Iwahori

, M. K. Bhuyan

, Yoshitsugu Hayashi

, Witsarut Achariyaviriya

and Boonserm Kijsirikul

Indian Institute of Technology Guwahati, 781039, India

Chubu University, 487-8501, Japan

Chulalongkorn University, Bangkok, 20330, Thailand

boonserm.k@chula.ac.th

Keywords:

Multi-task Learning, Semantic Segmentation, Object Detection, Deep Learning.

Abstract:

Lightweight, fast, and accurate deep-learning algorithms are essential for practical deployment in real-world

use-cases. Semantic segmentation and object detection are the principal tasks of visual perception. A multi-

task network signiﬁcantly reduces the number of parameters compared to two independent networks running

simultaneously for each task. Generally, multi-task networks have shared encoders and multiple independent

task-speciﬁc decoders. Instead, we modeled our network to exploit the features from both encoder and decoder.

We propose the multi-task network that performs both segmentation and detection with only 37.9 million

parameters and inference time of 74 milliseconds on a consumer-grade GPU. This network performs two

tasks with much fewer parameters and in much less inference time compared to each single task network.

1 INTRODUCTION

Convolutional neural networks (CNNs) have been re-

markably successful in the ﬁeld of computer vision

over recent years. Visual perception is a crucial part

of several upcoming breakthroughs in technologies

like self-driving, robotics, health care, automation,

and artiﬁcial intelligence. Semantic segmentation and

object detection constitute a signiﬁcant part of Visual

perception. Semantic segmentation is required to un-

derstand the areal classes like road, vegetation, and

sky. Whereas object detection helps us to understand

countable classes like vehicles and humans. Enor-

mous computational complexity and high inference

times have been signiﬁcant setbacks of these intensive

tasks. Most of the real-world visual perception tasks

necessitate both these tasks to be performed simulta-

neously on critical resource-constrained platforms.

It is evident that the initial layers of any encoder of

a computer vision task have similar ﬁlters [to Gabor

ones] independent of the task and decoder of a seman-

tic segmentation network have all the pixel-level con-

textual information which is very helpful for an object

detector to extract bounding boxes and class probabil-

ities from those representations. This motivates us to

build a single Multi-task learning model capable of

performing the complex tasks of semantic segmenta-

tion and object detection simultaneously. Along with

accuracy, we also concentrate on making our network

light and the fast inference on a consumer-gradeGPU.

In summary, the following are our important con-

tributions:

• We present a novel MTL network that performs

both semantic segmentation and object detection

simultaneously with inference time and the num-

ber of parameters much less than a semantic seg-

mentation or an object detection single task net-

work.

• Usually, MTL networks have shared encoders and

independent task-speciﬁc decoders; instead, we

exploit the feature maps of segmentation decoder

with rich semantic information. Thus, proposing

a framework for an MTL network with intercon-

nected decoders.

• We propose a scale aware training scheme for the

trident block of a one-stage object detector with

anchor boxes.

• We propose the training procedure for a highly in-

terdependent MTL network.

Reginthala, M., Iwahori, Y., Bhuyan, M., Hayashi, Y., Achariyaviriya, W. and Kijsirikul, B.

Interdependent Multi-task Learning for Simultaneous Segmentation and Detection.

DOI: 10.5220/0008949501670174

In Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2020), pages 167-174

ISBN: 978-989-758-397-1; ISSN: 2184-4313

167

2 RELATED WORKS

In general, Multi-task learning networks are catego-

rized into hard parameter sharing and soft parameter

sharing methods. Usually in hard parameter sharing,

several task-speciﬁc decoders are used to make pre-

dictions by using a feature map generated from a sin-

gle encoder similar to (Caruana, 1993) (Teichmann

et al., 2016). This method is unlikely to overﬁt and

found to be very good at generalization. Hard pa-

rameter sharing is widely used in MTL because of

its computational advantages. In soft parameter shar-

ing, every task will have its task-speciﬁc model with

some degree of sharing parameters in between differ-

ent models, for example (Misra et al., 2016). The sim-

ilarity of parameters are improved by regularizing the

distance between the parameters of models, as shown

in (Duong et al., 2015).

2.1 Semantic Segmentation

Over these years, it is evident that CNNs are very

good at semantic segmentation and classiﬁcation

tasks. For pixel-level prediction tasks like semantic

segmentation, fully convolutional networks (FCNs)

by (Long et al., 2015) introduced the end-to-end ap-

proach that maps the feature maps of a classiﬁcation

network to a dense prediction output. FCN (Long

et al., 2015) modiﬁed VGG-16 (Simonyan and Zisser-

man, 2014) into an encoder-decoder architecture with

skip connections. Conditional random ﬁelds (CRFs)

are used on network output for better performance

around object boundaries in DeepLab (Chen et al.,

2016). For better performance across different scales

of object’s instances was initially achieved by train-

ing network at multiple rescaled versions or by fusing

features from multiple parallel branches that take dif-

ferent image resolutions as shown in (Farabet et al.,

2012) and (Long et al., 2015) respectively. These

networks use pooling layers to increase the recep-

tive ﬁeld. Which are inefﬁcient for a segmentation

network. DeepLab (Chen et al., 2016) and PSPNet

(Zhao et al., 2017) use dilated convolutions of differ-

ent rates over multiple parallel branches and concate-

nates them. This enables them to increase the ﬁeld-of-

view of CNNs for multi-scale contextual information.

The major setback of these approaches is computa-

tional complexity and hence, large inference time.

Adapnet++ (Valada et al., 2019) addresses this issue

by Cascading atrous convolutions to enlarge ﬁeld-of-

view efﬁciently. Several recent works like DANet (Fu

et al., 2019) and GFF (Li et al., 2019a) use Gating

and Attention mechanisms frequently used in recur-

rent networks like LSTM and GRU in various ways

for tasks like semantic segmentation.

2.2 Object Detection

Object detectors based on deep learning can be classi-

ﬁed into two categories one-stage detectors and two-

stage detectors. It is commonly observed that two-

stage detectors are good at accuracy, whereas one-

stage detectors have faster inference times. Usually

in two-stage detectors, the ﬁrst stage proposes some

regions of the image that are the potential to have

an object. R-CNN (Girshick et al., 2014) uses Se-

lective Search (Uijlings et al., 2013), SPPNet (He

et al., 2015) uses spatial pyramid pooling, Fast R-

CNN (Girshick, 2015) uses RoIPooling layers, Faster

R-CNN (Ren et al., 2015) uses region proposal net-

work (RPN) as their ﬁrst stage. These proposed re-

gions are forwarded to the second stage for reﬁning

the detected boundaries and classifying the object.

For the second stage, R-CNN (Girshick et al., 2014)

uses class-speciﬁc linear SVMs over the ﬁxed-length

feature vector of the warped region generated using

a large convolutional neural network (CNN), Fast R-

CNN (Girshick, 2015) uses ﬁxed-size feature map

mapped to a feature vector by fully connected lay-

ers (FCs), Faster R-CNN (Ren et al., 2015) uses Fast

R-CNN detector module. On the other hand, one-

stage methods YOLO (Redmon and Farhadi, 2017)

and SSD (Liu et al., 2016) frames object detection as

an optimized end-to-end regression problem to offsets

of the predeﬁned anchor boxes and class probabili-

ties. Focal loss (Lin et al., 2018) addresses the issue

of class imbalance common in one-stage detectors.

Large scale variations of object instances is a vital

issue for object detectors. The following are different

methods used to handle large scale variations, SSD

(Liu et al., 2016) and MS-CNN (Cai et al., 2016) uses

feature maps at different levels for making predictions

at multiple scales. TDM (Shrivastava et al., 2016) and

FPN (Lin et al., 2017) uses top-down pathway and

lateral connections for more semantic representation

and make predictions at multiple levels of the decoder.

PANet (Liu et al., 2018) enhance the information ﬂow

between lower layers and topmost feature by bottom-

up path augmentation. SNIP (Lee et al., 2018) se-

lectively back-propagates the gradients of object in-

stances of different sizes as a function of the image

scale. M2Det (Zhao et al., 2019) uses alternating joint

Thinned U-shape Modules and Feature Fusion Mod-

ules to extract more representative, multi-level multi-

scale features. TridentNet (Li et al., 2019b) has a par-

allel multi-branch architecture in which each branch

has dilated convolutions at different rates but with the

same transformation parameters, which enables it to

perform better over different scales.

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

168

(a) (b) (c)

Figure 1: These are the observations when the following modiﬁcations are performed on a U-Net like semantic segmentation

network. a) Original prediction of semantic segmentation network. b) Prediction without skip connections, object boundaries

are smoothened, and small objects like poles, distant persons are lost. This shows the importance of skip connections and

feature maps from the top layers. c) Prediction with bottleneck activations and skip connections inverted. This shows the

spatial correlation of feature maps at different levels of Deep CNNs.

3 PROPOSED NETWORK

In this section, we describe the overall architecture of

the proposed network, which can perform simultane-

ous object detection and semantic segmentation. We

detail our design criteria, reasons, and advantages of

the proposed multi-tasking network. We then detail

the Trident block for one-stage detectors with anchor

boxes.

3.1 Semantic Segmentation

Our network is a fully convolutional image encoder

- segmentation decoder - object detector design, as

shown in the Figure 2. We adopt AdapNet++ (Val-

ada et al., 2019) Architecture for the semantic seg-

mentation part. AdapNet++ (Valada et al., 2019) is

a computationally efﬁcient semantic segmentation ar-

chitecture with the Image encoder based on full pre-

activation ResNet-50 (He et al., 2016) with multiscale

residual units at varying dilation rates in the last two

blocks of the encoder. ResNet-50 (He et al., 2016) ac-

commodates the sufﬁcient deep contextual features in

limited computational complexity. AdapNet++ (Val-

ada et al., 2019) poses an efﬁcient atrous spatial pyra-

mid pooling (eASPP) module as the bottleneck of the

network, which reduces the number of parameters re-

quired by over 87% compared to the originally pro-

posed ASPP in DeepLabv3+ (Chen et al., 2018b).

Its decoder consists of multiple deconvolutions and

convolution layers with skip connections fusing fea-

ture maps from the encoder for the segmentation of

small objects and object boundary reﬁnement. Two

auxiliary losses are used immediately after each up-

sampling stage to accelerate training and improve the

gradient propagation in the network. These auxiliary

losses also help us for coming forth object detector.

3.2 Object Detection

In object detection tasks, we need to do both localiza-

tion and classiﬁcation. The primary issue is the spatial

information required for the localization is abundant

at the top layers of the Deep CNNs, where bottom lay-

ers have rich contextual information, which is crucial

for the classiﬁcation of objects. On the other hand, we

have large scale variations of object instances where

smaller and simple instances are only found in top

layers, and large and complex instances are found in

the bottom layers, see Figure 1. Our proposed net-

work addresses these issues by ensuring that feature

maps from both the top and bottom layers are readily

available for the object detection module.

We have two auxiliary softmax losses each af-

ter the ﬁrst two up-convolutions in the segmenta-

tion decoder, which enforces the feature maps at

that intermediate level to be more spatially similar to

the ground truth labels. This feature maps concate-

nated with the feature maps from the skip reﬁnement

stage is forwarded to the object detection network,

as shown in the Figure 2. We model our object de-

tection network as a one-stage object detector simi-

lar to YOLO (Redmon and Farhadi, 2017) because

of its simplicity and fast inference time compared to

two-stage object detectors. In the object detection net-

work, these feature maps are rapidly downsampled to

the bottleneck resolution and concatenated with the

feature maps from the bottleneck. We use only 3

X 3 and 1 X 1 convolutions in this rapid downsam-

pling to preserve smaller and simple instance’s in-

formation present in the top layers and also we use

3 X 3/2 convolution instead of Maxpool to preserve

the spatial relativity. This rapid downsampling short-

ens the information path between lower layers and

the topmost feature maps. Several Multi-Task Learn-

ing (MTL networks) (Teichmann et al., 2016), (Sistu

et al., 2019) have shared encoders and multiple in-

dependent task-speciﬁc decoders. Instead, we try to

exploit the rich semantic information in the segmen-

Interdependent Multi-task Learning for Simultaneous Segmentation and Detection

169

(a) Image (b) Semantic Segmentation (c) Bounding Boxes

Figure 2: Overview of our proposed MTL network. Encoder based on ResNet 50 (He et al., 2016) is depicted in blue,

the bottleneck efﬁcient atrous spatial pyramid pooling (eASPP ) is depicted in green, The orange color part denotes the

segmentation decoder, grey color blocks are the skip reﬁnement stages and the plum color part represents the object detector.

tation decoder, which is highly useful for object de-

tection. As we discussed before, high-resolution top

layers are essential for object localization, but top lay-

ers of the encoder are low-level, ﬁne-grained, shal-

low feature maps whereas top layers of the segmenta-

tion decoder are semantic, coarse-grained, deep fea-

ture maps which are more helpful for object detec-

tion. We boost the information ﬂow similar to Path

Aggregation Network (PANet) (Liu et al., 2018), as

shown in the Figure 2 the dashed green line indicates

the shortcut for low-levelfeature maps from top layers

that are effective for accurate localization of instances

whereas, the dashed red line goes through the whole

encoder carries the deep-level feature maps that are

valuable for classifying instances. Also, we are for-

warding Multi-Level Features to the object detector

by concatenating feature maps from both segmenta-

tion decoder and skip reﬁnement stage.

Datasets like Cityscapes (Cordts et al., 2016) have

large scale variations of classes like Cars and Per-

son varies largely by the distance between the object

and the camera. It is observed that some vehicle in-

stances occupy a great part of the image. We use

trident block originally proposed in TridentNet (Li

et al., 2019b) for two-stage object detectors to han-

dle this scale variation. It consists of multiple parallel

branches each a stack of residual blocks with three

convolutions of kernel size 1 X 1, 3 X 3 and 1 X 1

but each branch of trident block has different dilation

rates for the 3 X 3 convolution as shown in the Fig-

ure 3. Weights are shared among these branches as it

reduces the number of parameters, and it intends that

the same transformation is applied at different spa-

tial scales. In our experiments for Each branch of

Trident block, we have 3 anchor boxes so the ten-

sor is of shape 24 X 48 X [ 3 * (4 + 1 + number of

classes ) ] for 4 bounding box offsets and 1 for ob-

ject conﬁdence prediction similar to YOLO (Redmon

and Farhadi, 2017). 24 X 48 is the resolution of the

bottleneck feature map. We use leaky rectiﬁed linear

activation function in the object detection part.

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

170

Figure 3: Illustration of the Trident block depicted in Figure 2. Trident block was initially proposed in TridentNet (Li et al.,

2019b).

4 EXPERIMENTS AND RESULTS

In this section, we ﬁrst give details about the dataset

selection and dataset generation. Then, we describe

our proposed scale-aware training of trident blocks

for one-stage networks, and we also describe our pro-

posed training procedure for an interdependent multi-

tasking network like this. Finally, we detail the eval-

uation procedure and results.

4.1 Dataset

We need a dataset with both segmentation labels and

bounding boxes for object detection. So, we trained

and evaluatedon publicly available Cityscapes dataset

(Cordts et al., 2016). It is a very popular and highly

challenging dataset containing images of complex ur-

ban scenes. We used only the provided 2875 ﬁnely

annotated images for training, 500 are for valida-

tion. Bounding boxes are not directly provided in the

original Cityscapes (Cordts et al., 2016) dataset, but

instance-level annotations are provided. We modiﬁed

the scripts given by Cityscapes (Cordts et al., 2016)

dataset to extract bounding boxes from instance-

level annotations for the classes Car/Truck/Bus, Per-

son/Rider,Bicycle/Motorcycle bicycle. In our work,

we ﬁrst resized the images in the dataset to resolu-

tion 768 x 384. Then, we do data argumentation by

randomly scaling ( 1 to 1.5 ), and we take random

crops of resolution 768 x 384 followed by random

horizontal ﬂipping of the cropped image. We use Bi-

linear Interpolation and Nearest Neighbour Interpola-

tion for rescaling images and segmentation labels, re-

spectively. We ignored the bounding box if its center

falls out of the randomly cropped image.

4.2 Training

For calculating segmentation loss (L

seg

), we use the

cross-entropy loss function for both main loss (L

main

)

and auxiliary losses (L

aux1

, L

aux2

). We use bilinear

upsampling for the feature maps at each auxiliary

loss branch to match the resolution with segmentation

ground truth labels. We set the weights to λ

= 0.6

and λ

= 0.5.

seg

= L

main

+ λ

∗ L

aux1

+ λ

∗ L

aux2

. (1)

In our work, We use three branches in the Tri-

dent block at dilation rates 1, 2, 3 for small instances,

medium instances, large instances, respectively. We

use three anchor boxes for each branch. Here, we

propose a scale-aware training scheme for one-stage

detectors to enhance scale awareness of every branch.

We use k-means clustering on the training data set to

determine the required nine anchor boxes similar to

YOLO (Redmon and Farhadi, 2017). We sort those

anchor boxes according to their area. Then we allot

the three largest anchor boxes for the branch special-

ized for large instances with dilation rate 3. Similarly,

we allot the three smallest anchor boxesfor the branch

specialized for smaller instances with dilation rate 1.

Remaining three anchor boxes in the middle are al-

lotted for the branch specialized for medium scale in-

stances with dilation rate 2. We use focal loss (Lin

et al., 2018) to calculate object detection loss (L

det

We use a weighted sum of individual losses for the

two tasks to train this multi-tasking network.

L = W

seg

∗ L

seg

det

∗ L

det

(2)

seg

= 30− 50, W

det

= 1) (3)

Interdependent Multi-task Learning for Simultaneous Segmentation and Detection

171

Table 1: Semantic segmentation and object detection results of the proposed network trained on the Cityscapes dataset (Cordts

et al., 2016).

Metrics Train Validation

Road IoU 99.26 97.84

Trafﬁc Sign IoU 84.07 69.84

Pedestrians IoU 87.87 74.25

Building IoU 96.36 90.91

Sky IoU 96.18 93.28

Vegetation IoU 95.73 91.66

Pole IoU 72.65 56.86

SideWalk IoU 95.34 83.91

Fence IoU 91.89 54.48

Rider/Cycle IoU 87.29 71.77

Car/Truck/bus IoU 96.60 93.07

Mean IoU (mIoU) 91.20 79.81

AP Car/Truck/Bus 71.92 64.17

AP Person/Rider 60.27 50.00

AP Bicycle/Motorcycle 57.86 37.10

Mean AP (mAP) 63.35 50.42

Table 2: Comparisons of semantic segmentation and object detection results with other MTL network evaluated on the

Cityscapes dataset (Cordts et al., 2016).

Network

Segmentation

(mIoU)

Detection

(mAP)

Real-time Joint Object Detection and Semantic Segmentation

Network for Automated Driving (Sistu et al., 2019)

55.55 23.55

Our proposed MTL network

79.81 50.42

Here, we propose the procedure for training

a highly interdependent multi-tasking network that

avoids plateau regions, longer training periods, imbal-

anced training of two tasks, and overﬁtting. We ini-

tialized the encoder of the network with the weights

pre-trained on the ImageNet dataset (Deng et al.,

2009), and He initialization (He et al., 2015) is used

for initializing rest of the network. We use Adam op-

timizer with hyperparameters β

= 0.9, β

= 0.999

and ε = 10

−10

. We use dropout layers with probabil-

ity 0.5 in block4 of the encoder and just before trident

block. We use polynomial decay of the Learning rate

with cyclic restarts for every 10K iterations.

1. First, we train the whole network for 30K itera-

tions using an initial learning rate = 0.001 to stabi-

lize the network from producing asymptotic num-

bers.

2. Secondly, We freeze object detector, and we train

only encoder and segmentation decoder for 120k

iterations with the initial learning rate = 0.001 and

weight of detection loss functionW

det

= 0, W

seg

30 − 50. In our network, object detector extracts

feature maps segmentation decoder. So ﬁrst, if the

segmentation decoder is trained well, then It will

give good feature maps, and the training of the

object detector will be smooth later. Otherwise,

object detector is taking many steps for training

that leads to overﬁtting of the encoder.

3. Then, We freeze encoder, segmentation decoder,

and we train the object detector for 120k iterations

with the initial learning rate = 0.0001 and weight

of segmentation loss function W

seg

= 0, W

det

= 1.

Freezing segmentation decoder ensures that the

additional object detection will not intervene in its

segmentation task or any loss of segmentation ac-

curacy.

4. Finally, we train the whole network for another

50k iterations with the learning rate starting from

0.0001.

4.3 Evaluation

We implemented our proposed multi-tasking network

using TensorFlow (Abadi et al., 2015) deep learning

library. We carried out the experiments on a sys-

tem with one consumer-gradeNVIDIA GeForce GTX

1080 Ti GPU. Per-class IoU and mean class IoU (In-

tersection over Union) were used as accuracy metrics

for semantic segmentation, per-class average preci-

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

172

Table 3: Comparison Study of Single task segmentation net-

works (DPC (Chen et al., 2018a), DeepLabv3+ (Chen et al.,

2018b), PSPNet (Zhao et al., 2017), HRNetV2-W48 (Wang

et al., 2019), Mapillary (Bul et al., 2018)) vs Our proposed

Multi-task network.

Network

Params(M) mIoU

DPC 41.8 80.9

DeepLabv3+ 43.5 79.6

PSPNet 56.3 80.9

HRNetV2-W48 (SOTA) 65.9 81.1

Mapillary

135.9 78.3

Our proposed MTL net-

work

37.9 79.8

sion, and mean average precision (mAP) are the met-

rics used for evaluation. During inference, we use

Non-maximal suppression with IoU threshold 0.55 to

handle multiple object detections.

In Table 1, we summarize the results of our pro-

posed MTL network. We trained our network on

the Cityscapes dataset with 11 semantic segmenta-

tion classes and three object detection classes. We

achievedsemantic segmentation Mean Class Intersec-

tion over Union (mIoU) of 79.81 and object detection

score of 50.42 Mean Average Precision (mAP) with

only 37.9M parameters and inference time of 74ms

on a consumer-grade GPU. Table 2 shows the accu-

racy leap compared to that recently proposed MTL

network for joint object detection and semantic seg-

mentation evaluated on the cityscapes dataset. Table

3 shows the computational effectiveness of our pro-

posed MTL network. We are performing two tasks

with the only 37.9M parameters, which are consider-

ably less compared to other single task segmentation

networks itself. This shows that our proposed net-

work is the best and most accurate MTL network for

simultaneous semantic segmentation and object de-

tection with the right trade-off of performance, pro-

ductivity, and computational complexity. It is making

our proposed network as the most efﬁcient way to per-

form visual perception tasks on resource-constrained

environments and with faster inference times.

5 CONCLUSIONS

In this paper, we discussed the importance of compu-

tational lightness and quickness of visual perception

algorithms. We examined the relationships between

different layers and different features. Then, we pro-

posed a multi-task learning framework for simulta-

neous semantic segmentation and object detection.

We focused on exploiting the helpful feature maps

from the decoder. We proposed a training scheme

for interdependent MTL networks. We centered on

designing a computationally efﬁcient network to de-

ploy on resource-constrainedplatforms. We evaluated

and shared the results of the proposed network on the

cityscapes dataset (Cordts et al., 2016). We discussed

the effectiveness of our proposed network compared

to other MTL network and single task networks.

ACKNOWLEDGEMENTS

This research is supported by SATREPS Project of

JST and JICA: Smart Transport Strategy for Thailand

4.0 Realizing better quality of life and low-carbon

society, by Japan Society for the Promotion of Sci-

ence (JSPS) Grantin-Aid for Scientiﬁc Research (C)

(17K00252) and by Chubu University Grant.

REFERENCES

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,

Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin,

M., Ghemawat, S., Goodfellow, I., Harp, A., Irving,

G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kud-

lur, M., Levenberg, J., Man´e, D., Monga, R., Moore,

S., Murray, D., Olah, C., Schuster, M., Shlens, J.,

Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Van-

houcke, V., Vasudevan, V., Vi´egas, F., Vinyals, O.,

Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and

Zheng, X. (2015). TensorFlow: Large-scale machine

learning on heterogeneous systems. Software avail-

able from tensorﬂow.org.

Bul, S. R., Porzi, L., and Kontschieder, P. (2018). In-place

activated batchnorm for memory-optimized training

of dnns. In CVPR, pages 5639–5647. IEEE Computer

Society.

Cai, Z., Fan, Q., Feris, R. S., and Vasconcelos, N. (2016).

A uniﬁed multi-scale deep convolutional neural net-

work for fast object detection. In Leibe, B., Matas, J.,

Sebe, N., and Welling, M., editors, ECCV (4), volume

9908 of Lecture Notes in Computer Science, pages

354–370. Springer.

Caruana, R. (1993). Multitask learning: A knowledge-

based source of inductive bias. In Proceedings of the

Tenth International Conference on Machine Learning,

pages 41–48. Morgan Kaufmann.

Chen, L.-C., Collins, M. D., Zhu, Y., Papandreou, G., Zoph,

B., Schroff, F., Adam, H., and Shlens, J. (2018a).

Searching for efﬁcient multi-scale architectures for

dense image prediction. In Bengio, S., Wallach,

H. M., Larochelle, H., Grauman, K., Cesa-Bianchi,

N., and Garnett, R., editors, NeurIPS, pages 8713–

8724.

Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and

Yuille, A. (2016). Deeplab: Semantic image segmen-

tation with deep convolutional nets, atrous convolu-

tion, and fully connected crfs. IEEE Transactions on

Pattern Analysis and Machine Intelligence, PP.

Interdependent Multi-task Learning for Simultaneous Segmentation and Detection

173

Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and

Adam, H. (2018b). Encoder-decoder with atrous sepa-

rable convolution for semantic image segmentation. In

Ferrari, V., Hebert, M., Sminchisescu, C., and Weiss,

Y., editors, ECCV (7), volume 11211 of Lecture Notes

in Computer Science, pages 833–851. Springer.

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler,

M., Benenson, R., Franke, U., Roth, S., and Schiele,

B. (2016). The cityscapes dataset for semantic urban

scene understanding. In Proc. of the IEEE Conference

on Computer Vision and Pattern Recognition (CVPR).

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). ImageNet: A Large-Scale Hierarchical

Image Database. In CVPR09.

Duong, L., Cohn, T., Bird, S., and Cook, P. (2015). Low

resource dependency parsing: Cross-lingual parame-

ter sharing in a neural network parser. pages 845–

850, Beijing, China. Association for Computational

Linguistics.

Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2012).

Scene parsing with multiscale feature learning, purity

trees, and optimal covers. In ICML. icml.cc / Omni-

press.

Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., and Lu,

H. (2019). Dual attention network for scene segmen-

tation. In CVPR, pages 3146–3154. Computer Vision

Foundation / IEEE.

Girshick, R. (2015). Fast r-cnn. In The IEEE International

Conference on Computer Vision (ICCV).

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014).

Rich feature hierarchies for accurate object detection

and semantic segmentation. In 2014 IEEE Conference

on Computer Vision and Pattern Recognition (CVPR),

volume 00, pages 580–587.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving

deep into rectiﬁers: Surpassing human-level perfor-

mance on imagenet classiﬁcation. IEEE International

Conference on Computer Vision (ICCV 2015), 1502.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Spatial pyra-

mid pooling in deep convolutional networks for visual

recognition. IEEE Transactions on Pattern Analysis

and Machine Intelligence, 37(9):1904–1916.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Identity

Mappings in Deep Residual Networks. arXiv e-prints,

page arXiv:1603.05027.

Lee, N., Ajanthan, T., and Torr, P. H. S. (2018). SNIP:

Single-shot Network Pruning based on Connection

Sensitivity. arXiv e-prints, page arXiv:1810.02340.

Li, X., Zhao, H., Han, L., Tong, Y., and Yang, K. (2019a).

GFF: Gated Fully Fusion for Semantic Segmentation.

arXiv e-prints, page arXiv:1904.01803.

Li, Y., Chen, Y., Wang, N., and Zhang, Z. (2019b). Scale-

Aware Trident Networks for Object Detection. arXiv

e-prints, page arXiv:1901.01892.

Lin, T.-Y., Dollr, P., Girshick, R. B., He, K., Hariharan, B.,

and Belongie, S. J. (2017). Feature pyramid networks

for object detection. In CVPR, pages 936–944. IEEE

Computer Society.

Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P.

(2018). Focal loss for dense object detection. IEEE

Transactions on Pattern Analysis and Machine Intel-

ligence, PP:1–1.

Liu, S., Qi, L., Qin, H., Shi, J., and Jia, J. (2018). Path

aggregation network for instance segmentation. In

CVPR, pages 8759–8768. IEEE Computer Society.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu,

C.-Y., and Berg, A. (2016). Ssd: Single shot multibox

detector. volume 9905, pages 21–37.

Long, J., Shelhamer, E., and Darrell, T. (2015). Fully con-

volutional networks for semantic segmentation. In

The IEEE Conference on Computer Vision and Pat-

tern Recognition (CVPR).

Misra, I., Shrivastava, A., Gupta, A., and Hebert, M. (2016).

Cross-stitch networks for multi-task learning. CoRR,

abs/1604.03539.

Redmon, J. and Farhadi, A. (2017). Yolo9000: Better,

faster, stronger. In The IEEE Conference on Computer

Vision and Pattern Recognition (CVPR).

Ren, S., He, K., Girshick, R. B., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. In Cortes, C., Lawrence, N. D.,

Lee, D. D., Sugiyama, M., and Garnett, R., editors,

NIPS, pages 91–99.

Shrivastava, A., Sukthankar, R., Malik, J., and Gupta, A.

(2016). Beyond Skip Connections: Top-Down Mod-

ulation for Object Detection. arXiv e-prints, page

arXiv:1612.06851.

Simonyan, K. and Zisserman, A. (2014). Very Deep Con-

volutional Networks for Large-Scale Image Recogni-

tion. arXiv e-prints, page arXiv:1409.1556.

Sistu, G., Leang, I., and Yogamani, S. (2019). Real-time

Joint Object Detection and Semantic Segmentation

Network for Automated Driving. arXiv e-prints, page

arXiv:1901.03912.

Teichmann, M., Weber, M., Zoellner, M., Cipolla, R., and

Urtasun, R. (2016). MultiNet: Real-time Joint Se-

mantic Reasoning for Autonomous Driving. arXiv e-

prints, page arXiv:1612.07695.

Uijlings, J. R. R., van de Sande, K. E. A., Gevers, T., and

Smeulders, A. W. M. (2013). Selective search for ob-

ject recognition. International Journal of Computer

Vision, 104(2):154–171.

Valada, A., Mohan, R., and Burgard, W. (2019). Self-

supervised model adaptation for multimodal seman-

tic segmentation. International Journal of Computer

Vision.

Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao,

Y., Liu, D., Mu, Y., Tan, M., Wang, X., Liu, W., and

Xiao, B. (2019). Deep High-Resolution Representa-

tion Learning for Visual Recognition. arXiv e-prints,

page arXiv:1908.07919.

Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. (2017). Pyra-

mid scene parsing network. In The IEEE Conference

on Computer Vision and Pattern Recognition (CVPR).

Zhao, Q., Sheng, T., Wang, Y., Tang, Z., Chen, Y., Cai, L.,

and Ling, H. (2019). M2det: A single-shot object de-

tector based on multi-level feature pyramid network.

Proceedings of the AAAI Conference on Artiﬁcial In-

telligence, 33:9259–9266.

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

174