PanDepth: Joint Panoptic Segmentation and Depth Completion

Juan Pablo Lagos and Esa Rahtu

Tampere University, Tampere, Finland

ﬁ

Keywords:

Panoptic Segmentation, Instance Segmentation, Semantic Segmentation, Depth Completion, CNN,

Multi-Task Learning.

Abstract:

Understanding 3D environments semantically is pivotal in autonomous driving applications where multiple

computer vision tasks are involved. Multi-task models provide different types of outputs for a given scene,

yielding a more holistic representation while keeping the computational cost low. We propose a multi-task

model for panoptic segmentation and depth completion using RGB images and sparse depth maps. Our model

successfully predicts fully dense depth maps and performs semantic segmentation, instance segmentation, and

panoptic segmentation for every input frame. Extensive experiments were done on the Virtual KITTI 2 dataset

and we demonstrate that our model solves multiple tasks, without a signiﬁcant increase in computational cost,

while keeping high accuracy performance. Code is available at https://github.com/juanb09111/PanDepth.git.

1 INTRODUCTION

Producing a holistic representation of a given scene

has become essential in computer vision. The tradi-

tional tasks and challenges, such as semantic segmen-

tation, instance segmentation, pose estimation, edge

estimation, or depth completion only provide a lim-

ited representation that alone are not enough to suc-

cessfully complete more complex tasks, for instance,

autonomous driving, where, in addition to estimating

the distance of the objects and stuff on and around

the road, it is also essential to understand the seman-

tic context of the scene, that is, identifying the type of

objects around, e.g. cars, pedestrians, road lanes, traf-

ﬁc signs, at the same time as the depth to such objects

is estimated. This raises the need for multi-task mod-

els that are capable of solving several tasks in parallel

while keeping the computational cost low.

This work is inspired by the idea of devising

a model that combines panoptic segmentation and

depth completion which is of high relevance in ap-

plications such as autonomous driving where under-

standing 3D environments semantically is pivotal for

the performance of autonomous machines. We ex-

plore the hypothesis that panoptic segmentation and

depth completion can use cues from one another,

more explicitly, that there are depth features that con-

tain relevant semantic cues as well there are seman-

tic segmentation features that contain relevant depth

https://orcid.org/0000-0001-8767-0864

Figure 1: The proposed model (PanDepth) takes RGB im-

ages and sparse depth and returns the corresponding panop-

tic segmentation and fully dense depth map with which we

create a 3D panoptic segmentation representation of the in-

put frame.

cues.

Multi-task networks, not only reduce the demand

for computational resources, as compared to running

multiple single-task networks but also, there is em-

pirical evidence that multi-task networks can perform

better in each individual task by jointly learning fea-

tures from all tasks involved (Ruder, 2017; Sener and

Koltun, 2018; Lagos and Rahtu, 2022). For instance,

Lagos, J. and Rahtu, E.

PanDepth: Joint Panoptic Segmentation and Depth Completion.

DOI: 10.5220/0011685200003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP, pages

635-643

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

635

depth features can be helpful for performing seman-

tic segmentation and vice-versa. Even in applications

where solving a single task is the primary goal, in-

troducing features from other tasks may leverage ac-

curacy and performance. Such paradigm is known

as auxiliary tasks (Liebel and K

orner, 2018a; Li and

Dong, 2021), whereby solving other tasks it is possi-

ble to obtain relevant features which lead to a better

performance in the main task.

We focus on solving three tasks in a joint manner,

namely, semantic segmentation, instance segmenta-

tion, and depth completion using convolutional neu-

ral networks (CNNs). Combining semantic segmen-

tation and instance segmentation into one single rep-

resentation is known as panoptic segmentation (Kir-

illov et al., 2018). It provides a representation of an

image where not only every pixel is assigned a label

from a list of predeﬁned labels, as in the case of se-

mantic segmentation, but also, objects are detected as

instances of a speciﬁc class, thus providing valuable

information, such as the number of cars, people, or

objects of a certain kind that are found in the image,

as well as the semantic context of the non-countable

stuff in the scene. Countable and non-countable ob-

jects are usually referred to as ”things” and ”stuff” in

the context of computer vision (Adelson, 2001).

While semantic segmentation produces a single

output, pixel-wise classiﬁcation, instance segmenta-

tion produces three different outputs: bounding boxes

for the objects detected, a label for each bounding

box, and a segmentation mask for each object de-

tected. The outputs of both tasks, semantic segmen-

tation, and instance segmentation, are usually fused

using heuristic methods with no learnable parameters

(Mohan and Valada, 2020; Xiong et al., 2019).

On the other hand, depth completion aims to pro-

duce a dense depth map from sparse depth points

which cover only a few pixels from a given image.

Sparse depth maps can be obtained with active depth

sensors, such as lidars. When 3D points obtained with

lidars are projected onto an image, only about 5% of

the image is covered (Uhrig et al., 2017). The goal is

then to produce a dense depth map, with depth values

for all the pixels in the image, given a sparse depth

map as input.

In this paper, we propose an end-to-end model

for panoptic segmentation and depth completion us-

ing joint training in order to provide a more holis-

tic representation of the input images. In contrast

with other works where predictions are made based

on RGB images only (Gao et al., 2022; Schon et al.,

2021; Yuan et al., 2021), our model processes hetero-

geneous data jointly, that is, RGB images and sparse

depth maps as shown in Figure 1. For most machine

perception applications, active depth sensors are part

of the setup, for which we consider it more relevant

to integrate both RGB images as well as sparse depth

maps. We also quantify the effects of joint train-

ing as compared to training every task individually,

thus providing more data on the growing evidence of

the advantages of multi-task networks. We conduct

extensive experiments on Virtual KITTI 2 (Cabon

et al., 2020), which is a relevant dataset in the con-

text of autonomous driving that contains ground truth

annotations for instance segmentation, semantic seg-

mentation and ground truth depth maps available for

the entire dataset. Although panoptic segmentation

ground truth is not directly provided by Virtual KITTI

2 dataset, we use semantic and instance segmentation

ground truth to generate panoptic segmentation anno-

tations.

2 RELATED WORKS

2.1 Panoptic Segmentation

Early works in computer vision developed CNN ar-

chitectures for performing semantic segmentation and

instance segmentation independently with reasonable

success (Long et al., 2014; Ronneberger et al., 2015;

He et al., 2017). Later on, Kirillov et al. (2018) pro-

posed a task that would combine both tasks into one,

which they named panoptic segmentation. Kirillov

et al. (2018) also deﬁned a metric for assessing the

performance of panoptic segmentation predictions re-

ferred to as panoptic quality (PQ), thus, providing a

complete deﬁnition of the problem of panoptic seg-

mentation with a target metric for performance com-

parison. Such a robust deﬁnition of the task called the

attention of the community, leading to the ﬁrst archi-

tectures for end-to-end panoptic segmentation using

CNNs (Li et al., 2018; Hou et al., 2019; Cheng et al.,

2019; Xiong et al., 2019; Liu et al., 2019; de Geus

et al., 2019; Kirillov et al., 2019; Petrovai and Nede-

vschi, 2019).

The most common challenges that appeared with

panoptic segmentation are how to optimize a shared

feature extractor as well as how to combine seman-

tic segmentation and instance segmentation predic-

tions while keeping the computational cost low. Mo-

han and Valada (2020) proposed a model for panoptic

segmentation which consists of two heads, namely,

semantic segmentation and instance segmentation, a

fusion module for combining the outputs of both

heads, and a feature extractor based on a family of

scalable CNNs known as EfﬁcientNet (Tan and Le,

2019), where the resolution, depth, and width are

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

636

balanced depending on the computational resources

available. For multi-scale features, Mohan and Val-

ada (2020) wrap the feature extractor into a two-way

feature pyramid network (FPN). Similarly, Chen et al.

(2020a) used the same concept of scalable networks to

Residual Networks (ResNets) for performing panop-

tic segmentation.

Other approaches (Wang et al., 2020; Carion et al.,

2020; Zhu et al., 2020; Cheng et al., 2021b; Li et al.,

2021; Cheng et al., 2021a) have adopted transform-

ers architecture (Vaswani et al., 2017), initially de-

signed for text processing and sequence transduction,

and integrated attention mechanisms for panoptic seg-

mentation. In contrast with more traditional methods,

with instance segmentation and semantic segmenta-

tion deﬁned as sub-tasks, transformer-based models

use queries to represent ”things” and ”stuff” classes

and perform panoptic segmentation.

2.2 Depth Completion

The task of depth completion aims to transform a

sparse depth map, usually obtained with active depth

sensors e.g. Lidar, into a dense depth map. Lidar

devices can only provide a limited amount of depth

points when projected onto the corresponding image,

raising the need for methods that can lead to a fully-

dense representation of the depth of an entire image.

Several works have used RGB images as guidance for

depth completion (Qiu et al., 2018; Eldesokey et al.,

2018; Gansbeke et al., 2019; Tang et al., 2019; Yang

et al., 2019; Park et al., 2020; Hu et al., 2021). Jaritz

et al. (2018) proposed an encoder-decoder network ar-

chitecture for depth completion, based on a late fusion

of RGB images and sparse depth maps. However,

processing RGB and Lidar data is not trivial, since,

in contrast to RGB images, sparse depth data lacks a

natural grid structure unless projected onto a 2D space

which also facilitates the usage of traditional 2D con-

volutional layers.

Nonetheless, when mapping 3D data to 2D, valu-

able information regarding the geometrical relation-

ship among the points in the 3D space is lost. Chen

et al. (2020b) introduced a fuse block that exploits 3D

cues by using parametric continuous convolution lay-

ers (Wang et al., 2018) while using 2D convolutions

for processing RGB and later fusing the correspond-

ing features in 2D space. Such 2D-3D fuse method

is an essential building block in the proposed model,

in which, with slight modiﬁcations to the model pro-

posed by Chen et al. (2020b), we successfully map

sparse depth maps to dense depth maps.

2.3 Multi Task Learning

CNNs can beneﬁt from performing multiple tasks, as

opposed to single-task networks. Branched CNNs

consist of shared layers as well as task-speciﬁc lay-

ers, also known as branches. When such CNNs are

trained, the weights of the shared layers are adjusted

via back-propagation from each one of the branches,

each one of which has one or multiple loss func-

tions deﬁned. In turn, the shared layers learn rele-

vant features for all tasks, and such features are then

fed to every branch. That allows for a very distinctive

ﬂow of information between the different branches.

There is increasing evidence that single tasks, beneﬁt

when models are trained jointly improving the perfor-

mance of each one of the tasks tackled by the network

(Liebel and K

orner, 2018b; Liu et al., 2018; Liebel

and K

orner, 2019; Zou et al., 2020a; Guo et al., 2020).

While some multi-task networks have addressed

tasks relatively similar e.g. instance segmentation

and semantic segmentation, other works have com-

bined semantic segmentation and depth completion as

end-to-end models (Hazirbas et al., 2016; Zou et al.,

2020b; He et al., 2021). Lagos and Rahtu (2022)

proposed a combined model for semantic segmen-

tation and depth completion using RGB images and

sparse depth maps, where it is demonstrated quantita-

tively and visually how each task outperforms equiv-

alent single-task models for semantic segmentation

and depth completion trained independently. Our

model performs depth completion, instance segmen-

tation, and semantic segmentation. We fuse instance

and semantic segmentation to obtain a panoptic seg-

mentation representation. In contrast with other meth-

ods, our model processes heterogeneous data, more

speciﬁcally RGB images, and sparse depth maps us-

ing a stack of 2D-3D fuse blocks as proposed by Chen

et al. (2020b)

3 ARCHITECTURE

3.1 Overview

The proposed model performs panoptic segmenta-

tion and depth completion in an end-to-end man-

ner. It consists of a two-way feature pyramid net-

work (FPN) as a shared feature extractor, three task-

speciﬁc branches, one for each task (semantic seg-

mentation, instance segmentation, and depth comple-

tion), one joint branch that reﬁnes the semantic log-

its using the resulting depth maps as guidance, and

one ﬁnal block for combining semantic and instance

logits based on the fusion block proposed by Mohan

PanDepth: Joint Panoptic Segmentation and Depth Completion

637

Figure 2: Overview of the proposed PanDepth architec-

ture. Given an RGB image and sparse depth map as input,

our model outputs the corresponding dense depth map and

panoptic segmentation.

and Valada (2020). The inputs to our model are RGB

images and sparse depth maps and the output is the

corresponding panoptic segmentation representations

and fully dense depth maps. The panoptic segmenta-

tion output and the resulting depth maps can be fur-

ther combined to produce a 3D panoptic segmentation

representation as shown in Figure 1.

3.2 Backbone

The backbone consists of a two-way FPN with an

EfﬁcientNet-B5 (Tan and Le, 2019) at the core, as

shown in Figure 3. On one hand, the FPN upsam-

ples lower resolution features and adds them together.

On the other hand, the FPN downsamples higher-

resolution features and adds them together. This al-

lows for multi-scale feature extraction. The backbone

returns feature maps at four different scales, down-

scaled by a factor of ×4, ×8, ×16, and ×32 with re-

spect to the spatial resolution at the input.

3.3 Semantic Segmentation Branch

The semantic segmentation branch is a light-weighted

structure that consists of three main building blocks

based on the model proposed by Mohan and Val-

ada (2020). Firstly, a Large Scale Feature Extractor

(LSFE) extracts localized ﬁne features. Secondly, a

small-scale feature extractor based on Dense Predic-

tions Cells (DPC), and ﬁnally, a Mismatch Correction

Module (MC) is used in order to properly aggregate

features at different scales. The input to this branch

consists of the four feature maps returned by the back-

bone, they are in four different scales, ×4, ×8, ×16,

and ×32. The tensors returned by the LSFE and DPC

modules are aggregated as shown in Figure 3. Finally,

this branch returns preliminary semantic segmenta-

tion logits of size nc × H ×W, where nc is the total

number of classes and H ×W refers to the spatial res-

olution of the input height × width respectively. At

a later stage, the preliminary semantic segmentation

logits are reﬁned with depth maps as guidance in the

joint branch.

3.4 Instance Segmentation Branch

The instance segmentation branch is a lighter ver-

sion of Mask R-CNN (He et al., 2017). Following

the modiﬁcations suggested by Mohan and Valada

(2020), all the convolutions were replaced by depth-

wise separable convolutions (Chollet, 2016), batch

normalization layers were replaced by synchronized

Inplace Activated Batch Normalization layers (iABN)

(Bul

o et al., 2017) and the ReLU activations were re-

placed by Leaky ReLU.

Similar to Mask R-CNN, the instance segmenta-

tion branch consists of two stages. In the ﬁrst stage, a

region proposal network (RPN) returns a set of rectan-

gular regions with a corresponding objectness score.

Thereafter, a RoIAlign module extracts small feature

maps of size 7 × 7 from the regions returned by the

RPN. Subsequently, those features are used as input

to two sub-branches that run in parallel, one of which

regresses bounding boxes and classiﬁes the objects of

each corresponding box, and another sub-branch that

regresses the corresponding masks returning an out-

put tensor of size NI ×28 ×28, where NI corresponds

to the number of instances detected.

3.5 Depth Completion Branch

This branch processes frame by frame and takes three

different input types. Firstly, a sparse depth map orig-

inated from a 3D to 2D projection of a point cloud.

Secondly, the corresponding RGB frame, and thirdly

a preliminary semantic segmentation map as shown

in Figure 3. Our depth completion branch is based

on the architecture proposed by Chen et al. (2020c),

upon which we made modiﬁcations in order to use

preliminary semantic segmentation maps as proposed

by Lagos and Rahtu (2022). At the input level, the

sparse depth map is passed through two 2D convo-

lutional layers of kernel size 3 × 3, while the RGB

image and the semantic segmentation map are con-

catenated and passed through two 2D convolutional

layers of kernel size 3× 3. Subsequently, the two cor-

responding outputs are concatenated and, along with

the original sparse depth map, they serve as input to

a stack of N 2D − 3D Fuse Blocks. Finally, the re-

sulting tensor from the Fuse Blocks passes through

two 2D convolutional layers of kernel size 3 × 3 for

reﬁnement, yielding the ﬁnal fully dense depth map.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

638

Figure 3: PanDepth architecture. Our model consists of a feature extractor, three task-speciﬁc branches (i.e. instance segmen-

tation, semantic segmentation, and depth completion), a joint branch, and a panoptic fusion module. The convolutional layers

in this diagram follow the notation Conv(k,s,c) ×n representing a stack of n convolutional layers where k refers to a kernel of

size k × k, s is the stride, c is the number of output feature channels, and FC represents a fully connected layer.

3.6 Joint Branch

We use the depth completion output as guidance for

reﬁning the semantic segmentation preliminary out-

put. This branch, albeit simple, successfully leverages

the performance of the semantic segmentation task.

It consists of four stacked 2D convolutions of kernel

size 3 × 3. The input to this branch is the concate-

nation of the output of the depth completion branch,

that is, a fully dense depth map, and the preliminary

semantic segmentation output. Finally, this branch re-

turns a tensor of size nc×H ×W , where nc is the total

number of classes in the dataset, H and W correspond

to the original height and width of the model’s input

respectively.

3.7 Loss Functions

Semantic Segmentation. We used the weighted

per-pixel log-loss for semantic segmentation. It is de-

ﬁned as follows:

semantic

= −

∑

log ˆp

, (1)

where i is the pixel index, w

W H

if pixel i is

within the 25% worst predictions, w

= 0 otherwise.

W and H correspond to the width and height of the in-

put image respectively, p

and ˆp

are the ground truth

and the predicted probability for pixel i of belonging

to class label c ∈ p respectively. The predicted prob-

ability ˆp

is computed using the Softmax function de-

ﬁned as:

So f tmax(x

) =

exp(x

)

∑

exp(x

)

. (2)

Instance Segmentation. We adopted the loss func-

tions for instance segmentation as deﬁned in Mask R-

CNN (He et al., 2017). There are loss functions de-

ﬁned for the two stages of this branch. In the ﬁrst

stage (the RPN), we calculate two losses, namely, ob-

jectness score loss (L

) and object proposal loss (L

For the second stage, we calculate three losses, classi-

ﬁcation loss L

cls

, bounding-box regression loss L

box

and mask loss L

mask

. The total loss for the instance

segmentation branch is given by:

instance

= L

+ L

cls

+ L

box

+ L

mask

(3)

Depth Completion. We used Mean Squared Error

(MSE) as loss for the depth completion branch. The

MSE was calculated and averaged over the pixels for

PanDepth: Joint Panoptic Segmentation and Depth Completion

639

which the corresponding ground truth depth values

were available in the sparse depth map. The loss func-

tion is deﬁned by

depth

∑

( ˆy

− y

)

, (4)

where N is the number of pixels, ˆy

is the predicted

value and y

is the ground truth value for pixel i.

Joint Loss. In addition to the loss function related

to each task, we compute a loss involving each one

of the tasks performed by our model, in particular,

semantic segmentation, instance segmentation, and

depth completion. This loss is simply the sum of ev-

ery speciﬁc loss as in e.q. 5

joint

= L

semantic

+ L

instance

+ L

depth

. (5)

4 EXPERIMENTS

4.1 Implementation Details

We trained our model for 50 epochs on one machine

with four 32GB graphics processing units (GPUs)

running in parallel. The loss functions were opti-

mized using Adam algorithm with a learning rate set

to 0.0002.

4.2 Dataset

We trained and tested our models on Virtual KITTI 2

(Cabon et al., 2020). It is a synthetic dataset that pro-

vides ground truth annotations for semantic segmen-

tation, instance segmentation, depth estimation, and

optical ﬂow for the entire dataset. It consists of ﬁve

scenes named ”Scene01”, ”Scene02”, ”Scene06”,

”Scene18”, and ”Scene20” which account for a to-

tal of 2126 unique frames of stereo images that are

augmented to recreate 10 different environment con-

ditions: clone, fog, morning, overcast, rain, sunset,

and four angle variations corresponding to ±15

◦

and

±30

◦

around the vertical axis. All in all, Virtual

KITTI 2 contains 21260 RGB stereo frames.

In our experiments, we discarded the angle vari-

ation splits, ±15

◦

and ±30

◦

, to reduce redundancy

in the dataset and kept the other six splits for train-

ing, evaluation, and testing. We trained on scenes

”Scene01”, ”Scene06”, and ”Scene20”, evaluated on

”Scene18” and tested on ”Scene02”. We resized the

input frames to 200px height and 1000px width.

(a) Fully dense depth map.

(b) Depth map, sparsity = 20%.

Figure 4: Depth maps visualization at different sparsity lev-

els.

Pre-Processing. Since Virtual KITTI 2 is a syn-

thetic dataset, it provides fully dense depth maps

for every single frame, however, in order to recre-

ate real conditions as close as possible, we sampled

the ground truth maps and set the sparsity to 20%,

meaning that only 20% of the pixels from any given

image would have a depth value available. On the

other hand, non-ground-truth maps were sampled to

have a sparsity of 5%. Under real-world conditions,

3D scenes are mapped with laser scanner devices, and

when the 3D points are projected onto a 2D plane,

they account for approximately 5% coverage of the

entire image. The ground truth, however, is usually

obtained by merging consecutive maps together, thus

increasing the sparsity to around 20% (Uhrig et al.,

2017). Figure 4 depicts the visual contrast between

different sparsity levels.

Panoptic Segmentation Annotations for Virtual

KITTI 2. Although Virtual Kitti 2 does not pro-

vide panoptic segmentation annotations directly, it is

possible to use semantic segmentation and instance

segmentation annotations to generate ground truth

panoptic segmentation annotations. All the scripts are

provided in the code repository. Thus, we hope to in-

crease the interest of the community in this dataset as

well as other possible datasets for which this approach

might be found suitable and useful.

4.3 Evaluation Metrics

We calculated the standard COCO metrics (Lin et al.,

2014) for every task. More speciﬁcally, we computed

the Intersection over Union (IoU) for semantic seg-

mentation, Mean Average Precision (mAP) for object

detection, as well as PQ, recognition quality (RQ),

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

640

(a)(b)(c)(d)(e)(f)

Figure 5: Panoptic segmentation and depth completion results on Virtual KITTI 2. Rows from top down show: (a) RGB input

images, (b) semantic segmentation, (c) instance segmentation, (d) panoptic segmentation, (e) depth completion output, and

(f) 3D panoptic segmentation.

Table 1: Results of our model compared to baselines.

Method mIoU mAP RMSE(mm) PQ RQ SQ

Semantic only 0.380 - - - - -

Instance only - 0.691 - - - -

Depth only - - 623 - - -

SemSegDepth 0.387 - 677 - - -

PanDepth(ours) 0.413 0.597 653 0.384 0.450 0.467

and segmentation quality (SQ) for panoptic segmenta-

tion. In addition to the COCO metrics, we computed

the root means squared error (RMSE) to evaluate the

performance of the depth completion task.

4.4 Results

We compared the proposed PanDepth model against

equivalent models where only one of the task-speciﬁc

branches of PanDepth is enabled. Such mod-

els are listed in Table 1 as ”Semantic only”, ”In-

stance only”, and ”Depth only”. Table 1 also shows

the performance of the proposed model PanDepth

compared to SemSegDepth (Lagos and Rahtu, 2022),

a joint-learning model for semantic segmentation and

depth completion. SemSegDepth is a multi-task

learning model that follows an architecture similar to

that of our model PanDepth. On one hand, it consists

of task-speciﬁc branches with a shared backbone. On

the other hand, the input comprises heterogeneous

data, namely, RGB frames and sparse depth maps.

However, our model solves more tasks, thus provid-

ing a more holistic representation of the input scenes,

while keeping high accuracy in all evaluation metrics

as shown in Table 1. The qualitative results of the

proposed model can be inspected visually in Figure 5,

where the output of every individual task is depicted

as well as a 3D panoptic segmentation reconstructed

using the corresponding depth completion output and

panoptic segmentation output.

Our model outperforms SemSegDepth in both the

accuracy of the semantic segmentation task, as mea-

sured by the mIoU metric, and the depth completion

task, as measured by the RMSE metric. The proposed

PanDepth model also outperforms the semantic-

segmentation-only model (”Semantic only”) provid-

ing more evidence of the advantages of joint-learning.

Although the single-task models ”Instance only” and

”Depth only”, for instance segmentation and depth

completion respectively, show an increase in accuracy

compared to PanDepth, as reported by the mAP and

the RMSE, the proposed PanDepth model provides a

more complete scene understanding of 3D environ-

ments which is a favorable trade-off in autonomous

driving applications where holistic scene representa-

tions are highly valuable.

It is also important to note that the size of our

model does not increase signiﬁcantly despite solving

multiple tasks. That is due to sharing structures such

as the feature extractor and relatively small model

branches as shown in Table 2.

PanDepth: Joint Panoptic Segmentation and Depth Completion

641

Table 2: Model size.

Structure Params

Backbone (EfﬁcientNet-B5 ) 25.2M

2-way FPN 1.5M

Semantic Branch 1.2M

Instance Branch 53.1M

Depth Branch 1.9M

Joint Branch 1.2M

PanDepth Total Params 84M

5 CONCLUSIONS

This paper presents an end-to-end model for panoptic

segmentation and depth completion using heteroge-

neous data as input, namely RGB images, and sparse

depth maps. Our model yields a better scene under-

standing by providing a semantic representation of 3D

environments. We propose a joint-learning method to

perform multiple tasks, speciﬁcally semantic segmen-

tation, instance segmentation, depth completion, and

panoptic segmentation. Through a rigorous set of ex-

periments, we demonstrate, quantitatively and qual-

itatively, the advantages of joint learning and multi-

task models. Our model solves multiple computer vi-

sion tasks, keeping high-accuracy results compared to

other strong baselines, without a signiﬁcant increase

in computational cost.

REFERENCES

Adelson, E. H. (2001). On seeing stuff: the perception of

materials by humans and machines. In IS&T/SPIE

Electronic Imaging.

Bul

o, S. R., Porzi, L., and Kontschieder, P. (2017). In-place

activated batchnorm for memory-optimized training

of dnns. CoRR, abs/1712.02616.

Cabon, Y., Murray, N., and Humenberger, M. (2020). Vir-

tual kitti 2.

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov,

A., and Zagoruyko, S. (2020). End-to-end object de-

tection with transformers. CoRR, abs/2005.12872.

Chen, L., Wang, H., and Qiao, S. (2020a). Scaling wide

residual networks for panoptic segmentation. CoRR,

abs/2011.11675.

Chen, Y., Yang, B., Liang, M., and Urtasun, R. (2020b).

Learning joint 2d-3d representations for depth com-

pletion. CoRR, abs/2012.12402.

Chen, Y., Yang, B., Liang, M., and Urtasun, R. (2020c).

Learning joint 2d-3d representations for depth com-

pletion.

Cheng, B., Collins, M. D., Zhu, Y., Liu, T., Huang, T. S.,

Adam, H., and Chen, L. (2019). Panoptic-deeplab: A

simple, strong, and fast baseline for bottom-up panop-

tic segmentation. CoRR, abs/1911.10194.

Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., and

Girdhar, R. (2021a). Masked-attention mask trans-

former for universal image segmentation. CoRR,

abs/2112.01527.

Cheng, B., Schwing, A. G., and Kirillov, A. (2021b). Per-

pixel classiﬁcation is not all you need for semantic

segmentation. CoRR, abs/2107.06278.

Chollet, F. (2016). Xception: Deep learning with depthwise

separable convolutions. CoRR, abs/1610.02357.

de Geus, D., Meletis, P., and Dubbelman, G. (2019).

Fast panoptic segmentation network. CoRR,

abs/1910.03892.

Eldesokey, A., Felsberg, M., and Khan, F. S. (2018). Con-

ﬁdence propagation through cnns for guided sparse

depth regression. CoRR, abs/1811.01791.

Gansbeke, W. V., Neven, D., Brabandere, B. D., and

Gool, L. V. (2019). Sparse and noisy lidar com-

pletion with RGB guidance and uncertainty. CoRR,

abs/1902.05356.

Gao, N., He, F., Jia, J., Shan, Y., Zhang, H., Zhao, X., and

Huang, K. (2022). Panopticdepth: A uniﬁed frame-

work for depth-aware panoptic segmentation.

Guo, P., Lee, C., and Ulbricht, D. (2020). Learn-

ing to branch for multi-task learning. CoRR,

abs/2006.01895.

Hazirbas, C., Ma, L., Domokos, C., and Cremers, D.

(2016). Fusenet: Incorporating depth into semantic

segmentation via fusion-based cnn architecture. In

Asian Conference on Computer Vision (ACCV).

He, K., Gkioxari, G., Doll

ar, P., and Girshick, R. B. (2017).

Mask R-CNN. CoRR, abs/1703.06870.

He, L., Lu, J., Wang, G., Song, S., and Zhou, J. (2021).

Sosd-net: Joint semantic object segmentation and

depth estimation from monocular images. CoRR,

abs/2101.07422.

Hou, R., Li, J., Bhargava, A., Raventos, A., Guizilini, V.,

Fang, C., Lynch, J. P., and Gaidon, A. (2019). Real-

time panoptic segmentation from dense detections.

CoRR, abs/1912.01202.

Hu, M., Wang, S., Li, B., Ning, S., Fan, L., and Gong, X.

(2021). Penet: Towards precise and efﬁcient image

guided depth completion. CoRR, abs/2103.00783.

Jaritz, M., de Charette, R., Wirbel,

E., Perrotton, X., and

Nashashibi, F. (2018). Sparse and dense data with

cnns: Depth completion and semantic segmentation.

CoRR, abs/1808.00769.

Kirillov, A., Girshick, R. B., He, K., and Doll

ar, P.

(2019). Panoptic feature pyramid networks. CoRR,

abs/1901.02446.

Kirillov, A., He, K., Girshick, R. B., Rother, C., and

Doll

ar, P. (2018). Panoptic segmentation. CoRR,

abs/1801.00868.

Lagos, J. P. and Rahtu, E. (2022). Semsegdepth: A com-

bined model for semantic segmentation and depth

completion. In Farinella, G. M., Radeva, P., and Boua-

touch, K., editors, Proceedings of the 17th Interna-

tional Joint Conference on Computer Vision, Imag-

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

642

ing and Computer Graphics Theory and Applica-

tions, VISIGRAPP 2022, Volume 5: VISAPP, On-

line Streaming, February 6-8, 2022, pages 155–165.

SCITEPRESS.

Li, B. and Dong, A. (2021). Multi-task learning with at-

tention : Constructing auxiliary tasks for learning to

learn. In 2021 IEEE 33rd International Conference on

Tools with Artiﬁcial Intelligence (ICTAI), pages 145–

152.

Li, Y., Chen, X., Zhu, Z., Xie, L., Huang, G., Du, D., and

Wang, X. (2018). Attention-guided uniﬁed network

for panoptic segmentation. CoRR, abs/1812.03904.

Li, Z., Wang, W., Xie, E., Yu, Z., Anandkumar, A., Alvarez,

J. M., Lu, T., and Luo, P. (2021). Panoptic segformer.

CoRR, abs/2109.03814.

Liebel, L. and K

orner, M. (2019). Multidepth: Single-

image depth estimation via multi-task regression and

classiﬁcation. CoRR, abs/1907.11111.

Liebel, L. and K

orner, M. (2018a). Auxiliary tasks in multi-

task learning.

Liebel, L. and K

orner, M. (2018b). Auxiliary tasks in multi-

task learning.

Lin, T., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick,

R. B., Hays, J., Perona, P., Ramanan, D., Doll

ar, P.,

and Zitnick, C. L. (2014). Microsoft COCO: common

objects in context. CoRR, abs/1405.0312.

Liu, H., Peng, C., Yu, C., Wang, J., Liu, X., Yu, G., and

Jiang, W. (2019). An end-to-end network for panoptic

segmentation. CoRR, abs/1903.05027.

Liu, S., Johns, E., and Davison, A. J. (2018). End-

to-end multi-task learning with attention. CoRR,

abs/1803.10704.

Long, J., Shelhamer, E., and Darrell, T. (2014). Fully

convolutional networks for semantic segmentation.

CoRR, abs/1411.4038.

Mohan, R. and Valada, A. (2020). Efﬁcientps: Efﬁcient

panoptic segmentation. CoRR, abs/2004.02307.

Park, J., Joo, K., Hu, Z., Liu, C., and Kweon, I. S. (2020).

Non-local spatial propagation network for depth com-

pletion. CoRR, abs/2007.10042.

Petrovai, A. and Nedevschi, S. (2019). Multi-task network

for panoptic segmentation in automated driving. In

2019 IEEE Intelligent Transportation Systems Confer-

ence (ITSC), pages 2394–2401.

Qiu, J., Cui, Z., Zhang, Y., Zhang, X., Liu, S., Zeng,

B., and Pollefeys, M. (2018). Deeplidar: Deep sur-

face normal guided depth prediction for outdoor scene

from sparse lidar data and single color image. CoRR,

abs/1812.00488.

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net:

Convolutional networks for biomedical image seg-

mentation. CoRR, abs/1505.04597.

Ruder, S. (2017). An overview of multi-task learning in

deep neural networks. CoRR, abs/1706.05098.

Schon, M., Buchholz, M., and Dietmayer, K. (2021).

MGNet: Monocular geometric scene understanding

for autonomous driving. In 2021 IEEE/CVF Interna-

tional Conference on Computer Vision (ICCV). IEEE.

Sener, O. and Koltun, V. (2018). Multi-task learning as

multi-objective optimization. CoRR, abs/1810.04650.

Tan, M. and Le, Q. V. (2019). Efﬁcientnet: Rethink-

ing model scaling for convolutional neural networks.

CoRR, abs/1905.11946.

Tang, J., Tian, F., Feng, W., Li, J., and Tan, P. (2019). Learn-

ing guided convolutional network for depth comple-

tion. CoRR, abs/1908.01238.

Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T.,

and Geiger, A. (2017). Sparsity invariant cnns. CoRR,

abs/1708.06500.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,

Jones, L., Gomez, A. N., Kaiser, L., and Polo-

sukhin, I. (2017). Attention is all you need. CoRR,

abs/1706.03762.

Wang, H., Zhu, Y., Adam, H., Yuille, A. L., and Chen, L.

(2020). Max-deeplab: End-to-end panoptic segmenta-

tion with mask transformers. CoRR, abs/2012.00759.

Wang, S., Suo, S., Ma, W.-C., Pokrovsky, A., and Urta-

sun, R. (2018). Deep parametric continuous convolu-

tional neural networks. 2018 IEEE/CVF Conference

on Computer Vision and Pattern Recognition.

Xiong, Y., Liao, R., Zhao, H., Hu, R., Bai, M., Yumer, E.,

and Urtasun, R. (2019). Upsnet: A uniﬁed panoptic

segmentation network. CoRR, abs/1901.03784.

Yang, Y., Wong, A., and Soatto, S. (2019). Dense depth

posterior (DDP) from single image and sparse range.

CoRR, abs/1901.10034.

Yuan, H., Li, X., Yang, Y., Cheng, G., Zhang, J., Tong,

Y., Zhang, L., and Tao, D. (2021). Polyphonicformer:

Uniﬁed query learning for depth-aware video panoptic

segmentation. CoRR, abs/2112.02582.

Zhu, X., Su, W., Lu, L., Li, B., Wang, X., and Dai, J. (2020).

Deformable DETR: deformable transformers for end-

to-end object detection. CoRR, abs/2010.04159.

Zou, N., Xiang, Z., Chen, Y., Chen, S., and Qiao, C.

(2020a). Simultaneous semantic segmentation and

depth completion with constraint of boundary. Sen-

sors, 20(3).

Zou, N., Xiang, Z., Chen, Y., Chen, S., and Qiao, C.

(2020b). Simultaneous semantic segmentation and

depth completion with constraint of boundary. Sen-

sors, 20(3).

PanDepth: Joint Panoptic Segmentation and Depth Completion

643