Achieving RGB-D Level Segmentation Performance from a Single ToF

Camera

Pranav Sharma

, Jigyasa Singh Katrolia

, Jason Rambach

, Bruno Mirbach

and Didier Stricker

1,2

German Research Center for Artiﬁcial Intelligence DFKI, Kaiserslautern, Germany

RPTU Kaiserslautern, Germany

FAU Erlangen, Germany

Keywords:

Multi-Modal Image Segmentation, Depth Image, Infrared Image, Machine Learning, Time-of-Flight,

Deep Learning.

Abstract:

Depth is a very important modality in computer vision, typically used as complementary information to RGB,

provided by RGB-D cameras. In this work, we show that it is possible to obtain the same level of accuracy as

RGB-D cameras on a semantic segmentation task using infrared (IR) and depth images from a single Time-

of-Flight (ToF) camera. In order to fuse the IR and depth modalities of the ToF camera, we introduce a

method utilizing depth-speciﬁc convolutions in a multi-task learning framework. In our evaluation on an in-

car segmentation dataset, we demonstrate the competitiveness of our method against the more costly RGB-D

approaches.

1 INTRODUCTION

The research ﬁeld of semantic segmentation is domi-

nated by RGB images. Only recently it shifted in the

direction of RGB-D semantic segmentation (Hazirbas

et al., 2017; Wang and Neumann, 2018; Cao et al.,

2021; Cheng et al., 2017). However, RGB images

may not always be available due to practical, logis-

tical and ﬁnancial reasons. RGB-D cameras incur

higher cost and more effort to calibrate the two cam-

eras. Their larger package size often limits their place

in real-world applications. Indeed, Time-of-Flight

(ToF) depth cameras are often deployed without an

accompanying RGB camera for applications like ges-

ture control, in-car monitoring, industry automation

and building management (Schneider et al., 2022a;

Katrolia et al., 2021a). Infrared images (IR) on the

other hand are a by-product of ToF depth cameras

(no additional sensor needed), but have not been ex-

plored sufﬁciently, speciﬁcally in combination with

depth data (Agresti et al., 2017; Su et al., 2016).

Infrared images from ToF cameras provide the

magnitude of the modulated light reﬂected from the

scene and contain shape and semantic features in a

different spectral range (Hahne, 2012). Due to the

similarities between RGB and IR images, it is nat-

ural to attempt to adapt existing RGB-D fusion ap-

proaches to combine IR and depth images (Schneider

Figure 1: Patch similarities of IR and Depth modalities of a

ToF Camera.

et al., 2022b; Katrolia et al., 2021b). However, in

most RGB-D methods depth information is only an

accessory to the color information and is consumed

by the same type of neural network layers despite

their differences. Some recent works proposed depth-

speciﬁc operations like depth-aware (Wang and Neu-

mann, 2018) and shape-aware (Cao et al., 2021) con-

volutions. We observe that both IR and depth out-

puts from a ToF camera are related in many ways and

therefore these depth-speciﬁc operations can be ap-

plied to IR images as well. For example, the intensity

of light reﬂected from an object decreases as distance

to the object increases. Object surfaces closer to the

camera will reﬂect more light implying that pixel in-

tensities in an infrared image varies with the shape of

that object. We can see this by comparing the same

Sharma, P., Katrolia, J., Rambach, J., Mirbach, B. and Stricker, D.

Achieving RGB-D Level Segmentation Performance from a Single ToF Camera.

DOI: 10.5220/0012265100003654

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 13th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2024), pages 171-178

ISBN: 978-989-758-684-2; ISSN: 2184-4313

171

colored patches in Figure 1 and note how both image

patches have the same relative changes in pixel val-

ues. We use this observation to leverage the shape-

aware convolution operation for both IR and depth

images to learn more meaningful features from both

modalities.

We aim to use the available modalities from a

single ToF camera to achieve semantic segmentation

performance comparable to RGB-D methods using

an architecture that is tailored to IR-Depth (IR-D)

input. We take inspiration from (Cao et al., 2021;

Wang and Neumann, 2018) and design a depth-aware

shape convolution operation that consumes IR-D in-

put in a multi-task learning (MTL) architecture with

depth completion as an auxiliary task. Our proposed

method surpasses the baseline RGB-D based methods

using only a ToF camera. To summarize, the main

contributions of our paper are:

• We show that recently introduced depth-aware

and shape convolution operations are suitable for

IR images and can achieve same performance on

IR-D images as compared to RGB-D data. We

propose a new convolution operation combining

the two and show that it is better than using either

of them alone for IR-D data.

• We show that multi-task learning with depth com-

pletion as an auxiliary task can be beneﬁcial for

the depth and IR segmentation task.

• In our experimental evaluation on the TICaM

dataset, we show that with our proposed approach

it is possible to surpass RGB-D segmentation per-

formance using only depth and infrared (IR-D)

images from a single ToF camera.

2 RELATED WORK

RGB-D Semantic Segmentation. A wide range of

approaches have been proposed for incorporating the

depth information in RGB-D semantic segmentation

and can be broadly classiﬁed into three categories

as outlined by (Barchid et al., 2021): (1) Using the

depth channel as an additional input and performing

fusion at different levels (early (Song et al., 2017),

feature-level (Hazirbas et al., 2017) or late (Cheng

et al., 2017) fusion); (2) Depth as supervision signal

for auxiliary tasks such as depth estimation or com-

pletion (Jiao et al., 2019) and (3) Depth-speciﬁc op-

erations. Depth-Aware CNN (Wang and Neumann,

2018) showed that object boundaries correlate with

depth gradients and created depth-aware convolution

and depth-aware pooling functions. ShapeConv (Cao

et al., 2021) used a shape convolution kernel to ensure

convolution kernels give consistent responses to ob-

ject classes at different locations in the scene. (Chen

et al., 2019) used depth information to adjust the

neighbourhood size of a 3D convolution ﬁlter.

Depth and IR for Semantic Segmentation. Depth-

only methods for semantic segmentation are typically

applied to solve very speciﬁc tasks, like object ma-

nipulation via mechanical arm (Zhou et al., 2018),

hand segmentation or hand and object segmentation

(Rezaei et al., 2021; Lim et al., 2019). Even fewer

methods have explored the combination of IR and

depth. (Su et al., 2016) used IR to classify materi-

als since the infrared response depends on the mate-

rial of the object. (Agresti et al., 2017) used IR im-

age as an additional input channel for improving the

depth predicted from a setup containing both stereo

and ToF cameras and (Lorenti et al., 2018) used semi-

supervised learning for image segmentation via re-

gion merging. (Katrolia et al., 2021b) compared seg-

mentation of depth maps using image-based methods

against the use of point clouds inputs derived from the

depth maps.

Multi-Task Learning (MTL). The survey by (Craw-

shaw, 2020) describes many examples where simulta-

neous learning of two or more related tasks can boost

the performance on either task. RGB-D segmenta-

tion is enhanced using auxiliary tasks like depth and

surface normal prediction (Wang et al., 2022). MTL

methods typically employ a single encoder to learn

features from the available input modalities and two

separate task-speciﬁc decoders to perform prediction

(Wang et al., 2022). Cross-Stitch network (Misra

et al., 2016) replace total parameter sharing with con-

trolled sharing between the two tasks using learned

sharing weights.

3 BACKGROUND AND

NOTATION

In this section, we brieﬂy introduce our notation and

describe the depth-speciﬁc convolution operations in-

troduced by ShapeConv (Cao et al., 2021) and Depth-

Aware CNN (Wang and Neumann, 2018).

3.1 ShapeConv: Shape-Aware

Convolutional Layer

In order to design convolutions that are invariant to

different depth values (base) when the underlying rel-

ative difference in depth in a local patch (shape) re-

mains same, ShapeConv (Cao et al., 2021) suggested

decomposing an image patch P ∈ R

into a

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

172

shape component P

and a base component P

. These

two components are operated on separately by a cor-

responding shape W

and base kernel W

before be-

ing passed to the standard convolutional kernel K. In-

stead of decomposing the image patch, the convolu-

tional kernel itself can be decomposed to the respec-

tive components as shown in equation (1). Here m(K)

refers to mean of the kernel.

= m(K)

= K − m(K)

(1)

Shape convolution is then written with ShapeConv

as:

F = ShapeConv(K, W

, W

, P

)

= Conv(W

⋄ m(K) + W

∗ (K − m(K)), P)

= Conv(W

⋄ K

+ W

∗ K

, P)

= Conv(K

+ K

, P)

= Conv(K

, P)

(2)

Here, ⋄ and ∗ represents the base and shape prod-

uct respectively. W

and W

are learnable weights

corresponding to base and shape components respec-

tively.

3.2 Depth-Aware CNN and Depth

Similarity

In depth-aware convolution DCNN (Wang and Neu-

mann, 2018), pixels with similar depth values to the

centre pixel are weighted more than other pixels. This

property is named depth similarity. The depth simi-

larity function F

, p

) calculates the difference of

depth values D(p

), D(p

) between two pixels p

and

respectively.

, p

) = exp(−α|D(p

) − D(p

)|) (3)

Depth-aware convolution is written with the depth

similarity function F

as:

y(p

) =

∑

∈R

w(p

, p

+ p

)x(p

+ p

) (4)

In Equation (4), the depth similarity term (F

) is in-

troduced with the convolution operation. The con-

volved features are weighted by F

. The parameter α

weighs the inﬂuence of the depth similarity function

on the convolution operation.

4 METHOD

We propose a depth-aware shape convolution opera-

tion applied within in a multi-task learning network.

Our primary task is semantic segmentation using con-

catenated infrared and depth images from a ToF cam-

era and our auxiliary task is depth completion for

missing pixels in raw depth images.

4.1 Depth-Aware Shape Convolution

We design a depth-aware shape convolution, where

the shape kernel in ShapeConv is supplemented

with the depth similarity measure F

as computed

in equation (3). Formally, this integration can be

written in two steps. First the kernel is decomposed

into shape and base kernel as shown in equation (2).

After the calculation of the weights K

, the term

w(p

) in equation (4) is replaced with K

calculated

from shape kernel. In this way, kernel weights

calculated using the shape kernel are integrated with

the depth similarity of DCNN. Equation (3) can thus

be rewritten for our Depth-aware ShapeConv as:

y(p

) =

∑

∈R

, p

+ p

)x(p

+ p

) (5)

4.2 Infrared and Depth-Aware

Multi-Task Network

We realize a hard parameter sharing-based multi-task

network with semantic segmentation as the main task

and depth completion as the auxiliary task (Figure

2). We use ResNet-101 as the backbone feature ex-

tractor to encode features from concatenated infrared

and depth (IR-D) images. The convolution layers in

the ResNet encoder are replaced with depth-aware

shape convolutions presented in section 4.1. The ex-

tracted features are passed to two task-speciﬁc de-

coders that generate ﬁnal segmentation masks and

depth values for missing pixels. For training the depth

ﬁlling branch, the ground truth is prepared as de-

scribed in section 5.1. We follow the training strat-

egy from (Mao et al., 2020) and use predicted depth

values only for missing pixels to calculate the error

between ground truth and predicted depth. The dense

depth map predicted by the network is then multiplied

with the missing pixels mask (1 for missing pixels, 0

otherwise) to keep predicted depth values only for the

pixels that are missing in raw image. The remaining

values are then replaced by the corresponding depth

values in input image.

Achieving RGB-D Level Segmentation Performance from a Single ToF Camera

173

Figure 2: MTL architecture for segmentation task and dense depth prediction using depth-aware shape convolutions.

Table 1: Comparison of our proposed method to segmentation baselines for RGB-D and IR-D data. The best result in the IR-D

category is marked bold and second best is underlined. Our MTL-DA-ShapeConv method achieves state-of-the-art results in

the IR-D category and even outperforms RGB-D methods by a large margin on the class accuracy and mean IoU metrics.

Input Baselines Pixel

Acc.

Class

Acc.

Mean

IoU

f.w.IoU

RGB-D

ShapeConv (Cao et al., 2021) 97.86 81.25 77.39 95.92

Depth-Aware CNN (Wang and Neumann, 2018) 94.63 66.88 54.14 90.78

FuseNet (Hazirbas et al., 2017) 95.35 56.89 42.46 92.43

IR-D

ShapeConv (Cao et al., 2021) 97.75 81.31 74.61 95.76

Depth-Aware CNN (Wang and Neumann, 2018) 93.52 60.52 50.03 88.45

FuseNet (Hazirbas et al., 2017) 93.18 55.73 39.61 88.84

Ours DA-ShapeConv 97.57 85.08 78.39 95.42

Ours MTL-DA-ShapeConv 97.73 85.98 79.73 95.68

Table 2: Number of model parameters (in millions) and per image inference time (in milliseconds) for baselines and proposed

MTL architecture. (Here, ch represents channels).

Method Input Params

(million)

Time

(ms)

ShapeConv (Cao et al., 2021) 4-ch 60.55 35.04

ShapeConv (Cao et al., 2021) 2-ch 60.54 32.20

Ours

DA-ShapeConv 2-ch 60.54 37.12

MTL-DA-ShapeConv 2-ch 78.12 37.16

5 EXPERIMENTS

5.1 Dataset

We evaluate our approach on the in-car cabin dataset

TICaM (Katrolia et al., 2021a) that provides RGB,

depth and infrared images recorded with a single ToF

camera and corresponding ground-truth segmentation

masks. TICaM is the only dataset that fulﬁlls our

experimental requirements. Surprisingly we could

not ﬁnd any other dataset that provided segmentation

masks for all three image modalities: RGB, depth and

infrared.

We used the real image-set of TICaM with the

suggested split of 4666 training images and 2012

test images for our experiments. Following (Katro-

lia et al., 2021a) we combine different object classes

into a single ’object’ class to have 6 object classes

in total. The RGB images have different resolution

and FoV to depth and infrared images. To align the

images, the RGB images are ﬁrst mapped to the pin-

hole model of the depth images. Subsequently all the

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

174

images are centre-cropped to size 230 × 418. Nor-

malization of the infrared image is implemented by

ﬁrst removing the outliers by calculating the 99

per-

centile of the image and then scaling the image to the

range of 0-255. To further enrich the information, his-

togram equalization and gamma ﬁltering are applied

to the normalized infrared image.

Ground-Truth Preparation. To train for the auxil-

iary task of dense depth prediction, completed depth

images are required as ground truth. As the TICaM

dataset only provides depth images with holes, ﬁlled

versions of these depth maps are artiﬁcially created

in this work using the ”Colorization using Optimiza-

tion” scheme (Levin et al., 2004). The depth im-

ages are ﬁlled by enforcing similar depth values to the

neighbouring pixels with similar intensities. The in-

formation on pixel intensities is provided by infrared

images. During training, the error is calculated for the

missing pixels following the training strategy of (Mao

et al., 2020).

5.2 State-of-the-Art Comparison

Table 1 provides an evaluation of segmentation ac-

curacy of our proposed method. We include re-

sults of state-of-the-art methods on RGB-D as well

as on our target modality, IR-D ToF data. We choose

three existing methods for RGB-D segmentation and

train them on both RGB-D and IR-D images to es-

tablish our baselines and better evaluate the differ-

ence between RGB-D and IR-D inputs. We choose

FuseNet (Hazirbas et al., 2017) since it is an estab-

lished and well-tested network on many benchmark

datasets for RGB-D segmentation, however it has not

been tested yet on TICaM dataset. ShapeConv (Cao

et al., 2021) and Depth-aware CNN (Wang and Neu-

mann, 2018) on the other hand are more recent

methods that use novel convolution operations un-

like FuseNet. FuseNet and Depth-aware CNN use a

VGG-16 backbone, while ShapeConv uses ResNet-

101. All networks use SGD optimizer with a momen-

tum of 0.9 and weight decay of 5 × 10

−4

to update the

weights. By default, Deeplab v3+ and ShapeConv use

pre-trained weights while FuseNet and Depth-aware

CNN are initialized using the kaiming initialization

(He et al., 2015). For training with IR-D images

we replicate infrared images to form 3-channel im-

ages and concatenate them with single channel depth

images. We report pixel accuracy, class accuracy,

mean Intersection-over-Union (IoU) and frequency

weighted IoU (f.w.IoU) in Table 1 for all results.

We can observe that the architectures that incorpo-

rate depth in an informed manner outperform FuseNet

which simply concatenates depth with other modal-

ities. Also, the combination of infrared and depth

can be used instead of RGB-D input while achiev-

ing almost the same performance on segmentation

but the disparity between the achieved performance

is least when using depth-aware architectures with

ShapeConv outperforming the other two methods. We

can see also from Figure 3 that ShapeConv with both

3-channel and 1-channel infrared images have similar

mask predictions.

Our proposed methods applied on IR-D images

are presented in Table 1 as well. DA-ShapeConv is

the integration of depth aware convolutions into the

ShapeConv architecture as described in Section 4.2.

We can note that combining both depth-aware and

shape convolutions gives signiﬁcant improvement on

class accuracy as well as mean IoU, compared to

the ShapeConv and Depth-Aware CNN baselines and

even their RGB-D versions. MTL-DA-ShapeConv

denotes our Multi-Task network, also incorporating

the proposed auxiliary task of depth completion. We

can see that the MTL architecture improves over the

best performing RGB-D ShapeConv baseline with

signiﬁcant improvement in class accuracy and mean

IoU. Also, the improvement over our method without

the auxiliary task (DA-ShapeConv) validates the use

of multi-task learning with depth completion for this

task. Overall, the results prove our main hypothesis,

that by applying our method we can outperform state-

of-the-art RGB-D methods using only IR-D images

provided by a single ToF camera.

The results in Table 1 are also directly comparable

with results on the TICAM dataset using only depth

information as reported in (Katrolia et al., 2021b).

This clearly indicates the advantage of the IR-D com-

bination of modalities over the use of the D channel

only for segmentation.

Additionally, we provide qualitative results of the

segmentation output from all evaluated network vari-

ants on images from the TICAM dataset in Figure 3.

We see that our methods provide smoother masks

around the edges, especially for the smaller objects on

the passenger seat as well as better class predictions

in some cases.

Finally, a runtime comparison for the evaluated

methods is provided in Table 2. The DA-ShapeConv

method does not have an impact on the number of pa-

rameters and a limited increase in runtime per frame

compared to ShapeConv. The MTL-DA-ShapeConv

method leads to an increase in parameters for the net-

work due to the additional branch, however this does

not impact the inference runtime.

Achieving RGB-D Level Segmentation Performance from a Single ToF Camera

175

Depth

Infrared

Ground

Truth

ShapeConv

RGB-D

ShapeConv

3ch IR-D

ShapeConv

1ch IR-D

ShapeConv

MTL-DA

ShapeConv

Figure 3: Predictions from RGB-D and proposed IR-D methods.

6 CONCLUSION

We designed a network for IR-D segmentation that

performs equally well as RGB-D segmentation so that

inconvenient and expensive RGB-D cameras can be

replaced with single Time-of-Flight (ToF) cameras.

We showed that existing fusion approaches for RGB-

D segmentation can be used with IR-D input if stan-

dard convolutions are replaced with depth-speciﬁc

convolutions. We then presented a combination of

depth-aware and shape-aware convolutions, and de-

signed a multi-task learning (MTL) architecture with

this new convolution operation. We employ hard

parameter sharing between our main and auxiliary

tasks of segmentation and depth ﬁlling respectively.

Through progressive modiﬁcations to the input, the

convolution operation, and the network architecture

we showed that we can outperform all baseline meth-

ods. We conclude that using images from a single ToF

camera, it is possible to surpass RGB-D segmentation

performance with our designed MTL architecture.

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

176

ACKNOWLEDGEMENTS

This work was partially funded within the Electronic

Components and Systems for European Leadership

(ECSEL) joint undertaking in collaboration with the

European Union’s H2020 Framework Program and

the Federal Ministry of Education and Research of the

Federal Republic of Germany (BMBF) under grant

agreement 16ESE0424/GA826600 (VIZTA), and par-

tially funded by the German Ministry of Educa-

tion and Research (BMBF) under Grant Agreement

01IW20002 (SocialWear).

REFERENCES

Agresti, G., Minto, L., Marin, G., and Zanuttigh, P. (2017).

Deep learning for conﬁdence information in stereo

and tof data fusion. In IEEE International Con-

ference on Computer Vision Workshops (CVPR-W),

pages 697–705.

Barchid, S., Mennesson, J., and Dj

eraba, C. (2021). Review

on indoor rgb-d semantic segmentation with deep con-

volutional neural networks. In International Confer-

ence on Content-Based Multimedia Indexing (CBMI),

pages 1–4. IEEE.

Cao, J., Leng, H., Lischinski, D., Cohen-Or, D., Tu, C.,

and Li, Y. (2021). Shapeconv: Shape-aware convo-

lutional layer for indoor rgb-d semantic segmentation.

In IEEE/CVF International Conference on Computer

Vision (ICCV), pages 7088–7097.

Chen, Y., Mensink, T., and Gavves, E. (2019). 3d neigh-

borhood convolution: learning depth-aware features

for rgb-d and rgb semantic segmentation. In Inter-

national Conference on 3D Vision (3DV), pages 173–

182. IEEE.

Cheng, Y., Cai, R., Li, Z., Zhao, X., and Huang, K. (2017).

Locality-sensitive deconvolution networks with gated

fusion for rgb-d indoor semantic segmentation. In

IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), pages 3029–3037.

Crawshaw, M. (2020). Multi-task learning with deep

neural networks: A survey. arXiv preprint

arXiv:2009.09796.

Hahne, U. (2012). Real-time depth imaging. PhD thesis,

Berlin Institute of Technology.

Hazirbas, C., Ma, L., Domokos, C., and Cremers, D.

(2017). Fusenet: Incorporating depth into semantic

segmentation via fusion-based cnn architecture. In

Asian Conference on Computer Vision, 2016, pages

213–228. Springer.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delv-

ing deep into rectiﬁers: Surpassing human-level per-

formance on imagenet classiﬁcation. In Proceedings

of the IEEE international conference on computer vi-

sion, pages 1026–1034.

Jiao, J., Wei, Y., Jie, Z., Shi, H., Lau, R. W., and Huang,

T. S. (2019). Geometry-aware distillation for indoor

semantic segmentation. In IEEE Conference on Com-

puter Vision and Pattern Recognition (CVPR).

Katrolia, J. S., El-Sherif, A., Feld, H., Mirbach, B., Ram-

bach, J. R., and Stricker, D. (2021a). Ticam: A time-

of-ﬂight in-car cabin monitoring dataset. In British

Machine Vision Conference (BMVC), page 277.

Katrolia, J. S., Kr

amer, L., Rambach, J., Mirbach, B., and

Stricker, D. (2021b). Semantic segmentation in depth

data: A comparative evaluation of image and point

cloud based methods. In 2021 IEEE International

Conference on Image Processing (ICIP), pages 649–

653. IEEE.

Levin, A., Lischinski, D., and Weiss, Y. (2004). Coloriza-

tion using optimization. ACM Transactions on Graph-

ics, 23.

Lim, G. M., Jatesiktat, P., Kuah, C. W. K., and Ang,

W. T. (2019). Hand and object segmentation from

depth image using fully convolutional network. In

International Conference of the IEEE Engineering in

Medicine and Biology Society (EMBC), pages 2082–

2086. IEEE.

Lorenti, L., Giacomantone, J., and Bria, O. N. (2018). Un-

supervised tof image segmentation through spectral

clustering and region merging. Journal of Computer

Science & Technology, 18.

Mao, J., Li, J., Li, F., and Wan, C. (2020). Depth image

inpainting via single depth features learning. In In-

ternational Congress on Image and Signal Process-

ing, BioMedical Engineering and Informatics (CISP-

BMEI), pages 116–120. IEEE.

Misra, I., Shrivastava, A., Gupta, A., and Hebert, M. (2016).

Cross-stitch networks for multi-task learning. In Pro-

ceedings of the IEEE conference on computer vision

and pattern recognition, pages 3994–4003.

Rezaei, M., Farahanipad, F., Dillhoff, A., Elmasri, R., and

Athitsos, V. (2021). Weakly-supervised hand part seg-

mentation from depth images. In PErvasive Technolo-

gies Related to Assistive Environments, pages 218–

225.

Schneider, P., Anisimov, Y., Islam, R., Mirbach, B., Ram-

bach, J., Stricker, D., and Grandidier, F. (2022a).

Timo—a dataset for indoor building monitoring with

a time-of-ﬂight camera. Sensors, 22(11):3992.

Schneider, P., Rambach, J., Mirbach, B., and Stricker,

D. (2022b). Unsupervised anomaly detection from

time-of-ﬂight depth images. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, pages 231–240.

Song, H., Liu, Z., Du, H., Sun, G., Le Meur, O., and Ren, T.

(2017). Depth-aware salient object detection and seg-

mentation via multiscale discriminative saliency fu-

sion and bootstrap learning. IEEE Transactions on

Image Processing, 26(9):4204–4216.

Su, S., Heide, F., Swanson, R., Klein, J., Callenberg, C.,

Hullin, M., and Heidrich, W. (2016). Material clas-

siﬁcation using raw time-of-ﬂight measurements. In

IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), pages 3503–3511.

Wang, W. and Neumann, U. (2018). Depth-aware cnn for

Achieving RGB-D Level Segmentation Performance from a Single ToF Camera

177

rgb-d segmentation. In European Conference on Com-

puter Vision (ECCV), pages 135–150.

Wang, Y., Tsai, Y.-H., Hung, W.-C., Ding, W., Liu, S.,

and Yang, M.-H. (2022). Semi-supervised multi-task

learning for semantics and depth. In IEEE/CVF Win-

ter Conference on Applications of Computer Vision

(WACV), pages 2505–2514.

Zhou, M., Song, W., Shen, L., and Zhang, Y. (2018).

Stacked objects segmentation based on depth image.

In International Conference on Optical and Photonic

Engineering (icOPEN), volume 10827, pages 363–

368. SPIE.

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

178