SALT: A Semi-automatic Labeling Tool for RGB-D Video Sequences

Dennis Stumpf

, Stephan Krauß

, Gerd Reis

, Oliver Wasenm

uller

and Didier Stricker

1,3

German Research Center for Artiﬁcial Intelligence GmbH (DFKI), Germany

Hochschule Mannheim, Germany

University of Kaiserslautern, Germany

Keywords:

RGB-D, Dataset, Tool, Annotation, Label, Detection, Segmentation.

Abstract:

Large labeled data sets are one of the essential basics of modern deep learning techniques. Therefore, there is

an increasing need for tools that allow to label large amounts of data as intuitively as possible. In this paper, we

introduce SALT, a tool to semi-automatically annotate RGB-D video sequences to generate 3D bounding boxes

for full six Degrees of Freedom (DoF) object poses, as well as pixel-level instance segmentation masks for

both RGB and depth. Besides bounding box propagation through various interpolation techniques, as well as

algorithmically guided instance segmentation, our pipeline also provides built-in pre-processing functionalities

to facilitate the data set creation process. By making full use of SALT, annotation time can be reduced by a

factor of up to 33.95 for bounding box creation and 8.55 for RGB segmentation without compromising the

quality of the automatically generated ground truth.

1 INTRODUCTION

Generating ground truth data for RGB-D data sets

is extremely time-consuming and expensive, as it in-

volves annotations in both 2D and 3D. The 3D bound-

ing boxes of the famous data sets KITTI (Geiger et al.,

2013) and SUN RGB-D (Song et al., 2015), for exam-

ple, are created manually for every individual object

in every frame or scene. Furthermore, as such tasks

are often outsourced to workers with limited technical

background, the annotation quality can vary highly.

KITTI, for example, contains frames with missing or

incorrectly labeled boxes. This raises the question, if

the manual effort in the annotation pipeline for RGB-

D data can be reduced without compromising quality.

As it can be seen in Section 2, this question opens

up an active ﬁeld of research where different ap-

proaches have already emerged. These approaches,

however, all come with their individual limitations.

They either require extensive pre-processing steps,

are only applicable to static scenes containing rigid

objects or limit the ground truth representation (e.g. 4

DoF object poses or segmentation for only one modal-

ity). Furthermore, many are designed for application

scenarios involving an ego-motion of the camera.

To alleviate these and other limitations, we pro-

pose SALT (cf. Figure 1), a simple, yet effective tool

to generate bounding box and segmentation ground

Figure 1: We introduce SALT, a tool to semi-automatically

annotate 3D bounding boxes and pixel-level instance seg-

mentation in RGB-D video sequences. This ﬁgure depicts a

screenshot from the 3D bounding box module of the tool.

truth data for RGB-D video sequences in a semi-

automatic fashion. The tool does not require any prior

knowledge of the scene and is independent of any ob-

jects to be labeled. In summary, our contributions are:

1. A pipeline to rapidly generate 3D bounding boxes

for full 6 DoF object poses up to 33.95 times faster

than a naive approach.

2. Algorithmically guided generation of pixel-level

instance segmentation masks in both 2D and 3D,

with 2D bounding boxes as a by-product, speed-

ing up annotation time by a factor of up to 8.55.

Stumpf, D., Krauß, S., Reis, G., Wasenmüller, O. and Stricker, D.

SALT: A Semi-automatic Labeling Tool for RGB-D Video Sequences.

DOI: 10.5220/0010303005950603

In Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2021) - Volume 4: VISAPP, pages

595-603

ISBN: 978-989-758-488-6

595

Table 1: Overview of state-of-the-art tools (upper half) and methods (lower half) for RGB-D data set creation. Listed are the

detection and segmentation ground truth data representation, whether the tool provides built-in calibration functionalities (e.g.

undistorting images or modifying extrinsic parameters) and if the functionalities of the tool are accessible through a Graphical

User Interface (GUI). Green rows indicate approaches built for annotating indoor scenarios (e.g. by using a Microsoft Kinect)

and yellow ones for autonomous vehicles (i.e. LiDAR sensor, ego motion of cameras).

Method Approach Detection Segmentation Calibration GUI

Ours interpolation + GrabCut 6DoF, 2D RGB, D X X

(Marion et al., 2018) mesh + reconstruction 6DoF RGB, D X (X)

(Suchi et al., 2019) incremental scene building 2D D - -

(Wong et al., 2015) learning 4DoF RGB - X

(Monica et al., 2017) graph segmentation - D - X

(Zimmer et al., 2019) interpolation 4DoF - - X

(Wang et al., 2019) tracking + clustering 4DoF - - X

(Arief et al., 2020) tracking + clustering 4DoF - - X

(Plachetka et al., 2018) tracking + optimization 4DoF, 2D RGB, D X X

(Lee et al., 2018) learning 4DoF D - X

(Huang et al., 2020) 3D-2D projection - RGB, D - X

(Yan et al., 2020) clustering 3DoF - - X

(Hoda

n et al., 2017) mesh + reconstruction 6DoF, 2D -

(Xiang et al., 2018) mesh + reference frame 6DoF D

(Grenzd

orffer et al., 2020) mesh + reference frame 6DoF, 2D D

(Xie et al., 2016) label transfer - RGB

(Patil et al., 2019) interpolation 4DoF -

(Dewan et al., 2016) Bayesian approach 3DoF D

(Chang et al., 2019) accumulation over time 6DoF -

2 RELATED WORK

Recent advances in the ﬁeld of data set creation have

shown that automating parts of the annotation process

in order to minimize necessary human interaction can

result in large-scale, high-quality RGB-D data sets in

a short amount of time.

(Suchi et al., 2019) build a static scene by adding

one object at a time, leveraging the change in depth

for automatic pointwise segmentation. (Monica et al.,

2017) use a nearest neighbor graph representation to

segment a single point cloud based on individually

labeled points. The quality of the generated ground

truth of such approaches, however, highly depends on

the depth quality and the texture of objects.

(Hoda

n et al., 2017) and (Marion et al., 2018) re-

construct the scene, align pre-built meshes with the

objects in the point cloud and then project back into

the frames used for reconstruction. Similarly, (Xi-

ang et al., 2018) and (Grenzd

orffer et al., 2020) align

meshes with the point cloud of a reference-frame and

then propagate the resulting object conﬁguration to

other frames. While mesh based approaches provide

full 6 DoF object poses and segmentation masks, they

require a pre-deﬁned set of rigid objects and corre-

sponding meshes beforehand. Thus, they are only ca-

pable of annotating known, static scenes.

(Wong et al., 2015) use actively trained models to

predict the 3D structure of indoor scenes from seg-

mented RGB-D images. (Lee et al., 2018) leverage

pre-trained Convolutional Neural Networks (CNNs)

to predict a 3D bounding box from a single click on an

object’s point in the point cloud. (Xie et al., 2016) use

a Conditional Random Field (CRF) model to transfer

a coarsely labeled 3D point cloud and geometric cues

into 2D image space, resulting in densely segmented

2D images. Such learning based approaches, how-

ever, require annotated data beforehand, do not guar-

antee to scale to different data domains and are highly

dependent on the used architectures.

(Wang et al., 2019), (Arief et al., 2020) and (Yan

et al., 2020) infer 3D bounding boxes from the spa-

tial extends of clustered points. The former two addi-

tionally provide CNN assisted pre-segmentation and

use tracking to propagate annotations into subsequent

frames. (Dewan et al., 2016) use a Bayesian approach

and motion cues to automatically detect and track ob-

jects. (Plachetka et al., 2018) track manually anno-

tated 3D bounding boxes and corresponding polygons

(for RGB segmentation). (Chang et al., 2019) accu-

mulate manually selected points over time to automat-

ically infer ﬁxed size 3D bounding boxes. The draw-

back of these approaches is that the spatial extents of

an object in a point cloud may not always represent

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

596

Camera

calibration

Intrinsics (RGB, depth)

Extrinsics (RGB, depth)

3D bounding

boxes

Center (x, y, z)

Extents (x, y, z)

Rotation (x, y, z)

Truncation (RGB, depth)

Occlusion (RGB, depth)

Additional attributes

(optional) pre-processing

undistort

(→ new intrinsics)

apply colormap

RGB

Depth

3D bounding boxes

modify origin of

world coordinate system

Interpolate

Create 3D

bounding

boxes

Instance segmentation

Segmentation masks

GrabCut guided

2D segmentation

segmentation

Project to 2D

Figure 2: Overview of the pipeline of our proposed tool SALT. Input are the individual frames of an RGB-D sequence from

a ﬁxed viewpoint, as well as the corresponding camera calibration (intrinsics and extrinsic transformation between RGB

and depth). The footage is either already undistorted, or can be undistorted using our tool. The 3D bounding box module

can be used to not only generate 3D bounding boxes in a semi-automatic fashion, but also to modify the extrinsics of the

RGB-D camera system w.r.t. the scene through simple drag and drop of the world coordinate system’s origin. The instance

segmentation module can segment depth maps and RGB images with the support of GrabCut (Rother et al., 2004), or in

the case of depth maps, by selecting the points directly from inside the point cloud. Already segmented depth maps can

additionally aid the segmentation of the corresponding RGB image.

the full spatial extents due to occlusion, sparse mea-

surements or reﬂective textures. In such cases, track-

ing and clustering may yield inaccurate results.

(Huang et al., 2020) manually segment back-

ground and moving objects in 3D and 2D respec-

tively and by combining both, they retrieve a fully

segmented image. (Zimmer et al., 2019) and (Patil

et al., 2019), similarly to our approach, annotate 3D

bounding boxes for only a subset of frames, while in-

terpolating the remaining ones. Both, however, only

interpolate linearly, which may not accurately capture

real dynamic motion.

3 METHOD

In this section, we describe the core features of SALT

in greater detail. A graphical overview can be viewed

in Figure 2, which can be used as a guiding reference

throughout this section.

3.1 Pre-processing

The tool internally uses undistorted RGB and depth

frames (16 bit depth and 8 bit color mapped depth).

The user has the option to either provide these images

directly, or use the tool to generate them from the raw

footage. In both cases, the appropriate intrinsic pa-

rameters of both cameras and the extrinsic stereo cal-

ibration (rotation R and translation t between the in-

dividual cameras) have to be provided.

Figure 3: Simpliﬁed, schematic visualization of our undis-

tortion pipeline for RGB images with radial distortion.

The naive approach undistorts while retaining the original

spatial dimension (green), resulting in loss of information

through scaling. Our approach assures to keep the valid area

(red) of roughly the original spatial dimension, retaining as

much information (i.e. original pixels) as possible.

Our pipeline undistorts in such a way that as many

pixels of the raw input image as possible are retained.

As it can be seen in Figure 3, undistorting an image

while retaining the resolution and all pixels results in

areas (indicated in black) that do not contain any pixel

values. The naive approach would be to crop out the

area containing only valid pixels (indicated in red)

and scale it back to the original resolution. But this

results in loss of information, as pixels outside of the

valid area are dropped and pixels inside are already

scaled down. Our pipeline, on the other hand, deter-

mines the ideal resolution at which one side of the

valid area matches the original image resolution. This

assures that during undistortion, as few pixels as pos-

SALT: A Semi-automatic Labeling Tool for RGB-D Video Sequences

597

sible are lost. Depth maps are processed in a similar

fashion. The tool undistorts each pixel individually

and projects them back to the original image space.

Depending on the distortion model, the ideal coordi-

nates of projected pixels could fall outside of the orig-

inal resolutions’ boundaries. To capture those pixels

as well, the necessary resolution and focal lengths are

computed and a new camera matrix is generated.

While we do not claim novelty for this undistor-

tion process, we argue that it is necessary to generate

a data set that is as ﬂexible as possible. This means

that, at a later point in time, if the images in the data

set need to be undistorted differently, already gener-

ated segmentation masks lose as few pixels as possi-

ble when being re-distorted.

3.2 3D Bounding Boxes

As depicted in Figure 1, the user accesses the scene

with an interactive 3D point cloud viewer. To iden-

tify objects in the scene and adjust parameters more

easily, the boxes are projected and displayed in the

corresponding RGB and depth frames in real time.

Let {F

, F

, .. . , F

} be a sequence of N RGB-D

frames. When creating a new 3D bounding box B

for frame F

, a user ﬁrst assigns a unique interpola-

tion ID and a class label. The process of drawing

the box and adjusting its parameters involves sim-

ple drag-and-drop operations (translation and scaling)

and the use of the mouse wheel (rotation) to reduce

the cognitive workload.

Let the object contained in B

be visible for M

frames. In order to retrieve the boxes for all frames

all

= {B

, B

k+1

, B

k+2

, . . . , B

k+M

} with minimal ef-

fort, two core features are provided: copy and inter-

polate. Given the sequential nature of videos, copying

the boxes into subsequent frames facilitates the anno-

tation process, since only minor adjustments, if any,

need to be made to copied boxes. This can then be

used in conjunction with interpolation: The user only

needs to fully annotate box B

, copy it into every m-th

frame (keyframe) and apply small adjustments, result-

ing in B

key

= {B

, B

k+m

, B

k+2m

, . . . , B

k+M

}. A simple

interpolation f

int

(·) then generates the box parameters

for all intermediate frames. More formally:

all

= f

int



key



(1)

The advantage of using a simple interpolation in-

stead of, for example, a more complex tracking al-

gorithm is its robustness against sparse or occluded

measurements. Even objects without any points (i.e.

visible only in the RGB frame) can be annotated.

Commonly, translation and scaling parameters are

interpolated individually using simple linear interpo-

lation (e.g. (Zimmer et al., 2019) and (Patil et al.,

2019)). We follow this approach for the scaling pa-

rameters (width, height and length), but implement a

hybrid interpolation method for the translation param-

eters. More precisely, if the Euclidean distance c

∆

the center coordinates is above a certain threshold ε

for at least 4 consecutive keyframes, we apply cubic

spline interpolation between those keyframes, other-

wise linear interpolation is used:

int

(·) =

(

cubic if c

∆

> ε for ≥ 4 keyframes

linear otherwise

(2)

Rotation angles (yaw, pitch and roll), on the other

hand, cannot be interpolated individually. Given the

Euler angles ϕ, a single 3D orientation is constructed

from individual rotations around the z, y and x axis

in a ﬁxed order. For this reason, we apply Spher-

ical Linear Interpolation (SLERP), as introduced in

(Shoemake, 1985). Instead of using three Euler an-

gles, keypoint orientations are represented as quater-

nions. Thus an interpolation step involves moving be-

tween quaternions on a sphere around a ﬁxed rotation

axis with constant velocity, resulting in unambiguous

rotational movement.

Furthermore, the tool automatically provides a

truncation value t by determining the cut-off area of

bounding boxes at image boundaries when projected

into the RGB and depth frames. Occlusion o, on the

other hand, can be manually speciﬁed and will be

propagated to interpolated boxes. By combining these

values, a visibility score v is computed as:

v = (1 −t) · (1 − o) t,o ∈ [0, 1] (3)

This value is used to colorize boxes depending on

their visibility, in order to guide the user in visually

understanding the scene and annotations more intu-

itively. Furthermore, the occlusion and truncation val-

ues can be used to categorize the difﬁculty of individ-

ual samples similarly to KITTI (Geiger et al., 2013).

Finally, the world coordinate system’s origin (i.e.

extrinsic calibration of the RGB and depth camera)

can be displayed and adjusted from inside the point

cloud viewer using the bounding box control scheme.

3.3 Instance Segmentation

The segmentation module is, in its core, aided by

the GrabCut algorithm (Rother et al., 2004). Ini-

tially, a user supplies a region of interest R

(repre-

sented as 2D rectangles) in the image space, contain-

ing the object that shall be segmented. This generates

an initial mask M

with pixels in- and outside of R

being designated fore- and background respectively.

Given M

and the corresponding image, the algorithm

learns Gaussian Mixture Models (GMMs) for fore-

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

598

and background and improves upon the initial pixel

assignment, resulting in a new mask M

, deﬁned as:

= f

GrabCut

) (4)

Depending on the input image, the process of

Equation 4 may not yield an optimal segmentation re-

sult. In such cases, M

can be modiﬁed by the user

in the form of coarse scribbles (cf. Figure 4), manu-

ally assigning fore- or background to the pixels. By

running f

GrabCut

(·) on the modiﬁed mask, the GMMs

are updated further, improving the estimated segmen-

tation mask. This process can be repeated iteratively

until convergence. Asking the user to indicate fore-

and background with coarse scribbles instead of poly-

gons or exact pixel-level segmentation reduces the

cognitive and physical workload by a wide margin.

Figure 4: Screenshot of the segmentation module of SALT.

Masks (yellow) are created by using GrabCut (Rother et al.,

2004) and can be optionally ﬁne-tuned. For this, a user pro-

vides coarse scribbles for fore- and background (red and

green respectively) for the algorithm to iteratively improve

upon its initial mask. Overlapping masks (gray) are used to

initialize pixels as background.

To further improve both run time and accuracy,

multiple extensions are applied to the default Grab-

Cut implementation. First, the 2D rectangles can be

created and interpolated similarly to the 3D bound-

ing boxes, or inferred by projecting already annotated

3D bounding boxes into 2D. Furthermore, instead of

generating M

by using the whole frame F

, we only

use the rectangle and a small, padded area outside

of its boundaries. We argue that the local surround-

ings of objects are sufﬁcient for the algorithm to de-

termine background pixels. If two masks M

k,1

and

k,2

overlap, the pixels belonging to M

k,2

can be set

as background when running f

GrabCut

k,1

) (cf. Fig-

ure 4). Moreover, images fed into the algorithm can

be downsampled beforehand to improve run time and

the generated masks can be ﬁltered using simple mor-

phological operations. Finally, individual masks can

be copied into subsequent frames, speeding up anno-

tation time of non-moving objects.

Even though GrabCut relies on color cues in the

RGB color space to estimate masks, we empirically

found that it is also applicable to the depth frames

when provided as color mapped versions. We addi-

tionally convert them into grayscale images to make

it easier for the user to differentiate between scrib-

bles, masks and actual image contents. Given the

three dimensional nature of depth maps, however, it is

easier to differentiate between foreground and back-

ground when looking at the scene from multiple an-

gles. Thus, we additionally provide the option to cre-

ate depth segmentation masks by using an interactive

3D point cloud viewer similar to the one used for the

3D bounding box module. The process of creating a

mask in 3D is visualized in Figure 5.

Figure 5: Segmenting a depth map using the 3D point cloud

viewer. A user adds (left) or deletes (center) multiple points

at once from different views using a 2D rectangle, resulting

in a fully segmented object (right).

When a new mask is created in the RGB image,

the tool will optionally search for a corresponding

mask in the depth map. If found, its points will be

projected into the RGB image space and used to ini-

tialize pixels as foreground in M

, making the initial

guess as precise as possible. Depending on the quality

of the camera calibration and the density of the depth

map, this feature can potentially remove the need to

manually ﬁne-tune the annotation.

4 EXPERIMENTS

In order to evaluate the effectiveness of our tool, we

create two toy data sets: 1000 frames of two people

walking around, and 100 frames of various objects

being placed into a shelf (referred to as ds people

and ds objects respectively). The ﬁrst data set can

be considered easy to annotate given the high fram-

erate and depth map quality as well as simple object

poses, while reﬂective surfaces, low framerate, low

depth map resolution and complex object poses of

ds objects make it a challenging data set. Annotated

samples of these data sets can be viewed in Figure 6.

We asked three different users to annotate these

sequences using SALT. For 3D bounding boxes, they

SALT: A Semi-automatic Labeling Tool for RGB-D Video Sequences

599

Table 2: Quantitative comparison of user annotations with the high-quality ground truth for both toy data sets in terms of

Average Scaling Error (ASE) (↓), Average Translation Error (ATE) (↓), Average Orientation Error (AOE) (↓), Intersection

over Union (IoU) (↑), Mean Absolute Error (MAE) (↓) and t

avg

(↓). Coverage (↑) represents the percentage of annotated

objects w.r.t. the ground truth (1.0 means all objects were annotated by the user). BB

are 3D bounding box annotations,

RGB

and S

depth

are the RGB and depth segmentation masks. Naive represents user annotations with a naive tool (averaged

across all users).

ds people ds objects

Data Metric User 1 User 2 User 3 Naive User 1 User 2 User 3 Naive

ASE 0.320 0.338 0.294 0.296 0.226 0.332 0.239 0.393

ATE (in cm) 9.781 9.264 9.182 8.103 2.543 1.867 2.253 2.753

AOE (in rad) 0.257 0.150 0.149 0.141 0.156 0.185 0.115 0.241

avg

(in s) 1.53 1.85 3.86 51.95 8.62 7.63 11.93 88.71

Coverage 0.960 0.999 0.966 1.0 1.0 1.0 1.0 1.0

RGB

IoU 0.943 0.948 0.953 0.942 0.955 0.958 0.956 0.948

MAE 0.015 0.014 0.012 0.018 0.011 0.010 0.011 0.013

avg

(in s) 47.64 69.01 75.28 146.6 18.19 24.41 44.39 155.56

Coverage 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

depth

IoU 0.999 1.000 0.998 - 0.874 0.904 0.886 -

MAE 0.000 0.000 0.000 - 0.019 0.014 0.016 -

avg

(in s) 24.40 31.82 32.87 - 42.53 65.55 55.54 -

Coverage 0.964 0.976 0.976 - 1.0 0.984 1.0 -

were instructed to annotate every 40th and 5th frame

of ds people and ds objects, respectively, as well as

frames in which objects enter or leave the scene. Seg-

mentation masks were annotated for every 20th and

5th frame, respectively. We compare the user anno-

tations with a high-quality ground truth created with

high care by a different expert user. Results are dis-

played in Table 2.

When evaluating the image segmentation masks,

we report the common metrics Mean Absolute Error

(MAE) and Intersection over Union (IoU):

IoU =

∑

j=1

∑

i=1



i, j

· M

∗

i, j



∑

j=1

∑

i=1



i, j

+ M

∗

i, j

− M

i, j

· M

∗

i, j



(5)

MAE =

h · w

∑

j=1

∑

i=1



i, j

− M

∗

i, j



, (6)

with h, w being the height and width of the 2D rect-

angle area containing the mask (cf. Section 3.3) and

i, j

, M

∗

i, j

∈ {0, 1} being the corresponding pixel val-

ues at position (i, j) of user annotations and ground

truth, respectively.

For evaluating the 6 DoF 3D bounding boxes,

we report the Average Scaling Error (ASE), Average

Translation Error (ATE) and Average Orientation Er-

ror (AOE) as introduced in the nuScenes benchmark

(Caesar et al., 2020). However, we modify the AOE

to take all three rotation angles into account, since

nuScenes only reports yaw. More precisely, given the

3x3 rotation matrices R

and R

for ground truth and

user annotation respectively, the difference in orienta-

tion ∆ϕ can be computed as:

∆ϕ = arccos





− 1

(7)

All of the aforementioned evaluation metrics are

averaged across all annotations of the individual data

sets for each user. Furthermore, we measure the total

annotation time per data set for each user, and com-

pute the resulting average annotation time per individ-

ual object t

avg

. For the 3D bounding boxes, both the

manually created as well as the interpolated boxes are

considered when computing t

avg

To further evaluate the achievable annotation time

speedup using SALT, we implement a naive version

of the tool. This means that 3D bounding boxes are

created without copying or interpolating, while seg-

mentation masks are manually drawn without being

assisted by the GrabCut algorithm. We let all three

users annotate the same, randomly sampled subset

of frames using the naive approach and report the

average annotation time t

avg

and accuracy per ob-

ject across all three users. As objects are not al-

ways distinguishable in the depth maps, applying this

approach on the depth segmentation has empirically

shown to fail. Therefore, we only report the naive re-

sults of RGB segmentation and 3D bounding boxes.

As can be seen in Table 2, using SALT allows

a reduction in annotation time without compromis-

ing quality compared to a high-quality ground truth.

RGB segmentation can be sped up by a factor of up

to 3.08 for ds people and 8.55 for ds objects, while

3D bounding box creation is up to 33.95 and 11.63

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

600

Figure 6: Qualitative results (3D bounding boxes, RGB and depth segmentation masks) of user annotated data using SALT.

The two leftmost columns are from the ds objects data set, the others from ds people. Segmented depth maps are depicted as

colorized point clouds. Due to the low depth resolution of ds objects, points are projected onto a 3x3 pixel region.

times faster, respectively, when compared to a naive

tool. The gap between those factors is a result of the

difference in difﬁculty for our data sets. Objects in

ds objects only move during short periods of time, al-

lowing masks to be copied into subsequent frames, re-

ducing the annotation time even further. Movements

in ds people involve less complex object poses, which

makes annotating the bounding boxes easier. Even

though User 3 has never used the tool before, and

therefore achieves the slowest annotation time, the

overall accuracy across all three users is on par and

in some cases even higher than with the naive tool.

Lastly, we evaluate different interpolation ap-

proaches for 3D bounding box propagation in terms

of ASE and ATE. More precisely, given the user an-

notated keyframes, we apply linear, cubic and our hy-

brid interpolation (cf. Section 3.2). The results are

listed in Table 3 and suggest that relying solely on lin-

ear interpolation is not optimal. For our data sets, we

achieved best results when applying linear interpola-

tion on the scaling parameters and our hybrid interpo-

lation approach on the translation parameters. In case

of translation, linear interpolation performs worst as

it assumes constant velocity between keyframes, thus

failing to capture dynamic object movements. Cu-

bic interpolation on the other hand will induce jitter

as a byproduct of the polynomial ﬁtting shortly be-

fore movement when an object is standing still for

some time. This drawback is solved by our hybrid ap-

proach, as we apply cubic interpolation only during

object movement, and linear otherwise.

Table 3: Quantitative comparison of different interpolation

approaches for scale and translation parameters of the 3D

bounding boxes.

Metric User 1 User 2 User 3

ds people

ASE (linear) 0.320 0.338 0.290

ASE (cubic) 0.331 0.339 0.299

ATE (linear) 11.740 11.363 10.921

ATE (cubic) 9.781 9.264 9.345

ATE (hybrid) 9.781 9.264 9.182

ds objects

ASE (linear) 0.226 0.332 0.239

ASE (cubic) 0.225 0.332 0.239

ATE (linear) 2.685 1.987 2.330

ATE (cubic) 2.617 1.898 2.286

ATE (hybrid) 2.543 1.867 2.253

5 CONCLUSION

In this paper, we introduced SALT, a tool to semi-

automatically annotate RGB-D video sequences. The

tool provides a pipeline for creating 6 DoF 3D bound-

ing boxes, as well as instance segmentation masks for

both RGB images and depth maps. We have shown

that by making full use of the provided features, an-

notation time can be reduced by a factor of up to 33.95

for 3D bounding box creation and 8.55 for RGB seg-

mentation without compromising annotation quality.

In some cases, the quality of automatically generated

data even improved, demonstrating that our provided

SALT: A Semi-automatic Labeling Tool for RGB-D Video Sequences

601

functionalities reduce the cognitive workload of the

user and make the annotation process more intuitive.

For future work, several additions can be con-

sidered to further enhance the efﬁciency of our pro-

posed pipeline. For example, distinguishing objects

of interest from static background in 3D point clouds

is a challenging task. By adding ground-plane and

background removal algorithms as part of the pre-

processing pipeline for depth maps, distracting ele-

ments can be removed from the scene, allowing the

user to identify and annotate objects with greater ease.

Furthermore, depth points inside 3D bounding boxes

can be projected to an already segmented RGB im-

age to automatically infer depth segmentation masks,

leaving the user to only remove or add individual

points, if necessary.

ACKNOWLEDGMENTS

We would like to thank every participant in the user

study for their time and efforts. This work was par-

tially funded by the German Federal Ministry of Edu-

cation and Research in the context of the project EN-

NOS (13N14975).

REFERENCES

Arief, H. A., Arief, M., Zhang, G., Liu, Z., Bhat, M., Indahl,

U. G., Tveite, H., and Zhao, D. (2020). Sane: Smart

annotation and evaluation tools for point cloud data.

IEEE Access, 8:131848–131858.

Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E.,

Xu, Q., Krishnan, A., Pan, Y., Baldan, G., and Bei-

jbom, O. (2020). nuscenes: A multimodal dataset for

autonomous driving. In 2020 IEEE/CVF Conference

on Computer Vision and Pattern Recognition (CVPR),

pages 11618–11628.

Chang, M., Lambert, J., Sangkloy, P., Singh, J., Bak, S.,

Hartnett, A., Wang, D., Carr, P., Lucey, S., Ramanan,

D., and Hays, J. (2019). Argoverse: 3d tracking and

forecasting with rich maps. In 2019 IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 8740–8749.

Dewan, A., Caselitz, T., Tipaldi, G. D., and Burgard, W.

(2016). Motion-based detection and tracking in 3d li-

dar scans. In 2016 IEEE International Conference on

Robotics and Automation (ICRA), pages 4508–4513.

Geiger, A., Lenz, P., Stiller, C., and Urtasun, R. (2013).

Vision meets robotics: The kitti dataset. International

Journal of Robotics Research, 32(11):1231–1237.

Grenzd

orffer, T., G

unther, M., and Hertzberg, J. (2020).

Ycb-m: A multi-camera rgb-d dataset for object

recognition and 6dof pose estimation. In 2020 IEEE

International Conference on Robotics and Automation

(ICRA), pages 3650–3656.

Hoda

n, T., Haluza, P., Obdr

alek,

S., Matas, J., Lourakis,

M., and Zabulis, X. (2017). T-less: An rgb-d dataset

for 6d pose estimation of texture-less objects. In 2017

IEEE Winter Conference on Applications of Computer

Vision (WACV), pages 880–888.

Huang, X., Wang, P., Cheng, X., Zhou, D., Geng, Q., and

Yang, R. (2020). The apolloscape open dataset for

autonomous driving and its application. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence,

42(10):2702–2719.

Lee, J., Walsh, S., Harakeh, A., and Waslander, S. L. (2018).

Leveraging pre-trained 3d object detection models for

fast ground truth generation. In 2018 21st Interna-

tional Conference on Intelligent Transportation Sys-

tems (ITSC), pages 2504–2510.

Marion, P., Florence, P. R., Manuelli, L., and Tedrake,

R. (2018). Label fusion: A pipeline for generat-

ing ground truth labels for real rgbd data of cluttered

scenes. In 2018 IEEE International Conference on

Robotics and Automation (ICRA), pages 3235–3242.

Monica, R., Aleotti, J., Zillich, M., and Vincze, M. (2017).

Multi-label point cloud annotation by selection of

sparse control points. In 2017 International Confer-

ence on 3D Vision (3DV), pages 301–308.

Patil, A., Malla, S., Gang, H., and Chen, Y. (2019). The

h3d dataset for full-surround 3d multi-object detec-

tion and tracking in crowded urban scenes. In 2019

International Conference on Robotics and Automation

(ICRA), pages 9552–9557.

Plachetka, C., Rieken, J., and Maurer, M. (2018). The tubs

road user dataset: A new lidar dataset and its appli-

cation to cnn-based road user classiﬁcation for auto-

mated vehicles. In 2018 21st International Conference

on Intelligent Transportation Systems (ITSC), pages

2623–2630.

Rother, C., Kolmogorov, V., and Blake, A. (2004). ”grab-

cut”: Interactive foreground extraction using iterated

graph cuts. In ACM SIGGRAPH 2004 Papers, SIG-

GRAPH ’04, pages 309–314.

Shoemake, K. (1985). Animating rotation with quaternion

curves. In Proceedings of the 12th Annual Conference

on Computer Graphics and Interactive Techniques,

SIGGRAPH ’85, pages 245–254.

Song, S., Lichtenberg, S. P., and Xiao, J. (2015). Sun rgb-

d: A rgb-d scene understanding benchmark suite. In

2015 IEEE Conference on Computer Vision and Pat-

tern Recognition (CVPR), pages 567–576.

Suchi, M., Patten, T., Fischinger, D., and Vincze, M. (2019).

Easylabel: A semi-automatic pixel-wise object anno-

tation tool for creating robotic rgb-d datasets. In 2019

International Conference on Robotics and Automation

(ICRA), pages 6678–6684.

Wang, B., Wu, V., Wu, B., and Keutzer, K. (2019). Latte:

Accelerating lidar point cloud annotation via sensor

fusion, one-click annotation, and tracking. In 2019

IEEE Intelligent Transportation Systems Conference

(ITSC), pages 265–272.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

602

Wong, Y.-S., Chu, H.-K., and Mitra, N. J. (2015). Smar-

tannotator: An interactive tool for annotating indoor

rgbd images. Computer Graphics Forum (Proc. Euro-

graphics), 34(2):447–457.

Xiang, Y., Schmidt, T., Narayanan, V., and Fox, D. (2018).

Posecnn: A convolutional neural network for 6d ob-

ject pose estimation in cluttered scenes. In Proceed-

ings of Robotics: Science and Systems (RSS).

Xie, J., Kiefel, M., Sun, M., and Geiger, A. (2016). Se-

mantic instance annotation of street scenes by 3d to

2d label transfer. In 2016 IEEE Conference on Com-

puter Vision and Pattern Recognition (CVPR), pages

3688–3697.

Yan, Z., Duckett, T., and Bellotto, N. (2020). Online learn-

ing for 3d lidar-based human detection: Experimen-

tal analysis of point cloud clustering and classiﬁcation

methods. Autonomous Robots, 44:147–164.

Zimmer, W., Rangesh, A., and Trivedi, M. (2019). 3d

bat: A semi-automatic, web-based 3d annotation tool-

box for full-surround, multi-modal data streams. In

2019 IEEE Intelligent Vehicles Symposium (IV), pages

1816–1821.

SALT: A Semi-automatic Labeling Tool for RGB-D Video Sequences

603