Synthetic Ground Truth for Presegmentation of Known Objects for

Effortless Pose Estimation

Frederik Haarslev

, William Kristian Juel

, Norbert Kr

uger

and Leon Bodenhagen

SDU Robotics, Maersk Mc-Kinney Molle Institute, University of Southern Denmark, Campusvej 55, 5230 Odense, Denmark

Keywords:

Synthetic Ground Truth Generation, Object Segmentation, Pose Estimation.

Abstract:

We present a method for generating synthetic ground truth for training segmentation networks for presegment-

ing point clouds in pose estimation problems. Our method replaces global pose estimation algorithms such as

RANSAC which requires manual ﬁne-tuning with a robust CNN, without having to hand-label segmentation

masks for the given object. The data is generated by blending cropped images of the objects with arbitrary

backgrounds. We test the method in two scenarios, and show that networks trained on the generated data

segments the objects with high accuracy, allowing them to be used in a pose estimation pipeline.

1 INTRODUCTION

Pose estimation is an important task in the ﬁeld of

robotics. It is used in tasks where a robot needs to

manipulate objects in unknown locations such as in

bin picking. Pose estimation is usually solved in three

steps (Figure 1a). First, 3D features in the scene are

found and matched to the model. Then an initial pose

is found with a global method – e.g. feature matching

and RANSAC (Fischler and Bolles, 1981). Lastly, the

initial pose is optimized using a local pose optimizer

– often ICP (Besl and McKay, 1992). This approach

can lead to problems since classic global pose estima-

tors usually require a lot of ﬁne tuning, while often

still resulting in a lot of false positives.

Researchers have started to look towards deep

learning, as a way to mitigate the need for manual

ﬁne-tuning, as neural networks are known to be very

noise resistant. (Xiang et al., 2017; Do et al., 2018)

both have succeeded at replacing the global pose es-

timation pipeline with a single network, which given

a 2D image estimates the 3D bounding boxes and 6D

poses of all known objects in the image (Figure 1b).

These networks are trained end-to-end, meaning that

they train on ground truth where the data consists

of an image, and the annotation is semantic labels,

bounding boxes and poses for all objects. Such data

https://orcid.org/0000-0003-2882-0142

https://orcid.org/0000-0001-5046-8558

https://orcid.org/0000-0002-3931-116X

https://orcid.org/0000-0002-8083-0770

Figure 1: Overview of three different approaches to pose

estimation. Blue indicates that the method requires manual

ﬁne tuning and red indicates that it relies on training using

a ground truth.

can be found in publicly available pose estimation

data sets, but it is cumbersome to gather if the object

of interest is not part of such a dataset. Since neu-

ral networks often require thousands of training ex-

amples in order to generalize, it is impractical to hand

label enough training data, every time one faces a new

pose estimation problem.

Another way deep learning can be used to aid pose

estimation, is by only replacing part of the global pose

estimation step. (Wong et al., 2017) uses a seman-

tic image segmentation convolutional neural network

(CNN) for presegmenting the point cloud. This al-

lows the pose to be estimated using an initial pose

guess followed by ICP (Figure 1c). This approach has

the beneﬁt of using a generic segmentation network,

which can easily be changed to newer models once

they are published. The training data is also easier

to collect, as only a segmentation mask is needed for

each image. However, while the annotations are eas-

ier to collect than the pose estimation network coun-

terparts, it is still not practical to create thousands of

segmentation masks whenever one needs to estimate

482

Haarslev, F., Juel, W., Krüger, N. and Bodenhagen, L.

Synthetic Ground Truth for Presegmentation of Known Objects for Effortless Pose Estimation.

DOI: 10.5220/0009163904820489

In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020) - Volume 4: VISAPP, pages

482-489

ISBN: 978-989-758-402-2; ISSN: 2184-4321

(a) (b)

Figure 2: (a) & (b) Examples of two pose estimation scenar-

ios and (c) & (d) their corresponding synthesized training

data.

Figure 3: The SMOOTH robot docked with the bin used for

use case 1 of the SMOOTH project.

the pose of a new object.

We propose to solve pose estimation by preseg-

menting the point cloud as in (Wong et al., 2017), but

using a network trained on synthetic training data, sig-

niﬁcantly reducing the manual labor required when

estimating the pose of new objects. We exploit the

fact that pose estimation problems are concerned with

speciﬁc objects. This means that the network needs to

learn exactly what certain objects looks like, instead

of learning broader context dependent object classes

such as car and chair. In this work, the data is syn-

thesized by blending images of the objects with arbi-

trary backgrounds creating images with no apparent

context (Figure 2). This way the network learns to

ignore the background, and instead focus on the ob-

jects. This process reduces the manual labor when

using deep learning to solve pose estimation of new

objects, from collecting and hand labeling thousands

of images, to taking a few images of the objects from

different angles.

This is motivated by solving one of the three use

cases in the SMOOTH project

. In (Juel et al., 2019)

the robot itself and the technical modules required to

solve the three use cases are theorized and described

(Figure 3). In this paper, we focus on the pose es-

timation method used to solve use case 1 which en-

tails the robot platform detecting, docking, and deliv-

ering laundry bins at an elderly care center to ease the

heavy work duty for the caregivers. This solution is

constrained by the rotational and positional error tol-

erances of the algorithm used to dock the laundry bin.

2 RELATED WORK

Deep learning is an increasingly pervasive compo-

nent of modern robotic applications. It is subject to

a lot of ongoing research within using object detec-

tion and semantic segmentation to robustly crop point

clouds of objects, which can be used with the ICP

algorithm, as a replacement for RANSAC that re-

quires tedious manual ﬁne-tuning. Extended research

also replaces the full pipeline used for classic pose

estimation with deep learning methods i.e. learning

poses of the detected objects. A common problem

for all this research is that a lot of training data is re-

quired. Often data is annotated manually, which is

a very time-consuming task and does not scale well

to the industry. Therefore, various methods for syn-

thesizing training data using deep learning methods

and visual simulators (2D and 3D) are common fo-

cus points for researchers working with deep learning

within a pose estimation pipeline. In the following,

selected relevant works are addressed, in particular,

related pose estimation where deep learning is used

within the pipeline (Section 2.1) and synthetic gener-

ation of training data (Section 2.2).

2.1 Deep Learning for Pose Estimation

The research using deep learning within the pose es-

timation pipeline is motivated by the fact that sophis-

ticated robots still struggle to achieve a fast and re-

liable perception of task-relevant objects in uncon-

strained and realistic scenarios. In (Wong et al., 2017)

a method called SegICP is proposed – a method for

object detection and pose estimation. They train a

CNN on 7500 images to do pixel-wise semantic seg-

mentation. They manually label around 5625 objects

while the rest is generated using a motion capture sys-

tem with active markers placed on their cameras and

objects. The output of the segmentation is then used

to crop the point cloud and thereby allowing pose es-

timation using multi-hypothesis ICP.

www.smooth-robot.dk

Synthetic Ground Truth for Presegmentation of Known Objects for Effortless Pose Estimation

483

(Xiang et al., 2017) proposes the network

PoseCNN for pose estimation. PoseCNN estimates

the 3D translation of an object by localizing its cen-

ter in the image and predicting its distance from the

camera. The 3D rotation of the object is estimated by

regressing to a quaternion representation.

(Wong et al., 2017; Xiang et al., 2017) replace

parts of or the whole classic pose estimation pipe-line.

The drawback they have in common is how depen-

dent their methods are on training data i.e. annota-

tion on 2D images and collection of poses for the ob-

jects. Creating a data set like this does not scale well

to industry since a lot of manual work labeling and

producing known poses is required, which prevents

its application in real pose estimation tasks, where a

known object is to be detected in an arbitrary scene.

2.2 Synthetic Generation of Data

Different approaches like semi-supervised labeling

(Russell et al., 2008) and gamiﬁcation (von Ahn

and Dabbish, 2004) have been proposed to ease and

streamline the task of annotating data, but a remaining

problem these approaches leave the user with is the

effort required from humans to manually label or at

least supervise the annotation process. Contemporary,

approaches that use annotated photo-realistic simu-

lated data to train deep neural networks have shown

promising results. In (Johnson-Roberson et al., 2016)

a pipeline to gather data from a visual simulator with

high realism is established and the data is then used

to train deep convolution neural network for object

detection. The trained networks, using only the sim-

ulated data, were capable of achieving high levels of

performance on real-world data. Although this ap-

proach gives attractive results, it is limited by the ver-

satility of the visual simulator since it is not possi-

ble to generate data of objects and environments not

present in it.

Another approach to synthetic ground truth gener-

ation uses Generative Adversarial Networks (GAN’s)

(Goodfellow et al., 2014). One such approach is

GeneSIS-RT introduced in (Stein and Roy, 2018). It is

a procedure for generating high quality, synthetic data

for domain-speciﬁc learning tasks, for which anno-

tated training data may not be available. They utilize

the CycleGAN algorithm (Zhu et al., 2017) to learn a

mapping function G

between unlabeled real images

and unlabeled simulated images. G

is then used to

generate more realistic synthetic training images from

labeled simulated images.

While GANs are able to generate data, they still

need to be trained. For this a manually annotated

ground truth is often required. In the case of GeneSIS-

RT no labels are required for the real and simulated

image sets. However, in order to generate a ground

truth using the method a simulated environment has to

be created. A different method described by (Dwibedi

et al., 2017) requires even less manual labor than the

GAN based methods. They automatically cut ob-

ject instances and paste them on random backgrounds

within the context of the deployment environment.

The generated synthetic data gives competitive per-

formance against real data. This method is proved to

work when the background images are within the con-

text, requiring the user to take images of the relevant

environment that training images can be pasted on.

In this paper, we propose a method where neither

human labor nor an external visual simulator is re-

quired for synthesizing ground truth data. This data

can be used for training a CNN for presegmentation of

point clouds in a pose estimation pipeline. We show

that the method used in (Dwibedi et al., 2017) works

without the use of contextual relevant backgrounds

for the synthesized images and without the use of real

world images during training. We show that the re-

sulting trained CNN can be used as a link in a pose

estimation pipeline. We optimize the CNN so that

it can be deployed on a mobile robot by comparing

different network backbones and input image sizes.

In addition, we show that the developed pose estima-

tion pipeline satisﬁes the rotational and positional er-

ror tolerances required by the use case.

3 METHODS

In this work we try to solve image segmentation of

speciﬁc objects using as little human labour as possi-

ble. This is accomplished by synthesizing the train-

ing data and labels, removing the need to do do any

annotation by hand. The following subsections will

explain how the training data is synthesized, and how

the trained segmentation network is used for estimat-

ing the pose of the object in the scene.

3.1 Generation of Training Data

The ﬂow of the data generation pipeline is shown in

Figure 4. The ﬁrst step is to take a few images of the

desired objects from different angles and removing

the background. The images have their background

removed automatically by utilizing the depth channel

of the RGB-D camera. By comparing images of the

object in a scene with images of the scene without the

object, the object mask can be determined as the pix-

els which has been changed between the two images.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

484

Figure 4: Flow of training data generation. Images of the object are automatically cropped and blended with an arbitrary

background. Changed pixels are given the object label.

(a) (b) (c)

Figure 5: Three methods for blending an object to the back-

ground: (a) no blending, (b) Gaussian blurring, and (c)

seamless cloning.

The synthetic images are generated by blending

the cropped object images with an arbitrary back-

ground (Figure 4a). In this work the MS COCO (Lin

et al., 2014) dataset has been used for the background

images. First, an object image is generated by picking

one of the cropped images (Figure 4b) and then aug-

menting it (Figure 4c). Such augmentations includes

ﬂipping, scaling, and rotating. The augmented object

is then blended with the background at a random lo-

cation (Figure 4d). As in (Dwibedi et al., 2017), two

different approaches for blending are used: Gaussian

blurring of the object edges, or the seamless clone al-

gorithm described in (P

erez et al., 2003). The results

of both methods are compared to not using blending

in Figure 5.

The Gaussian blurring method blends the object

by using a Gaussian ﬁlter on the alpha mask of the ob-

ject image. This removes the hard edges between the

object and background by fading the images together.

The seamless clone algorithm minimizes the differ-

ence in the gradient ﬁeld of the object image and the

desired gradient ﬁeld of the affected area of the back-

ground image, by solving a set of Poisson equations.

This smooths the transition between the object and

background, by removing any abrupt changes in the

gradient ﬁeld of the resulting image. Another result of

the Poisson editing is variations in the lighting of the

object, dependent on the background. The seamless

clone algorithm therefore inherently performs data

augmentation, leading to greater variation in the syn-

thetic data.

Once an object has been blended with the back-

ground the segmentation mask can be created as the

pixels which has changed after the blending (ﬁg. 4e).

For each synthetic image N random object images are

blended with the background (ﬁg. 4f). In order to bet-

ter handle cases where the object is not in the image,

a percentage of the synthesized training data contains

only the random background. In this work this per-

centage is set to 5%. As the position of each object

image is random some occlusion of the objects oc-

curs, leading to the trained network being able to han-

dle such occurrences. In (Dwibedi et al., 2017) every

image they synthesize is generated twice, once with

either blending method. I.e. the ground truth con-

tains pairs of images with the same objects blended

at the same locations on the same background, with

the only difference between them being the blend-

ing method. This is stated to help the network ig-

nore learning the blending artifacts, instead learning

the actual shape of the object. We did not ﬁnd this

to improve the segmentation performance, so instead

the blending method is chosen randomly per object:

I.e., a single synthetic image can contain both blend-

ing methods on different objects.

Another method used by (Dwibedi et al., 2017) is

the inclusion of distractor objects. Distractor objects

Synthetic Ground Truth for Presegmentation of Known Objects for Effortless Pose Estimation

485

(a) (b)

Figure 6: Output of the segmentation network trained on the

grey bin.

(a) (b) (c)

Figure 7: Output of the segmentation network trained on the

green bin (b) without and (c) with the decoy as distractor.

are other objects than the objects of interest which

are labeled as background. This is stated to also help

the network learn to ignore the blending artifacts. We

did not ﬁnd this to improve the segmentation perfor-

mance, so instead distractor objects are used to miti-

gate false positives, as explained in Section 3.2.

3.2 Training Network

Before the ground truth could be synthesized, 300 im-

ages of the bin are taken using the depth channel to

generate a segmentation mask as well as 100 images

not containing the bin. These were split into a valida-

tion and test set containing 150 object images and 50

background images each. Of the 150 validation ob-

ject images, 100 were cropped with the segmentation

mask and used for synthesizing the training data.

Since only a single object is being learnt, the ca-

pacity of the segmentation network does not need to

be large. Instead the network is chosen based on fast

inference speed and low memory footprint, in order

to optimize the performance on embedded platforms

like mobile robots. Therefore, the BiSeNet (Yu et al.,

2018) real-time segmentation network has been im-

plemented. The network was trained on a ground

truth consisting of 5,000 synthesized images for 10

epochs, using the Adam optimizer with 0.001 as the

initial learning rate. The dataset was balanced by

weighting the loss based on the occurrence of each

class, thereby reducing the effect of the background

class being abundant in the training data.

In order to validate the method, a network was ﬁrst

trained on the generic grey garbage bin. An example

of the network output can be seen on Figure 6. This

shows that the network is able to learn to segment an

Figure 8: Flow of pose estimation. Point cloud is cropped

using the output of the segmentation network, allowing the

pose to be found using only ICP.

object correctly using our method of synthetic ground

truth generation, and therefore the method was used

with the use case speciﬁc green bin. In order to ver-

ify that the network learns to segment the object –

and not just the color green – it was tested on images

containing other objects in the same shade of green

(Figure 7). The network correctly segments the green

circle as background, but mislabels the green rectan-

gle (decoy) which is made to look like the bin. This

is remedied by taking 3 images of the decoy, and then

synthesizing a new ground truth using the decoy as

distractor. This equates to testing the trained network

in the deployment environment, and then taking a few

images of any false positives.

3.3 Estimating Pose

The ﬂow of the pose estimation pipeline is shown in

Figure 8. First step is to capture the 2D scene im-

age and corresponding 3D point cloud. The image

is fed through the trained network to produce a seg-

mentation mask. For RGB-D sensors there is a one-

to-one correspondence between the pixel locations

in the segmentation mask, and the points in the or-

dered point cloud. This allows for the extraction of

all points of the desired class using the segmentation

mask. However, the registration is not perfect, mean-

ing that some bin pixels are registered to 3D locations

in the background. This creates some systematic ar-

tifacts which the network cannot account for, since

it only works with the 2D data. Because of this, all

points further away than some margin (depending on

object size) from the centroid of the segmented point

cloud are also removed.

The pose can now be estimated using an initial

pose guess followed by ICP. In the use case of this pa-

per, the bin is always positioned upright on the ﬂoor.

This reduces the pose estimation to 3 Degrees of Free-

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

486

Table 1: Intersection over Union (IoU) results and Frames

Per Second (FPS) of the segmentation network with varying

backbones and input resolution. The mean IoU is calculated

over 5 networks with the same hyper-parameters, and the

standard deviation is indicated with ±.

Backbone Mean IoU FPS

Xception 0.894 ±0.005 41.4

MobileNetV2 0.870 ± 0.026 61.7

DenseNet121 0.872 ± 0.025 41.9

ResNet50 0.804 ± 0.029 40.5

Input resolution Mean IoU FPS

288 × 160 0.855 ± 0.018 141.5

384 × 224 0.847 ± 0.025 95.9

512 × 288 0.870 ± 0.026 61.7

672 × 384 0.827 ± 0.004 41.5

896 × 512 0.779 ± 0.070 24.6

dom (DoF). In addition to this the object is four-way

symmetrical. This means that the object model can

be placed in the centroid of the cropped cloud as the

initial pose guess. Afterwards the pose is found us-

ing a three-fold coarse-to-ﬁne ICP scheme. First the

scene and model clouds are heavily downsampled us-

ing a voxelgrid with leaf size depending on the ob-

ject size and desired speed. Then ICP is used with a

correspondence distance double the leaf size to ﬁnd

an initial transformation. This is repeated twice us-

ing the previously found transformation as the initial

guess, halving the leaf size and correspondence dis-

tance at each step. The ﬁnal transformation is then an

estimation of the pose of the model in the scene.

In case of 6-DoF pose estimation problems with

unsymmetrical objects, the multi-hypothesis pose es-

timation scheme presented in (Wong et al., 2017) can

be used. They replace the initial pose guess with mul-

tiple random guesses with the same centroid but ran-

dom rotation. ICP is then used with all guesses, and

the best pose is found using a pose evaluation metric.

4 RESULTS

The ﬁrst step in creating the pose estimation pipeline

is the training of the segmentation network. It is de-

sired that the network accurately segments the object,

while being lightweight and having a high through-

put. Therefore, various network backbones and input

resolutions are tested. For each backbone 5 networks

are trained on the green bin synthetic ground truth at

512 × 288 input resolution. The trained networks are

used on the test set consisting of 150 images of the bin

labeled using the depth channel, and 50 images with-

out the bin. Many of the test images contain the green

decoy object and other green objects. The Intersection

over Union (IoU) between the segmentation output

and ground truth is calculated. The mean IoU, stan-

dard deviation and Frames Per Second (FPS) is calcu-

lated for the 5 networks using the same backbone, for

every backbone. All backbones are pretrained on Im-

ageNet (Russakovsky et al., 2015). These results can

be seen in Table 1. For further investigation, we de-

cided to continue with the MobileNet (Howard et al.,

2017) implementation due to the high mean IoU and a

signiﬁcantly higher FPS compared to using the other

backbones. Since this must run on a mobile robot the

computational time is more important than a few per-

cent higher IoU.

Besides the network backbone, the input resolu-

tion was also tested. Similar to before, 5 networks us-

ing the MobileNet backbone are trained for each input

resolution, and the mean IoU, standard deviation and

FPS is calculated (Table 1). One result which stands

out is the networks trained at 896 × 512, which leads

to low mean IoU and high variance. An explanation

for this could be that the pixel quantization hides the

blending artifacts in lower resolution images, leading

to a larger mismatch between the synthetic training

data and real test data at higher resolutions. Similar

to the choice of backbone, the gain in lightweightness

and inference speed when using 288 × 160 images,

outweighs the slight gain in mean IoU when using

512 × 288 images. Therefore, the network chosen for

the pose estimation tests was the network scoring the

highest IoU of the 5 networks with 288 × 160 input

resolution and MobileNet backbone.

The accuracy of the pose estimation pipeline is

tested using a dataset with known ground truth poses.

The ground truth poses are found using detachable

AR markers (Figure 9). The markers needs to be de-

tachable, since they should not be visible when seg-

menting the image. The detachable markers are cal-

ibrated to the center of the bin using two calibration

markers. Each marker aligns with a side of the square

bin, allowing the center to be calculated by ﬁnding

the intersection between the pairs of marker planes,

and then averaging them. This transformation is cal-

culated for 250 positions of the bin, after which the

calibration markers are removed.

The test data is made by placing the bin randomly

in a different setting than the one used for capturing

the object images which the network was trained on.

The ground truth pose is then found by averaging the

marker poses over 50 frames. The markers are then

removed and 25 image/point cloud sets are recorded.

This is repeated 150 times with different bin place-

ments resulting in 3750 image/point cloud pairs. A

Synthetic Ground Truth for Presegmentation of Known Objects for Effortless Pose Estimation

487

(a) (b) (c)

Figure 9: The markers used for the pose estimation ground

truth. (a) shows the bin with the calibration and detachable

markers, (b) shows the bin with the detachable markers, and

Figure 10: Histograms of the position and rotation errors in

the estimated poses on the test set. Errors larger than the red

lines will result in a failed docking, while errors lower than

the green lines always results in a successful docking.

pose is estimated for each pair using the described

pose estimation method.

In the use case the laundry bin is restricted to 3-

DoF. I.e, it is limited to an upright position with the

wheels on the ground so that it can be picked up by the

robot. Therefore, ground truth and estimated poses

are projected to an (x, y) position along the ground

plane, and a θ rotation around the normal vector to

the ground plane. A positional error is calculated as

the L2-norm between the ground truth and estimated

positions, and a rotational error as the difference be-

tween the ground truth and estimated rotations. Be-

cause of the four-way symmetrical property of the

bin, the estimated pose can be rotated 90

◦

or 180

◦

from the ground truth and still be correct. Therefore,

◦

is subtracted from the rotation error until it lies

between −45

◦

and 45

◦

, where after the absolute value

is found resulting in a rotation error between 0

◦

and

◦

. Before testing the pose estimation accuracy, the

maximum rotational and positional errors which the

docking algorithm used in the use case can handle was

found. This was done by placing the bin with a ran-

dom pose behind the robot, and trying to dock with

it. The test was repeated with various poses in order

to ﬁnd a threshold where the docking would always

succeed, and a threshold where it would always fail.

Figure 10 shows histograms of the position and rota-

tion errors on the test set. The red lines represents the

errors over which the docking will fail, and the green

lines represents the errors under which the docking is

guaranteed to succeed. Most estimates lies below the

green lines while no estimates lies above the red lines.

It can be seen that the pose estimation has a position

bias of about 2 cm. This could be explained by the un-

certainty in the marker calibration. This is not present

in the rotation errors, since the ground truth rotation

is not calibrated, but instead is the average orienta-

tion of the two detachable markers. Figure 11 shows

results from the segmentation and pose estimation of

both the grey and green bins.

5 CONCLUSION

We present a method to effortlessly solve pose es-

timation of known objects without prior knowledge

about the context of the object, by presegmenting the

point cloud using a segmentation network. The net-

work is trained on a synthetic ground truth, created

by taking a small number of images of the object of

interest. The object images are blended with an ar-

bitrary background from MS COCO. We show that

the method works on two different objects, a generic

garbage bin and a use case speciﬁc bin. When the two

objects is placed in a real-world context the output of

the trained segmentation network is robust enough to

crop the point cloud such that it can be used as input

to the ICP algorithm for pose estimation. We train the

network using different backbones and input resolu-

tions, and show how changing the network affects the

intersection over union between the ground truth and

segmentation output.

We test the effect of varying the backbone and in-

put size of the segmentation network, and ﬁnd that the

best compromise for our use case is using MobileNet

and 288 x 160 image size, since the network is to be

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

488

(a) (b)

(e) (f)

Figure 11: Outputs of the segmentation networks (left col-

umn) and corresponding estimated poses (right column) in

the scene clouds.

deployed on a mobile platform. We show that the pose

of the object can be estimated using the centroid of the

presegmented cloud as an initial pose guess followed

by ICP, thereby not having to rely on a more sophis-

ticated pose estimation algorithm. We show that the

accuracy of the resulting pose estimation is within the

constraints of the use case.

REFERENCES

Besl, P. J. and McKay, N. D. (1992). A method for regis-

tration of 3-d shapes. IEEE Transactions on Pattern

Analysis and Machine Intelligence.

Do, T.-T., Pham, T., Cai, M., and Reid, I. D. (2018). Lienet:

Real-time monocular object instance 6d pose estima-

tion. In BMVC.

Dwibedi, D., Misra, I., and Hebert, M. (2017). Cut, paste

and learn: Surprisingly easy synthesis for instance de-

tection. CoRR, abs/1708.01642.

Fischler, M. A. and Bolles, R. C. (1981). Random sample

consensus: A paradigm for model ﬁtting with appli-

cations to image analysis and automated cartography.

Commun. ACM.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Ben-

gio, Y. (2014). Generative adversarial networks. Ad-

vances in Neural Information Processing Systems.

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D.,

Wang, W., Weyand, T., Andreetto, M., and Adam,

H. (2017). Mobilenets: Efﬁcient convolutional neu-

ral networks for mobile vision applications. CoRR,

abs/1704.04861.

Johnson-Roberson, M., Barto, C., Mehta, R., Nittur Sridhar,

S., and Vasudevan, R. (2016). Driving in the matrix:

Can virtual worlds replace human-generated annota-

tions for real world tasks? CoRR, abs/1610.01983.

Juel, W. K., Haarslev, F., Ram

ırez, E. R., Marchetti, E., Fis-

cher, K., Shaikh, D., Manoonpong, P., Hauch, C., Bo-

denhagen, L., and Kr

uger, N. (2019). Smooth robot:

Design for a novel modular welfare robot. Journal of

Intelligent & Robotic Systems.

Lin, T., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick,

R. B., Hays, J., Perona, P., Ramanan, D., Doll

ar, P.,

and Zitnick, C. L. (2014). Microsoft COCO: common

objects in context. CoRR, abs/1405.0312.

erez, P., Gangnet, M., and Blake, A. (2003). Poisson im-

age editing. ACM Trans. Graph.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,

Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-

stein, M., Berg, A. C., and Fei-Fei, L. (2015). Ima-

genet large scale visual recognition challenge. Inter-

national Journal of Computer Vision.

Russell, B. C., Torralba, A., Murphy, K. P., and Freeman,

W. T. (2008). Labelme: A database and web-based

tool for image annotation. Int. Journal of Computer

Vision.

Stein, G. J. and Roy, N. (2018). Genesis-rt: Generating syn-

thetic images for training secondary real-world tasks.

Int. Conf. on Robotics and Automation (ICRA).

von Ahn, L. and Dabbish, L. (2004). Labeling images with

a computer game. Proceedings of the SIGCHI Con-

ference on Human Factors in Computing Systems.

Wong, J. M., Kee, V., Le, T., Wagner, S., Mariottini, G.,

Schneider, A., Hamilton, L., Chipalkatty, R., Hebert,

M., Johnson, D. M. S., Wu, J., Zhou, B., and Torralba,

A. (2017). Segicp: Integrated deep semantic segmen-

tation and pose estimation. In Int. Conf, on Intelligent

Robots and Systems (IROS).

Xiang, Y., Schmidt, T., Narayanan, V., and Fox, D. (2017).

Posecnn: A convolutional neural network for 6d

object pose estimation in cluttered scenes. CoRR,

abs/1711.00199.

Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., and Sang,

N. (2018). Bisenet: Bilateral segmentation net-

work for real-time semantic segmentation. CoRR,

abs/1808.00897.

Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. (2017).

Unpaired image-to-image translation using cycle-

consistent adversarial networks. Int. Conf. on Com-

puter Vision (ICCV).

Synthetic Ground Truth for Presegmentation of Known Objects for Effortless Pose Estimation

489