AbSynth: Using Abstract Image Synthesis for Synthetic Training

Dominik Penk

1,2

, Maik Horn

, Christoph Strohmeyer

, Bernhard Egger

, Marc Stamminger

and Frank Bauer

Chair of Visual Computing, Friedrich-Alexander-Universit

at Erlangen-N

urnberg, Cauerstraße 11, Erlangen, Germany

Schaefﬂer Technologies AG & Co. KG, Industriestraße 1-3, Herzogenaurach, Germany

ﬂ

Keywords:

Synthetic Training Data, Domain Gap, Deep Learning, Computer Vision.

Abstract:

We present a novel pipeline for training neural networks to tackle geometry-induced vision tasks, relying solely

on synthetic training images generated from (geometric) CAD models of the objects under consideration.

Instead of aiming for photorealistic renderings, our approach maps both synthetic and real-world data onto a

common abstract image space reducing the domain gap. We demonstrate that this projection can be decoupled

from the downstream task, making our method an easy drop-in solution for a variety of applications. In this

paper, we use line images as our chosen abstract image representation due to their ability to capture geometric

properties effectively. We introduce an efﬁcient training data synthesis method, that generates images tailored

for transformation into a line representation. Additionally, we explore how the use of sparse line images opens

up new possibilities for augmenting the dataset, enhancing the overall robustness of the downstream models.

Finally, we provide an evaluation of our pipeline and augmentation techniques across a range of vision tasks

and state-of-the-art models, showcasing their effectiveness and potential for practical applications.

1 INTRODUCTION

In modern-day industrial computer vision applica-

tions, deep learning, speciﬁcally convolutional neu-

ral networks (CNNs) (Ciresan et al., 2011), play an

important role. Usually, they are trained in a super-

vised fashion, leveraging annotated datasets of im-

age or video data. These methods routinely outper-

form humans or hand-crafted approaches on industry-

relevant tasks like object classiﬁcation, detection, or

anomaly detection. State-of-the-art models are typi-

cally trained on large-scale public datasets, e.g., Ima-

geNet (Isola et al., 2017) or CoCo (Lin et al., 2014).

However, these models must be ﬁnetuned on a use-

case-speciﬁc dataset to be used in production. Unfor-

tunately, such datasets are often not publicly available

and must be created manually, a time-consuming and

costly process prone to errors.

In this paper, we propose a novel pipeline for

training or ﬁne-tune neural networks without reliance

on real-world data. Our approach requires only a

CAD model of the objects of interest and does not rely

on realistic rendering like other concurrent work. Fur-

thermore, our method does not require any additional

information about surface color or reﬂectance. This

makes our pipeline well-suited for industry-relevant

F G

synth

real

synth

abs

real

abs

CAD

Figure 1: The proposed pipeline projects real and synthetic

training images to an abstract representation I

abs

using a

neural network F, which we then pass on to the downstream

task G.

applications, where the shape of the object is the main

contributor to derive a solution and CAD models are

usually readily available.

The proposed, task-agnostic, Abstract Image

Synthesis (AbSynth) pipeline is depicted in Fig. 1.

The core idea of this pipeline is to transform both

real-world and synthetic images into a shared abstract

representation I

abs

using a common transformation F.

We will show that F can be ﬁxed for various geomet-

rically based tasks. Consequently, we can then train a

network, called G in Fig. 1, in a supervised fashion.

At the same time, we will show that the reduced de-

gree of detail leads to straightforward data synthesis

718

Penk, D., Horn, M., Strohmeyer, C., Egger, B., Stamminger, M. and Bauer, F.

AbSynth: Using Abstract Image Synthesis for Synthetic Training.

DOI: 10.5220/0012431400003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 2: VISAPP, pages

718-729

ISBN: 978-989-758-679-8; ISSN: 2184-4321

so that no real data or manual labeling is required for

training.

2 RELATED WORK

Creating photo-realistic color images is difﬁcult and

time-consuming since many user-deﬁned parameters

are required. This includes scene parameters, e.g., re-

alistic lighting conditions and camera properties like

sensor noise and distortion. All these have a subtle

effect, but a network trained on real data might learn

some distinguishing features from them. Similarly, a

network trained on synthetic RGB images might ex-

tract some features from rendering artifacts, e.g., sam-

pling noise on a speciﬁc surface material description.

Due to this many minute details, the domain gap be-

tween real and simulated color images is commonly

large.

Narrowing or bridging this gap is an active re-

search topic. Tobin et al. were the ﬁrst to introduce

the idea of domain randomization for synthetic RGB

data (Tobin et al., 2017). Their work primarily fo-

cuses on removing the reliance on accurate surface

parameters. The core idea is to force the downstream

network to be robust against domain shift by present-

ing many parameter variations to the network dur-

ing training. With this approach, the real-world im-

ages appear to be merely a different variation. Con-

cretely, they demonstrate that they can train a sim-

ple object localizer to identify objects with varying

color textures using exclusively synthetic data. Fol-

lowup work extended this simple idea to other scene

parameters, e.g., by adding distractor objects (Trem-

blay et al., 2018) or random backgrounds (Dosovit-

skiy et al., 2015).

A different approach called sim-to-real tries to ex-

plicitly learn a transformation from the distribution

of synthetic to real-world images. This mapping is

a general Image-to-Image translation usually realized

using a Generative Adverserial Network (GAN) ar-

chitecture (Goodfellow et al., 2020; Karras et al.,

2020). Numerous GAN variations were introduced

in recent years, which primarily differ in the data re-

quired to train them. For instance, pix2pix, intro-

duced by (Isola et al., 2017), uses semantic labels,

which are passed to the discriminator and generator

networks. The authors show their architecture out-

performs color-based GAN architectures. Other ap-

proaches rely only on color data from real-world im-

ages without semantic labels for domain supervision.

One such approach is the Cycle-GAN architecture in-

troduced by (Zhu et al., 2017). They also learn the

inverse mapping, real-world to synthetic images, and

enforce a cycle, from one domain to the other and

back, to produce an image similar to the original in-

put.

In recent publications, diffusion models are

used to perform Image-to-Image translation (Croitoru

et al., 2023).

2.1 Line Drawings

Instead of creating more realistic color images,

(Harary et al., 2022) proposed that edge images can

be used as the basis for domain generalization. They

use edge images to guide a learned bridge domain that

encapsulates all necessary information for a speciﬁed

downstream task. They use the bride domain images

to force the network, which solves the downstream

task, to generalize over multiple input domains, rang-

ing from color images to paintings.

Other studies (Goodman, 2022; Kennedy and

Ross, 1975; Hertzmann, 2021a; Hertzmann, 2021b)

have shown that edge images convey a strong sense

of geometry and can even be used to predict depth

images.

Based on these observations, we have chosen line

images as the abstract intermediate representation for

our AbSynth pipeline. This implies that the function

F needs to take an RGB image and output a line draw-

ing of the same image. Usually, an implementation of

F only produces a single-channel image. However,

as depicted in Fig. 2e, we can combine multiple in-

stances of F to form the ﬁnal abstract representation

abs

. We will show that this combined abstract repre-

sentation often improves performance.

Edge Detection

Edge detection is a natural ﬁt for F, and since

it is a fundamental technique in image processing,

many algorithms were developed for this task. One

of the most widely used edge detection methods is

the Canny edge detection algorithm (Canny, 1986),

which involves applying a series of image ﬁlters to

smooth the image and highlight areas containing large

intensity gradients. Unfortunately, these methods ex-

pose a couple of user-chosen parameters, e.g., thresh-

olds of intensity differences, which must be chosen

carefully to achieve a good edge image. Simple im-

age ﬁlters such as Sobel, Prewitt, and Roberts oper-

ators can also be used for edge detection. However,

these ﬁlters often produce noisy or incomplete edges,

More recent, data-driven approaches like Holistically-

Nested Edge Detection (HED), presented by (Xie and

Tu, 2015), can be pre-trained and used without the

need to set parameters. For completeness, we clas-

sify the intermediate bride domain BrAD introduced

AbSynth: Using Abstract Image Synthesis for Synthetic Training

719

(a) Color Input. (b) Anime. (c) BrAD. (d) Open Sketch.

(e) Combined.

Figure 2: Different line styles for the same input. Subﬁbure (e) displays the three styles combined as a single RGB image.

by (Harary et al., 2022) as a version of edge detection

since it uses the same model as HED. The authors

also ensured that the resulting bride domain images

remained visually similar to the output of the original

HED network.

Style Transfer

In the most general form, F is a generic image-to-

image transformation. Many data-driven approaches

have been developed to facilitate such transformations

in recent years. For our approach, we are particu-

larly interested in the ﬁeld of style transfer. Here,

the goal is to input an arbitrary color image and re-

turn a new image with the same content in another

style. Applications in this domain range from sim-

ple image colorization to transforming a photograph

into a cubistic drawing. We model F under the style

transform paradigm by training a network to produce

a given style of line drawings. Some example styles

are shown in Fig. 2 and range from artistic anime to

technical drawings.

The paper ”Informative Drawings: Learning to

generate line drawings that convey geometry and se-

mantics” by (Chan et al., 2022) presents a method

tailored to precisely this problem and is focused on

generating line drawings that accurately convey both

the geometry and the semantics of the original color

image. They developed an updated training pipeline

incorporating geometric and semantic consistency be-

tween the input color image and the resulting line

drawing.

They used an explicit semantic loss function be-

tween the input and output images to achieve this.

This loss uses the network called CLIP (Contrastive

Language-Image Pre-training) (Radford et al., 2021),

which computes an embedding vector of an image.

The CLIP model was trained on pairs of images and

their description such that the embedding captures the

most important semantic information. The semantic

loss then compares the embeddings of the original

color image and the resulting line drawing, forcing

the network to produce images with similar semantic

information.

However, (Chan et al., 2022) points out that the se-

mantics alone does not ensure a geometrically consis-

tent transformation. For instance, both images could

contain a plane, but in one image, it is ﬂying in the

air, while it is parked on the ground in the other one.

They, therefore, introduce a geometric loss that uses a

pre-trained monocular depth estimator network. Dur-

ing training, they assume known depth maps for the

color images and use the depth estimator network to

compute depths for the generated line drawings. The

loss is simply the average per-pixel difference be-

tween those two depth maps. Interestingly, the sin-

gle image depth estimator uses VGG19 features and

is pre-trained on color images. This conﬁrms the ob-

servation made by (Hertzmann, 2021b) that line draw-

ings can adequately convey geometry.

3 METHOD

3.1 Synthetic Image Generation

We now present our synthetic data generation process

for the AbSynth pipeline. Our goal is to provide a

simple rendering setup that only requires the geomet-

ric data of target objects and produces training images

containing enough visual information to be converted

to a realistic edge image. The method is inspired by

early works of (DeCarlo et al., 2003). They infor-

mally introduce suggestive contours as those regions

on a mesh that are real contours in nearby viewpoints.

They also deﬁne them more formally using the con-

tour indicator function on a smooth surface S

n(p

v(p

p) (1)

where p

p ∈ S is a surface point, n

n(p

p) is the unit surface

normal, and v

v(p

p) is the view vector from the camera

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

720

(a) Contour Indicator. (b) Surface Color.

Figure 3: (a)-(C): Shading of synthetic training images combines a contour indicator function and colors based on surface

normals. During data synthesis, we can easily generate per pixel label, e.g. instance masks in (d).

to the surface point. With this, suggestive contours

are the local minima of this indicator function, and

real contours are found at the roots. DeCarlo et al.

show that an approximation of these minima can be

found by rendering the object and shading the pix-

els using the dot product from Eq. (1). The resulting

images look like Fig. 3a, and intensity ridges are the

suggestive contours.

Since the suggestive contour map, deﬁned by

Eq. (1) only yields the surface intensity, we can use

the hue to provide more information. For this, we

adopted an idea from the ﬁeld of surface normal es-

timation, where different colored lights are placed

along the cardinal direction of a predeﬁned coordinate

system. If the object is diffuse and white, a surface di-

rectly facing one of these directions will reﬂect only

that light, whereas partially rotated ones will take on a

mixed color. We approximate this effect by using the

surface normal, mapped from [−1, 1]

to [0, 1]

, as the

color of the surface, this produces the colors depicted

in Fig. 3b. If multiple objects are in the scene, we also

apply a random hue shift per object to ensure good vi-

sual separation.

3.2 Abstract Augmentation

A common method to improve generalization and

prevent shortcut learning is dataset augmentation. Of

course, typical image-level augmentations can be ap-

plied together with the AbSynth pipeline. Most pixel-

wise augmentations like randomized hue shifts or

Gaussian blur have a negligible effect since these

changes bearly affect F. On the other hand, content-

altering methods, e.g., random ﬂipping or image crop-

ping, do not lose any potency in our setup.

Since I

abs

is still an image, we can use augmenta-

tion during training directly on the abstract represen-

tation. As we will see in this subsection, the sparse

nature of the line images enables augmentations that

are impossible or hard to pull off in the original RGB

space.

(a) Naive Fusion. (b) Deferred Fusion.

Figure 4: Sample of an abstract training image. The line

fusion adds the background. In the deferred approach, lines

on the foreground objects follow the surface curvature.

3.2.1 Line Fusion

Remixing multiple images from a dataset to gener-

ate new images and add variety is a potent method.

A famous example of such a technique is the mo-

saic data augmentation introduced by (Bochkovskiy

et al., 2020), which greatly contributed to improved

object detection performance in YOLOv4. However,

due to the high complexity of natural images, these

approaches are usually rather limited: We can either

mix two images using alpha blending or stitch them

to form a bigger image.

In contrast, using the sparse nature of our abstract

representation enables us to combine the features of

two images more easily. Say we want to enrich a

training image with distracting features from a differ-

ent (natural) image J. We ﬁrst compute J

abs

using the

image-to-image transformation F

abs

. Then we create

a fused image

abs

by taking the pixel-wise minimum:

abs

= min(I

abs

, J

abs

) (2)

Since this method essentially merges the lines of two

abstract images, we call this augmentation Line Fu-

sion (LF). In Section 4.1, we show that this simple

approach can improve generalization since the addi-

tional, randomized complexity forces the downstream

network G to distill more robust features.

However, the resulting fused images are usually

not plausible since the lines of J

abs

are painted over

the original training image without any geometric rea-

soning. If we control the data synthesis process, or if

AbSynth: Using Abstract Image Synthesis for Synthetic Training

721

the dataset provides more information besides color,

we can turn this LF augmentation into a potent con-

tent augmentation technique. Usually, any dataset

contains images with objects of interest in the fore-

ground. If the dataset provides a per-pixel instance

mask for the target objects, we can ensure that they

stay in front of the distractor image J by only fusing

pixels where the instance mask M indicates the back-

ground:

abs

q) =

(

min(I

abs

q), J

abs

q)) if M(q

q) = 0

abs

q) else

(3)

Here, we assume that a value of M is the instance id, 0

being the background. This method allows us to ran-

domize the background, and to place the target objects

in varying environments.

Our data generation outlined in Section 3.1 does

not use surface textures, leading to featureless sur-

faces. Real-world objects, on the other hand, often

contain non-geometric texture details which impede

network performance during inference. We can use

LF to counteract this by adding random surface tex-

tures to foreground objects using a second distractor

image T . We produce the randomized surface texture

by fusing T

abs

with the foreground regions of I

abs

. We

use a random pixel offset δ

per object instance to ac-

cess T

abs

, which makes sure that the distractor image

is broken up, even if many target objects overlap:

abs

q) =

(

min(I

abs

q), J

abs

q)) if M(q

q) = 0

min(I

abs

q), T

abs

q + δ

)) M(q

q) = k

(4)

A result of this naive fore- and background fusion ap-

proach is depicted in Fig. 4a.

Upon closer inspection of this image, we see that

the overall curvature of the coffee cups is hard to un-

derstand due to the ﬂat lines pasted over this region.

This is contrary to our assumption that the down-

stream task should be mainly focused on geometric

features. To ensure the fused real image follows the

actual surface geometry, we borrow an idea of de-

ferred shading and output a uv-mask Φ for the ren-

dered training image. We then use Φ to warp and map

abs

along the surface of the target objects:

min(I

abs

q), T

abs

(∆

Φ(q

q)) for M(q

q) = k (5)

Here, we also apply a per-instance random coordinate

transformation ∆

to the uv coordinates. In Fig. 5a,

we show an example where a logo is warped onto the

cups using this approach. The mapped lines are thin-

ner and less visible than the naive warping method.

This aliasing effect is caused by the mapping of the

(potentially) large and very sparse line image T

abs

onto a comparatively small region. We can address

uvs

(a) Late warping.

uvs

(b) Early warping.

Figure 5: Difference between warping the abstract lines (a)

versus warping the distractor image (b). Warping lines pro-

duces to magniﬁcation and miniﬁcation artifacts. In con-

trast, early warping produces clean synthetic images.

Figure 6: Abstract representation with random erase aug-

mentation. On the left, the augmentation was performed on

the color image. On the right on in the abstract domain.

this problem by applying the image warping to T be-

fore we apply F

abs

and then using Eq. (4) with a con-

stant δ

= 0. Applying this early-warping method

to the logo of the example, we obtain the new fore-

ground distractor depicted on the lower left in Fig. 5a.

The resulting foreground fusion on the right is much

cleaner, with even line thickness and brightness.

3.2.2 Random Erase

Random erasing, introduced by (Zheng et al., 2021),

is a popular augmentation method for object detec-

tion and instance segmentation to simulate object oc-

clusion. This augmentation selects a random subre-

gion of the input image and ﬁlls it with a constant

value. We tried to apply this augmentation to the

intermediate representation directly onto the abstract

image I

abs

. As shown in Fig. 6, the results differ con-

siderably depending on the time of content removal.

Applying the erase augmentation to the color image

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

722

Table 1: Comparison of classiﬁcation for a ResNet34

trained on synthetic datasets and evaluated on a subset of

CoCo without and with the AbSynth pipeline.

Dataset Style

Augmentation

Accuracy

LF RE

VisDA

2017

Baseline 0.363

Ours

(BrAD)

✗ ✗ 0.429

✓ ✗ 0.566

✓ ✓ 0.547

AbsDA

Baseline 0.288

Ours

(BrAD)

✗ ✗ 0.430

✓ ✗ 0.592

✓ ✓ 0.488

yields a visible rectangle (that occludes the object).

Erasing regions in the abstract image mimics another

observation: Real-world images may contain areas

where object boundaries are not visible due to low

contrast. Specular reﬂections, shadows, or similar

surface colors may cause this. Since all the presented

versions of F rely — at least partially — on intensity

gradients, no line will be drawn in these regions. In

contrast, our data generation usually has high contrast

leading to images with clearly deﬁned object bound-

aries. We assumed, that random line erasing on the

abstract image might improve the downstream net-

work for some datasets. As we will see, this was not

the case for our experiments.

4 EXPERIMENT RESULTS

4.1 Classiﬁcation

In this section, we demonstrate synthetic training for

image classiﬁcation. To this end, we use the VisDa

2017 (Peng et al., 2018) dataset, which consists of

rendered training images — with model taken among

others from ShapeNet (Chang et al., 2015) — and test

data selected from CoCo (Lin et al., 2014). Since in-

stance and uv-masks are not provided with the ViDA

dataset, we can only apply the naive version of LF.

To assess the impact of deferred LF, we created

a separate dataset called AbsDA. We only took ob-

jects from ShapeNet Core, which drops the number

of classes in the dataset to 8. Some samples from the

two train datasets are depicted in Fig. 7, and the last

row shows some examples from the test dataset.

We evaluate the classiﬁcation accuracy of a

ResNet34 model trained on the VisDA 2017 and Ab-

sDa datasets. The results are summarized in Table 1.

Figure 7: Sample images from different target domains,

starting from the top row: VisDa, AbsDa, and CoCo. Each

image displays the original color image and its abstract rep-

resentation (BrAD).

The Baseline style is used as a reference point, where

no image abstraction or augmentation is applied, and

its accuracy acts as a lower bound for the expected

performance. For these experiments, we opted to

use the BrAD intermediate representation as it was

originally designed for cross-domain classiﬁcation by

(Harary et al., 2022). As shown in Table 1, the ab-

straction greatly improves the classiﬁcation accuracy,

and even the naive LF forces the network to general-

ize better to real-world data. In contrast, random line

erasing (RE) decreases the classiﬁcation accuracy.

The results of this experiment highlight the im-

portance of using LF as an augmentation method to

enhance the generalization ability of models trained

using our pipeline. This augmentation introduces ad-

ditional complexity to the images by adding a back-

ground scene. Furthermore, the addition of lines in

the region of the foreground objects essentially ran-

domizes their surface texture, which means the model

must learn to recognize objects by consistent geomet-

ric features and outlines instead of speciﬁc textural

details.

4.2 Object Detection

Object detection and localization are essential in vari-

ous visual inspection applications, including quality

control and robotic manipulation. To evaluate the

performance of our pipeline on a state-of-the-art ob-

ject detection model, we utilized three publicly avail-

AbSynth: Using Abstract Image Synthesis for Synthetic Training

723

(a) LM-O. (b) ITODD. (c) T-LESS.

Figure 8: Training and evaluation samples from the BOP challenge datasets. Images display the original color input and the

corresponding abstract representation.

able datasets from the BOP challenge, namely, LM-O

(Brachmann, 2020), ITODD (Drost et al., 2017) and

T-LESS (Hodan et al., 2017). As with all datasets

in the BOP challenge, they contain synthetic training

data generated using BlenderProc (Denninger et al.,

2019). BlenderProc uses Blender (Foundation, 2022)

and its built-in ray tracing engine to generate physi-

cally based renderings of 3D models, which are also

included in the datasets. While the training data is

synthetic, the test data for each dataset is composed

of real-world images, making the evaluation more re-

alistic and relevant to real-world applications.

The three chosen datasets all contain objects with

little to no distinguishing textures which implies ob-

ject detection and classiﬁcation is primarily geome-

try based. Besides this similarity, the three datasets

present different challenges:

The LM-O dataset features 10 household objects

with discriminative shapes, sizes, and colors. Despite

the low number of classes, the dataset is challenging

for object localization due to the high levels of occlu-

sion and cluttered backgrounds. The object models in

the dataset were reconstructed using a depth camera,

leading to relatively noisy and inaccurate meshes.

In contrast, the ITODD dataset comprises 28 in-

dustrial objects with handcrafted CAD models, result-

ing in cleaner synthetic training data. The test dataset

was captured in a realistic, productive setting using

different sensors, including a grayscale camera. The

test images were captured against a uniform back-

ground with varying levels of inter-object occlusion.

For example, some scenes contain a pile of washers,

as shown in Fig. 8b while others contain fewer, clearly

separated objects.

Finally, the T-LESS dataset combines the features

of the previous two datasets. It includes 30 texture-

less, industry-relevant objects with clean CAD mod-

els. The 20 test scenes vary in complexity, with some

scenes containing clutter objects. The objects in the

dataset exhibit symmetries and mutual similarities in

shape and size, and some are composed of other ob-

jects, making the dataset more challenging. A sample

image from the T-LESS dataset is shown in Fig. 8c.

Realistic Training Data

In this experiment, we aim to evaluate the effect

of the proposed abstraction on the training process.

To achieve this, we train a Faster-RCNN (Girshick,

2015) on the synthetic (semi-)realistic training data

provided by the BOP challenge datasets. We use var-

ious intermediate styles, ranging from classic canny-

edge images to combinations of line drawing styles.

Besides different styles, we also test the impact of the

augmentation techniques, presented in Section 3.2, on

the ﬁnal detection quality. Each network was trained

for 30 epochs using SGD with a learning rate of 0.02,

which we decreased after epochs 16 and 22 by a fac-

tor of 10. The Faster-RCNN network uses ResNet50,

pre-trained on ImageNet using o ur pipeline, for fea-

ture extraction. In Table 2, we report the mean Aver-

age Precision (mAP) on the test results for different

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

724

Table 2: Detection Accuracy on BOP challenge datasets us-

ing the provided synthetic training data. Baseline indicates

training without the AbSynth pipeline.

Dataset Style

Augmentation

mAP

LF RE

LM-O

Baseline 0.516

Canny ✗ ✗ 0.423

BrAD ✗ ✗ 0.490

Anime

✗ ✗ 0.480

✓ ✗ 0.495

ITODD

Baseline 0.618

Canny

✗ ✗ 0.756

✓ ✗ 0.783

BrAD ✗ ✗ 0.617

Anime

✗ ✗ 0.813

✓ ✗ 0.795

Anime + OS

✗ ✗ 0.849

✗ ✓ 0.812

✓ ✗ 0.819

T-LESS

Baseline 0.195

Canny

✗ ✗ 0.677

✓ ✗ 0.660

BrAD ✗ ✗ 0.290

Anime ✗ ✗ 0.695

Anime + OS

✗ ✗ 0.685

✓ ✗ 0.695

✓ ✓ 0.658

conﬁgurations of the training pipeline.

To provide a baseline for comparison, we also

train the network using only the original training

dataset without using our pipeline. The results in Ta-

ble 2 show that the proposed AbSynth pipeline signif-

icantly improves detection accuracy compared to the

baseline approach. Speciﬁcally, using the AbSynth

pipeline with the Anime and Open Sketch styles and

LF achieves the highest mAP on all three datasets.

Interestingly, the BrAD style underperforms com-

pared to any intermediate representation based on the

informative drawing architecture. This contrasts with

the previous chapter, where the BrAD style was the

best choice for image classiﬁcation.

To understand this difference, we considered the

training procedures for the two styles. The BrAD net-

work was trained by (Harary et al., 2022) to contain

cues for cross-domain image classiﬁcation. While

this task may partially use shape-based information,

Table 3: mAP and per class precision for a subset of objects

in the ITODD dataset. Line fusion augmentation often sig-

niﬁcantly improves network predictions on a per-class level.

LF mAP Box Cap Fuse Wash

✗ 0.742 0.789 0.900 0.845 0.850

✓ 0.827 0.950 0.904 0.911 0.764

the authors did not explicitly force the network to in-

clude geometric information in its representation.

On the other hand, as we discussed in Section 2.1,

the line drawing style transform explicitly forces the

resulting images to contain geometric details. Since

the datasets used in our evaluation contain mostly tex-

tureless objects, their geometry becomes the primary

feature, and the intermediate style that best conveys it

performs the best.

The impact of the RE augmentation on the ﬁnal

detection accuracy was found to be minor and some-

times negative according to the results in Table 2.

This technique was designed to increase the network’s

robustness against partial occlusion, a common chal-

lenge in object detection. However, since the training

data already includes many examples of inter-object

occlusion, the network is trained to deal with par-

tial views of the target objects. On the other hand,

the rectangular crop regions used in the augmentation

are not commonly found in real-world images, which

may lead to a slight increase in the gap between the

training and inference domains.

False Color Training Data

We have already shown that our pipeline can improve

detection quality for (semi-)realistic synthetic train-

ing data. However, to generate such high-quality ren-

derings, the user needs to provide object textures and

other surface parameters, e.g., how specular the object

is. In Section 3.1, we presented a rendering technique

that takes just the CAD model and outputs false color

images that can be converted to plausible intermediate

line representations. To avoid the need to create com-

plex background scenes, we opted to create a custom

dataset for ITODD. We created the training data using

Blender, with objects randomly scattered on a plane

using the built-in rigid-body physics engine. Some

samples from the sample data are depicted in Fig. 9.

We again trained a Faster-RCNN network with

hyperparameters similar to the previous section. We

used Anime line drawings as the intermediate repre-

sentation since a single-style intermediate representa-

tion is a reasonable tradeoff between expected qual-

ity and inference speed. In Table 3, we present the

detection precision on the test dataset. The mAP is

competitive with the network trained using realistic

AbSynth: Using Abstract Image Synthesis for Synthetic Training

725

(a) Multi object train image.

(b) Box train image.

Figure 9: Synthetic and real-world data for the box class. The images show the original color image and their anime line

drawings.

renderings. While the line fusion augmentation was

not beneﬁcial on the original ITODD training data, it

was for our custom dataset.

We analyze the per-class detection accuracy in

more detail to investigate how LF augmentation in-

ﬂuences the ﬁnal network. To this end, we present a

selection of per-class detection accuracies in Table 3.

In most cases, the augmentation had a minor effect

on the results, with only a few classes showing a sig-

niﬁcant improvement. One example is the packaging

box which uses a simple cuboid as its CAD model.

As depicted in Fig. 9c, the real-world counterpart has

more details, such as a ﬂap to open it. Without aug-

mentation, the network often misclassiﬁes such ob-

jects because they lack these details in the training

data. However, with LF-augmentation, the network

is trained to be robust against varying surface details

inside the object. Therefore, the additional details are

treated as random surface details and ignored by the

trained network, resulting in a more accurate detec-

tion.

4.3 Bin Picking

In this section, we apply the proposed AbSynth

pipeline to train a model to detect the electric motor

shown in Fig. 10a. As depicted in Fig. 10b, we simu-

late a bin-picking scenario placing the motor in a box

alongside other objects.

Similar to the previous section, we used Blender

(Foundation, 2022) to generate the synthetic train-

ing data. The dataset comprises 600 scenes rendered

using 20 random camera positions, yielding 12000

training images. We created a box with random di-

mensions for each scene and used the built-in physics

engine to let objects fall into it. The entire scene

generation and rendering procedure was automated,

so we only needed to convert the CAD models from

the step ﬁle format to one that can be imported into

Blender. Since we used the false-color rendering ap-

Table 4: Evaluation metrics for custom bin-picking dataset.

mAP

0.5

mAP

0.75

mAP

0.5:0.95

mAR

Full 0.906 0.697 0.639 0.710

Boxes 0.910 0.677 0.634 0.701

Other 0.903 0.732 0.650 0.720

proach outlined in Section 3.1 we do not need to cre-

ate any randomized lighting setup or provide any sur-

face properties.

We took 64 images of similar scenes containing

motors in boxes for evaluation. Since we want to eval-

uate whether the network can generalize, we also cap-

tured 24 images of conﬁgurations not present in the

synthetic data. An example is depicted in Fig. 10c,

where the motors are placed in a tray instead of a box.

We trained a Faster-RCNN network for 30 epochs,

utilizing a dataset comprising 25,000 synthetic train-

ing images. The training was performed using SGD

with an initial learning rate of 0.02, which we reduced

by a factor of 0.1 after epochs 16 and 22. We chose the

anime style as the intermediate representation since a

one-channel representation has proven to be a good

trade-off between inference speed and model accu-

racy in previous experiments. The detection results on

the evaluation test set are compiled in Table 4. Inter-

estingly, the detection accuracy between bin-picking

and other cluttered scenes is very similar for a low

IoU threshold. This indicates that the training syn-

thetic training data is diverse enough to generalize

the ROI classiﬁcation to similar but unseen scenar-

ios. The precision of box samples is lower compared

to the rest of the validation data as the threshold in-

creases. We attributed this to a higher probability of

occlusion in those scenes, which makes it more chal-

lenging to locate object boundaries accurately.

One of the main advantages of our approach is that

it does not require manual labeling, which is be time-

consuming and costly. In traditional approaches us-

ing real-world images, image acquisition and labeling

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

726

(a) Target Objects. (b) Training Sample. (c) Validation Sample.

Figure 10: Data used to train and evaluate the object detection use case.

Table 5: Comparison of labeling and data acquisition

speeds.

Box time Acquisition time Total

Synthetic 0.3s 9.0s 9.0s

Manual 8.4s 24.7s 57.2s

(a) Fully visible. (b) Occluded.

Figure 11: Two samples from our labeling study showing

the annotations of all participants.

are separate steps that can add to the overall time and

effort required. However, with our data-generation

pipeline, labels (e.g., object bounding boxes) are gen-

erated simultaneously with the synthetic images. To

quantify the reduction in manual labor provided by

our approach, we measured the average time required

to annotate a bounding box (box time) for both the

training and evaluation datasets. We estimate the im-

age acquisition time for the synthetic dataset by di-

viding the duration of synthetic data generation by the

number of images in the training set. The image ac-

quisition time is estimated using the duration of gen-

erating the synthetic dataset divided by the number

of images in the training set. Note that this includes

the time used to simulate the objects falling into the

box. For the real-world dataset, we measure the time

to capture the images and divide it by the dataset size.

A single user annotated the test dataset to ensure

consistent label quality. However, the labeling speed

is inﬂuenced by the user’s experience, which intro-

duces a bias in the timing statistics.

We conducted a small user study involving 10 par-

ticipants to address this issue. Each participant was

assigned a set of 10 images randomly selected from

the test dataset and asked to annotate them. By com-

paring the results, we obtain a more comprehensive

understanding of the labeling process and its associ-

ated time requirements. The ﬁndings from this study

are presented in Table 5, revealing notable differences

in image acquisition and labeling times between syn-

thetic data generation and real-world scenarios.

Furthermore, we took advantage of this user study

to investigate the impact of occlusion on the consis-

tency of bounding box annotations across different

users. For this purpose, two of the 10 presented im-

ages were identical for all participants. One depicts

strong occlusion, and the other features a clear sep-

aration between instances. As hypothesized, we ob-

served signiﬁcant variations in annotated bounding

boxes for heavily occluded motors, while instances

with less occlusion generally exhibited more consis-

tent labeling. Figure 11 provides visual examples for

two instances from these images. For fully visible

instances, annotation differences can be primarily at-

tributed to sloppy labeling. However, achieving accu-

rate annotations becomes exceedingly difﬁcult when

faced with occlusion and challenging lighting condi-

tions.

In contrast, the labels of synthetic training data

produced by our approach are inherently pixel-

perfect, irrespective of the scene’s complexity or level

of occlusion.

5 CONCLUSION

This paper we presented a novel approach to synthetic

image training. Instead of creating photo-realistic im-

ages or using neural networks that transform render-

ings into such images, we propose projecting syn-

thetic and real-world data into a shared abstract do-

main. We demonstrated that line drawings are such

AbSynth: Using Abstract Image Synthesis for Synthetic Training

727

a domain, that is well suited for downstream tasks

based on object geometry. Applying our approach

to different tasks, we showed that the image-to-line

transformation can be decoupled from these down-

stream tasks, and we presented various methods to fa-

cilitate the transformation. Our experiments showed

that our method can be a drop-in to improve object de-

tection quality, even using datasets with semi-realistic

synthetic data. The intermediate line representation

also enables novel augmentation methods, further im-

proving network generalization to real-world data. Fi-

nally, we demonstrated how our approach could be

used in a real-world use case by training a network to

identify objects in a bin-picking scenario without any

real training images.

Despite the success of our approach, there are ar-

eas for further exploration and optimization. The pro-

jection of images to their abstract representation is an

additional step that requires computation time, and

optimizing the runtime should be a focus in follow-

up work. One promising idea is to use knowledge

distillation with a student-teacher approach producing

smaller image-to-image networks based on the pre-

sented ones. Additionally, we believe the downstream

networks can be trimmed down since the abstract in-

put data contains condensed, more meaningful data

than pure color images.

ACKNOWLEDGEMENTS

This work has been supported by the Schaefﬂer

Hub for Advanced Research at Friedrich-Alexander-

Universit

at Erlangen-N

urnberg (SHARE at FAU).

REFERENCES

Bochkovskiy, A., Wang, C.-Y., and Liao, H.-Y. M. (2020).

Yolov4: Optimal speed and accuracy of object detec-

tion. arXiv preprint arXiv:2004.10934.

Brachmann, E. (2020). 6D Object Pose Estimation using

3D Object Coordinates [Data].

Canny, J. (1986). A computational approach to edge de-

tection. IEEE Transactions on pattern analysis and

machine intelligence, pages 679–698.

Chan, C., Durand, F., and Isola, P. (2022). Learning to

generate line drawings that convey geometry and se-

mantics. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

7915–7925.

Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan,

P., Huang, Q., Li, Z., Savarese, S., Savva, M.,

Song, S., Su, H., et al. (2015). Shapenet: An

information-rich 3d model repository. arXiv preprint

arXiv:1512.03012.

Ciresan, D. C., Meier, U., Masci, J., Gambardella, L. M.,

and Schmidhuber, J. (2011). Flexible, high perfor-

mance convolutional neural networks for image clas-

siﬁcation. In Twenty-second international joint con-

ference on artiﬁcial intelligence. Citeseer.

Croitoru, F.-A., Hondru, V., Ionescu, R. T., and Shah, M.

(2023). Diffusion models in vision: A survey. IEEE

Transactions on Pattern Analysis and Machine Intel-

ligence.

DeCarlo, D., Finkelstein, A., Rusinkiewicz, S., and San-

tella, A. (2003). Suggestive contours for conveying

shape. In ACM SIGGRAPH 2003 Papers, pages 848–

855. ACM New York, NY, USA.

Denninger, M., Sundermeyer, M., Winkelbauer, D., Zi-

dan, Y., Oleﬁr, D., Elbadrawy, M., Lodhi, A., and

Katam, H. (2019). Blenderproc. arXiv preprint

arXiv:1911.01911.

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas,

C., Golkov, V., Van Der Smagt, P., Cremers, D., and

Brox, T. (2015). Flownet: Learning optical ﬂow with

convolutional networks. In Proceedings of the IEEE

international conference on computer vision, pages

2758–2766.

Drost, B., Ulrich, M., Bergmann, P., Hartinger, P., and Ste-

ger, C. (2017). Introducing mvtec itodd-a dataset for

3d object recognition in industry. In Proceedings of

the IEEE international conference on computer vision

workshops, pages 2200–2208.

Foundation, B. (2022). Blender.

Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE

international conference on computer vision, pages

1440–1448.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Ben-

gio, Y. (2020). Generative adversarial networks. Com-

munications of the ACM, 63(11):139–144.

Goodman, N. (2022). Languages of art. In Lexikon

Schriften

uber Musik, pages 293–376. Springer.

Harary, S., Schwartz, E., Arbelle, A., Staar, P., Abu-

Hussein, S., Amrani, E., Herzig, R., Alfassy, A.,

Giryes, R., Kuehne, H., et al. (2022). Unsupervised

domain generalization by learning a bridge across do-

mains. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

5280–5290.

Hertzmann, A. (2021a). The role of edges in line drawing

perception. Perception, 50(3):266–275.

Hertzmann, A. (2021b). Why do line drawings work? a re-

alism hypothesis. Journal of Vision, 21(9):2029–2029.

Hodan, T., Haluza, P., Obdr

alek,

S., Matas, J., Lourakis,

M., and Zabulis, X. (2017). T-less: An rgb-d dataset

for 6d pose estimation of texture-less objects. In 2017

IEEE Winter Conference on Applications of Computer

Vision (WACV), pages 880–888. IEEE.

Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. (2017).

Image-to-image translation with conditional adversar-

ial networks. In Proceedings of the IEEE conference

on computer vision and pattern recognition, pages

1125–1134.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

728

Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen,

J., and Aila, T. (2020). Analyzing and improving

the image quality of stylegan. In Proceedings of the

IEEE/CVF conference on computer vision and pattern

recognition, pages 8110–8119.

Kennedy, J. M. and Ross, A. S. (1975). Outline picture per-

ception by the songe of papua. Perception, 4(4):391–

406.

Lin, T., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick,

R. B., Hays, J., Perona, P., Ramanan, D., Doll’a r, P.,

and Zitnick, C. L. (2014). Microsoft COCO: common

objects in context. CoRR, abs/1405.0312.

Peng, X., Usman, B., Kaushik, N., Wang, D., Hoffman, J.,

and Saenko, K. (2018). Visda: A synthetic-to-real

benchmark for visual domain adaptation. In Proceed-

ings of the IEEE Conference on Computer Vision and

Pattern Recognition Workshops, pages 2021–2026.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,

Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,

J., et al. (2021). Learning transferable visual models

from natural language supervision. In International

conference on machine learning, pages 8748–8763.

PMLR.

Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., and

Abbeel, P. (2017). Domain randomization for transfer-

ring deep neural networks from simulation to the real

world. In 2017 IEEE/RSJ international conference on

intelligent robots and systems (IROS), pages 23–30.

IEEE.

Tremblay, J., Prakash, A., Acuna, D., Brophy, M., Jam-

pani, V., Anil, C., To, T., Cameracci, E., Boochoon,

S., and Birchﬁeld, S. (2018). Training deep networks

with synthetic data: Bridging the reality gap by do-

main randomization. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition

workshops, pages 969–977.

Xie, S. and Tu, Z. (2015). Holistically-nested edge detec-

tion. In Proceedings of the IEEE international confer-

ence on computer vision, pages 1395–1403.

Zheng, Z., Wang, P., Ren, D., Liu, W., Ye, R., Hu, Q.,

and Zuo, W. (2021). Enhancing geometric factors in

model learning and inference for object detection and

instance segmentation. IEEE Transactions on Cyber-

netics, 52(8):8574–8586.

Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. (2017).

Unpaired image-to-image translation using cycle-

consistent adversarial networks. In Proceedings of

the IEEE international conference on computer vi-

sion, pages 2223–2232.

AbSynth: Using Abstract Image Synthesis for Synthetic Training

729