CAD-based Learning for Egocentric Object Detection in Industrial

Context

Julia Cohen

1,2

, Carlos Crispim-Junior

, C

eline Grange-Faivre

and Laure Tougne

Univ. Lyon, Lyon 2, LIRIS, F-69676 Lyon, France

DEMS, Saint Bonnet de M

ure, France

Keywords:

Egocentric, Database Generation, Object Detection.

Abstract:

Industries nowadays have an increasing need of real-time and accurate vision-based algorithms. Although the

performance of object detection methods improved a lot thanks to massive public datasets, instance detection

in industrial context must be approached differently, since annotated images are usually unavailable or rare.

In addition, when the video stream comes from a head-mounted camera, we observe a lot of movements and

blurred frames altering the image content. For this purpose, we propose a framework to generate a dataset of

egocentric synthetic images using only CAD models of the objects of interest. To evaluate different strategies

exploiting synthetic and real images, we train a Convolutional Neural Network (CNN) for the task of object

detection in egocentric images. Results show that training a CNN on synthetic images that reproduce the

characteristics of egocentric vision may perform as well as training on a set of real images, reducing, if not

removing, the need to manually annotate a large quantity of images to achieve an accurate performance.

1 INTRODUCTION

Object detection in images is an interesting challenge

with a wide range of applications. Video surveillance,

autonomous driving or robotics are examples of the

domains that need to recognize objects, in order to au-

tomate tasks or perform quality control (Agin, 1980;

Malamas et al., 2003). We are interested in detecting

objects for augmented reality (AR) applications. Re-

cently, AR technologies have spread out to our every-

day lives (Van Krevelen and Poelman, ), overlaying

virtual content onto the reality with wearable and mo-

bile devices. This is particularly useful for industrial

tasks, since it enables to visualize additional infor-

mation during maintenance or assembly tasks (Gav-

ish et al., 2015; Evans et al., 2017). Smartphones

and tablets used as AR displays occupy the hands of

the user and require to move repeatedly ones attention

from the screen to the workstation. On the contrary,

head-mounted displays such as AR glasses or head-

sets enable to visualize virtual objects at eye-level,

hands-free. Head-mounted displays usually embed

one or several cameras following the worker’s head

movements, showing fast changes in the image con-

tent and illumination, as well as motion blurring (Fig-

ure 1). These challenges, speciﬁc to the domain of

egocentric vision, are already being investigated by

researchers, especially for the automatic analysis of

daily activities (Nguyen et al., 2016; Damen et al.,

2018).

While these works focus on the relationship be-

tween daily objects and actions in egocentric videos,

we are interested in speciﬁc objects in an industrial

context, in order to project the adequate virtual con-

tent onto the head-mounted display. Unlike existing

solutions, we need a system that adapts to different

indoor environments, objects materials and shapes.

Also, there is no possibility to adjust the working

area with artiﬁcial markers, leaving out of the scope

marker-based AR methods. Several works tackle this

problem, but they usually rely on speciﬁc conditions,

e.g., a planar surface and static scene (Klein and Mur-

ray, 2007). Given the recent success of deep learning

in many vision tasks, we use a Convolutional Neu-

ral Network (CNN) for the task of object detection in

egocentric images. Using large databases, CNNs are

able to recognize objects with varying shape, pose,

texture or material. However, in industrial context,

target objects are industrial pieces manufactured for

speciﬁc products, for which it is difﬁcult to acquire

and annotate a large number of data. In this work,

we exploit the 3D CAD models of the objects of in-

terest to automatically generate an egocentric syn-

thetic database. Although using CAD models to de-

644

Cohen, J., Crispim-Junior, C., Grange-Faivre, C. and Tougne, L.

CAD-based Learning for Egocentric Object Detection in Industrial Context.

DOI: 10.5220/0008975506440651

In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020) - Volume 5: VISAPP, pages

644-651

ISBN: 978-989-758-402-2; ISSN: 2184-4321

Figure 1: Consecutive frames extracted from an egocentric video (30fps) where a bus seat is being assembled.

tect or classify objects in images is not a new ap-

proach (B

ohm et al., 2000; Ulrich et al., 2001; Toshev

et al., 2009; Langlois et al., 2018), only few methods

combine deep learning methods and 3D models for

the task of instance detection (detection of a speciﬁc

object instead of a class), and without real images.

To the best of our knowledge, no prior work has been

done to detect industrial object instances from an ego-

centric point of view and using only synthetic images.

Building on previous works presented in Sec-

tion 2, we propose a method to generate a synthetic

dataset using CAD models as the only input, and we

study different learning strategies (Section 3). We

tailor our approach towards images from an egocen-

tric viewpoint where hand-held objects are truncated

and head motion contributes to blur the images. Af-

ter generating a synthetic dataset, we train the CNN

YOLOv3 (Redmon and Farhadi, 2018) and evaluate

its ability to detect objects in real images. Especially,

we compare the performance of training on a large

dataset of synthetic images and a small dataset of real

images, and we study the contribution of motion and

Gaussian types of blur in the synthetic dataset, to re-

duce the gap between synthetic and real domains. Ex-

periments are described in Section 4, and results in

Section 5. Section 6 concludes this work.

2 RELATED WORK

CAD models were ﬁrst used for classiﬁcation and lo-

calization tasks by matching hand-crafted feature vec-

tors from the model or its 2D projection to a range or

color image (Qin et al., 2014; B

ohm et al., 2000; Ul-

rich et al., 2001; Toshev et al., 2009). Then, CNNs

were fed directly with 3D data as a grid of voxels

(Maturana and Scherer, 2015; Wu et al., 2015) or

as an octree (Wang et al., 2019). Encouraged by the

availability of large databases such as ModelNet (Wu

et al., 2015) and PASCAL3D+ (Xiang et al., 2014),

these CNNs serve as feature extractors for any vision-

based task, or they are tailored towards a speciﬁc task

as classiﬁcation or pose estimation. A different ap-

proach is to render the CAD models to obtain a collec-

tion of 2D views with known pose to feed the CNN,

for pose estimation or reﬁnement for example (Su

et al., 2015; Kehl et al., 2017; Sundermeyer et al.,

2018; Langlois et al., 2018). These views can be used

as a form of data augmentation to complement a small

or imbalanced dataset (Peng et al., 2015), or as the

only source of data. We focus on this approach to train

an object detector using only RGB images, removing

the need to reconstruct a 3D scene. On the same line

as (Sarkar et al., 2017), we perform instance detection

with real images unavailable or in reduced number,

with a training set built entirely from one CAD model

per object. Unlike them, our objects are being manip-

ulated from an egocentric viewpoint and not lying on

a table.

The domain gap between synthetic and real im-

ages is another challenge: a CAD model has sharp

edges and visible details, while a real image quality

depends a lot on the camera characteristics. While

we can learn from a small set of real images the best

parameters to generate synthetic images (Rozantsev

et al., 2015), we still need images from the target

domain. Several works (Massa et al., 2016; Inoue

et al., 2018; Planche et al., 2019) proposed to bridge

the gap between real and synthetic data by giving the

real images a synthetic aspect, whereas our approach

is to make synthetic images similar to real ones. To

tackle the challenge of cross-domain transfer learn-

ing, (Chu et al., 2016) proposed to take advantage of

networks pre-trained on massive datasets such as Im-

ageNet (Deng et al., 2009), showing experimentally

the performance improvement. However, their source

and target domains both contain real images.

In this work, we generate synthetic images from

CAD models with a special attention on egocentric

characteristics. Then, we compare different learning

strategies with or without real and synthetic data, pre-

trained weights, blur and shadows.

3 PROPOSED METHOD

First, we introduce our method to generate an egocen-

tric synthetic dataset, simulating visual cues represen-

tative of real egocentric images. Then, we present the

different training strategies evaluated.

3.1 Synthetic Images Generation

We create images containing variability in viewpoint,

illumination, truncation and blurring. Although we

CAD-based Learning for Egocentric Object Detection in Industrial Context

645

Figure 2: Synthetic dataset generation. Details in Section 3.1.

focus on instance detection, the generated dataset can

be used for other vision-based tasks.

3.1.1 Rendered Views

For each object of interest, a textureless CAD model

is rendered at the origin of a 3D scene. A virtual

camera is placed at several positions, with its opti-

cal axis pointing at the object, without varying the

camera parameters or simulating artifacts as in other

works (Klein and Murray, 2009). To obtain equally

distributed camera positions, we apply the method

introduced in (Hinterstoisser et al., 2008), in which

the viewpoints are sampled from an icosahedron: the

initial triangular faces are iteratively subdivided into

4 smaller triangles in the reﬁnement step, providing

a ﬁner sampling of the 3D space (Figure 2-A). The

model does not need resizing at this point. While

thousands of poses are required for pose estimation,

(Wohlhart and Lepetit, 2015; Kehl et al., 2017), much

less poses are necessary for object detection. With

this sampling method, we can obtain 12, 42 or 162

poses with 0, 1 or 2 levels of reﬁnement (level 0 cor-

responding to the initial icosahedron vertices). We

render the scene with the camera placed at each of

the viewpoints. The resulting ”views” cover two out-

of-plane rotation angles; the third one is obtained

by in-plane rotation in a subsequent step. We add

visual variations by changing the color of our non-

textured models: we empirically deﬁned 12 colors,

from which 5 are randomly selected for each camera

pose (Figure 2-B). The lighting is a simple point light,

emitting in all directions and creating reﬂections for

certain colors and camera positions.

3.1.2 Background Compositing

Previous works (Peng et al., 2015; Sarkar et al., 2017)

have shown better results when the synthetic images

have a real background. We apply this strategy by

adding the synthetic views into images of real scenes.

The annotation is automatically created during this

step. In previous works, the background images are

retrieved from large databases with constraints such

as ”indoor location” and removing images with ob-

jects similar to the target objects. Since usual datasets

do not contain images from our target domain, we

manually retrieved 50 images from online databases

corresponding to keywords like ”industrial”, ”manu-

facturing”, or ”workplace” (Figure 2-D).

3.1.3 Framework

To create synthetic images, we apply the framework

in Figure 2. First, we randomly select and augment a

background image with domain randomization tech-

niques: we crop and resize the image, modify the

pixel values, apply random contrast normalization,

horizontal and vertical ﬂipping, and motion blur (Fig-

ure 2-E). Then, a view is selected, resized, and its lo-

cation on the background image is randomly deﬁned

(Figure 2-C). We allow a third of the object width and

height to fall outside the boundaries of the image, in

order to simulate truncated objects as they appear in

egocentric images (Figure 2-F). The view is also aug-

mented with random rotation and pixel multiplication,

and it is projected onto the background image, before

adding Gaussian noise to smooth the object contours

(Figure 2-G). Finally, to bridge the gap between syn-

thetic and real images, we simulate movements with

either Gaussian or motion blur (Figure 2-H). Gaus-

sian blur is applied with a zero-mean and a standard

deviation between 5 and 12, motion blur with a ker-

nel of size between 10 and 80, random angle and di-

rection of motion. We simulate shadows with semi-

transparent black shapes blended on top of the images

(Figure 2-I). Although the shadows do not look real-

istic, they add variability in the objects colors.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

646

3.2 Learning Strategies

Typically, a CNN for object detection is composed

of a feature extractor and a detection head. When

trained on ImageNet, the feature extractor is powerful

for a wide range of tasks and enables transfer learning

to different domains (Razavian et al., 2014). On the

contrary, the detection head is speciﬁc to the applica-

tion and dataset at hand. Several training strategies

result from this decomposition: training the feature

extractor and the detection head jointly, training them

with different data or parameters, training only the de-

tection head or speciﬁc layers.

According to (Chu et al., 2016), an optimal learn-

ing strategy with few images from the target domain is

to use as many pre-trained layers as possible and ﬁne-

tune the whole network with the target data. With

only synthetic data available, (Hinterstoisser et al.,

2018) achieved the better performance keeping the

feature extractor pre-trained on ImageNet and train-

ing only the detector head. On the contrary, the au-

thors from (Tremblay et al., 2018) obtained better re-

sults ﬁne-tuning the whole CNN with synthetic im-

ages. To identify the best strategy in our case, we an-

alyze the following experiments: training the full net-

work or only the detection head, using only synthetic

data, only real data or a combination of both. Finally,

to replicate common perturbations in egocentric im-

ages, we evaluate the inﬂuence on the performance of

including blur and shadows in the training set.

4 EXPERIMENTS

In this section, we present ﬁrst the real dataset ac-

quired and the synthetic dataset built following the

framework in Section 3.1. Second, we detail the train-

ing strategies applied. Finally, we introduce the eval-

uation metrics. We use in our experiments a bus seat,

for which 5 CAD models and the real non-assembled

objects are available. With the real objects, we ac-

quired images to evaluate the performance of our ap-

proach. Notice that the framework can be easily ap-

plied on any dataset with CAD models available.

4.1 Real Dataset

We acquired images of the real objects to evaluate

the detection task. We also used them to train a

CNN and compare the performance. The acquired

images are photos and videos shot from three dif-

ferent smartphones (Figure 3), and even more, and

egocentric videos shot from one of the smartphones

mounted into an augmented reality headset. This

Figure 3: (Top) Pictures acquired with a smartphone. (Bot-

tom) A. Full bus seat model (1), backrest (2), handle (3),

seat (4), left proﬁle (5), right proﬁle (6). B. Bad model.

headset allows the user to manipulate the objects with

both hands. Therefore, the videos show characteris-

tics similar to an industrial application with an AR

head-mounted device: objects fall partially outside

the image when held at arm’s length, many frames are

blurred, and lots of shadows and reﬂections appear.

These elements make detection challenging, as they

modify the visual appearance of the objects, espe-

cially in video recordings (Figure 1). From the videos,

we manually extracted a portion of their frames to re-

duce annotation time, selecting a large variety of ob-

ject positions, illumination and blurring conditions.

However, it is worth noting that these images do

not show an exhaustive set of poses, and their back-

grounds are very similar.

A total of 392 annotated images is used to form

the real dataset, with one or several objects per im-

age. All classes appear within the same proportions

(about 20%), except for the backrest (about 30%) and

the seat (about 10%). Most images contain several

objects occluding each other or truncated. Some ob-

jects are very similar (backrest and seat, left and right

proﬁles) and hardly distinguishable in some images,

e.g., when only a part of the object is visible, making

our target domain very challenging.

4.2 Synthetic Dataset

The bus seat has 5 main parts: handle, backrest, left

proﬁle, right proﬁle, and seat (Figure 3-A). First, we

process the CAD models with poor quality (Figure 3-

B). Using the framework of Section 3.1, we generate

2D views with 162 camera viewpoints and 5 colors

per viewpoint, leading to 810 images per class. Ex-

perimentally, we selected 162 camera poses because

many objects were missed in our ﬁrst experiments

with only 42 viewpoints. On the contrary, more cam-

era poses lead to similar and redundant views. Each

view appears with 15 in-plane rotations as most CNNs

are not rotation-invariant. For the experiments pre-

sented here, we focused on images with only one ob-

ject, leaving the simulation of occlusions for future

work. To compose the synthetic dataset, we randomly

CAD-based Learning for Egocentric Object Detection in Industrial Context

647

selected 2000 images per class, leading to 10 000 syn-

thetic images with balanced classes. For some exper-

iments, this dataset is modiﬁed with blur or shadows

on half of the images.

4.3 Model Training and Evaluation

We use as object detector a Darknet implementation

of YOLOv3 (Redmon et al., 2019). YOLOv3 is ca-

pable of real-time inference, which makes it adequate

for mobile object detection as in an AR headset. The

feature extractor network is Darknet-53, and we use it

in all experiments its weights pre-trained on the Im-

ageNet dataset of the 1000-class classiﬁcation task.

The detection head is always randomly initialized.

Training is performed with the default parameters:

stochastic gradient descent with mini-batches of 64

images, momentum of 0.9, weight decay of 0.0005,

initial learning rate of 0.001 (divided by 10 when the

validation performance stops improving). Data aug-

mentation (saturation, hue and exposure variations) is

used for real and synthetic training images, as well as

in-plane rotation for real images only (since already

included in the synthetic dataset).

As a baseline, we train the full network end-to-

end with real images (Net Real), and only the de-

tector head with real images (Det Real). We re-

peat these experiments with synthetic images with-

out any blur or shadows (Net Synth, Det Synth). To

evaluate the importance of egocentric cues, we train

the detector head only with shadows (Det Shadows),

only with blur (Det Blur) and both (Det BS). Fi-

nally, we study the effect of using a small amount

of real data: added to the synthetic images during

training (Det SynthReal), or ﬁne-tuning the detector

head in a ﬁnal step (Net Synth+FT, Det Synth+FT,

Det Blur+FT).

All experiments are tested on real images. Net-

works trained on synthetic data are also tested on syn-

thetic images. All experiments follow a 3-fold cross-

validation scheme. For synthetic data, we subdivide

the 10000 images into 3 equal subsets: two are used

as training set (average of 6650 images), and the third

one is subdivided equally for validation and test (aver-

age of 1675 images each). We repeat each experiment

3 times with a different subset for validation and test,

and we report the average results. Real images from

videos were treated carefully: since frames within a

video have similar illumination, background and ob-

ject poses, they cannot be distributed among training,

validation and test set. As the videos do not show

all 5 objects, we carefully distribute the videos into

the training, validation and test set, to make sure that

each object appears in every set of the 3 splits. Then,

we balance the number of objects and images in each

set by distributing the remaining pictures. In average,

293 objects appear in the training set for 131 images,

322 objects in the validation set for 125 images, and

302 objects in the test set for 136 images.

4.4 Evaluation Metrics

As an object detector, YOLOv3 outputs for each im-

age a series of bounding box coordinates and their

corresponding class labels, along with a conﬁdence

level. At test time, detections with conﬁdence higher

than 0.25 are kept. However, the performance is eval-

uated for all detections with the mean Average Preci-

sion (mAP) and Intersection over Union (IoU), as they

are deﬁned for the Pascal VOC challenge (Evering-

ham et al., 2010). The mAP metric index is the mean

Average Precision (AP) over all classes. For each

class, AP is the area under the interpolated precision-

recall curve, computed as the mean precision corre-

sponding to 11 equally spaced recall values. The IoU

measures the overlap of the predicted and ground-

truth bounding box of an object. A detection is posi-

tive if its bounding box overlaps with the ground-truth

by 50% or more (denoted IoU@50). We also report

the standard deviation on the 3 cross-validation exper-

iments, to illustrate the variation in performance with

different training sets.

5 RESULTS AND DISCUSSION

In this section, we present the results obtained with

the different learning strategies in Section 4.3. We

also analyze the effect on learning of different types of

blur in the synthetic training set. Finally, we propose

a qualitative analysis of the results.

5.1 Baselines

Table 1 shows the results of the baseline experiments

on synthetic images. When YOLOv3 is trained end-

to-end on synthetic images without blur and shadows

(Net Synth), the mAP is 88.6% and the IoU is 76.9%.

When the feature extractor is ﬁxed after being pre-

trained on ImageNet (Det Synth), the mAP decreases

to 63.4% and the IoU drops to 49.5%. These results

suggest that the features learned on real images do not

correspond to features in synthetic images, which is

consistent with the domain gap. From all the classes,

the only exception is the handle, which is detected

almost as well in both cases.

Table 2 shows the results of the same experiments

on real images. As expected, the metrics drop for both

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

648

Table 1: Detailed results for training and test on synthetic images: AP per class, mAP, IoU (%).

Handle Backrest Left proﬁle Right proﬁle Seat mAP IoU@50

Net Synth 90.8 86.7 90 88.8 86.8 88.6 ± 0.3 76.9 ±1

Det Synth 86.5 48.6 69.5 58.3 53.4 63.4 ± 5 49.5± 2.4

Table 2: Detailed results of baseline experiments on real images: AP per class, mAP, IoU (%).

Handle Backrest Left proﬁle Right proﬁle Seat mAP IoU@50

Net Synth 12.1 30 12.8 1.1 39.1 19 ± 8.4 32.1 ± 9

Det Synth 37.3 54.7 18.4 18.3 45.4 34.8 ± 4.8 41.6 ± 3.8

Net Real 51.7 70.3 18.8 25.5 63.8 46 ± 12 44.4 ± 10.8

Det Real 48.3 65.4 15.7 21.2 55.1 41.1 ± 3.8 42.7 ± 13.7

Net Synth and Det Synth, although Det Synth per-

forms much better: the network takes advantage of

the real features learned on ImageNet, even if it has

not seen any real image of the objects. Regarding the

networks trained on real images only, Net Real and

Det Real, objects are not detected very accurately,

with a mAP of 46% and 41.1% respectively. We

observe here the limitations of training with a small

number of real images, and we explore in the next sec-

tions possible directions to improve the performance.

5.2 Using Blur and Shadows

As we observed better results training only the de-

tector head with synthetic images, we carried out the

next experiments with this strategy. Table 3 shows

the results on real images when the synthetic train-

ing set presents: no blur and shadows (Det Synth),

shadows only (Det Shadows), blur only (Det Blur),

both blur and shadows (Det BS) (resulting in a big-

ger training set). While artiﬁcial shadows do not im-

prove the accuracy (mAP is the same for Det Synth

and Det Shadows), the IoU increases from 41.6% to

46.1%. We can deduce that shadows make the local-

ization of objects more difﬁcult in real images.With

Det Blur, the mAP and IoU both increased compared

to Det Synth. This experiment shows a better detec-

tion in slightly blurry images where nothing was de-

tected previously, and a more precise localization. Fi-

nally, with Det

BS, the IoU is decreased compared to

Det Blur (although it is still higher than Det Synth),

while the mAP increases to 45.8%. Without any real

image during training, the best strategy for our dataset

seems to be the training of the detector head only with

blur and shadows for a better mAP, or only blur for a

better IoU.

Table 3: Inﬂuence of blur and shadows on real images (%).

mAP (%) IoU@50 (%)

Det Synth 34.8 ± 4.8 41.6± 3.8

Det Shadows 34.8± 3.3 46.1 ± 6.5

Det Blur 44 ± 1.4 51.3 ± 5

Det BS 45.8 ± 1.5 46.5 ± 3.7

5.3 Fine-tuning with Real Images

We ﬁne-tune the detection head of YOLOv3 with the

real images used in the baseline experiments Net Real

and Det Real. As Table 4 shows, this ﬁne-tuning step

increases the performance, except for the IoU when

training with blurred images. For all experiments but

Net Synth, using both synthetic and real images dur-

ing training gives better performance than using real

images only. When training with real and synthetic

images at the same time (Det SynthReal), we obtain

better mAP than without any real image, but low IoU.

In conclusion, the detection is improved combining

real and synthetic images, but the localization can be

better without any real image.

Table 4: Inﬂuence of using real images during training (%).

mAP IoU@50

Net Synth 19 ± 3.8 32.1 ± 13.7

Det Synth 34.8 ± 4.8 41.6 ± 3.8

Det Blur 44± 1.4 51.3 ± 5

Net Synth+FT 40.7 ± 13.6 44.71 ± 7.6

Det Synth+FT 49.3 ± 10.6 42.7 ± 2.9

Det Blur+FT 52.4 ± 10.3 48.5 ± 7.7

Det SynthReal 48.5± 6.2 41.8 ± 5.3

5.4 Blur Analysis

We studied the inﬂuence of the type of blur used in

the training set. In Figure 4, we show the mAP and

IoU@50 of the network trained without any blurred

image (Det Synth), only motion blur, only Gaussian

blur, and both motion and Gaussian blur equally dis-

tributed (Det Blur). We observe that any blur im-

proves the performance, although Gaussian blur is

slightly better. Finally, the best performance is ob-

tained with both types of blur, with 44% of mAP and

51.3% of IoU. This result shows again that more char-

acteristics of real images and variability are key com-

ponents towards generalization to the real domain.

CAD-based Learning for Egocentric Object Detection in Industrial Context

649

IoU@50

mAP

41.6

34.8

39.2

42.5

41.2

44.9

51.3

Performance (%)

Det Synth Motion blur

Gaussian blur Det Blur

Figure 4: Comparison of the type of blur during training.

Figure 5: Detections of Net Real, Det Blur and

Det Blur+FT.

5.5 Qualitative Analysis

In this section, we provide examples of correct and

missed detections for a deeper understanding of the

results. As we do not simulate occlusions and multi-

object detections, the CNN is not able to detect many

close or overlapping objects (Figure 5, bottom), and

outputs a small number of bounding boxes.

Training on real images allows to detect all parts

when the bus seat is assembled, as similar composi-

tion exists in the training set (Figure 5, bottom); the

number of detections per image is much higher, in-

cluding more false positive detections (Figure 5, top).

Using synthetic training and real ﬁne-tuning removes

the false positive while detecting all objects. Another

type of error comes from the similarity between the

objects: the left and right handle look very similar,

especially when they do not appear entirely in the im-

age ; the seat and backrest are almost identical ob-

jects except for their length. Indeed, when manually

labelling the real images, we considered the several

consecutive frame, because it is sometimes not pos-

sible to distinguish some objects considering a single

frame, without extra information. It is then under-

standable that the CNN is not able to recognize them

either. Finally, when the training is carried out only on

synthetic images, the CNN detects objects even when

they are slightly truncated by the image borders (Fig-

ure 6), because images of this scenario are generated

by our framework.

Figure 6: Truncated objects correctly detected (Det Blur).

6 CONCLUSION AND FUTURE

WORK

We proposed a framework for the generation of ego-

centric synthetic images given a set of CAD mod-

els. We showed that pre-training the feature extrac-

tor on ImageNet and training the detector head with

synthetic images, enriched with blur and shadows,

enables to reach a similar classiﬁcation accuracy as

using only real images, and even better localization.

Fine-tuning the resulting network with a small set of

real images further improves the detection, at the cost

of localization precision.

To pursue our effort towards a better detection

in real images, occlusions and multiple objects re-

main to be incorporated. Small objects as screws and

bolts could also be added to the objects to detect.

Finally, discriminating similar objects is a challenge

that needs further investigations.

ACKNOWLEDGEMENT

This work is supported by grant CIFRE n.2018/0872

from ANRT. We also thank DEMS for providing the

3D data and real objects.

REFERENCES

Agin, G. J. (1980). Computer vision systems for industrial

inspection and assembly. Computer, (5):11–20.

ohm, J., Brenner, C., G

uhring, J., and Fritsch, D. (2000).

Automated extraction of features from cad models for

3d object recognition. In ISPRS Congress.

Chu, B., Madhavan, V., Beijbom, O., Hoffman, J., and Dar-

rell, T. (2016). Best practices for ﬁne-tuning visual

classiﬁers to new domains. In ECCV, pages 435–442.

Damen, D., Doughty, H., Farinella, G. M., Fidler, S.,

Furnari, A., Kazakos, E., Moltisanti, D., Munro, J.,

Perrett, T., Price, W., and Wray, M. (2018). Scal-

ing egocentric vision: The epic-kitchens dataset. In

ECCV, pages 720–736.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). Imagenet: A large-scale hierarchical

image database. In CVPR, pages 248–255.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

650

Evans, G., Miller, J., Pena, M. I., MacAllister, A., and

Winer, E. (2017). Evaluating the microsoft hololens

through an augmented reality assembly application.

In Degraded Environments: Sensing, Processing, and

Display, volume 10197.

Everingham, M., Van Gool, L., Williams, C. K. I., Winn,

J., and Zisserman, A. (2010). The pascal visual object

classes (voc) challenge. IJCV, 88(2):303–338.

Gavish, N., Guti

errez, T., Webel, S., Rodr

ıguez, J., Peveri,

M., Bockholt, U., and Tecchia, F. (2015). Evaluating

virtual reality and augmented reality training for in-

dustrial maintenance and assembly tasks. Interactive

Learning Environments, 23(6):778–798.

Hinterstoisser, S., Benhimane, S., Lepetit, V., Fua, P., and

Navab, N. (2008). Simultaneous recognition and ho-

mography extraction of local patches with a simple

linear classiﬁer. In BMVC.

Hinterstoisser, S., Lepetit, V., Wohlhart, P., and Konolige,

K. (2018). On pre-trained image features and syn-

thetic images for deep learning. In ECCV Workshops.

Inoue, T., Choudhury, S., De Magistris, G., and Dasgupta,

S. (2018). Transfer learning from synthetic to real im-

ages using variational autoencoders for precise posi-

tion detection. In IEEE ICIP, pages 2725–2729.

Kehl, W., Manhardt, F., Tombari, F., Ilic, S., and Navab, N.

(2017). Ssd-6d: Making rgb-based 3d detection and

6d pose estimation great again. In ICCV, pages 1521–

1529.

Klein, G. and Murray, D. (2007). Parallel tracking and map-

ping for small AR workspaces. In IEEE/ACM Inter-

national Symposium on Mixed and Augmented Reality

(ISMAR).

Klein, G. and Murray, D. (2009). Simulating low-cost

cameras for augmented reality compositing. IEEE

Transactions on Visualization and Computer Graph-

ics, 16(3):369–380.

Langlois, J., Mouch

ere, H., Normand, N., and Viard-

Gaudin, C. (2018). 3d orientation estimation of in-

dustrial parts from 2d images using neural networks.

In ICPRAM, pages 409–416.

Malamas, E. N., Petrakis, E. G. M., Zervakis, M., Petit,

L., and Legat, J.-D. (2003). A survey on industrial vi-

sion systems, applications and tools. Image and Vision

Computing, 21(2):171–188.

Massa, F., Russell, B. C., and Aubry, M. (2016). Deep ex-

emplar 2d-3d detection by adapting from real to ren-

dered views. In CVPR, pages 6024–6033.

Maturana, D. and Scherer, S. (2015). Voxnet: A 3d convolu-

tional neural network for real-time object recognition.

In IROS, pages 922–928.

Nguyen, T.-H.-C., Nebel, J.-C., and Florez-Revuelta, F.

(2016). Recognition of activities of daily living with

egocentric vision: A review. Sensors, 16(1):72.

Peng, X., Sun, B., Ali, K., and Saenko, K. (2015). Learning

deep object detectors from 3d models. In ICCV, pages

1278–1286.

Planche, B., Zakharov, S., Wu, Z., Kosch, H., and Ilic, S.

(2019). Seeing beyond appearance - mapping real im-

ages into geometrical domains for unsupervised cad-

based recognition. In IROS.

Qin, F.-w., Li, L.-y., Gao, S.-m., Yang, X.-l., and Chen, X.

(2014). A deep learning approach to the classiﬁca-

tion of 3d cad models. Journal of Zhejiang University

SCIENCE C, 15(2):91–106.

Razavian, A. S., Azizpour, H., Sullivan, J., and Carlsson,

S. (2014). Cnn features off-the-shelf: an astounding

baseline for recognition. In CVPR Workshops, pages

806–813.

Redmon, J., Bochkovskiy, A., and Sinigardi, S. (2019).

Darknet: Yolov3 - neural network for object detection.

https://github.com/AlexeyAB/darknet.

Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental

improvement. arXiv preprint arXiv:1804.02767.

Rozantsev, A., Lepetit, V., and Fua, P. (2015). On rendering

synthetic images for training an object detector. Com-

puter Vision and Image Understanding, 137:24–37.

Sarkar, K., Varanasi, K., Stricker, D., Sarkar, K., Varanasi,

K., and Stricker, D. (2017). Trained 3d models for cnn

based object recognition. In VISAPP, pages 130–137.

Su, H., Qi, C. R., Li, Y., and Guibas, L. J. (2015). Render

for cnn: Viewpoint estimation in images using cnns

trained with rendered 3d model views. In ICCV, pages

2686–2694.

Sundermeyer, M., Marton, Z.-C., Durner, M., Brucker, M.,

and Triebel, R. (2018). Implicit 3d orientation learn-

ing for 6d object detection from rgb images. In ECCV,

pages 699–715.

Toshev, A., Makadia, A., and Daniilidis, K. (2009). Shape-

based object recognition in videos using 3d synthetic

object models. In CVPR, pages 288–295.

Tremblay, J., Prakash, A., Acuna, D., Brophy, M., Jampani,

V., Anil, C., To, T., Cameracci, E., Boochoon, S., and

Birchﬁeld, S. (2018). Training deep networks with

synthetic data: Bridging the reality gap by domain

randomization. In CVPR Workshops, pages 969–977.

Ulrich, M., Steger, C., Baumgartner, A., and Ebner, H.

(2001). Real-time object recognition in digital images

for industrial applications. In 5th Conference on Op-

tical 3-D Measurement Techniques, pages 308–318.

Van Krevelen, D. W. F. and Poelman, R. A survey of aug-

mented reality technologies, applications and limita-

tions. International Journal of Virtual Reality, 9(2).

Wang, P.-S., Sun, C.-Y., Liu, Y., and Tong, X. (2019). Adap-

tive o-cnn: A patch-based deep representation of 3d

shapes. ACM Transactions on Graphics, 37(6):217.

Wohlhart, P. and Lepetit, V. (2015). Learning descriptors for

object recognition and 3d pose estimation. In CVPR,

pages 3109–3118.

Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X.,

and Xiao, J. (2015). 3d shapenets: A deep representa-

tion for volumetric shapes. In 2015 IEEE Conference

on Computer Vision and Pattern Recognition (CVPR),

pages 1912–1920.

Xiang, Y., Mottaghi, R., and Savarese, S. (2014). Beyond

pascal: A benchmark for 3d object detection in the

wild. In IEEE Winter Conference on Applications of

Computer Vision, pages 75–82.

CAD-based Learning for Egocentric Object Detection in Industrial Context

651