Building Synthetic Simulated Environments for Conﬁguring and

Training Multi-camera Systems for Surveillance Applications

Nerea Aranjuelo

1,2

, Jorge Garc

ıa

, Luis Unzueta

, Sara Garc

ıa

, Unai Elordi

1,2

and Oihana Otaegui

Vicomtech, Basque Research and Technology Alliance (BRTA), San Sebastian, Spain

Basque Country University (UPV/EHU), San Sebastian, Spain

Keywords:

Simulated Environments, Synthetic Data, Deep Neural Networks, Object Detection, Video Surveillance.

Abstract:

Synthetic simulated environments are gaining popularity in the Deep Learning Era, as they can alleviate the

effort and cost of two critical tasks to build multi-camera systems for surveillance applications: setting up

the camera system to cover the use cases and generating the labeled dataset to train the required Deep Neural

Networks (DNNs). However, there are no simulated environments ready to solve them for all kind of scenarios

and use cases. Typically, ‘ad hoc’ environments are built, which cannot be easily applied to other contexts.

In this work we present a methodology to build synthetic simulated environments with sufﬁcient generality to

be usable in different contexts, with little effort. Our methodology tackles the challenges of the appropriate

parameterization of scene conﬁgurations, the strategies to generate randomly a wide and balanced range of

situations of interest for training DNNs with synthetic data, and the quick image capturing from virtual cameras

considering the rendering bottlenecks. We show a practical implementation example for the detection of

incorrectly placed luggage in aircraft cabins, including the qualitative and quantitative analysis of the data

generation process and its inﬂuence in a DNN training, and the required modiﬁcations to adapt it to other

surveillance contexts.

1 INTRODUCTION

In recent years, the irruption of Deep Neural Net-

works (DNNs) has generated signiﬁcant gains in a va-

riety of Computer Vision tasks, such as object detec-

tion (Hou et al., 2019), object segmentation (Mani-

nis et al., 2019), image captioning and visual rela-

tionship detection (Xi et al., 2020) or action recogni-

tion (Zhang et al., 2019) to build more sophisticated

vision-based systems. DNNs have demonstrated the

achievement of excellent results using large-scale su-

pervised learning approaches in which a large amount

of labeled data sets are usually required for training.

Typically, the more data we have available, the better

is their performance. In this context, collecting such

amount of data can be challenging speciﬁcally for two

principal reasons.

First, the compliance with privacy-related reg-

ulations in some countries, such as the ’General

Data Protection Regulation’ (GDPR) of the European

Union. The data collection process should put in

place appropriate technical and organizational mea-

sures to implement the data protection principles,

such as data protection or data anonymization.

Second, the time and cost that we need to devote

in the data collection process, considering that not

only the quantity is important but also the variety and

balance to appropriately cover all possibilities during

training. In the vast majority of cases, the data cannot

be directly extracted from the Internet since we try to

solve a particular problem in a speciﬁc scenario with

a speciﬁc camera setup. Thus, custom data sets need

to be created. This implies the execution of several

actions such as data set speciﬁcation, camera setup in-

stallation, technical preparation, recording protocols,

even actors simulating precise guide notes. All these

actions together with the well-known problem of la-

beling data (Mujika et al., 2019) due to the lack of

effective tools and/or the annotation complexity make

the data set generation process more challenging.

There are several ways to extend training data:

(i) augmentation techniques (Shorten and Khoshgof-

taar, 2019), (ii) domain adaptation techniques and

(Singh et al., 2020) (iii) synthetic data generation

(Nikolenko, 2019). Augmentation techniques or do-

main adaptation approaches can be used for those

cases in which we already have some annotated data

corresponding to the domain. In case we do not,

Aranjuelo, N., García, J., Unzueta, L., García, S., Elordi, U. and Otaegui, O.

Building Synthetic Simulated Environments for Conﬁguring and Training Multi-camera Systems for Surveillance Applications.

DOI: 10.5220/0010232400800091

In Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2021) - Volume 5: VISAPP, pages

80-91

ISBN: 978-989-758-488-6

before we start collecting real data, we could tackle

the initial stages of the system’s design by generat-

ing some synthetic data. This would allow us to train

initial versions of DNNs that could help us annotat-

ing real data, start applying the other mentioned tech-

niques, and progressively improve the system’s relia-

bility (Seib et al., 2020).

As stated in (Nikolenko, 2019), synthetic data can

be generated following different kinds of strategies,

such as compositing real data, relying on generative

models, or using simulated environments. The lat-

ter has an additional advantage when building multi-

camera systems for surveillance applications: they

typically provide virtual cameras that can help setting

up the camera system to cover the use cases in the

targeted real scenario. This is particularly relevant in

scenarios not easily accessible for system designers

and/or when the camera positions and lens character-

istics are not predeﬁned. However, there are no gen-

eral simulated environments ready to solve the cam-

era setup and data generation tasks for all kinds of

scenarios and use cases. Typically, ’ad hoc’ environ-

ments are built. On the contrary, a generalist solution

should allow:

• Including the required scenario-related graphical

assets in an easy manner.

• Conﬁguring context- and use-case-based scenes

with user-friendly parameters.

• Capturing images from virtual camera viewpoints

quickly, taking into account that the rendering

time could be an important bottleneck in this pro-

cess. These images should include camera-related

effects, such as the geometric distortion intro-

duced by the lenses.

• Generating a wide and balanced range of plausi-

ble situations of interest, randomly, applying suit-

able noise to the labeled data for the appropriate

training of DNNs.

In order to respond to all these challenges, in this

work we present a methodology to build synthetic

simulated environments for conﬁguring and training

multi-camera systems with sufﬁcient generality to be

usable in different surveillance contexts, with little ef-

fort. We focus on static systems, i.e., those in which

the cameras are visualizing the scene from speciﬁc

positions. This means that applications involving dy-

namic systems (e.g., robots that interact with the en-

vironment while patrolling areas of interest), are be-

yond this scope. We show a practical implementation

example of this methodology in the context of digital-

ized on-demand aircraft cabin readiness veriﬁcation

with a camera-based smart sensing system. We also

compare it to alternative state-of-the-art approaches,

including the qualitative and quantitative analysis of

the data generation process, and the required modi-

ﬁcations to adapt it to another surveillance context.

To ensure the suitability of the generated data by our

methodology, we train a classiﬁcation DNN and eval-

uate its accuracy when trained with real and synthetic

images.

The rest of the paper is organized as follows: sec-

tion 2 describes prior works related to synthetic sim-

ulated environments for training intelligent systems;

section 3 explains our proposed methodology; section

4 presents the mentioned practical implementation

example and experiments; ﬁnally, section 5 presents

the conclusions and future lines of work.

2 RELATED WORK

The recent survey on synthetic data for deep learning

(Nikolenko, 2019) shows that in the last years there

has been a shift from static synthetic datasets to inter-

active simulation environments, grouped in the fol-

lowing categories: (1) outdoor urban environments

for learning to drive, (2) indoor environments and

(3) robotic and aerial navigation simulators. Current

state-of-the-art environments have been built upon

Grand Theft Auto V (GTA V) (Hurl et al., 2019),

Unity3D (Saleh et al., 2018; Scheck et al., 2020),

Unreal Engine (Lai et al., 2018; Shah et al., 2017),

CityEngine (Khan et al., 2019) or Blender (Rajpura

et al., 2017), among others.

GTA V is an action-adventure video game with re-

alistic graphics of a large detailed city and surround-

ing areas from which diverse data can be extracted,

involving virtual people, animals, cars, trucks, motor-

bikes, planes, etc. As stated in (Saleh et al., 2018), the

main drawback of video-game-based environments

is the limited freedom for customization and control

over the scenes to be captured, which makes obtaining

a large diversity and good balance of classes difﬁcult,

due to the rather complicated procedure required to

obtain ground-truth instance-level annotations.

On the contrary, (Saleh et al., 2018) claims

that their environment, built upon the game engine

Unity3D, can be set up by one person in one day.

This is very little effort, considering that it allows hav-

ing access to a virtually unlimited number of anno-

tated images with the object classes of state-of-the-art

real urban scene datasets. It captures synthetic im-

ages and instance-level semantic segmentation maps

at the same time and in real-time, with no human in-

tervention. While generating the data it renders the

original textures and shaders of included 3D objects

and other automatically created unique ones for their

Building Synthetic Simulated Environments for Conﬁguring and Training Multi-camera Systems for Surveillance Applications

Environment Objects

Scene

Description

File

Loader

Camera

Parameters

Rendering

Parameters

Object

Parameters

Importer

Object

Relations

Engine

Scene

Manager

Scene

Configuration

Label

Generator

Images

Labels

Lighting

Parameters

Training Data

Segmented

Images

User

Figure 1: Software architecture of the proposed methodology for creating synthetic training data for vision-based systems.

corresponding instances. Unity3D is also the basis of

the indoor environment proposed by (Scheck et al.,

2020) for the generation of synthetic data for object

detection from an omnidirectional camera placed on

the ceiling of a room. The 3D assets are generated us-

ing the skinned multi-person linear model proposed

in (Loper et al., 2015). This work points out that

Unity3D only provides a camera model for perspec-

tive and orthographic projection. They overcome this

limitation by combining four perspective cameras fol-

lowing the procedure proposed in (Bourke and Fe-

linto, 2010).

(Lai et al., 2018) proposed a universal dataset and

simulator of outdoor scenes such as pedestrian de-

tection, patrolling drones, forest ﬁres, shooting, and

more. It is powered by the Unreal Engine and lever-

ages the AirSim (Shah et al., 2017) plugin for hard-

ware simulation. The TCP/IP protocol is used to com-

municate with external deep learning libraries. As

it is focused on training dynamic systems relying on

deep reinforcement learning (Hernandez-Leal et al.,

2019), it prioritizes the real-time interactivity of the

virtual agents over photorealistic rendering. This dif-

fers from static systems for surveillance applications,

like those that motivate this work, in which at least for

the data generation process, achieving a better render-

ing quality is more important than the real-time inter-

activity during training.

CityEngine is a program that allows generating

3D city-scale maps procedurally from a set of gram-

mar rules. It is used in (Khan et al., 2019) as part

of their method to generate an arbitrarily large, se-

mantic segmentation dataset reﬂecting real-world fea-

tures, including varying depth, occlusion, rain, cloud,

and puddle levels while minimizing required effort.

Blender is an open-source 3D graphics software

with Python APIs that facilitate loading 3D models

and automating the scene rendering. It was used in

(Rajpura et al., 2017) to generate a dataset to train a

DNN-based detector for recognizing objects inside a

refrigerator. It used Cycles Render Engine available

with Blender. This engine is a physically-based path

tracer that allows getting photorealistic results, which

is beneﬁcial for avoiding a big domain gap between

the feature distribution of synthetic and real data do-

mains.

3 PROPOSED METHODOLOGY

Figure 1 shows the architecture of the proposed

methodology for generating synthetic data for train-

ing deep learning models. The input data of our ap-

proach are the 3D assets of the scene and the scene’s

description ﬁle. The output are the synthetic images

and the annotations for training DNNs. The principal

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

modules in our approach are: (i) the scene manager

which is responsible to load the scene conﬁguration;

(ii) the engine which sets up the 3D scene according

to the provided conﬁguration and generates the train-

ing images by rendering different camera viewpoints

and; (iii) the label generator which generates an out-

put ﬁle containing the annotations corresponding to

the generated images.

Similar to the data collecting process carried out

in a real environment, the ﬁrst step is gathering all

the necessary 3D graphical assets to reproduce the

scene of interest. Speciﬁcally, two different groups

of graphical assets are required. The ﬁrst type con-

tains those graphical assets representing the environ-

ment (i.e., the scenario in which we should accom-

plish the recordings). We assume that such assets be-

longing to the environment are static since they rep-

resent the background. The second group contains

those graphical assets representing the dynamic ob-

jects of the scene. The combination of the locations

of these assets, their poses, sizes, and appearances to-

gether with the variations of the lighting sources and

the camera properties will provide a wide range of va-

riety for generating our custom training data.

For use cases in which some graphical assets are

unavailable, the user should create them using 3D

modelling software applications. We use 3DS Max

but it is possible to use other applications such as Au-

todesk Maya, Lightwave-3D, Vectary, Blender, etc.

Depending on the complexity, additional plugins such

as Populate (for 3DS Max) can alleviate the effort

of designing the objects. Working with a very de-

tailed 3D model can become a challenge due to the

vast amount of polygons and materials that the render

engine has to process. This is directly related to the

rendering time of the scene and it could be a trouble-

some bottleneck in the data ﬂow when a user tries to

generate thousands of samples. However, using very

detailed 3D assets or more lightweight ones should be

considered based on the scope of the use case.

Once all the assets of the scene are designed, our

method allows us to conﬁgure different camera se-

tups, placing objects at multiple locations and poses,

having multiple illumination sources, or conﬁguring

the rendering parameters by a minimum user interac-

tion. Furthermore, the user can easily deﬁne multiple

combinations of parameters for generating different

training data samples in a single iteration.

Speciﬁcally, the user has to deﬁne a conﬁguration

ﬁle in which the parameterization of the scene is spec-

iﬁed in a user-friendly way. Then, a loader interprets

the content related to the different entities according

to the nature of the parameters (objects, cameras, il-

lumination, rendering). Such information is received

by the scene manager, which interprets and handles

these data to build the scene conﬁguration for the

user-deﬁned sequence. This conﬁguration is used to

replicate the target scenes in the 3D synthetic environ-

ment, which implies loading the environment and dy-

namic assets with the corresponding conﬁgurations,

as well as setting all the cameras, lighting, and ren-

dering parameters to get the desired results. Then,

the engine renders the images from the deﬁned cam-

era perspectives. We use Blender for this purpose, al-

though the same methodology could be applied using

alternative similar programs.

The label generator is in charge of generating the

corresponding annotations based on generated seg-

mented images. The user can choose different anno-

tation types depending on the target computer vision

task (e.g., object detection or semantic segmentation).

3.1 Scene Management

The synthetic scene is replicated based on the infor-

mation provided by the user. The user is in charge

of the conﬁguration through the description ﬁle. For

this purpose we adopt the Video Content Descrip-

tion (VCD) structured JSON-schema ﬁle format (Vi-

comtech, 2020). VCD is an open source metadata

structure able to describe complex scenes, including

annotations and all the needed scene information in a

very ﬂexible way. In addition, it allows a very easy

and fast user interaction for the ﬁles generation pro-

cess. The VCD format is compatible with the Open-

LABEL standard (ASAM, 2020). The conﬁguration

ﬁle contains the information related to the cameras

and lighting setup, the 3D assets in the scene and the

relation between them, and the rendering parameters.

Modifying something in the scene, such as the posi-

tion of an object or the camera speciﬁcations, is sim-

ply done by changing the corresponding ﬁeld in the

conﬁguration ﬁle.

3.2 Camera Setup and Lighting Sources

The camera setup controls the way the scenario and

the objects are represented in the 2D images. In order

to obtain realistic training data, the virtual cameras

should simulate the same properties of the sensor and

the lens that the expected cameras, which will be in-

stalled in the real environment. One of the major ben-

eﬁts of using Blender as an engine for creating and

rendering the scene, in contrast with others such as

Unity3D, is the multiple choices of camera models.

In particular, we can generate the image projection of

the virtual camera by using an orthographic model,

a perspective model, or a panoramic model. These

Building Synthetic Simulated Environments for Conﬁguring and Training Multi-camera Systems for Surveillance Applications

three models allow emulating any combination of the

image sensor and lens type mounted in a real camera.

Table 1 shows the parameters that need to be added

to the conﬁguration ﬁle to add as many cameras as

desired with their corresponding parameters.

Table 1: Camera Parameters.

Camera model

Camera model (different distortion

types).

Resolution Output image resolution (pixels).

Position

Sensor position (m) in X, Y and Z

axis.

Orientation

Sensor orientation (degrees) in X,

Y, and Z axis.

Size Sensor size (mm).

FOV FOV of the sensor (degrees).

Focal length Focal length of the sensor (mm).

Custom params Extra params deﬁned by the user.

Another important factor affecting the projection

of the visual information from 3D to 2D is the light-

ing of the scene. Depending on the target scenario, the

user may want to add light coming from single points

which emit light in all directions or from spots with

a single direction (e.g., indoor lamps) or an outdoor

lighting simulating the sun. Figure 2 shows some ex-

ample of the mentioned lighting types. Table 2 shows

the parameters that should be deﬁned in the conﬁg-

uration ﬁle to add as many different light sources as

desired to the 3D environment.

Table 2: Lighting Parameters.

Light model Type (point, spot, sun).

Position

Light position (m) in X, Y and Z

axis.

Orientation

Light orientation (degrees) in X, Y

and Z axis.

The scene conﬁguration is automatically repli-

cated in the 3D environment from the description ﬁle.

During this step, the scene conﬁguration is exported

as a Blender project for those cases in which the user

wants to interactively explore the camera setup. This

way, it can be used as an interactive tool that helps to

design a proper setup for a target application. Parame-

ters such as the intrinsic and extrinsic camera param-

eters, their positions, or even the number of needed

cameras can be modiﬁed by the user while he/she vi-

sualizes the images that would be captured. Simulat-

ing a speciﬁc set up with no need of physically de-

ploying it, can be of great help to avoid wrong deci-

sions that lead to a not optimal or wrong set up.

This way it may help to deﬁne the appropriate

number of cameras, their locations, poses, and view-

points. Thus, the proposed methodology includes the

following two bidirectional features:

Figure 2: Light sources for a 3D environment: lamp emit-

ting light in all directions (top left) and a single direction

(top right), and lighting simulating the sun (bottom).

• VCD2Scene: The user deﬁnes the VCD descrip-

tion ﬁle and the scene is replicated in the 3D envi-

ronment including the 3D assets, camera conﬁgu-

rations, and lighting sources.

• Scene2VCD: The user loads a 3D scene and after

making the desired modiﬁcations, he/she exports

the new setup to a VCD description ﬁle.

3.3 3D Assets in the Scene

Regarding the 3D assets’ conﬁguration in the scene,

it is also deﬁned in the VCD ﬁle. The user can add

as many objects as desired to the scene in speciﬁc

conﬁgurations. These objects should belong to the

available 3D asset types (e.g., humans, cars). Then

the 3D environment simulator interprets this informa-

tion through the local relation between the assets. The

user should have previously deﬁned the relations that

will be present in the scene and interactively select the

positions they would belong to. For example, if the

target application is about detecting abandoned ob-

jects in an airport, some local relations that would be

necessary could be ”on” or ”below”. These relations

would let the user relate different assets for example

by describing that certain objects (e.g., a bag, a suit-

case) are on a desk or below the waiting seats. At the

same time, the environment simulator would relate

these positions with the user-selected positions. In

addition, the user deﬁnes in the VCD ﬁle the time in-

terval when each speciﬁc asset is present in the scene

in the described conﬁguration. This method provides

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

high ﬂexibility for the user to generate a high variety

of conﬁgurations in a very fast and user-friendly way.

In order to add a higher degree of variety to the

generated data, when each asset is placed on a spe-

ciﬁc position some random noise is added to slightly

perturb its position and orientation. In addition, the

user can choose to apply random colors to the 3D as-

sets for including more diversity in the data or on the

contrary maintain the original textures.

3.4 Rendering Parameters

The rendering step turns the 3D scene into the out-

put 2D image. The conﬁguration of the rendering af-

fects both the image quality and rendering time. The

optimum conﬁguration is closely related to the target

task requirements and the available hardware. There-

fore, our approach lets the user change the most in-

ﬂuential parameters in the conﬁguration ﬁle. The pa-

rameters that can be modiﬁed are shown in Table 3.

The user can choose between the available rendering

engines (in the case of Blender there are three avail-

able engines), computing caustics or not, the render-

ing tile size, the device to be used (CPU or GPU), and

the maximum allowed light bounces. When the light

hits a surface, it bounces off the surface and hits an-

other one, and then the process is repeated. This is

very expensive in terms of rendering time. Decreas-

ing the maximum bounces implies limiting the num-

ber of times a ray can bounce before it is killed and

therefore, decreasing the time spent computing rays.

Depending on the scene and the task, the user can ad-

just this parameter for getting the desired output.

Table 3: Rendering Parameters.

Device Device used for rendering (CPU/GPU).

Engine Engine type used for rendering.

Tile

Area of the image considered during

the rendering.

Bounces Maximum light bounces to be applied.

Caustics Caustics computation.

3.5 Labeled Data Generation

Training machine learning models in a supervised

way implies not only collecting the needed data but

also the corresponding annotations. Annotating the

data is an expensive process for which synthetic data

generation can be very helpful thanks to the automatic

labels generation. Our approach provides ﬂexible an-

notations, with different levels of detail depending on

the target computer vision task (object detection, ob-

ject segmentation, or visual relationship detection).

Getting the 3D assets’ world coordinates and pro-

jecting them to the image plane would be a very ef-

ﬁcient way to automatically obtain the annotations.

However, it can be observed in Figure 3 (correspond-

ing to a scene in the context explained in the follow-

ing section) that this could lead to wrong annotations

because of occlusions. Different elements of the envi-

ronment can occlude some objects, so if these occlu-

sions are not considered we would obtain annotations

for objects that are not visible in the rendered images.

An additional challenge is the camera distortion. Get-

ting accurate annotations implies taking into account

and replicating the distortion algorithm used by the

chosen 3D graphics software’s camera model to ob-

tain the ﬁnal objects’ positions.

Figure 3: Object annotations based on 3D coordinates can

result in some annotated objects occluded by the seats.

We opt for rendering accurate instance-level

masks. Each asset in the simulated scene is given a

unique ID apart from the object class ID it belongs to.

These instance IDs are used to compute an alpha mask

per object. These masks are combined in a single seg-

mentation mask. When the data generation starts, the

synthetic images are rendered at the same time as the

instance-level semantic segmentation maps.

The segmentation masks are used to generate the

annotations, which are stored in the output VCD ﬁle.

This ﬁle contains both the input conﬁguration data

and the annotations. By default, the masks are used

to get the minimum bounding boxes containing each

objects’ pixels, which are represented in the VCD

ﬁle by each boxes’ corner coordinates. These anno-

tations can be used to train object detectors. How-

ever, our approach can be conﬁgured to save the ob-

jects’ segmentation mask too. These data are already

in the instance-level masks and are stored as object

contours, which are described as polygons in the out-

put ﬁle. In addition, the output VCD ﬁle contains the

description of the relations between the assets in the

scene. These data are complemented with the objects’

bounding boxes. These annotations can be used to

train visual relationship detectors. The annotations

Building Synthetic Simulated Environments for Conﬁguring and Training Multi-camera Systems for Surveillance Applications

need to be parsed to the required format depending

on the chosen deep learning framework (e.g., Tensor-

Flow, PyTorch).

The annotations in the output VCD ﬁle are stored

per object and framewise. Consequently, each annota-

tion is stored along with its corresponding image path.

4 PRACTICAL CASE AND

EXPERIMENTS

In order to validate the proposed methodology, we ap-

ply it to a real problem, in the context of digitalized

on-demand aircraft cabin readiness veriﬁcation with

a camera-based smart sensing system. Currently, the

veriﬁcation of Taxi, Take-off, and Landing (TTL) re-

quirements in aircraft cabins is a manual process. The

cabin crew members need to check that all the luggage

is correctly placed in each TTL phase. During these

phases, the luggage should not be situated in such

a way that an emergency evacuation of the aircraft

would be delayed or hindered. Figure 4 shows the

allowed and not allowed positions for the cabin lug-

gage during TTL. The veriﬁcation done by the crew

members could be automated with the development

of a vision-based system. This system would be ben-

eﬁcial in terms of operational efﬁciency and safety.

Developing a system capable of detecting the luggage

positions in the cabin entails two main challenges:

the design of the camera set up and the generation of

suitable data for training the corresponding machine

learning models. As stated previously, a 3D environ-

ment simulator can be very helpful for these tasks.

Figure 4: Authorised positions for luggage during TTL.

4.1 3D Assets for the Use Case

To address this problem, the ﬁrst step of our method-

ology is the generation of the assets. We ﬁrst generate

all the involved 3D assets for the scene of interest. In

this case, we model 22 different object types to sim-

ulate typical cabin luggage (e.g., backpacks, maga-

zines, laptops), a cabin model representing a Boeing

737 aircraft (with 19 seats), and a group of human

models with different poses and appearances for the

seated passengers. The generated 3D assets are shown

in Figure 5.

Figure 5: Generated 3D assets for the use case: cabin lug-

gage, passengers, and aircraft cabin.

4.2 Cabin Camera Setup Design

For the camera setup task, our feature VCD2Scene

allows loading the aircraft model within some of the

modeled 3D assets using an initial conﬁguration ﬁle

and visualizing directly the 3D scene from the virtual

camera viewpoints. The initial setup idea could be to

place some cameras on top of the seats to control the

luggage in these areas (e.g., backpacks partially below

the seats) and some other cameras above the corridor

to verify a clear exit. However, how many cameras

should be installed? Where should they be to guaran-

tee the system can see every seat without occlusions?

What lens parameters should these cameras have?

Figure 6: Captured scene from different camera positions.

The conﬁguration shown in the left images is discarded be-

cause of the occlusions in the front middle seat.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

To answer these questions we interactively change

the virtual cameras’ extrinsic and intrinsic parameters

to check the results we would obtain with each con-

ﬁguration. We move the camera positions and orien-

tations to guarantee that all the regions of interest are

captured by the minimum number of cameras.

Figure 6 shows an example of accepted and dis-

carded camera positions for capturing the luggage in

the seat areas. At ﬁrst, we could think the cameras

should be on top of the middle seats to capture the

status of 6 individual seats (left images). As it can

be observed, the passenger seated in the front middle

generates an occlusion that does not allow visualiz-

ing if he/she has some luggage incorrectly placed. To

avoid this blind spot, we test moving the camera to-

wards the windows to capture 4 individual seats (right

images). We can see that from this viewpoint there are

no occlusion problems, so we opt for this conﬁgura-

tion. Then, we test setting some cameras in the corri-

dors to verify that area and the seats next to the cor-

ridor. Figure 7 shows the tested conﬁgurations. Even

the corridor space is well visualized in both shown

camera positions, the camera should be aligned with

the seats (right image) to guarantee the minimum pos-

sible occlusions in the feet are of the passengers for as

many seats as possible. These tests resulted in a setup

design of 20 perspective cameras on top of the seats

and 19 on top of the corridor. To guarantee a good vi-

sualization of the target areas, we set the parameters

of all the cameras to a FOV of 118 degrees, a focal

length of 2.13mm, and a sensor size of 4mm.

Figure 7: Captured scene from different camera positions.

The conﬁguration shown in the left images is discarded be-

cause of more occlusions.

4.3 Conﬁguration Files Generation

Once we have the ﬁnal cameras’ conﬁguration we

export it to an updated VCD conﬁguration ﬁle with

the Scene2VCD functionality. The ﬁle should also

contain the 3D assets’ conﬁguration so that our en-

vironment generator replicates the deﬁned sequences.

The situations’ variety and the number of object in-

stances of each class can be easily controlled, but de-

pends on the conﬁguration ﬁle we use. Generating

a balanced dataset is important to guarantee that the

trained model does not have a bias for the most com-

mon objects in the dataset. Consequently, we deﬁne

the following requirements for generating the VCD

ﬁles:

1. An object sample from each object category

should be placed at all the possible conﬁgurations

and places at least once.

2. All the object classes should be present in the

generated sequences’ frames the same number of

times.

3. Samples should show a wide variety of object ap-

pearances.

Following these criteria, we generate three VCD ﬁles.

The ﬁrst ﬁle describes a sequence where each frame

contains an object type placed in a speciﬁc conﬁgu-

ration at all the cabin places (e.g., backpacks on all

the seats). The goal of this sequence is to generate

enough samples of each object type’s appearance in

simple conﬁgurations (with no interaction of other

object types). The second ﬁle combines randomly

all the objects in all the possible conﬁgurations the

same number of times. We deﬁne a sequence of 500

frames. Each of these frames contains different ob-

jects’ conﬁgurations. The last ﬁle follows the same

strategy of random combinations but it is focused on

the appearance variations. In addition to the default

random variations aggregated to the 3D assets posi-

tion, it enables the color randomization for objects

(Section 3.3). Consequently, this ﬁle extends the al-

ready deﬁned object combinations with new ones that

contain objects with a high variety of random appear-

ances. Providing a 3D asset a random color can make

some objects look less realistic but increases the va-

riety of samples and can beneﬁt the robustness of the

trained models. The strategy used for generating the

VCD ﬁles is not limited to the current use case, it can

be applied to other tasks and scenarios. The deﬁned

sequences are summarized in the Table 4.

4.4 3D Scenes Generation

We use the VCD ﬁles to automatically replicate the

3D scenes described in them with no manual interven-

Building Synthetic Simulated Environments for Conﬁguring and Training Multi-camera Systems for Surveillance Applications

Figure 8: Data generation pipeline example. The input VCD ﬁle describes the scene that is replicated by the synthetic

3D environment generator, including present objects and its relations. The tool outputs rendered images and corresponding

annotations in VCD format, which correspond to data from different camera perspectives.

Table 4: Summary of the generated data.

Sequence VCD 1 VCD 2 VCD 3 Total

Description

Single object

class/frame

Random,

balanced

Random

appearances

Number of

frames

45 500 500 1,045

Number of

rendered images

1,755 19,500 19,500 40,755

Number of 3D

object instances

5,966 33,000 33,000 71,966

tion. The 3D environment is conﬁgured with all the

data regarding the cameras, lighting, rendering, and

3D assets to replicate the deﬁned scenes in the cabin.

The environment is dynamically conﬁgured when the

data generation starts.

4.5 Generated Data

Each frame in the sequences is captured from all the

deﬁned cameras, so for each frame 39 images are ren-

dered from different camera viewpoints. An output

VCD ﬁle is generated for each sequence containing

the data already in the input ﬁle (e.g., camera con-

ﬁgurations, relations between objects) and the addi-

tion of the output data. The output data include the

paths to the generated images and the corresponding

bounding box annotations for each object or passen-

ger. Figure 8 shows the data generation process from

the input VCD ﬁle to the output synthetic data. The

left image shows an example of the objects and the

relations between them as it is deﬁned in the VCD

ﬁle. The 3D environment simulator processes this in-

formation to conﬁgure the 3D scene. It can be seen

that the example data (’magazine-4 is on Passenger-

14’) is replicated in the synthetic 3D world). This

scene is captured from the deﬁned cameras to pro-

duce the corresponding rendered images, along with

the annotations. In the output synthetic images it can

be observed that the same 3D assets can be observed

from different cameras at the same time. The right im-

ages of Figure 8 show an example of the additional in-

formation that the output VCD ﬁle contains (objects’

bounding box coordinates in each rendered frame).

This information completes can be used not only for

training object detectors but also for training visual

relationship detectors.

4.6 Analysis of the Proposed Approach

In this section, we provide a qualitative and quantita-

tive analysis of the proposed methodology’s charac-

teristics. Table 5 shows a qualitative comparison of

some state-of-the-art approaches to ours.

The environment and 3D assets involved in all the

data generation methods are manually modeled with

the help of a 3D modeling software or gathered from

public repositories. (Khan et al., 2019) also uses data

priors such as OpenStreetMap data to model differ-

ent city environments but still needs manual work to

complete and adjust the scene data.

Once the 3D environment is prepared, if the user

wants to change certain scene conﬁgurations with our

method, such as the light properties, the objects in the

scene or the relation between these objects, he/she can

deﬁne the modiﬁed scene in the conﬁguration ﬁle us-

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

Table 5: Comparison between state-of-the-art synthetic dataset generation methodologies and our approach.

Feature (Lai et al., 2018) (Scheck et al., 2020) (Saleh et al., 2018) (Rajpura et al., 2017) (Khan et al., 2019) Our approach

Environment

modeling

Manual Manual Manual Manual Manual (data priors) Manual

Scene

modiﬁcations

Manual Manual Manual Manual Manual Automatic

Scene capture

Single-sensor Single-sensor Single-sensor Multisensor Single-sensor Multisensor

3D assets

relations

Users’ interactions None None None None

Spatial conﬁguration

Annotations

Obj. detection,

sem. segmentation,

reinforcement

learning

Obj. detection,

sem. segmentation

Sem. segmentation Obj. detection

Sem. segmentation,

depth estimation

Obj. detection,

sem. segmentation,

visual relationship

detection

Generality Default scenarios

Limited

Driving scenarios Limited

Urban driving scenarios Surveillance scenarios

ing a user-friendly parameterization. Then the new

3D scenario will be automatically replicated. Our ap-

proach is the only one that proposes to dynamically

conﬁgure all the environment when the data genera-

tion process starts. Scene modiﬁcations can be done

in the different approaches but none of them provides

a high-level mechanism like ours to vary the environ-

ment. Related to the possible 3D assets relations, (Lai

et al., 2018) allows adding some limited user inter-

actions. Our approach allows adding spatial location

relations between the 3D assets.

Regarding the images capture, all the works pro-

pose a single source to capture the scene, except for

(Rajpura et al., 2017) and our approach, which allow

capturing the same time interval from cameras with

different viewpoints.

The data generated by the presented works are ori-

ented to the training of DNNs. Typical output in-

cludes annotations for object detection or semantic

segmentation tasks, to which (Lai et al., 2018) adds

reinforcement learning. Our approach also generates

labels suitable for training visual relationship detec-

tors.

One of the main advantages of our approach over

the others is its generality and the possibility to adapt

it with little effort to new scenarios and tasks in the

surveillance ﬁeld. This feature is very limited to the

predeﬁned scenarios in the state-of-the-art works.

Regarding the rendering time, it depends on differ-

ent factors such as the rendering engine, the complex-

ity of the scene, the number of 3D assets included,

the lights conﬁguration, or the polygon number of the

modeled objects. For the current aircraft use case,

we conﬁgure the rendering to be as fast as possible

using the parameters in the conﬁguration ﬁle main-

taining a good output quality. We use the real-time

viewport shading rendering for the camera set up de-

sign (Section 4.2) so that we can see the modiﬁca-

tions’ effect interactively. For the data generation, we

use the Cycles Rendering Engine, which is slower but

provides more photorealistic results. Table 6 shows

the rendering time (s) computed in different tests to

choose the optimum parameters to render a 640x480

Table 6: Rendering time in seconds based on different con-

ﬁguration parameters using an Nvidia Tesla T4 GPU.

Max Bounces Caustics Tile Size

12 6 1 Yes No 16 32 64 128 256

11.9 11.7 9.5 9.5 9.4 16.3 10.8 9.3 9.4 9.5

pixel image of the cabin environment including 50

3D assets (e.g., suitcases, passengers) apart from the

background ones, 16 point lights and outdoors sun-

light. We use an Nvidia Tesla T4 GPU. It can be

seen that reducing the maximum light bounces from

12 to 1 we need 2.4 seconds less, without noticeable

changes in the output image so we adopt only 1 light

bounce. Disabling caustics effects neither change the

scene appearance and we save an additional 0.1 sec-

ond. These times are based on using a tile size of

128, but the optimum tile size for our hardware set

up is 64. These parameter changes allow us to move

from a rendering time of 11.9 seconds to 9.3 seconds.

More than 2 seconds per sample in a process where

we need to generate thousands of images is a remark-

able time-saving. Regarding the rendering time of re-

lated state-of-the-art approaches, some works based

on game-engines claim to render in real-time (Scheck

et al., 2020; Khan et al., 2019). As stated in (Khan

et al., 2019), a simpliﬁed lighting model allows real-

time rendering of massive amounts of geometry with

a limited realism. The Cycles Rendering Engine, as a

physically-based path tracer, allows generating good

quality results in a reasonable time. Domain adapta-

tion techniques are important to solve the domain gap

that DNNs trained with synthetic data can present.

Generating data with a certain degree of realism min-

imizes the problem to be solved by those techniques.

However, Blender also has a real-time rendering en-

gine (Eevee) that can be used in tasks where the ren-

dering time is considered to be a bigger priority than

the data quality.

We apply our methodology to the aircraft use case,

but adapting it to another surveillance scenario re-

quires little effort from the user. Once the user col-

lects all the 3D assets of the new environment, the

ﬂexibility of the method allows building a totally new

scene. The user interactively designs a suitable cam-

Building Synthetic Simulated Environments for Conﬁguring and Training Multi-camera Systems for Surveillance Applications

era set up, along with the selection of target places.

Then he/she can start gathering variate training data

for the considered task based on the conﬁguration

ﬁles.

4.6.1 DNN Training

In order to see how a DNN would perform with the

generated synthetic data and real data, we train the

model EfﬁcientNet-B0 (Tan and Le, 2019) to clas-

sify whether an image contains a correct or incorrect

situation (e.g., cabin luggage correctly or incorrectly

placed). We deﬁne a region of interest (ROI) for each

seat, which will be then classiﬁed as correct or in-

correct. We replicate the cabin mock-up for capturing

also real images from the camera over the seats. Some

examples of both real and synthetic ROI images are

shown in Figure 9.

Figure 9: Real (top) and synthetic (bottom) ROI samples

for training a classiﬁcation DNN with correct and incorrect

situations.

We do two tests to see the contribution of our syn-

thetic images in the DNN accuracy. First, we train

the DNN only with real data and test it on a separated

subset of real images. Then, we repeat the training

but incorporating the same number of samples from

our generated synthetic dataset and test it again on the

test real images. We capture and annotate 1,600 real

images.

We initialize the DNN with pretrained weights on

the ImageNet dataset (Deng et al., 2009) and ﬁne-tune

it with our datasets. We train each DNN for 50 epochs

with a batch size of 40 and the RMSprop optimizer

(Bengio and CA, 2015). The model trained only with

real images achieves 88.52% accuracy when classify-

ing the real test samples, while the model which also

includes synthetic samples achieves 94.26% accuracy

on the same images. Consequently, the positive con-

tribution of the generated synthetic data in the model

accuracy conﬁrms the suitability of the generated im-

ages for DNNs training.

5 CONCLUSIONS

This work presents a methodology to build simulated

3D environments for conﬁguring and training multi-

camera systems with enough generality to be used in

different surveillance contexts, with little effort. Our

proposal helps in designing an appropriate camera

system to cover the target use cases and avoid ex-

pensive system setup errors. Once the camera setup

is done, our method allows generating a wide range

of situations of interest for training DNNs with suit-

able synthetic data. These situations should be bal-

anced for a successful DNN training. This is guar-

anteed by the input conﬁguration ﬁles, which allow

controlling the content of the data with a user-friendly

fast scene parameterization. Certain randomness is al-

ways added to the scenes in terms of objects’ spatial

locations or appearances for greater data variability.

We show a practical implementation example of

our methodology in the context of digitalized on-

demand aircraft cabin readiness veriﬁcation with a

camera-based smart sensing system. We design a 39

camera-based system and generate a training dataset

of 40,755 images containing a total number of 71,966

object instances in the cabin environment. We fol-

low a data balancing strategy to guarantee the suit-

ability of the generated data based on the conﬁgura-

tion ﬁles. We also compare our methodology to al-

ternative state-of-the-art approaches. Features such

as the generality, ﬂexibility, and multi-camera setting

stand out from the features provided by the other ap-

proaches. We use a subset of the generated dataset

to train a classiﬁcation DNN jointly with real images.

We show that incorporating our synthetic samples to

the training dataset boosts the accuracy of the model

when tested on real images. Consequently, the im-

ages generated by our methodology are suitable for

training DNNs.

Future work includes the extension of the method-

ology to support data generation for action recogni-

tion tasks, which requires the possibility of the as-

sets’ interaction beyond their conﬁguration and rela-

tion. We also plan to continue our experiments about

training with the generated synthetic data to ensure

their suitability for different computer vision tasks as

we keep advancing. In addition, future research will

explore the emerging DNN techniques to generate 3D

models from 2D images, which may help and speed

up the manual process of gathering or designing the

3D assets required for any virtual environment.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

ACKNOWLEDGEMENTS

This work has received funding from the Clean Sky

2 Joint Undertaking under the European Union’s

Horizon 2020 research and innovation program

under grant agreement No. 865162, SmaCS

(https://www.smacs.eu/).

REFERENCES

ASAM (2020). OpenLABEL. https://www.asam.net/

project-detail/scenario-storage-and-labelling/.

Bengio, Y. and CA, M. (2015). Rmsprop and equilibrated

adaptive learning rates for nonconvex optimization.

corr abs/1502.04390.

Bourke, P. and Felinto, D. (2010). Blender and immer-

sive gaming in a hemispherical dome. In International

Conference on Computer Games, Multimedia and Al-

lied Technology, volume 1, pages 280–284.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). Imagenet: A large-scale hierarchical

image database. In 2009 IEEE conference on com-

puter vision and pattern recognition, pages 248–255.

Ieee.

Hernandez-Leal, P., Kartal, B., and Taylor, M. (2019). A

survey and critique of multiagent deep reinforcement

learning. Autonomous Agents and Multi-Agent Sys-

tems, 33:750–797.

Hou, Q., Cheng, M., Hu, X., Borji, A., Tu, Z., and Torr,

P. H. S. (2019). Deeply supervised salient object

detection with short connections. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence,

41(4):815–828.

Hurl, B., Czarnecki, K., and Waslander, S. L. (2019).

Precise synthetic image and LiDAR (PreSIL) dataset

for autonomous vehicle perception. arXiv preprint

arXiv:1905.00160.

Khan, S., Phan, B., Salay, R., and Czarnecki, K. (2019).

ProcSy: Procedural synthetic dataset generation to-

wards inﬂuence factor studies of semantic segmenta-

tion networks. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

(CVPR) Workshops, pages 88–96.

Lai, K.-T., Lin, C.-C., Kang, C.-Y., Liao, M.-E., and Chen,

M.-S. (2018). VIVID: Virtual environment for visual

deep learning. In Proceedings of the ACM Interna-

tional Conference on Multimedia (MM), pages 1356–

1359.

Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., and

Black, M. J. (2015). SMPL: A skinned multi-person

linear model. ACM Transactions on Graphics (Pro-

ceedings of SIGGRAPH Asia), 34(6):248:1–248:16.

Maninis, K. K., Caelles, S., Chen, Y., Pont-Tuset, J., Leal-

Taix

e, L., Cremers, D., and Van Gool, L. (2019).

Video object segmentation without temporal informa-

tion. IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence, 41(6):1515–1530.

Mujika, A., Fanlo, A. D., Tamayo, I., Senderos, O., Baran-

diaran, J., Aranjuelo, N., Nieto, M., and Otaegui, O.

(2019). Web-based video-assisted point cloud anno-

tation for ADAS validation. In Proceedings of the In-

ternational Conference on 3D Web Technology, pages

1–9.

Nikolenko, S. I. (2019). Synthetic data for deep learning.

arXiv preprint arXiv:1909.11512.

Rajpura, P. S., Bojinov, H., and Hegde, R. S. (2017). Ob-

ject detection using deep CNNs trained on synthetic

images. arXiv preprint arXiv:1706.06782.

Saleh, F. S., Aliakbarian, M. S., Salzmann, M., Peters-

son, L., and Alvarez, J. M. (2018). Effective use of

synthetic data for urban scene semantic segmentation.

arXiv preprint arXiv:1807.06132.

Scheck, T., Seidel, R., and Hirtz, G. (2020). Learning from

theodore: A synthetic omnidirectional top-view in-

door dataset for deep transfer learning. In Proceed-

ings of the IEEE Winter Conference on Applications

of Computer Vision (WACV), pages 932–941.

Seib, V., Lange, B., and Wirtz, S. (2020). Mixing real

and synthetic data to enhance neural network train-

ing – a review of current approaches. arXiv preprint

arXiv:2007.08781.

Shah, S., Dey, D., Lovett, C., and Kapoor, A. (2017). Air-

Sim: High-ﬁdelity visual and physical simulation for

autonomous vehicles. Field and Service Robotics,

pages 621–635.

Shorten, C. and Khoshgoftaar, T. (2019). A survey on image

data augmentation for deep learning. Journal of Big

Data, 6:1–48.

Singh, R., Vatsa, M., Patel, V. M., and Ratha, N., editors

(2020). Domain Adaptation for Visual Understanding.

Springer, Cham.

Tan, M. and Le, Q. V. (2019). Efﬁcientnet: Rethink-

ing model scaling for convolutional neural networks.

arXiv preprint arXiv:1905.11946.

Vicomtech (2020). VCD - video content description. https:

//vcd.vicomtech.org/.

Xi, Y., Zhang, Y., Ding, S., and Wan, S. (2020). Visual ques-

tion answering model based on visual relationship de-

tection. Signal Processing: Image Communication,

80:115648.

Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., and Zheng,

N. (2019). View adaptive neural networks for high

performance skeleton-based human action recogni-

tion. IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence, 41(8):1963–1978.

Building Synthetic Simulated Environments for Conﬁguring and Training Multi-camera Systems for Surveillance Applications