Synthetic Driver Image Generation for Human Pose-Related Tasks

Romain Guesdon

, Carlos Crispim-Junior

and Laure Tougne Rodet

Univ. Lyon, Univ. Lyon 2, CNRS, INSA Lyon, UCBL, Centrale Lyon,

LIRIS UMR5205, F-69676 Bron, France

Keywords:

Dataset, Synthetic Generation, Neural Networks, Human Pose Transfer, Consumer Vehicle.

Abstract:

The interest in driver monitoring has grown recently, especially in the context of autonomous vehicles. How-

ever, the training of deep neural networks for computer vision requires more and more images with signiﬁcant

diversity, which does not match the reality of the ﬁeld. This lack of data prevents networks to be properly

trained for certain complex tasks such as human pose transfer which aims to produce an image of a person

in a target pose from another image of the same person. To tackle this problem, we propose a new synthetic

dataset for pose-related tasks. By using a straightforward pipeline to increase the variety between the images,

we generate 200k images with a hundred human models in different cars, environments, lighting conditions,

etc. We measure the quality of the images of our dataset and compare it with other datasets from the literature.

We also train a network for human pose transfer in the synthetic domain using our dataset. Results show that

our dataset matches the quality of existing datasets and that it can be used to properly train a network on a

complex task. We make both the images with the pose annotations and the generation scripts publicly avail-

able.

1 INTRODUCTION

The increasing complexity of computer vision tasks

over the years has led to a growth in the size of deep

learning models. Therefore, more and more data has

been required to train the deep neural networks, with

more diversity among the images. Large-scale gen-

eral datasets have been published over the years to

answer this problem, such as ImageNet (Deng et al.,

2009), COCO (Lin et al., 2015), or DeepFashion (Liu

et al., 2016) datasets. However, speciﬁc contexts lack

sufﬁciently large datasets, especially because of the

high cost of acquisition in comparison with the size

of the research ﬁeld.

Human Pose Transfer (HPT) is an example of a

data-demanding task. HPT aims to generate, from a

source image of a person, a new image of that same

person in a different target pose. Generative Ad-

versarial Networks (GAN) (Goodfellow et al., 2014)

achieve good performances on this task (Zhu et al.,

2019; Huang et al., 2020; Zhang et al., 2021), mostly

in two contexts: fashion and video surveillance im-

ages. These two domains correspond to the two main

datasets available for this task (Liu et al., 2016; Zheng

https://orcid.org/0000-0003-4769-9843

https://orcid.org/0000-0002-5577-5335

https://orcid.org/0000-0001-9208-6275

et al., 2015). However, a substantial number of im-

ages, with high diversity in persons, clothes, and en-

vironment is required to properly train GAN models.

These requirements are difﬁcult to achieve in speciﬁc

contexts, for example, images of drivers in consumer

vehicles. In this context, data acquisition requires set-

ting up experimentations in a moving car (Guesdon

et al., 2021) or at least in a simulator (Martin et al.,

2019). These constraints lead to the availability of

few images with little variety of subjects.

A commonly used solution to tackle a lack of

training data is geometric data augmentation such as

random rotation, crop, scaling, etc. (Simard et al.,

2003; Krizhevsky et al., 2012). However, these meth-

ods may be sufﬁcient for rigid objects but are not fully

suitable for articulated ones. An alternative is the use

of synthetic data. This process allows the generation

of a high number of images with a theoretically inﬁ-

nite diversity and accurate annotations, within a lim-

ited time and ﬁnancial cost. Even if a domain gap ex-

ists between synthetic and real images, literature has

demonstrated that generated images can be used to as-

sist the training of networks on real-world images for

many tasks (Juraev et al., 2022; Wu et al., 2022; Kim

et al., 2022). In the driving context, few synthetic

public datasets exist (Cruz et al., 2020; Katrolia et al.,

2021). Furthermore, these datasets mainly focus on

762

Guesdon, R., Crispim-Junior, C. and Rodet, L.

Synthetic Driver Image Generation for Human Pose-Related Tasks.

DOI: 10.5220/0011780800003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP, pages

762-769

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

Figure 1: Samples of images from the proposed synthetic dataset.

monitoring tasks and emphasize more on actions than

on subject diversity.

To address the lack of diversity in driving vehi-

cles, we propose a large dataset of synthetic images

for pose-related tasks. We develop a pipeline where

we diversify the subjects (with 100 driver models),

but also the car cockpits, the environment, the light-

ing conditions, etc. The images are publicly available,

as well as the scripts used for data generation

This paper is organized as follows. Section 2

presents related work on driver image datasets. In

Section 3, we present our proposed process and the

synthetic dataset along with the choices made for the

generation. We show and evaluate in Section 4 the

generated images and an application of our dataset

with an HPT architecture. Finally, Section 5 presents

our conclusions and future work.

2 RELATED WORK

Work in the computer-vision ﬁeld about drivers in

consumer vehicles mainly focuses on passenger mon-

itoring, mostly for safety-related tasks. Therefore,

datasets in real-world conditions or in driving simu-

lators have been published for tasks such as driver ac-

Images and generation scripts are publicly available

on : https://gitlab.liris.cnrs.fr/aura autobehave/synthetic

drivers

tivity recognition (Ohn-Bar et al., 2014; Jegham et al.,

2019; Martin et al., 2019; Borghi et al., 2020), driver

pose estimation (Guesdon et al., 2021), driver gaze

estimation (Ribeiro and Costa, 2019; Selim et al.,

2020), driver awareness monitoring (Abtahi et al.,

2014).

Most of these datasets contain RGB images from

video clips annotated for the target tasks. However,

these datasets usually do not provide pose annotations

required for the study of human pose transfer tasks.

Drive&Act (Martin et al., 2019) proposes a multi-

modal (RGB, NIR, depth) and multi-view dataset in

a static driving simulator, with 3D human pose and

activity annotations. DriPE dataset (Guesdon et al.,

2021) depicts drivers in consumer vehicles in real-

world driving conditions, with manually annotated

poses. However, these two datasets contain only 15

and 19 subjects, respectively, which is not enough to

fully train deep neural networks on a complex task,

such as HPT, according to our observations.

Regarding synthetic data for driver monitoring,

two datasets have been published. SVIRO (Cruz et al.,

2020), a synthetic dataset for scenarios in the pas-

senger cockpit. It depicts people and objects in the

car back seat with different placements and provides

RGB images along with infrared imitation, depth

maps, segmentation masks, and human pose ground-

truth keypoints. TICaM (Katrolia et al., 2021) is a

dataset with both real and synthetic images for vehicle

interior monitoring, with real images recorded in a car

Synthetic Driver Image Generation for Human Pose-Related Tasks

763

Blender

Basic Scene:

Configuration:

Human pose:

Synthetic images

+ annotations

MakeHuman

Human

Models

Unity

Car

Models

Default

camera

Default

background

Human

model

Car model

Background

Camera

deviations

Lighting

Targeted

upper body

Fix lower

body

Random

upper body

Figure 2: Global process for the generation of the synthetic driver images.

cockpit simulator. The dataset provides RGB, depth,

and infrared images with action annotations and seg-

mentation ground-truth masks. The two main issues

with these datasets are the front view angle, which

does not allow a clear view of the driver’s full body,

and the subject diversity which is still low for large

models such as GAN (Goodfellow et al., 2014) on

these data without overﬁtting. We can also mention

nas et. al. (Canas et al., 2022) which describe

a global approach to generate synthetic images for

passenger monitoring. However, their work only par-

tially considers the question of random pose genera-

tion, and no script nor images have been made pub-

licly available so far.

In summary, there currently exists no publicly

available dataset suited to study driver pose transfer

with a high variety of driver subjects and a full body

view camera angle.

3 DATASET GENERATION

Because the driver datasets in the literature for hu-

man pose-related tasks lack diversity, deep generative

methods cannot be trained and used to increase the

available data quantity. We propose a process based

on a standard pipeline for 3D scene generation to ren-

der new synthetic images. Using this method, we

build a large dataset depicting one hundred human in-

stances, several car models, variations of luminance,

etc. In this section, we describe the generation pro-

cess and present statistics about the generated images.

3.1 3 D Models

To generate synthetic driver images, two objects need

to be modeled: cars and humans. Human models are

generated using MakeHuman Community (MakeHu-

man, 2022). This open-source software produces 3D

models with many parameters like age, height, mus-

cle mass, ethnicity, face proportions, etc. Models are

generated with a rigged skeleton, which allows ani-

mating them easily and realistically. We use the de-

fault clothes from MakeHuman along with some pro-

vided by the community. To generate many models,

we use the Mass Produce module which allows set-

ting an interval for each parameter. We also randomly

change the color of the clothes’ textures when gener-

ating the full scene to increase the diversity. The car

models are obtained on the Unity Asset store (Unity,

2022). We select different types of consumer vehicles

to represent various car cockpits (e.g., family cars,

sports cars, pick-ups), with equipment going from

plain dashboards to touchscreens.

3.2 Pose Generation

Human models are animated using the included

rigged skeleton (Figure 3-a). Theoretically, each bone

can rotate freely around the body joint where its head

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

764

Figure 3: Illustrations of the generation process in Blender with (a) the skeleton rig, (b) the ﬁxed wrist targets (only used for

the additional driving images), (c) the default scene perspective, an example of the ﬁnal scenes without (d) and with (e) the

light rendering, and (f) a view from the camera.

is attached, which gives it three degrees of liberty.

However, several constraints must be considered in

our case. First, no real human bone can fully rotate

in any direction. If we take the forearm for exam-

ple and consider that it is fully open by default, it can

approximately rotate from 0 to 150° around the pitch

and the roll axis and cannot rotate around its yaw axis

(Maik et al., 2010). Secondly, the car cabin is a con-

stricted space, which brings many constraints to avoid

the human and the car models colliding. Therefore, to

address these constraints, we proceed as follows:

1. We deﬁne a default pose, which corresponds to

the person sitting straight on the car seat with the

arms close to the upper body.

2. We perform small random rotations on the head,

back, and legs considering the human body con-

straints and the car cabin.

3. We randomly deﬁned a target for each wrist, in

front of the subject and within the arm range.

We also add a constraint to force the targets to

be within a deﬁned box that represents the cabin

space. The boxes are manually deﬁned before-

hand for each car model to best match their shape.

4. We use an inverse kinematic solver integrated

into the 3D modeling software to place the wrists

on the targets. We only move the upper arms

and forearms during this process, which does not

modify the back inclination. This is to avoid un-

natural poses in the car seat. Kinematic angle con-

straints are set on each involved bone to match

real body constraints.

This process allows us to easily generate many ran-

dom plausible poses while taking into consideration

body and environment constraints.

However, random positioning is very unlikely to

generate standard driving poses, such as hands on the

wheel or the gear lever. This is not problematic when

considering the car as an autonomous vehicle of level

2 or 3 for example, but can be less realistic for man-

ual driving tasks (in a vehicle of levels of autonomy

0 or 1). Therefore, we additionally set in each car

model ﬁxed wrist targets on the wheel, gear lever, and

dashboard (Figure 3-b). We use these targets instead

of random ones to separately generate more realistic

driving images.

3.3 Generation Process

To set up the full scene and render the images, we use

Blender 3.2 (Blender, 2022) modeling software. Its

advantages are that it is free and open-source, accessi-

ble, and can be fully automated using python scripts.

The global rendering process is summarized in Fig-

ure 2.

We ﬁrst create the default scene by setting up a

ﬁxed camera, a sunlight source, and a panel for the

background image (Figure 3-c). We use high-quality

images of landscapes to simulate the background,

which allows us to easily leverage a high number

Synthetic Driver Image Generation for Human Pose-Related Tasks

765

Table 1: Comparison table between different datasets.

Dataset SVIRO TICaM Drive&Act DriPE Market Fashion Ours

Year 2020 2021 2019 2021 2015 2016 2022

#Frames 25K 126K 9.6M 10k 33k 54k 200k

#Subjects 22 adults 13 15 19 ∼ 3k ∼ 10k 100

#Views 1 1 6 1 - - 1

Synthetic /

Real

Synthetic Both Real Real Real Real Synthetic

Data

Depth,

RGB, IR

Depth,

RGB, IR

Depth,

RGB, IR

RGB RGB RGB RGB

Annotation

Classiﬁcation

labels, 2D

box mask,

2D skeleton

2D+3D

boxes, 3D

segmenta-

tion mask,

activity

Activity,

2D+3D

skeletons

2D boxes,

skeleton

2D skeleton 2D skeleton

2D+3D

skeletons

and boxes

of different backgrounds from free picture databases.

The 3D models are then imported into the scene.

Then, we randomly deﬁne several conﬁgurations,

where a conﬁguration is composed of a human model,

a car model, a background, small camera deviations,

and lighting parameters (Figure 3-d, e. Note that the

black triangle in the illustrations represents the up di-

rection of the camera model). We use a Blender add-

on that places the sun in a realistic position from GPS

coordinates and date time, which we set randomly.

We also generate night conﬁgurations by selecting

night backgrounds and dimming the lights. The night

setting is randomly used 20% of the time.

Finally, for each conﬁguration, we generate a pose

using the process described in Section 3.2 (Figure 3-

f) and render the image. We also save the 2D and 3D

coordinates of each body joint, the bounding boxes,

and the camera’s intrinsic and extrinsic parameters.

4 RESULTS AND DISCUSSIONS

In this section, we present and discuss methods used

to evaluate the relevance of the proposed dataset. We

ﬁrst compare it with other state-of-the-art datasets us-

ing metrics from the literature to measure the quality

of the images. Then, we use the task of human pose

transfer to evaluate whether our synthetic dataset is

large and diversiﬁed enough for a complex task.

4.1 Dataset Evaluation

We deﬁne a total of 1.000 conﬁgurations by randomly

picking between 7 cars and 100 human models. For

each conﬁguration, 200 poses are generated, which

results in a dataset of 200k images.

In Table 1, we compare our dataset with sev-

eral other datasets from the literature. We can see

that our dataset possesses more images than both

driver synthetic and real-world HPT datasets. The

only exception is Drive&Act, which is composed of

video clips instead of single images, which multiplies

the total number of frames. However, the proposed

dataset presents far more driver models than previous

datasets.

Then, we compare the quality of the synthetic im-

ages with the ones in other datasets. For this purpose,

we use the Inception Score (IS) (Salimans et al., 2016)

which is a metric commonly used to evaluate the qual-

ity of images generated by GAN (Zhu et al., 2019;

Tang et al., 2020; Huang et al., 2020). This metric

is based on the predictions from a pre-trained Incep-

tionNet classiﬁer (Szegedy et al., 2016). Since In-

ception Score is sensitive to image sizes, each dataset

is resized to approximately match the same number

of pixels. We choose a standard size of 49,152 pix-

els, which corresponds to a shape of 192 ∗ 256 pixels.

The Inception Score is computed on the full datasets

using a Pytorch implementation of the original IS al-

gorithm (Pytorch metrics, 2022). Results of the eval-

uation can be found in Table 2.

Table 2: Evaluation of the image quality of the full dataset

using Inception Score.

Dataset Inception Score (IS) ↑

DeepFashion 4.247

Market 4.223

DriPE 1.481

Drive&Act 1.343

SVIRO 1.902

TICam - synthetic 1.276

TICam - real 1.662

Ours 2.391

First, we observe in Table 2 that the two datasets

used for HPT, i.e., DeepFashion and Market, present

a score strictly higher than the one measured on

driver datasets. This can be explained by the fact

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

766

Figure 4: Samples from the test inferences generated by the GAN trained on our synthetic dataset.

that the Inception Score reﬂects two aspects: the in-

trinsic quality of each image and the variety among

the dataset (Salimans et al., 2016). Since the driver

datasets present fewer subjects with large and ﬁxed

foregrounds, we can expect a lower IS. However, we

can see that our synthetic dataset obtains a better score

than the other driver datasets. This suggests that its

images have an apparent quality similar to those from

the other driver datasets while presenting a larger va-

riety.

4.2 Human Pose Transfer

As mentioned in Section 1, training a model for a

complex task such as human pose transfer, without

heavily overﬁtting the training set, requires many im-

ages with a high variety of subjects.

Therefore, we train an HPT generative network

on our synthetic dataset to evaluate the diversity of

its images. We chose from the state of the art the

APS architecture (Huang et al., 2020), which presents

competitive performances with no need for additional

input data such as segmentation maps. We train the

network using the scripts provided by the authors in

their repository. We adopted the same hyperparame-

ters used for training on the DeepFashion dataset and

resize our synthetic images to 192x256 pixels to get

closer to the size of the DeepFashion images. The

proposed dataset is split into a training set of 180k

pictures and a testing set of 20k pictures, and these

two sets do not share any subject model.

To measure the quality of our results, we evaluate

the images using several state-of-the-art metrics (Ta-

ble 3): Inception Score (IS), Frechet Inception Dis-

tance (FID) (Heusel et al., 2017), and Structural Sim-

ilarity (SSIM) (Wang et al., 2004). FID and SSIM are

computed using the same script as IS (Pytorch met-

rics, 2022). Unlike the evaluation of the datasets in

Section 4.1, the metrics here are only computed on

the images generated by the network on the test set.

Table 3: Evaluation of images generated by an APS network

trained on different datasets.

Dataset IS ↑ FID ↓ SSIM ↑

Fashion 3.565 16.84 0.669

Market 3.144 41.49 0.312

Synthetic 2.456 38.06 0.810

First, we observe that the Inception Score of the

generated images is close to the one measured on

the full synthetic dataset in Table 2. Then, the FID

distance between the driver images generated by the

GAN and the ground truth images is close to the one

observed with the Market dataset. Furthermore, the

SSIM score, which measures the structural similar-

ity between two images, is higher on our synthetic

dataset than on both Fashion and Market. This can be

explained by the fact that more than half the surface of

Synthetic Driver Image Generation for Human Pose-Related Tasks

767

driver images is composed of a ﬁxed background that

the GAN network can easily preserve since it almost

does not change during the pose transfer.

We can notice that the score measures on the Fash-

ion dataset are better than those on both the Market

and our synthetic dataset. This can be explained by

the simplicity of the Fashion images context, espe-

cially the lack of a complex background, fully visible

body parts, etc., in comparison with the real-life im-

ages in the two other datasets.

Finally, Figure 4 presents qualitative results of the

trained GAN. The generated images show that the

network learned to reproduce the pose while preserv-

ing most of the visual characteristics of the subject

and the global environment. This result indicates that

the network can learn and generalize on our dataset.

In the end, the evaluation results combined with the

qualitative results suggest that our dataset contains

enough diversity to train a network for a complex task

without overﬁtting.

5 CONCLUSION

In this paper, we have presented a dataset of 200k

synthetic driver images for human pose-related tasks

with a large diversity of human models to answer the

lack of available datasets on driver monitoring tasks.

Using state-of-the-art metrics, we demonstrated that

the quality of our synthetic images is comparable to

the one measured in existing datasets, synthetic or

real-world. We ﬁnally trained a GAN for human

pose transfer, a data-demanding task, on our synthetic

dataset. The network achieved similar performances

to those trained for HPT on real-world datasets for

other applications, which demonstrates that the pro-

posed synthetic dataset is diverse enough to train large

networks. This dataset is publicly available as well as

the script used to generate it.

Future work will investigate the problem of do-

main adaptation from synthetic to real-world driver

images in models for human pose-related tasks.

Moreover, the proposed pipeline could be used to ex-

tend our dataset with multiple views to approach tasks

such as 3D human pose estimation, or with real activ-

ities for passenger monitoring.

ACKNOWLEDGEMENTS

This work was supported by the Pack Ambition

Recherche 2019 funding of the French AURA Region

in the context of the AutoBehave project.

REFERENCES

Abtahi, S., Omidyeganeh, M., Shirmohammadi, S., and

Hariri, B. (2014). Yawdd: A yawning detection

dataset. In Proceedings of the 5th ACM multimedia

systems conference, pages 24–28.

Blender (2022). Blender. https://www.blender.org/. Ac-

cessed: 2022-11-01.

Borghi, G., Pini, S., Vezzani, R., and Cucchiara, R. (2020).

Mercury: a vision-based framework for driver moni-

toring. In International Conference on Intelligent Hu-

man Systems Integration, pages 104–110. Springer.

Canas, P. N., Ortega, J. D., Nieto, M., and Otaegui, O.

(2022). Virtual passengers for real car solutions: syn-

thetic datasets. arXiv preprint arXiv:2205.06556.

Cruz, S. D. D., Wasenmuller, O., Beise, H.-P., Stifter, T.,

and Stricker, D. (2020). Sviro: Synthetic vehicle in-

terior rear seat occupancy dataset and benchmark. In

Proceedings of the IEEE/CVF Winter Conference on

Applications of Computer Vision, pages 973–982.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,

L. (2009). Imagenet: A large-scale hierarchical im-

age database. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition,

pages 248–255. Ieee.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Ben-

gio, Y. (2014). Generative adversarial nets. In Ghahra-

mani, Z., Welling, M., Cortes, C., Lawrence, N., and

Weinberger, K., editors, Advances in Neural Infor-

mation Processing Systems, volume 27. Curran Asso-

ciates, Inc.

Guesdon, R., Crispim-Junior, C., and Tougne, L. (2021).

Dripe: A dataset for human pose estimation in

real-world driving settings. In Proceedings of the

IEEE/CVF International Conference on Computer Vi-

sion Workshops, pages 2865–2874.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and

Hochreiter, S. (2017). Gans trained by a two time-

scale update rule converge to a local nash equilibrium.

Advances in neural information processing systems,

30.

Huang, S., Xiong, H., Cheng, Z.-Q., Wang, Q., Zhou, X.,

Wen, B., Huan, J., and Dou, D. (2020). Generating

person images with appearance-aware pose stylizer. In

IJCAI.

Jegham, I., Khalifa, A. B., Alouani, I., and Mahjoub, M. A.

(2019). Mdad: A multimodal and multiview in-

vehicle driver action dataset. In International Confer-

ence on Computer Analysis of Images and Patterns,

pages 518–529. Springer.

Juraev, S., Ghimire, A., Alikhanov, J., Kakani, V., and Kim,

H. (2022). Exploring human pose estimation and

the usage of synthetic data for elderly fall detection

in real-world surveillance. IEEE Access, 10:94249–

94261.

Katrolia, J. S., El-Sherif, A., Feld, H., Mirbach, B., Ram-

bach, J. R., and Stricker, D. (2021). Ticam: A time-of-

ﬂight in-car cabin monitoring dataset. In 32nd British

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

768

Machine Vision Conference 2021, BMVC 2021, On-

line, November 22-25, 2021, page 277. BMVA Press.

Kim, T. S., Shim, B., Peven, M., Qiu, W., Yuille, A., and

Hager, G. D. (2022). Learning from synthetic vehi-

cles. In Proceedings of the IEEE/CVF Winter Confer-

ence on Applications of Computer Vision, pages 500–

508.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).

Imagenet classiﬁcation with deep convolutional neu-

ral networks. In Pereira, F., Burges, C. J. C., Bottou,

L., and Weinberger, K. Q., editors, Advances in Neu-

ral Information Processing Systems 25, pages 1097–

1105. Curran Associates, Inc.

Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick,

R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L.,

and Doll

ar, P. (2015). Microsoft coco: Common ob-

jects in context.

Liu, Z., Luo, P., Qiu, S., Wang, X., and Tang, X. (2016).

Deepfashion: Powering robust clothes recognition and

retrieval with rich annotations. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, pages 1096–1104.

Maik, V., Paik, D., Lim, J., Park, K., and Paik, J. (2010).

Hierarchical pose classiﬁcation based on human phys-

iology for behaviour analysis. Computer Vision, IET,

4:12 – 24.

MakeHuman (2022). Makehuman community. http://www.

makehumancommunity.org/. Accessed: 2022-11-01.

Martin, M., Roitberg, A., Haurilet, M., Horne, M., Reiß,

S., Voit, M., and Stiefelhagen, R. (2019). Drive&act:

A multi-modal dataset for ﬁne-grained driver behavior

recognition in autonomous vehicles. In Proceedings of

the IEEE/CVF International Conference on Computer

Vision, pages 2801–2810.

Ohn-Bar, E., Martin, S., Tawari, A., and Trivedi, M. M.

(2014). Head, eye, and hand patterns for driver activ-

ity recognition. In 2014 22nd international conference

on pattern recognition, pages 660–665. IEEE.

Pytorch metrics (2022). Pytorch implementation of com-

mon gan metrics. https://github.com/w86763777/

pytorch-gan-metrics. Accessed: 2022-11-01.

Ribeiro, R. F. and Costa, P. D. P. (2019). Driver gaze

zone dataset with depth data. In 2019 14th IEEE In-

ternational Conference on Automatic Face & Gesture

Recognition (FG 2019), pages 1–5.

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V.,

Radford, A., and Chen, X. (2016). Improved tech-

niques for training gans. Advances in neural informa-

tion processing systems, 29.

Selim, M., Firintepe, A., Pagani, A., and Stricker, D.

(2020). Autopose: Large-scale automotive driver head

pose and gaze dataset with deep head orientation base-

line. In VISIGRAPP (4: VISAPP), pages 599–606.

Simard, P. Y., Steinkraus, D., and Platt, J. C. (2003). Best

practices for convolutional neural networks applied to

visual document analysis. In Proceedings of the Sev-

enth International Conference on Document Analysis

and Recognition-Volume 2, page 958.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wo-

jna, Z. (2016). Rethinking the inception architecture

for computer vision. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recogni-

tion, pages 2818–2826.

Tang, H., Bai, S., Zhang, L., Torr, P. H., and Sebe, N.

(2020). Xinggan for person image generation. In Pro-

ceedings of the European conference on computer vi-

sion, pages 717–734.

Unity (2022). Unity asset store. https://assetstore.unity.

com/. Accessed: 2022-11-01.

Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P.

(2004). Image quality assessment: from error visi-

bility to structural similarity. IEEE transactions on

image processing, 13(4):600–612.

Wu, Y., Yuan, Y., and Wang, Q. (2022). Learning from

synthetic data for crowd instance segmentation in the

wild. In 2022 IEEE International Conference on Im-

age Processing (ICIP), pages 2391–2395. IEEE.

Zhang, J., Li, K., Lai, Y.-K., and Yang, J. (2021). Pise:

Person image synthesis and editing with decoupled

gan. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

7982–7990.

Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., and Tian,

Q. (2015). Scalable person re-identiﬁcation: A bench-

mark. In Proceedings of the IEEE/CVF International

Conference on Computer Vision.

Zhu, Z., Huang, T., Shi, B., Yu, M., Wang, B., and Bai, X.

(2019). Progressive pose attention transfer for person

image generation. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recogni-

tion, pages 2347–2356.

Synthetic Driver Image Generation for Human Pose-Related Tasks

769