Synthetic Data Generation for Deep Learning in Counting Pedestrians

Hadi Keivan Ekbatani, Oriol Pujol and Santi Segui

Faculty of Mathematics and Computer Science, University of Barcelona,

Gran Via de les Corts Catalanes, 585 08007 Barcelona, Spain

hadi.keivan.ekbatani@stu.ﬁb.upc.edu, oriol pujol@ub.edu, santi.segui@ub.edu

Keywords:

Synthetic Data Generation, Deep Convolutional Neural Network, Deep Learning, Computer Vision.

Abstract:

One of the main limitations of the application of Deep Learning (DL) algorithms is when dealing with prob-

lems with small data. One workaround to this issue is the use of synthetic data generators. In this framework,

we explore the beneﬁts of synthetic data generation as a surrogate for the lack of large data when applying DL

algorithms. In this paper, we propose a problem of learning to count the number of pedestrians using synthetic

images as a substitute for real images. To this end, we introduce an algorithm to create synthetic images for

being fed to a designed Deep Convolutional Neural Network (DCNN) to learn from. The model is capable of

accurately counting the number of individuals in a real scene.

1 INTRODUCTION

Counting the number of objects in still images or

video frames is a new approach towards dealing with

detecting or learning objects which has been recently

proffered in the literature (Rabaud and Belongie,

2006), (Kong et al., 2005). Previously, in order to

count the objects of interest in an image or video, var-

ious object features needed to be designed, extracted

or detected during the learning phase which restrict

their usage in large-scale computer vision applica-

tions thus demanding more efﬁcient solutions to al-

leviate, expedite and improve this process.

One of the recent and commonly used methods to

facilitate feature detection process is the application

of Deep Convolutional Neural Networks (DCNN)

(Krizhevsky et al., 2012), (LeCun and Bengio, 2005),

(Szegedy et al., 2015). One of the promises of DCNN

is replacing handcrafted features with efﬁcient algo-

rithms for feature learning and hierarchical feature ex-

traction (Song and Lee, 2013). DCNNs have been

claimed and practically proven to achieve the most

assuring performance in different vision benchmark

problems concerning feature detection and classiﬁca-

tion (Ciregan et al., 2012), (Szegedy et al., 2015).

Although access to fast computers and vast

amounts of data has enabled the advances of deep

learning algorithms such as DCNN in solving many

problems that were not solvable using classic AI, they

have limitations. For instance, they do not perform

well when there is limited data (Grifﬁn et al., 2007).

This constrain restricts the application of DL meth-

ods in various areas including computer vision where

they have shown promising performances. As one so-

lution to tackle this issue, we introduce a synthetic

data generator algorithm to create images highly-

representative of the real images.

In this article, we tackle a crowd counting prob-

lem by the means of synthetic images and deep con-

volutional neural network. We generate a set of

highly realistic, synthetically generated images to be

fed to a proposed convolution-based deep architec-

ture. DCNN are well-suited for learning object fea-

tures from the scratch and in a hierarchical approach.

The proposed architecture consist of convolutional

network to capture discriminative information about

the object we are willing to count, following by fully

connect layers where we count the multiplicity of ob-

ject of interest. Figure 1 illustrates the proposal at a

glance. The input instances contain a random set of

pedestrians in a walkway. As it’s shown in below,

our goal is to learn to count the number of people in

synthetic images and thereby, accurately predict the

number of pedestrians in similar but real images.

Our contributions are as follows: We introduce a

synthetic image generation algorithm in order to sub-

stitute the lacking training data in a fully supervised

learning problem casted as learning to count the num-

ber of pedestrians in a walkway. Moreover, we pro-

pose a DCNN capable of learning pedestrians’ fea-

tures. Then, we validate our approach in a similar but

real scenario. We test our proposed model which has

been trained on synthetic images, on real images to

see if synthetic data generation can be incorporated

318

Ekbatani, H., Pujol, O. and Segui, S.

Synthetic Data Generation for Deep Learning in Counting Pedestrians.

DOI: 10.5220/0006119203180323

In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2017), pages 318-323

ISBN: 978-989-758-222-6

Figure 1: A schematic of our proposal. In this paper, we show that by creating realistic synthetic images, we are able to train

a DCNN that is able to count the number of pedestrians in similar but real images.

as a surrogate for replacing small training sets when

applying deep architectures.

2 BACKGROUND AND RELATED

WORKS

2.1 Synthetic Data Generation

The main purpose of generating synthetic datasets has

been to protect the privacy and conﬁdentiality of the

actual data (Phua et al., 2010), (Yao et al., 2013),

since it does not hold any personal information and

cannot be traced back by any individual. Problems

such as fraud detection (Phua et al., 2010), or health

care (Yao et al., 2013), are normally tackled by the

use of synthetic data. However, most of the previ-

ously mentioned approaches towards synthetic data

generation would not be applicable when it comes

to synthetic image generation. This is due to the

fact that standard methods such as Probability Den-

sity Function (PDF) or Interpolation operate element-

wise. The need for generating and synthesizing im-

ages using object-wise operations led researchers to

the use image processing tools for creating synthetic

images to tackle vision problems.

In computer vision, usage of synthetic images has

a longstanding history, as in 2000, Cappelli et al.

in (Cappelli et al., 2000) presented an approach to

synthetic ﬁngerprint generation on the basis of some

mathematical models that describe the main features

of real ﬁnger prints. More recently, after the success

of deep convolutional neural networks in various vi-

sion tasks concerning object detection or classiﬁca-

tion, generation and use of synthetic datasets has been

frequently considered. For example, in (Eggert et al.,

2015), synthetic images are generated to be fed to a

DCNN in order to learn how to detect company logo

in the absence of a large training set.

Moreover, as one of the most recent approaches,

Segui et al. in (Segu

ı et al., 2015) proposed synthetic

data generation to counter lack of data issue for learn-

ing to count the number of objects in images using

deep convolutional neural networks. In their work,

they took advantage of existent unlabeled and labeled

datasets to generate synthetic images representative of

the actual images. The authors introduce two count-

ing problems, counting number of even-digits in im-

ages, and counting the amount of pedestrians in a

walkway.

2.2 Crowd Counting

Learning to count the objects of interest in an im-

age can be approached from two different perspec-

tives: either training an object detector, or training an

object counter. In the ﬁeld of object detection, nu-

merous works have been previously proposed (Kong

et al., 2005), (Marana et al., 1998). Furthermore,

Wu and Nevatia in (Wu and Nevatia, 2005) proposed

edgelet features (an edgelet is a short segment of line

or curve) as a new type of silhouette-oriented features

to deal with the problem of detecting individuals in

crowded still images.

As a similar line of work in the course of object

counting and more speciﬁcally crowd counting, in

(Leibe et al., 2007) and (Rabaud and Belongie, 2006),

different object tracking approaches were taken to de-

tect and count moving objects in the scene. However,

most of object tracking approaches met with skepti-

cism by society, given the perception of infringing in-

dividuals’ privacy rights.

More recently, in (Chan et al., 2008), Chan et al.

presented a novel approach with no explicit object

segmentation or tracking to estimate the number of

people moving in each direction (towards and away

from camera) in a privacy-preserving manner.

On the other hand, in case of feature learning,

Segui et al. in (Segu

ı et al., 2015) proposed a novel

approach for counting objects representations using

deep object features. In their work, objects’ features

Synthetic Data Generation for Deep Learning in Counting Pedestrians

319

Figure 2: An illustration of image generation process at different steps.

are learned by a counting DCNN and are used to un-

derstand the underlying representation. Contrary to

the previous approaches, their proposal is the ﬁrst one

where counting problem is handled by learning deep

features. Additionally, no hints on the object of inter-

est was given besides its’ occurrence multiplicity.

3 SYNTHETIC IMAGE

GENERATION

The main hypothesis of this work is that synthetic data

generation algorithms can be used as a workaround

for problems with no or little training sets. On this

course, we propose an algorithm for creating highly

realistic synthetic images of pedestrians in a walkway.

We used UCSD unlabeled Anomaly detection dataset

of pedestrians collected by Chan et al. and used in

(Mahadevan et al., 2010) and (Chan et al., 2009).

UCSD Anomaly detection dataset contains clips of

groups of people walking towards and away from the

camera, and consists of 34 training video samples and

36 testing video samples. Each video has 200 frames

of each 238 × 158 pixels.

3.1 Image Generation

In our dataset, we employed all 70 training and test-

ing video samples to generate the synthetic pedestrian

dataset. We constrained each image by having up to

29 pedestrians in the walkway. The process of gener-

ating the data includes the following steps while ﬁg-

ure 2 illustrates this process.

1. Background Extraction. Firstly, we simply sub-

tract the background from each video frame and

from there, we extract the median backgrounds of

each video (in total, 70 different backgrounds).

2. Pedestrian Extraction. Subtracting each image

from the mean background, we are able to label

the connected regions (each individual in case of

our images) using morphological labeling meth-

ods.

3. Background Generation. In this step, we try to

make the backgrounds of images as realistic as

possible by:

• making a sparse combination of median back-

grounds.

• changing the global illumination of the images

randomly.

• adding some random Gaussian noise to the

backgrounds.

4. Region Of Interest (ROI). Then, for training and

comparison purposes, images are masked with a

ﬁlter of Region Of Interest (ROI).

5. Creating Synthetic Images. Afterwards, pedes-

trians are added to the masked background in a

way that the center of each person is placed in-

side white area of the mask. Finally images are

normalized (between 0 and 255) and resized to

158 × 158 in order to be fed to convolution lay-

ers.

3.2 Image Improvement

Although we managed to successfully create syn-

thetic images of people in the street, the generated

images were still quite distinguishable from the real

dataset. Thus, in order to make images as highly real-

istic as possible, we improved the dataset as explained

underneath. Figure 3 depicts this procedure.

1. Remove Non-pedestrians. Amongst the ex-

tracted pedestrians, there were some non-

pedestrians with objects instead of pedestrians,

and yet others with more than one person. There-

fore, we manually removed these outliers. After

this edition, we ended with 426 samples of peo-

ple.

2. Lack of Pedestrians. For the sake of general-

ization, we needed a decent variety of pedestrians

in the images to train with. For this purpose, we

created 2 versions of current pedestrians list, each

darkened by the factor of 20% from each other.

3. Halos Around the Pedestrians. Due to lack of

accuracy of the region measuring method, a ﬁne

layer of the background that pedestrians were ex-

tracted from, still remained around the pedestri-

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

320

Figure 3: An illustration of each step of image improvement process.

ans. In the created images, depending on where

the person was placed, these thin layers appeared

like a halo around the person. We used morpho-

logical erosion on pedestrians’ masks and also

Poisson image editing to remove the halos.

4. Image Perspective. Finally, Since pedestrians

of different sizes were put randomly in the im-

ages, we considered peoples tallness perspective

in the images. Humans height almost follows a

Gaussian distribution (Subramanian et al., 2011).

Therefore, with respect to (Subramanian et al.,

2011), we mapped individuals heights with the

length of the walkway in the image, considering

a Gaussian noise with mean µ = 0 and σ = 3.5.

4 EXPERIMENTS AND RESULTS

For learning to count the number of pedestrians in a

walkway, we synthetically generated a set of 1 million

images of size 158 × 158 with up to 29 pedestrians in

each image. Maximum overlapping was considered

in the creation of the images. We divided this dataset

into a training set of 800k images and 200k images

for validation set. To test our model, we used UCSD

crowd counting dataset with 3375 manually labeled

images of pedestrians. The selected UCSD images

contain from 11 to 29 pedestrians in each image.

We designed a seven layers architecture DCNN

with four convolutional layers and three fully con-

nected layers. The architecture is shown in Table 1.

Table 1: Proposed DCNN for counting pedestrians.

Convolutions Fully connects

10 × 15 × 15 & x2 pooling 128

10 × 11 × 11 & x2 pooling 64

20 × 9 × 9 1

20 × 5 × 5

The algorithm is trained using the Caffe pack-

age[11] on a GPU NVIDIA Tesla K40. The network

has been set to 400,000 iterations. The output layer is

conﬁgured as a classiﬁcation problem.

On the validation set, the performance of the

model is 0.70 mean absolute error and 0.94 mean

squared error. This results improve the achieved re-

sults in a similar experiment done by (Segu

ı et al.,

2015) (the comparison is shown in table 2). On the

other hand, on the real test set, we obtained 1.38 mean

absolute error and 3.61 mean squared error which

closely follow the results in (Chan et al., 2008) which

was obtained by hand-crafting highly specialized im-

age features that are dependent on the object class.

This comparison is depicted in table 3 The confusion

matrix regarding the model performance is illustrated

in ﬁgure 4. As you may notice, due to the inevitable

differences between real and synthetic samples, the

model mostly over-predicts. Moreover, as the number

of pedestrians increases in the images, the prediction

accuracy of the model decreases.

Table 2: Performance comparison on the synthetic data be-

tween our proposal and related work in (Segu

ı et al., 2015).

Experiments MSE MAE

Our approach (29 peds) 0.942 0.707

(Segu

ı et al., 2015) (25 peds) 1.12 0.74

Table 3: Performance comparison on the real data between

our proposal and related work in (Chan et al., 2008).

Experiments MSE MAE

Proposed method 3.61 1.38

(Chan et al., 2008) approach 2.73 1.24

As you may observe in table 2, in case of synthetic

images, although our images contain more pedestri-

ans, our results beat the previous approach in (Segu

et al., 2015). This proves the improvement we made

in synthetic data generation process and the designed

deep architecture.

Respectively, in case of real images, although we

could not improve the work done in (Chan et al.,

2008), our results follows their results closely. We

Synthetic Data Generation for Deep Learning in Counting Pedestrians

321

Figure 4: Confusion matrix regarding the model perfor-

mance on the real test set. The starting point of the graph is

11 since the minimum amount of pedestrians in the real test

set is 11.

should mention that Chan et.al experiment in (Chan

et al., 2008) was done by hand-crafting highly spe-

cialized features and exhaustive labeling. This results

approve the suitability of synthetic data as a surrogate

for the small real data when using DCNN.

5 CONCLUSIONS

In this paper we explore the beneﬁts of synthetic data

generation for the application of deep convolutional

neural networks for a crowd counting problem with

small training set. We propose an algorithm for cre-

ating a highly realistic synthetic dataset of pedestri-

ans in a walkway to train the proposed DCNN with.

Moreover, we provide a system trained with synthetic

images capable of predicting the number of pedestri-

ans in an image to a satisfactory extent. The obtained

results suggest the incorporation of synthetic data as

a well-suited surrogate for the missing real along with

alleviating required exhaustive labeling.

There are still many open questions to be ad-

dressed such as, when and to what extent synthetic

images are applicable as a substitute to solve real

world problems. which is the best network architec-

ture for counting the crowd?

ACKNOWLEDGEMENTS

This work has been partially funded by the Spanish

MINECO Grants TIN2013-43478-P and TIN2012-

38187- C03. We gratefully acknowledge the support

of NVIDIA Corporation with the donation of a Tesla

K40 GPU used for this research.

REFERENCES

Cappelli, R., Erol, A., Maio, D., and Maltoni, D. (2000).

Synthetic ﬁngerprint-image generation. In Pattern

Recognition, 2000. Proceedings. 15th International

Conference on. IEEE.

Chan, A. B., Liang, Z.-S. J., and Vasconcelos, N. (2008).

Privacy preserving crowd monitoring: Counting peo-

ple without people models or tracking. In Computer

Vision and Pattern Recognition, 2008. CVPR 2008.

IEEE Conference on. IEEE.

Chan, A. B., Morrow, M., and Vasconcelos, N. (2009).

Analysis of crowded scenes using holistic properties.

In Performance Evaluation of Tracking and Surveil-

lance workshop at CVPR.

Ciregan, D., Meier, U., and Schmidhuber, J. (2012). Multi-

column deep neural networks for image classiﬁcation.

In Computer Vision and Pattern Recognition (CVPR),

2012 IEEE Conference on. IEEE.

Eggert, C., Winschel, A., and Lienhart, R. (2015). On the

beneﬁt of synthetic data for company logo detection.

In Proceedings of the 23rd ACM international confer-

ence on Multimedia. ACM.

Grifﬁn, G., Holub, A., and Perona, P. (2007). Caltech-256

object category dataset. California Institute of Tech-

nology.

Kong, D., Gray, D., and Tao, H. (2005). Counting pedes-

trians in crowds using viewpoint invariant training. In

BMVC. Citeseer.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Advances in neural information process-

ing systems.

LeCun, Y. and Bengio, Y. (2005). Convolutional networks

for images, speech, and time series. In BMVC. Cite-

seer.

Leibe, B., Schindler, K., and Van Gool, L. (2007). Coupled

detection and trajectory estimation for multi-object

tracking. In 2007 IEEE 11th International Conference

on Computer Vision. IEEE.

Mahadevan, V., Li, W., Bhalodia, V., and Vasconcelos, N.

(2010). Anomaly detection in crowded scenes. In

CVPR.

Marana, A., Costa, L. d. F., Lotufo, R., and Velastin, S.

(1998). On the efﬁcacy of texture analysis for crowd

monitoring. In Computer Graphics, Image Process-

ing, and Vision, 1998. Proceedings. SIBGRAPI’98. In-

ternational Symposium on. IEEE.

Phua, C., Lee, V., Smith, K., and Gayler, R. (2010). A

comprehensive survey of data mining-based fraud de-

tection research. In arXiv preprint arXiv:1009.6119.

Rabaud, V. and Belongie (2006). Counting crowded moving

objects. In 2006 IEEE Computer Society Conference

on Computer Vision and Pattern Recognition. IEEE.

Segu

ı, S., Pujol, O., and Vitria, J. (2015). Learning to count

with deep object features. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recogni-

tion Workshops.

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

322

Song, H. A. and Lee, S.-Y. (2013). Hierarchical representa-

tion using nmf. In International Conference on Neural

Information Processing. Springer.

Subramanian, S.,

Ozaltin, E., and Finlay, J. E. (2011).

Height of nations: a socioeconomic analysis of cohort

differences and patterns among women in 54 low-to

middle-income countries. In PLoS One. Public Li-

brary of Science.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,

Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-

novich, A. (2015). Going deeper with convolutions.

In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition.

Wu, B. and Nevatia, R. (2005). Detection of multi-

ple, partially occluded humans in a single image by

bayesian combination of edgelet part detectors. In

Tenth IEEE International Conference on Computer

Vision (ICCV’05) Volume 1. IEEE.

Yao, W., Basu, S., Wei-Nchih, L., and Singhal, S. (2013).

Synthetic healthcare data generation. Google Patents.

Synthetic Data Generation for Deep Learning in Counting Pedestrians

323