Investigating Synthetic Data Sets for Crowd Counting in Cross-scene

Scenarios

Rita Delussu, Lorenzo Putzu and Giorgio Fumera

University of Cagliari, Piazza D’Armi, Cagliari, Italy

Department of Electrical and Electronic Engineering, Piazza D’armi, 09123 Cagliari, Italy

Keywords:

Cross-scene, Crowd Analysis, Crowd Density Estimation, Synthetic Data Sets, Texture Features, Regression.

Abstract:

Crowd counting and density estimation are crucial functionalities in intelligent video surveillance systems but

are also very challenging computer vision tasks in scenarios characterised by dense crowds, due to scale and

perspective variations, overlapping and occlusions. Regression-based crowd counting models are used for

dense crowd scenes, where pedestrian detection is infeasible. We focus on real-world, cross-scene application

scenarios where no manually annotated images of the target scene are available for training regression models,

but only images with different backgrounds and camera views can be used (e.g., from publicly available data

sets), which can lead to low accuracy. To overcome this issue, we propose to build the training set using

synthetic images of the target scene, which can be automatically annotated with no manual effort. This work

provides a preliminary empirical evaluation of the effectiveness of the above solution. To this aim, we carry

out experiments using real data sets as the target scenes (testing set) and using different kinds of synthetically

generated crowd images of the target scenes as training data. Our results show that synthetic training images

can be effective, provided that also their background, beside their perspective, closely reproduces the one of

the target scene.

1 INTRODUCTION

The use of computer vision tools to automatise crowd

monitoring and analysis tasks, or to support human

operators involved in such tasks, is becoming increas-

ingly relevant in many applications, such as crowd

behaviour analysis and intelligent video surveillance,

given the nowadays pervasive deployment of CCTV

systems. One prominent example is the use of CCTV

systems by Law Enforcement Agencies to monitor

and guarantee the security of mass gathering events,

which is one of the issues addressed by the EU H2020

LETSCROWD project

our research group is work-

ing on. In particular, crowd counting and density es-

timation are potentially very useful functionalities of

crowd monitoring systems. These are however very

challenging tasks in practical applications due to is-

sues such as illumination changes, severe occlusions

due to objects in the scene or by other people (see

Fig. 1), and distortions caused by the camera view,

which make the size and perspective of people in

the scene change considerably according to the dis-

https://letscrowd.eu/

Figure 1: Example of a static (left) and dynamic (right) oc-

clusion, where a person is almost totally occluded by a palm

and by another person, respectively. (Images taken from the

Mall and PETS2009 data sets – see Sect- 3.3).

tance from the camera. In this work, we focus on

the most challenging scenario of dense crowd scenes

(see Fig. 2), where crowd counting and density esti-

mation cannot rely on pedestrian detection. For dense

crowd scenes, the so-called counting by regression

approach is widely used in the literature (Loy et al.,

2013). It consists of estimating the number of people

using a regression model trained on low-level image

Delussu, R., Putzu, L. and Fumera, G.

Investigating Synthetic Data Sets for Crowd Counting in Cross-scene Scenar ios.

DOI: 10.5220/0008981803650372

In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020) - Volume 4: VISAPP, pages

365-372

ISBN: 978-989-758-402-2; ISSN: 2184-4321

365

Figure 2: Examples of a non-crowded scene (top) and of

a crowded scene (bottom). (Images taken from the UCSD

and ShanghaiTech data sets – see Sect- 3.3).

features, using images of a crowd manually annotated

with the number of people in the scene. Recently,

several regression-based methods that use Convolu-

tional Neural Networks (CNNs) have been proposed

(Sindagi and Patel, 2017a); most of them directly

estimate the density map and then obtain the peo-

ple count from it. However, CNNs require massive

amounts of training data, which may not be avail-

able in real applications. Moreover, using the den-

sity map as the ground truth requires the position of

each pedestrian in the scene to be manually annotated

(Liu et al., 2019), which requires a much higher effort

than providing only the people count. Furthermore,

CNN-based approaches may require several seconds

to compute the density map for a single video frame,

which makes them not yet suitable for real-time ap-

plications. For all the reasons above in this work, we

focus on earlier regression-based approaches that do

not use CNNs.

Two issues that affect the performance of

regression-based methods are the fact that the features

typically used are signiﬁcantly affected by the image

background, and that in some application scenarios it

may be unfeasible to collect and manually annotate

crowd images of the target scene to build a represen-

tative training set. In such a case one faces a chal-

lenging cross-scene scenario characterised by train-

ing images exhibiting different perspective (camera

views) and background from the target scene where

the crowd counting system has to be deployed: as a

consequence its accuracy can be very low. This is the

application scenario we focus on in this work.

Inspired by some recent works that used synthetic

training images to improve the performance of crowd

counting approaches based on pedestrian or body part

detection (Courty et al., 2014; Schr

oder et al., 2018),

as well as in other computer vision tasks, in this

work we propose to address the issues of regression-

based crowd counting methods mentioned above by

building a training set of synthetic images of the tar-

get scene. This would allow, e.g., to create scene-

speciﬁc regression models on the ﬂy for new camera

installations. This work aims to empirically investi-

gate whether and to what extent the above solution

can improve the cross-scene effectiveness of existing

regression-based crowd counting methods in dense

crowd scenes. To this aim, we carry out experiments

using three real, benchmark data sets of dense crowd

images as the target scenes (testing set) and using dif-

ferent kinds of synthetically generated crowd images

of the same target scenes as training data.

The remainder of the manuscript is structured

as follows. Related works are ﬁrst summarised in

Sect. 2. Sect. 3 describes the features and the regres-

sion models considered, the real and the synthetic data

sets used in our experiments, and the accuracy mea-

sures used for performance evaluation. The experi-

mental set-up and the results are presented in Sect. 4.

Sect. 5 summarises the conclusions of this work.

2 RELATED WORK

Several crowd counting approaches have been pro-

posed so far. They can be divided into counting by

detection, counting by clustering and counting by re-

gression (Loy et al., 2013). Counting by detection is

based on pedestrian detection from still images, either

full-body (Leibe et al., 2005; Dollar et al., 2011) or

body part detection (Lin et al., 2001; Tu et al., 2008).

The latter aims at overcoming the presence of occlu-

sions and attempts to locate head and shoulders (Ge

and Collins, 2009), as these are among the most im-

portant and most visible cues for typical camera loca-

tions (Tu et al., 2008). However counting by detection

is effective only under very limited occlusion, i.e., for

non-crowded scenes (see Fig. 2, left), whereas it is

unable to deal with dense crowds (see Fig. 2, right).

Counting by clustering is based on people track-

ing, and assumes that coherent feature trajectories can

be grouped together to approximate the number of

people; this approach is ineffective as well on dense

crowd scenes (Loy et al., 2013).

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

366

The counting by regression approach, which we

focus on in this work, aims at mapping from low-

level image features to the number of people or to

the density map of a scene by supervised training of

a regression model. The earliest methods estimate

the number of people using holistic scene descriptors

such as foreground segment (Ma et al., 2004), edge

(Kong et al., 2005), texture and gradient (Wu et al.,

2006; Ojala et al., 2002), shape (Dong et al., 2007),

intensity (Lempitsky and Zisserman, 2010) and mo-

tion (Benabbas et al., 2010). The main regression

models used in such methods are Linear Regression,

Random Forests and Support Vector Regression (Loy

et al., 2013). More recently, a variety of methods

based on CNNs, mostly regression-based, have been

proposed (Sindagi and Patel, 2017a). Such methods

estimate the density map directly and obtain the peo-

ple count from it. Most of them exploit speciﬁcally-

devised CNN architectures, although they often share

several layers with generic architectures and then ﬁne-

tune them on a speciﬁc training set (Sindagi and Patel,

2017b).

As for all supervised methods, the effectiveness

of regression-based crowd counting methods relies on

a representative training set of crowd images manu-

ally annotated with the people count (or the density

map). However, in some application scenarios, it may

be unfeasible to collect and manually annotate crowd

images to build a training set representative of the

target scene where a crowd counting system has to

be deployed. If a training set made up of images of

different scenes (e.g., taken from publicly available,

benchmark data sets), one faces a cross-scene sce-

nario where the mismatch in image background and

perspective (camera view) can severely affect the ac-

curacy of the resulting model. We shall provide clear

empirical evidence of this issue in Sect. 4. In fact, fea-

tures typically used by regression-based methods are

signiﬁcantly affected by the image background. Ex-

isting background subtraction and segmentation ap-

proaches are not sufﬁcient under frequent illumina-

tion changes or, even worse, when a background im-

age is not available. Moreover, regression models

are affected by perspective distortions (objects in the

background appear smaller than foreground ones).

Several data sets of crowd images have been col-

lected so far (see Sect. 3.3), which however exhibit

several limitations: most of them contain only non-

dense crowd scenes with limited occlusion, and are

also relatively small in size. This can further af-

fect the cross-scene performance of regression-based

models. Domain adaptation methods have been pro-

posed to mitigate the cross-scene issue (Change Loy

et al., 2013), but they nevertheless require images of

the target scene for ﬁne-tuning, which is not feasible

in the application scenario considered in this work.

In some computer vision tasks, including crowd

counting by detection, the use of synthetic data sets

built using computer graphics tools has been proposed

to overcome the limits of data sets made up of real im-

ages. This solution can be potentially useful also for

regression-based crowd counting in the above appli-

cation scenario since it would allow to generate syn-

thetic images of the target scene, and to automatically

control every parameter of interest such as the num-

ber and location of pedestrians and the scene perspec-

tive, background and illumination. A similar solution

has already been proposed in (Wang et al., 2019), but

in the context of a domain adaptation method which

requires crowd images of the target scene for ﬁne-

tuning, which is not feasible in the application sce-

nario considered in this work.

3 EXPERIMENTAL SETTING

In this section, we describe the methods, the data sets

and experimental setting used in this work. Since we

focus on the counting by regression approach, we ﬁrst

describe the features and regression models, that are

mainly based on the ones described in the surveys of

(Loy et al., 2013; Ryan et al., 2015).

3.1 Feature Extraction

Several kinds of features have been proposed so far

for regression-based crowd counting, and often dif-

ferent complementary features are combined together.

We consider here segment and edge features, which

are among the most common foreground ones, as well

as the Grey-Level Co-occurrence Matrix (GLCM)

and Local Binary Patterns (LBP) texture features.

Foreground features can be obtained through back-

ground subtraction. Segment features (Ma et al.,

2004) aim at capturing global properties of the im-

age regions, such as area and perimeter; edge features

(Kong et al., 2005) focus on complementary informa-

tion about local image properties, such as the number

of edge pixels and edge orientation. Texture features

analyse the spatial relationships among image pixels

(Loy et al., 2013) instead. In particular, GLCM is ob-

tained in the form of a matrix by computing how often

pairs of pixels with a certain value and in a speciﬁed

spatial relationship occur. From the GLCM several

global statistical information can be extracted (Har-

alick et al., 1973). The LBP descriptor is capable to

characterise the local structure of the image, as it is

rotation invariant and robust to grey-scale changing

Investigating Synthetic Data Sets for Crowd Counting in Cross-scene Scenarios

367

(Ojala et al., 2002). All the mentioned feature sets

have been combined together, using a simple concate-

nation, to create a single and stronger feature set.

3.2 Regression Models

Regression-based methods can be divided into global

and local (Loy et al., 2013). Global methods

learn a single regression function for the whole im-

age, whereas local methods subdivide an image into

patches and estimate the total people count by per-

forming regression on each patch. Local methods can

handle efﬁciently different kinds of scenes in which

the density is not the same over the whole image but

can be too complex for real-world applications. For

this reason, we focus on global methods.

Several global regression models have been pro-

posed in the literature (Loy et al., 2013). The sim-

plest one is linear regression, which estimates the

people count ﬁnding a linear relationship with a D-

dimensional feature vector that describes the image.

Since the feature space size D can be very high, to

address the issue of feature collinearity partial least

squares (PLS) regression has been used (Abdi, 2010).

Non-linear models have also been proposed for crowd

counting, such as Gaussian process regression (GPR).

However, its processing cost at the prediction phase is

too high and therefore, not suitable for real-time ap-

plications. For this reason, we do not consider it in

this work, preferring others non-linear models, such

as the Random Forests (RF) (Criminisi et al., 2012),

which has the advantage of being scalable and less

sensitive to parameter selection, and the Support vec-

tor regression (SVR) (Basak et al., 2007) with a radial

basis function (RBF) kernel, which is one of the most

popular regression models.

3.3 Real Data Sets

As stated above, in this work we focus on crowd

counting systems to be deployed on a speciﬁc target

scene (i.e., camera view), with a dense crowd. To our

knowledge, only three publicly available data sets can

be considered representative of dense crowd scenarios

(Zhang et al., 2016; Zhang et al., 2019; Sindagi and

Patel, 2017a), namely ShanghaiTech, UCF-QNRF

and World Expo Shanghai 2010. However, they do

not contain images (in a signiﬁcant number) from a

same scene. On the contrary, to evaluate the effec-

tiveness of a crowd counting system data sets charac-

terised by the whole videos (or at least a signiﬁcant

number of frames) from a same scene with a dense

crowd are required. Unfortunately, there are very few

publicly available data sets manually annotated with

the crowd count that exhibit all the above features.

For this reason, we used only three real data sets of

non-dense crowd scenes (less than 60 persons per im-

age), that are Mall, UCSD and PETS. [Mall] has 2000

frames collected using a publicly accessible camera in

a shopping mall (Loy et al., 2013). It contains several

crowd densities from 13 to 53 people per frame (on

average 31) for a total of 62,325 pedestrians. This is a

challenging data set with severe perspective distortion

and several occlusions caused by static objects or by

other people. UCSD contains a total of 70 videos ac-

quired from a low-resolution camera (frames of size

238×158) installed in a pedestrian walkway at the

UCSD campus (Chan et al., 2008). It contains a to-

tal of 49,885 pedestrians, with an average number of

people per frame of around 25. For these data sets, we

used the same setting as in (Loy et al., 2013), where

2000 frames were extracted from the original data set.

[PETS2009] was built to test several visual surveil-

lance tasks (Ferryman and Shahrokni, 2009), includ-

ing people counting (S1 part). It is a multi-view data

set, and each sequence has been acquired with a dif-

ferent camera, but the frames belonging to the same

camera view can be used to create some single-scene

data set. To this aim, we used only the images from

the ﬁrst camera as a single-scene data set, using the

ground truth provided in (Zhang and Chan, 2019).

Fig. 3 shows some examples of frames from all the

above data sets.

3.4 Synthetic Data Sets

Building and evaluating crowd counting models in

real-world application scenarios is a very complex

task, especially when no annotated images of the tar-

get scene are available to train regression models.

Synthetic images can be very useful to this aim, since

the exact number of pedestrians shown in such images

is exactly known and no manual counting is required.

Synthetic data sets are therefore a potentially useful

solution to build the training set of a regression model.

To this aim, in this work we evaluated the effective-

ness of using as a training set synthetic crowd images

which reproduce the target scene (the background, the

perspective or both) of the testing set, where the latter

is obtained from a real data set.

We collected three different data sets of syn-

thetic images. The ﬁrst two data sets have been

created using a commercial crowd modelling and

simulation software developed by a member of the

LETSCROWD consortium,

which is based on the

computer graphic engine Unity

. This software al-

Crowd Dynamics, https://www.crowddynamics.com/

https://unity.com/

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

368

(a) (b) (c)

Figure 3: Example of frames from real data sets: (a) Mall, (b) UCSD, (c) PETS 2009.

(a) (b) (c)

Figure 4: Example of frames from our CG-sets: (a), (b) uniform background; (c) PETS2009-like background.

(a) (b) (c)

Figure 5: Examples of frames from our PNG-sets, using background from the real data sets: (a) Mall, (b) UCSD, (c) PETS

2009. (Images are slightly blurred in this ﬁgure for privacy reasons.)

lows to generate synthetic crowd scenes by choosing

a background image, the number of pedestrians and

their paths. In the ﬁrst data set, we reproduced the

camera views and perspectives of the three real data

sets, but not the corresponding people ﬂow and crowd

size, nor the background. We collected in total 15000

frames from videos of pedestrians walking in differ-

ent directions on a uniform background; the number

of pedestrians in each frame ranges from 1 to 1000.

Since the features used to train global regression

models are signiﬁcantly affected by the image back-

ground, in the second set of synthetic images, we tried

to simulate also the background of the target, real test-

ing scenes. However, this turned out to be much more

complex, and therefore we only reproduced the sin-

gle view extracted from the PETS2009 real data set.

Fig. 4 shows some frames of the ﬁrst two synthetic

data sets above obtained by a computer graphic tool,

which will be denoted from now on by CG-sets.

In the third data set of synthetic images, we sim-

ulated the real testing scene by superimposing pedes-

trian images to a real background image of that scene,

to increase the realism of synthetic images. First, for

each real (testing) data set we picked an image where

no pedestrians appeared (if any), or obtained a back-

ground image by background subtraction; then we de-

ﬁned the region of interest (ROI) and the perspective

map. The perspective map is necessary to estimate

the pedestrian scales (Zhang et al., 2015), and is com-

puted by linear interpolation from the measurements

of several pedestrians randomly selected in the im-

ages, assuming that all the adult pedestrians present a

standard height. This map allows one to easily com-

pute the height in pixels of a pedestrian in each lo-

cation of the ROI. Then we collected from the web a

set of images with a transparent background (in PNG

format) showing a single person. The synthetic im-

ages were ﬁnally obtained by superimposing a given

Investigating Synthetic Data Sets for Crowd Counting in Cross-scene Scenarios

369

number of randomly selected pedestrian images over

the background image at randomly chosen locations

in the ROI, by resizing pedestrian images according

to the perspective map. Overlapping between pedes-

trians was controlled by drawing ﬁrst the one farthest

from the camera. For each target (test) scene, we

built 1000 synthetic images containing a number of

pedestrians ranging from 1 to 100. Fig. 5 shows some

examples of these images, that from now on will be

denoted by PNG-sets. We point out that this latter

kind of synthetic images may not present a realistic

perspective distortion nor real human poses. Never-

theless, such images respect the scale and the camera

view, which are the most relevant features for the con-

sidered task; this is also a straightforward procedure

to create synthetic images that reproduce a given tar-

get scene. In particular, it also allows to automatically

annotate each image with the number of pedestrians

inside it.

3.5 Performance Measures

We evaluated crowd counting accuracy using two

common metrics that are deﬁned over a single im-

age: the absolute error (AE) and the root squared er-

ror (RSE). We shall report their average values com-

puted across all testing images of a given target scene,

i.e., the mean absolute error (MAE) and the root mean

squared error (RMSE):

MAE =

∑

i=1

|η

−

| , (1)

RMSE =

∑

i=1

(η

−

)

, (2)

where N is the total number of testing images, η

the ground truth (pedestrian number) and

is the es-

timated pedestrian number for the i-th frame. The

MAE computes the average absolute difference be-

tween the actual and the estimated count. The second

RMSE, as a result of the squaring of each difference,

penalises large errors more heavily than small ones,

instead. We point out that in many recent works (Liu

et al., 2019; Wang et al., 2019; Ryan et al., 2015) the

metric RMSE was used, but it was called MSE; how-

ever, the MSE metric does not include the square root

(Loy et al., 2013). We prefer to use the RMSE metric

to be aligned with the most recent works in this area,

and also because it is expressed in the same unit of

measurement as MAE.

Table 1: Crowd counting accuracy (MAE and RMSE) in

a same-scene scenario (training and testing images come

from the same scene) using different regression models (RF,

SVR and PLS), on each real data set (Mall, UCSD and

PETS).

Data set

RF SVR PLS

MAE RMSE MAE RMSE MAE RMSE

Mall 3.66 4.53 5.53 6.98 3.03 3.83

UCSD 2.85 3.45 6.79 8.35 2.32 2.82

PETS 7.51 9.49 8.66 10.34 5.46 7.1

4 EXPERIMENTS

We performed three different experiments whose goal

is to verify if and to what extent using synthetic im-

ages of the target scene in the training set can pro-

vide an effective, scene-speciﬁc regression model for

people counting tasks in dense crowd scenes. To

create a baseline for comparison, in the ﬁrst exper-

iment we evaluated the performance on each of the

three real data sets of Sect. 3.3 by using in the train-

ing and test set only real images from the same data

set. To average the inﬂuence of training/testing data

splits we performed a 10-fold cross-validation, where

the training and test sets are made up respectively of

80% and 20% disjoint subset of images. The results

are reported in Table 1. The second experiment sim-

ulates a real-world cross-scene application scenario,

e.g., when a camera is installed in a new place and

no annotated crowd images of the same target scene

are available for training a scene-speciﬁc regression

model; in this case the regression model has to be

trained using a previously available data set which

may be not representative of the target scene. To this

aim we trained a regression model on training data

belonging to one or two of the three real data sets,

and tested it on the other data set, which is consid-

ered as the target scene. Our aim is to evaluate the

amount of performance decrease with respect to the

use of training data from the same target scene, as in

the ﬁrst experiment above. The results are reported

in Table 2. As it can be expected, a comparison be-

tween tables 1 and 2 shows that the performance on

the target scene always decrease when training data

come from different scenes, which can be due to the

difference in the background and in the perspective.

We also point out that using the perspective correc-

tion described in (Loy et al., 2013) we observed a fur-

ther performance decrease: this may be because the

perspective maps used are not invariant to scale.

The third experiment aims at evaluating whether

using synthetic training images of the target scene im-

proves the crowd counting accuracy with respect to

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

370

Table 2: Crowd counting accuracy (MAE and RMSE) in

a cross-scene scenario (training and testing images come

from different scenes) using different regression models

(RF, SVR and PLS). Each group of three rows corresponds

to a single real data set used as the target scene (Mall, UCSD

and PETS); the rows in each group correspond to the other

two data sets used for training, either alone or together.

Train Set

RF SVR PLS

MAE RMSE MAE RMSE MAE RMSE

Mall

UCSD 17.16 18.51 6.39 8.09 15.57 16.75

PETS 16.26 17.21 8.87 10.71 19.38 20.08

UCSD-PETS 18.32 19.47 6.89 8.65 14.49 15.83

UCSD

Mall 10.92 12.49 9.44 10.62 10.53 13.05

PETS 12.39 14.02 20.97 21.42 50.76 51.37

Mall-PETS 13.08 14.86 19.97 20.23 19.78 19.99

PETS

Mall 18.26 20.97 13.27 15.51 23.84 26.29

UCSD 10.11 12.79 10.75 12.46 10.36 13.11

Mall-UCSD 13.77 15.82 12.18 15.08 16.51 18.52

Table 3: Crowd counting accuracy attained by training the

regression models RF, SVR and PLS on synthetic images

reproducing the target (testing) scene. Synthetic images

come from the CG-set: ‘Blank-CG’ and ‘PETS-CG’ denote

respectively images with no background, and images which

reproduce the PETS data set background.

Test Train

RF SVR PLS

MAE RMSE MAE RMSE MAE RMSE

Mall Blank-CG 5.74 7.16 5.52 6.97 5.62 7.0

UCSD Blank-CG 5.9 7.47 6.99 8.37 7.24 12.17

PETS Blank-CG 9.25 11.66 9.31 10.9 7.56 9.39

PETS PETS-CG 7.39 9.28 8.55 10.31 7.62 9.24

the ones attained by using real training images from

different scenes. The set-up is similar to the second

experiment, with the only difference that, for each real

data set used as the target scene (testing set) each re-

gression model is now trained using one of the data

sets of synthetic images (see Sect. 3.4) that reproduce

the same target scene. The results are reported in

Table 3 for CG-set images and in Table 4 for PNG-

set images. As expected, the crowd counting perfor-

mances on the target scene are better when training

data come from synthetic images, instead of from real

images of a different scene, as in Table 2. This mainly

depends on the background and perspective that could

differ a lot between two real scenes, while using the

synthetic images we can reproduce, with a certain ac-

Table 4: Crowd counting accuracy attained by training the

regression models RF, SVR and PLS on synthetic (PNG-

set) images reproducing the target (testing) scene.

Test Train

RF SVR PLS

MAE RMSE MAE RMSE MAE RMSE

Mall Mall-PNG 4.73 5.84 5.99 7.63 5.65 7.14

UCSD UCSD-PNG 4.08 4.95 11.04 13.35 2.76 3.5

PETS PETS-PNG 7.42 9.07 14.77 17.94 8.11 10.31

curacy, both features. Nevertheless, the performances

of the regression models exploiting synthetic images

are still lower than the ones obtained from models

exploiting real images of the same scene (reported

in Table 1), even if the gap in performances become

smaller with the presence of more realistic (accurate)

images. This is also conﬁrmed by comparing the re-

sults in Table 3 and in Table 4, where the regres-

sion models trained with the PNG-set images demon-

strated better performances than the ones obtained

with the CG-set images. In general, the presence of

the real scene background in the synthetic images is

crucial to create strong regression models. Indeed,

even using the CG-set which reproduce the PETS data

set background, the performances increase compared

to the images presenting a blank background. This

shows that also fully synthetic images could be effec-

tive to some extent, but they would require a back-

ground and a perspective view closer to the images of

the target scene.

5 CONCLUSIONS

In this work, we considered a challenging, cross-

scene application scenario for crowd counting in

dense crowd scenes, where no manually annotated

images of the target scene are available for training a

regression-based model. To improve the accuracy of

regression models under this scenario, we proposed to

build the training set using synthetic images of the tar-

get crowd scene, characterised by the same perspec-

tive (view) and possibly by the same background. We

empirically evaluated the effectiveness of this solu-

tion using several real data sets of crowd images as the

target scenes (testing set) and different kinds of syn-

thetically generated images of the target scene as the

training set. Preliminary results provide evidence that

using synthetic images can be an effective solution,

provided that they closely reproduce the background

and the camera perspective of the target scene.

The use of synthetic images may also be useful

in the same application scenario considered here for

regression-based methods that use CNNs, which have

not been considered in this work due to their still too

high processing time during inference. Accordingly,

our ongoing efforts are devoted to investigating the

performance of CNN-based crowd counting methods

using synthetic images for training or ﬁne-tuning.

Investigating Synthetic Data Sets for Crowd Counting in Cross-scene Scenarios

371

ACKNOWLEDGEMENTS

This work has been partially supported by the

LETSCROWD project, funded by the European

Union Horizon 2020 research and innovation pro-

gramme under grant agreement No 740466, and

by the project ODIS - Optimisation of DIstributed

systems in the Smart-city and smart-grid settings,

CUP:F72F16003170002, funded by Fondazione di

Sardegna. The authors also thank the LETSCROWD

partner Crowd Dynamics for providing some of the

synthetic data sets.

REFERENCES

Abdi, H. (2010). Partial least squares regression and pro-

jection on latent structure regression (pls regression).

Wiley interdisciplinary reviews: computational statis-

tics, 2(1):97–106.

Basak, D., Pal, S., and Patranabis, D. C. (2007). Support

vector regression. Neural Information Processing-

Letters and Reviews, 11(10):203–224.

Benabbas, Y., Ihaddadene, N., Yahiaoui, T., Urruty, T., and

Djeraba, C. (2010). Spatio-temporal optical ﬂow anal-

ysis for people counting. In AVSS, pages 212–217.

Chan, A. B., Liang, Z.-S. J., and Vasconcelos, N. (2008).

Privacy preserving crowd monitoring: Counting peo-

ple without people models or tracking. In CVPR,

pages 1–7.

Change Loy, C., Gong, S., and Xiang, T. (2013). From

semi-supervised to transfer counting of crowds. In

ICCV, pages 2256–2263.

Courty, N., Allain, P., Creusot, C., and Corpetti, T. (2014).

Using the agoraset dataset: Assessing for the quality

of crowd video analysis methods. Pattern Recognition

Letters, 44:161–170.

Criminisi, A., Shotton, J., Konukoglu, E., et al. (2012). De-

cision forests: A uniﬁed framework for classiﬁcation,

regression, density estimation, manifold learning and

semi-supervised learning. Foundations and Trends



in Computer Graphics and Vision, 7(2–3):81–227.

Dollar, P., Wojek, C., Schiele, B., and Perona, P. (2011).

Pedestrian detection: An evaluation of the state of the

art. IEEE Trans. on PAMI, 34(4):743–761.

Dong, L., Parameswaran, V., Ramesh, V., and Zoghlami, I.

(2007). Fast crowd segmentation using shape index-

ing. In ICCV, pages 1–8.

Ferryman, J. and Shahrokni, A. (2009). Pets2009: Dataset

and challenge. In PETS, pages 1–6.

Ge, W. and Collins, R. T. (2009). Marked point processes

for crowd counting. In CVPR, pages 2913–2920.

Haralick, R. M., Shanmugam, K., et al. (1973). Textural

features for image classiﬁcation. IEEE Trans. on sys-

tems, man, and cybernetics, SMC-3(6):610–621.

Kong, D., Gray, D., and Tao, H. (2005). Counting pedes-

trians in crowds using viewpoint invariant training. In

BMVC, pages 1–10.

Leibe, B., Seemann, E., and Schiele, B. (2005). Pedestrian

detection in crowded scenes. In CVPR, pages 878–

885.

Lempitsky, V. and Zisserman, A. (2010). Learning to count

objects in images. In Advances in neural information

processing systems, pages 1324–1332.

Lin, S.-F., Chen, J.-Y., and Chao, H.-X. (2001). Estimation

of number of people in crowded scenes using perspec-

tive transformation. IEEE Trans. on Systems, Man,

and Cybernetics, 31(6):645–654.

Liu, X., Van De Weijer, J., and Bagdanov, A. D. (2019).

Exploiting unlabeled data in cnns by self-supervised

learning to rank. Trans. on PAMI, 41(8):1862–1878.

Loy, C. C., Chen, K., Gong, S., and Xiang, T. (2013).

Crowd counting and proﬁling: Methodology and eval-

uation. In Modeling, simulation and visual analysis of

crowds, pages 347–382. Springer.

Ma, R., Li, L., Huang, W., and Tian, Q. (2004). On

pixel count based crowd density estimation for visual

surveillance. In CIS, pages 170–173.

Ojala, T., Pietik

ainen, M., and M

aenp

a, T. (2002). Mul-

tiresolution gray-scale and rotation invariant texture

classiﬁcation with local binary patterns. IEEE Trans.

on PAMI, 24(7):971–987.

Ryan, D., Denman, S., Sridharan, S., and Fookes, C. (2015).

An evaluation of crowd counting methods, features

and regression models. Computer Vision and Image

Understanding, 130:1–17.

Schr

oder, G., Senst, T., Bochinski, E., and Sikora, T. (2018).

Optical ﬂow dataset and benchmark for visual crowd

analysis. In AVSS, pages 1–6.

Sindagi, V. and Patel, V. M. (2017a). A survey of recent

advances in cnn-based single image crowd counting

and density estimation. Pattern Recognition Letters,

107:3–16.

Sindagi, V. A. and Patel, V. M. (2017b). Cnn-based cas-

caded multi-task learning of high-level prior and den-

sity estimation for crowd counting. In AVSS, pages

1–6.

Tu, P., Sebastian, T., Doretto, G., Krahnstoever, N.,

Rittscher, J., and Yu, T. (2008). Uniﬁed crowd seg-

mentation. In ECCV, pages 691–704. Springer.

Wang, Q., Gao, J., Lin, W., and Yuan, Y. (2019). Learning

from synthetic data for crowd counting in the wild. In

CVPR, pages 8198–8207.

Wu, X., Liang, G., Lee, K. K., and Xu, Y. (2006). Crowd

density estimation using texture analysis and learning.

In ROBIO, pages 214–219.

Zhang, C., Li, H., Wang, X., and Yang, X. (2015). Cross-

scene crowd counting via deep convolutional neural

networks. In CVPR, pages 833–841.

Zhang, Q. and Chan, A. B. (2019). Wide-area crowd count-

ing via ground-plane density maps and multi-view fu-

sion cnns. In CVPR, page 8297–8306.

Zhang, Y., Zhou, C., Chang, F., and Kot, A. C. (2019). A

scale adaptive network for crowd counting. Neuro-

computing, 362:139–146.

Zhang, Y., Zhou, D., Chen, S., Gao, S., and Ma, Y. (2016).

Single-image crowd counting via multi-column con-

volutional neural network. In CVPR, pages 589–597.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

372