Domain Adaptation for Trafﬁc Density Estimation

Luca Ciampi

1 a

, Carlos Santiago

2 b

, Joao Paulo Costeira

2 c

, Claudio Gennaro

1 d

and Giuseppe Amato

1 e

Institute of Information Science and Technologies, National Research Council, Pisa, Italy

Instituto Superior T

ecnico (LARSyS/IST), Lisbon, Portugal

Keywords:

Unsupervised Domain Adaptation, Domain Adaptation, Synthetic Datasets, Deep

Learning, Deep Learning for Visual Understanding, Counting Vehicles, Trafﬁc Density Estimation,

Convolutional Neural Networks.

Abstract:

Convolutional Neural Networks have produced state-of-the-art results for a multitude of computer vision tasks

under supervised learning. However, the crux of these methods is the need for a massive amount of labeled

data to guarantee that they generalize well to diverse testing scenarios. In many real-world applications, there

is indeed a large domain shift between the distributions of the train (source) and test (target) domains, leading

to a signiﬁcant drop in performance at inference time. Unsupervised Domain Adaptation (UDA) is a class of

techniques that aims to mitigate this drawback without the need for labeled data in the target domain. This

makes it particularly useful for the tasks in which acquiring new labeled data is very expensive, such as for

semantic and instance segmentation. In this work, we propose an end-to-end CNN-based UDA algorithm

for trafﬁc density estimation and counting, based on adversarial learning in the output space. The density

estimation is one of those tasks requiring per-pixel annotated labels and, therefore, needs a lot of human effort.

We conduct experiments considering different types of domain shifts, and we make publicly available two new

datasets for the vehicle counting task that were also used for our tests. One of them, the Grand Trafﬁc Auto

dataset, is a synthetic collection of images, obtained using the graphical engine of the Grand Theft Auto video

game, automatically annotated with precise per-pixel labels. Experiments show a signiﬁcant improvement

using our UDA algorithm compared to the model’s performance without domain adaptation. The code, the

models and the datasets are freely available at https://ciampluca.github.io/unsupervised counting.

1 INTRODUCTION

With the advent of Convolutional Neural Networks

(CNNs) (Lecun et al., 1998), supervised learning has

reached excellent results across many Computer Vi-

sion application areas, such as object detection (Red-

mon and Farhadi, 2018) and instance segmentation

(He et al., 2017). However, most CNN-based meth-

ods require a large amount of labeled data and make a

common assumption: the training and testing data are

drawn from the same distribution. The direct trans-

fer of the learned features between different domains

does not work very well because the distributions

are different. Thus, a model trained on one domain,

https://orcid.org/0000-0002-6985-0439

https://orcid.org/0000-0002-4737-0020

https://orcid.org/0000-0001-6769-2935

https://orcid.org/0000-0002-3715-149X

https://orcid.org/0000-0003-0171-4315

named source, usually experiences a drastic drop in

performance when applied on another domain, named

target. This problem is commonly referred as Domain

Shift (Torralba and Efros, 2011).

Domain Adaptation is a common technique to ad-

dress this problem. It adapts a trained neural network

by ﬁne-tuning it with a new set of labeled data belong-

ing to the new distribution. However, in many real

cases, gathering a further collection of labeled data

is expensive, especially for tasks that imply per-pixel

annotations, like semantic or instance segmentation.

Unsupervised Domain Adaptation (UDA) ad-

dresses the domain shift problem differently. It does

not use labeled data from the target domain and relies

only on supervision in the source domain. Speciﬁ-

cally, UDA takes a source labeled dataset and a target

unlabeled one. The challenge here is to automatically

infer some knowledge from the target data to reduce

the gap between the two domains.

Ciampi, L., Santiago, C., Costeira, J., Gennaro, C. and Amato, G.

Domain Adaptation for Trafﬁc Density Estimation.

DOI: 10.5220/0010303401850195

In Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2021) - Volume 5: VISAPP, pages

185-195

ISBN: 978-989-758-488-6

185

∑

25 vehicles

Figure 1: Example of an image with the bounding box anno-

tations (left) and the corresponding density map that sums

up to the counting value (right).

In this work, we consider the counting task, de-

ﬁned as estimating the number of object instances

in still images or video frames (Lempitsky and Zis-

serman, 2010), which has recently attracted signif-

icant attention in the Computer Vision community.

Speciﬁcally, we consider the vehicle counting sce-

nario, where the task is to estimate the number of

vehicles occurring in streets, roads, or parking lots.

Most current systems address the counting task as

a supervised learning process, relying on regression

techniques to estimate a pixel-based density map from

the image. The ﬁnal count is obtained by summing all

pixel values (Lempitsky and Zisserman, 2010). Fig-

ure 1 illustrates this approach.

We propose an end-to-end CNN-based UDA al-

gorithm for trafﬁc density estimation and counting,

based on adversarial learning. Adversarial learning

is performed directly on the generated density maps,

i.e., in the output space, given that in this speciﬁc

case, the output space contains valuable information

such as scene layout and context. We focus on vehi-

cle counting, but the approach is suitable for counting

any other types of objects. To the best of our knowl-

edge, we are the ﬁrst to introduce a UDA scheme for

counting to reduce the gap between the source and the

target domain without using additional labels.

We conducted experiments considering different

types of domain shifts and validating our approach

on various vehicle counting datasets. First, we em-

ployed two existing datasets for trafﬁc density esti-

mation, WebCamT (Zhang et al., 2017a) and TRAN-

COS (Guerrero-G

omez-Olmedo et al., 2015). To em-

phasize the domain shift problem, we used as source

domain images acquired by a speciﬁc subset of cam-

eras. In contrast, we represented the target domain

with images captured by a different subset of cameras,

seeing different perspectives and visual contexts. We

call this type of domain shift as the Camera2Camera

domain shift. Comparisons with other techniques on

these datasets show the superiority of our approach.

In order to test our technique with further types of

domain shifts, we created and made publicly available

the two additional datasets described in the following.

The NDISPark - Night and Day Instance Seg-

mented Park dataset, consisting of images taken from

surveillance cameras in a parking lot. Here, on the

one hand, source data include annotated images col-

lected by various cameras during the day. On the

other hand, the unlabeled target domain contains im-

ages collected, in the same scenarios, during the

night. We call this domain shift Day2Night.

The GTA - Grand Trafﬁc Auto dataset, a vast col-

lection of synthetic images generated with the highly

photo-realistic graphical engine of the Grand Theft

Auto V video game, developed by Rockstar North.

This dataset consists of urban trafﬁc scenes, automat-

ically and precisely annotated with per-pixel annota-

tions. To the best of our knowledge, it is the ﬁrst in-

stance segmentation synthetic dataset of trafﬁc sce-

narios. We use this dataset to train the counting algo-

rithm. Then, we performed domain adaptation to be

able to count in real images. In this case, the domain

shift is represented by the Synthetic2Real difference.

Figure 2 summarizes the described domain shifts

that we have addressed.

In all the experiments, we show that our

UDA technique always outperforms the non-domain

adapted models.

Contributions of this work can be summarized as

follows:

• We introduce a UDA algorithm for trafﬁc density

estimation and counting, which can reduce the do-

main gap between a labeled source dataset and an

unlabeled target one. To the best of our knowl-

edge, this is the ﬁrst time that UDA is applied to

counting.

• We create and make publicly available two new

datasets, both having instance segmentation anno-

tations. One is manually annotated and consists

of images of parked cars collected during the day

and by night. The second is a synthetic collection

of images taken from a photo-realistic graphical

engine, where the per-pixel annotations are auto-

matically created.

• We conduct extensive experiments taking into ac-

count three different types of domain shifts and

validating our technique on various vehicle count-

ing datasets, demonstrating a signiﬁcant improve-

ment using our UDA algorithm compared to the

model’s performance without domain adaptation.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

186

(a) (b) (c) (d)

Figure 2: The Domain Shift scenarios that have been addressed in this work: (a) Day2Night; (b) and (c) Camera2Camera;

(d) Synthetic2Real. The ﬁrst row represents the labeled source domain, while the second represents the unlabeled target one

used for our unsupervised domain adaptation.

2 RELATED WORK

This section reviews some previous work related to

the Unsupervised Domain Adaptation and the Count-

ing task.

2.1 Unsupervised Domain Adaptation

Traditional UDA approaches have been developed to

address the problem of image classiﬁcation, and they

try to align features across the two domains ((Ganin

and Lempitsky, 2015), (Tzeng et al., 2017)). How-

ever, as pointed out in (Zhang et al., 2017b), they do

not perform well in other tasks.

More recent advances also involve the semantic

segmentation task. In this case, adversarial training

for UDA is the most employed approach. It includes

two networks. The ﬁrst predicts the segmentation

maps for the input source image. The second acts

as a discriminator, taking the feature maps from the

segmentation network and trying to predict the input

domain. The adversarial loss, computed from the dis-

criminator output, tries to make the distributions of

the two domains more similar. The ﬁrst to apply such

a technique is (Hoffman et al., 2016). More recently,

the work proposed in (Hong et al., 2018) employs a

residual network and adversarial training to make the

source feature maps closer to the target ones. The

authors of (Chen et al., 2019) combine semantic seg-

mentation and depth estimation to boost the adapta-

tion performance, providing to the discriminator the

segmentation and the depth prediction maps jointly.

Another interesting work that inspired this paper is

(Tsai et al., 2018), where the authors applied adver-

sarial training to the output space taking advantage of

the structural consistency across domains.

A very appealing application of domain adapta-

tion concerns synthetic data, which has led to the

development of several synthetic datasets, such as

ViPeD ((Amato et al., 2019), (Ciampi et al., 2020a))

for pedestrian detection and SYNTHIA (Ros et al.,

2016) for semantic segmentation and autonomous

driving applications. In this case, the algorithm is

trained using these synthetic images and applied over

real images. The domain adaptation algorithm is in

charge of ﬁlling the gap between the two worlds.

2.2 The Counting Task

Following the taxonomy adopted in (Sindagi and Pa-

tel, 2018), we can broadly classify existing counting

approaches into two categories: counting by detection

and counting by regression. Counting by detection is

a supervised technique where we localize instances

of the objects, and then we count them. Some rele-

vant works present in the literature are (Ciampi et al.,

2018), (Amato et al., 2019), (Amato et al., 2018),

(Aich and Stavness, 2018), (Laradji et al., 2018). In-

stead, Counting by regression (Lempitsky and Zis-

serman, 2010) is a supervised learning approach that

tries to establish a direct mapping (linear or not) from

the image features to the number of objects present

in the scene or a corresponding density map (i.e., a

continuous-valued function), skipping the challeng-

ing task of detecting instances of the objects.

Domain Adaptation for Trafﬁc Density Estimation

187

Regression techniques have shown superior per-

formance in crowded scenarios where the objects’

instances are sometimes not clearly visible due to

occlusions, and they have been applied to a multi-

tude of situations. The ﬁrst work that employed a

pure CNN to estimate the density and count people

in crowded contexts is presented by (Boominathan

et al., 2016). A more efﬁcient structure is proposed

by (Zhang et al., 2016) introducing a Multi-Column

CNN-based architecture (MCNN) for crowd count-

ing. A similar idea is developed by (O

noro-Rubio

and L

opez-Sastre, 2016) with a scale-aware, multi-

column counting model named Hydra-CNN able to

estimate trafﬁc densities in congested scenes. More

recently, the authors of (Li et al., 2018) introduced

CSRNet. This CNN-based algorithm uses dilated ker-

nels to deliver larger reception ﬁelds and replace pool-

ing operations. We employ this network as the base-

line in our work, and we brieﬂy review its architecture

in the next sections.

The main limitations of these approaches are due

to the scarcity of data. As a result, existing methods

often suffer from overﬁtting, which leads to perfor-

mance degradation while transferring them to other

scenes. Besides, there is another inherent problem:

the labels of these datasets are not very accurate. Most

of the existing datasets are dot-annotated. Conse-

quently, the ground truth density maps are just an ap-

proximation in which the objects’ sizes are estimated

using some heuristics. This work addresses both

problems proposing an unsupervised domain adap-

tation technique that exploits unlabeled data and in-

troduces two new datasets with per-pixel annotations

that allow the creation of precise ground truth density

maps. To the best of our knowledge, this work is the

ﬁrst that employs UDA to the counting task, extend-

ing the very preliminary results obtained in (Ciampi

et al., 2020b), where a similar approach was exploited

in just one limited scenario.

3 DATASETS

As mentioned before, to prove our approach’s va-

lidity, we performed experiments on various vehi-

cle counting datasets, offering different domain shift

characteristics. Speciﬁcally, we exploited two exist-

ing datasets for trafﬁc density estimation: WebCamT

(Zhang et al., 2017a) and TRANCOS (Guerrero-

omez-Olmedo et al., 2015). Then, we used two ad-

ditional datasets that we created on purpose and made

publicly available: the NDISPark - Night and Day In-

stance Segmented Park dataset and the GTA - Grand

Trafﬁc Auto dataset. Figure 3 shows some images be-

longing to these datasets, together with the associated

labels and the corresponding generated density maps

used for the counting task. In the next sections, we

describe more in detail each of them.

3.1 WebCamT Dataset

The WebCamT dataset is a collection of trafﬁc scenes

recorded using city-cameras introduced by (Zhang

et al., 2017a). It is particularly challenging to ana-

lyze due to the low-resolution (352 × 240), high oc-

clusion, and large perspective. We consider a total of

about 40,000 images belonging to 10 different cam-

eras and consequently having different views. We

employ the existing bounding box annotations of the

dataset to generate ground truth density maps. In par-

ticular, we consider one Gaussian Normal kernel for

each vehicle present in the scene, having a value of

µ and σ equal to the center and proportional to the

size of the bounding box surrounding the vehicle, re-

spectively. We used this dataset to test performance

with the Camera2Camera domain shift, introduced in

Section 1.

3.2 TRANCOS Dataset

The TRANCOS dataset is a public dataset contain-

ing 1244 dot-annotated images of different congested

trafﬁc scenes captured by surveillance cameras, in-

troduced by (Guerrero-G

omez-Olmedo et al., 2015).

The approximated ground truth density maps are gen-

erated by putting one Normal Gaussian kernel for

each dot present in the scene, having a value of σ em-

pirically decided by the authors. They also provided

the regions of interest (ROIs) for each image. We

used this dataset to test performance with the Cam-

era2Camera domain shift, mentioned in Section 1.

3.3 NDISPark Dataset

The NDISPark - Night and Day Instance Segmented

Park dataset was created by us on purpose and made

publicly available. It is a small, manually annotated

dataset for counting cars in parking lots, consisting

of about 250 images. This dataset is challenging and

describes most of the problematic situations that we

can ﬁnd in a real scenario: seven different cameras

capture the images under various weather conditions

and viewing angles. Another challenging aspect is

the presence of partial occlusion patterns in many

scenes such as obstacles (trees, lampposts, other cars)

and shadowed cars. Furthermore, it is worth noting

that images are taken during the day and the night,

showing utterly different lighting conditions and that,

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

188

(a) (b) (c)

Figure 3: Some examples taken by the four datasets used in this work: (a) Images; (b) Labels; (c) Density Maps generated

from the labels. Each row correspond to a speciﬁc dataset: from top to bottom, the NDISPark - Night and Day Instance

Segmented Park and the GTA - Grand Trafﬁc Auto datasets introduced in this work, the WebCamT dataset (Zhang et al.,

2017a) and the TRANCOS dataset (Guerrero-G

omez-Olmedo et al., 2015). Note that the densities maps generated in our

datasets are accurate since we start from an instance segmentation annotations. Also notice that, in the case of the GTA -

Grand Trafﬁc Auto dataset, annotations are automatically generated without human effort.

unlike most counting datasets, the NDISPark dataset

is precisely annotated with instance segmentation la-

bels, allowing us to generate accurate ground truth

density maps for the counting task since the size of

the vehicles is well-known. We employed this dataset

to test performance with the Day2Night domain shift,

explained in Section 1.

3.4 GTA Dataset

The GTA - Grand Trafﬁc Auto dataset was also created

by us on purpose and made publicly available. It is a

vast collection of about 15,000 synthetic images of

urban trafﬁc scenes collected using the highly photo-

realistic graphical engine of the GTA V - Grand Theft

Auto V video game. About half of them concern ur-

ban city areas, while the remaining involve sub-urban

areas and highways. To generate this dataset, we de-

signed a framework that automatically and precisely

annotates the vehicles present in the scene with per-

pixel annotations. To the best of our knowledge, this

is the ﬁrst instance segmentation synthetic dataset of

city trafﬁc scenarios. As in the NDISPark dataset, the

instance segmentation labels allow us to produce ac-

curate ground truth density maps for the counting task

since the size of the vehicles is well-known. We ex-

Domain Adaptation for Trafﬁc Density Estimation

189

ploited this dataset to test performance with the Syn-

thetic2Real domain shift, introduced in Section 1.

4 PROPOSED METHOD

Our method relies on a CNN model trained end-to-

end with adversarial learning in the output space (i.e.,

the density maps), which contains rich information

such as scene layout and context. The peculiarity of

our adversarial learning scheme is that it forces the

predicted density maps in the target domain to have

local similarities with the ones in the source domain.

Figure 4 depicts the proposed framework consist-

ing of two modules: 1) a CNN that predicts trafﬁc

density maps, from which we estimate the number of

vehicles in the scene, and 2) a discriminator that iden-

tiﬁes whether a density map (received by the density

map estimator) was generated from an image of the

source domain or the target domain.

In the training phase, the density map predictor

learns to map images to densities based on annotated

data from the source domain. At the same time, it

learns to predict realistic density maps for the target

domain by trying to fool the discriminator with an ad-

versarial loss. The discriminator’s output is a pixel-

wise classiﬁcation of a low-resolution map, as illus-

trated in Figure 4, where each pixel corresponds to

a small region in the density map. Consequently, the

output space is forced to be locally similar for both the

source and target domains. In the inference phase, the

discriminator is discarded, and only the density map

predictor is used for the target images. We describe

each module and how it is trained in the following

subsections.

4.1 Density Estimation Network

We formulate the counting task as a density map es-

timation problem (Lempitsky and Zisserman, 2010).

The density (intensity) of each pixel in the map de-

pends on its proximity to a vehicle centroid and the

size of the vehicle in the image so that each vehicle

contributes with a total value of 1 to the map. There-

fore, it provides statistical information about the vehi-

cles’ location and allows the counting to be estimated

by summing of all density values.

This task is performed by a CNN-based model,

whose goal is to automatically determine the vehicle

density map associated with a given input image. For-

mally, the density map estimator, Ψ : R

C×H ×W

7→

H ×W

, transforms a W × H input image I with C

channels, into a density map, D = Ψ(I ) ∈ R

H ×W

4.2 Discriminator Network

The discriminator network, denoted by Θ, also con-

sists of a CNN model. It takes as input the density

map, D, estimated by the network Ψ. Its output is

a lower resolution probability map where each pixel

represents the probability that the corresponding re-

gion (from the input density map) comes either from

the source or the target domain. The goal of the dis-

criminator is to learn to distinguish between density

maps belonging to source or target domains. Through

an adversarial loss, this discriminator will, in turn,

force the density estimator to provide density maps

with similar distributions in both domains. In other

words, the target domain density maps have to look

realistic, even though the network Ψ was not trained

with an annotated training set from that domain.

4.3 Domain Adaptation Learning

The proposed framework is trained based on an alter-

nate optimization of the density estimation network,

Ψ, and the discriminator network, Θ. Regarding the

former, the training process relies on two compo-

nents: 1) density estimation using pairs of images

and ground truth density maps, which we assume are

only available in the source domain; and 2) adversar-

ial training, which aims to make the discriminator fail

to distinguish between the source and target domains.

As for the latter, images from both domains are used

to train the discriminator on correctly classifying each

pixel of the probability map as either source or target.

To implement the above training procedure, we

use two loss functions: one is employed in the ﬁrst

step of the algorithm to train network Ψ, and the other

is used in the second step to train the discriminator Θ.

These loss functions are detailed next.

Network Ψ Training. We formulate the loss function

for Ψ as the sum of two main components:

L(I

, I

) = L

density

) + λ

adv

), (1)

where L

density

is the loss computed using ground truth

annotations available in the source domain, while

adv

is the adversarial loss that is responsible for mak-

ing the distribution of the target and the source do-

main closer to each other. In particular, we deﬁne

the density loss L

density

as the mean square error be-

tween the predicted and ground truth density maps,

i.e. L

density

= MSE(D

, D

S G T

To compute the adversarial loss L

adv

, we ﬁrst

forward the images belonging to the target domain

through network Ψ, to generate the predicted density

maps D

. Then, we forward D

through network

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

190

Source Image

Label

Source Prediction

Density Estimator

Network

Target Image

Target Prediction

Discriminator

Network

Discriminator

Loss

Density Loss

Adversarial Loss

Figure 4: Algorithm overview. Given C × H ×W images from source and target domains, we pass them through the density

map estimation network to obtain output predictions. A density loss is computed for source predictions based on the ground

truth. In order to improve target predictions, a discriminator is used to locally classify whether a density map belongs to the

source or target domain. Then, an adversarial loss is computed on the target prediction and is back-propagated to the density

map estimation and counting network.

Θ, to generate the probability map P = Θ(Ψ(I

)) ∈

[0, 1]

×W

, where H

< H and W

< W . The adver-

sarial loss is given by

adv

) = −

∑

h,w

log(P

h,w

), (2)

where the subscript h, w denotes a pixel in P. This loss

makes the distribution of D

closer to D

by forcing

Ψ to fool the discriminator, through the maximization

of the probability of D

being locally classiﬁed as

belonging to the source domain.

Network Θ Training. Given an image I and the

corresponding predicted density map D, we feed D

as input to the fully-convolutional discriminator Θ to

obtain the probability map P. The discriminator is

trained by comparing P with the ground truth label

map Y ∈ {0, 1}

×W

using a pixel-wise binary cross-

entropy loss

disc

(I ) = −

∑

h,w

(1 −Y

h,w

)log(1 − P

h,w

log(P

h,w

(3)

where Y

h,w

= 0 ∀ h, w if I is taken from the target

domain and Y

h,w

= 1 otherwise.

5 EXPERIMENTAL RESULTS

5.1 Implementation Details

Density Map Estimation and Counting Network.

We build our density map estimation network based

on the Congested Scene Recognition Network (CSR-

Net) (Li et al., 2018). Here we brieﬂy review some

of the features characterizing this algorithm. CSRNet

provides a CNN-based method that can understand

highly congested scenes and perform accurate den-

sity estimation and counting. It is composed of two

major components. The authors use the well-known

VGG-16 network (Simonyan and Zisserman, 2014) as

the front-end for 2D feature extraction because of its

strong transfer learning ability. On the other hand, the

back-end consists of dilated kernels. The basic con-

cept of using dilated convolutions is to deliver larger

reception ﬁelds replacing the pooling operations. It is

worth noting that the max pool operation is responsi-

ble for losing quality in the density generation proce-

dure. Since the output size from VGG is reduced by

a factor of 8 of the original input size, we up-sampled

the ﬁnal output to compare it with the ground truth

density map.

Discriminator. We use a Fully Convolutional Net-

work similar to (Tsai et al., 2018) and to (Radford et

al., 2015), composed of 5 convolution layers with ker-

nel 4 × 4 and stride of 2. The number of channels

are {64, 128, 256, 512, 1}, respectively. Each con-

volution layer is followed by a leaky ReLU having a

parameter equals to 0.2.

We implement the whole system using the Py-

Torch framework on a single Nvidia RTX 2080 GPU

with 12 GB memory. To train the density estimator

network and the discriminator, we use Adam opti-

mizer (Kingma and Ba, 2014) with an initial learn-

Domain Adaptation for Trafﬁc Density Estimation

191

ing rate set to 10

−5

. During the training, it is crucial

to balance the weight between density and adversarial

losses. A small value of λ

adv

may not help the training

process signiﬁcantly. In contrast, a larger value may

propagate incorrect gradients to the density estimator.

We empirically choose the value of λ

adv

depending on

the employed dataset.

5.2 Results and Discussion

We validate the proposed UDA method for density es-

timation and counting of trafﬁc scenes under differ-

ent settings. First, we employ the NDISPark dataset,

and we test the Day2Night domain shift; then, we

utilize the WebCamT and the TRANCOS datasets to

take into account the Camera2Camera performance

gap. Finally, we use the GTA dataset to consider the

Synthetic2Real domain difference. For all the experi-

ments, we base the evaluation of the models on three

metrics widely used for the counting task: (i) Mean

Absolute Error (MAE) that measures the absolute

count error of each image; (ii) Mean Squared Error

(MSE) that instead quantiﬁes the squared count error

for each image; (iii) Average Relative Error (ARE),

which measures the absolute count error divided by

the true count. Note that, as a result of the squaring of

each error, the MSE effectively penalizes large errors

more heavily than the small ones. Instead, the ARE

is the only metric that considers the relation of the er-

ror and the total number of vehicles present for each

image. Results are summarized in Table 1, while in

the next sections, we describe the results obtained for

every considered scenario. Finally, we also plot some

examples of the outputs obtained using our models,

showing their visual quality. In particular, Figure 5

shows the ground truth and the predicted density maps

for some random samples of the considered scenarios.

5.2.1 Day2Night Domain Shift

In this scenario, we split the NDISPark dataset into

train, validation, and test subsets containing about

100, 50, and 100 images. The former has only pic-

tures taken during the day (source domain), while

the validation and the test subsets contain night im-

ages (target domain). To fairly evaluate our method,

we ﬁrst consider the baseline model without the do-

main adaptation module (i.e., putting the λ

adv

value to

zero). Then, we add the adversarial module compar-

ing the results. In both cases, we train the network for

300 epochs, validating at each iteration. We choose

the best validation model in terms of MAE, and we

test it against the test set. As showed in Table 1, using

our solution, we obtained performance improvements

considering all the three metrics.

5.2.2 Camera2Camera Domain Shift

In this case, we perform two sets of experiments to

test the domain shift that takes place when we con-

sider a camera different from the ones used in the

training phase.

First, we consider the WebCamT dataset, and we

split it into train, validation, and test subsets. In the

former, we account for about 25,000 images belong-

ing to 7 cameras (source domain). In the last two, we

consider the remaining 15,000 pictures of 3 different

cameras, having diverse contexts and slightly differ-

ent angle of views (target domain). We compare the

baseline and our solution when training for 20 epochs,

validate it at each iteration, and choose the best model

in terms of MAE.

Second, we take into account the TRANCOS

dataset. We split it into train, validation, and test sets,

following (Guerrero-G

omez-Olmedo et al., 2015).

The train set represents the source domain, while the

other two belong to the target domain and are col-

lected in different contexts. We train our domain

adaptation for 200 epochs, picking the best validation

model in terms of MAE, and we evaluate it against

the test set. We compare the obtained results with

the ones claimed by (Li et al., 2018) using only the

state-of-the-art CSRNet algorithm (i.e., our baseline)

and with other state-of-the-art techniques present in

the literature.

As showed in Table 1, we obtained performance

improvements in both cases, taking into account all

three metrics. Considering the publicly available

TRANCOS dataset, we achieved superior results not

only concerning the baseline but also compared to the

other considered approaches.

5.2.3 Synthetic2Real Domain Shift

In this scenario, we train the algorithm using synthetic

images. Then we test it on real data. In particular, we

consider a subset of the GTA dataset containing about

5,000 images of city trafﬁc scenarios, and we use it as

the training set (source domain). On the other hand,

we account for the test and the validation subsets of

the WebCamT dataset as the target domain. We com-

pare the results obtained using the baseline model and

our solution with the domain adaptation module. In

both cases, we train the algorithm for 20 epochs, val-

idating at each iteration. We choose the best model in

terms of MAE.

Again, as showed in Table 1, we achieved better

results compared to the basic model. We believe that

this scenario is particularly interesting because we ob-

tained comparable results with the previous one, but

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

192

Table 1: Experimental results obtained for the four considered domain shift. We employed three evaluation metrics: the Mean

Absolute Error (MAE), the Mean Squared Error (MSE) and the Average Relative Error (ARE). We achieved performance

improvements for all the scenarios, considering all the three metrics.

MAE MSE ARE

Day2Night Domain Shift - NDISPark Dataset

Baseline - CSRNet (Li et al., 2018) 3.95 27.45 0.43

Our Approach 3.49 20.90 0.39

Camera2Camera Domain Shift - WebCamT Dataset (Zhang et al., 2017a)

Baseline - CSRNet (Li et al., 2018) 3.24 16.83 0.21

Our Approach 2.86 13.03 0.19

Camera2Camera Domain Shift - TRANCOS Dataset (Guerrero-G

omez-Olmedo et al., 2015)

Hydra-CNN (O

noro-Rubio and L

opez-Sastre, 2016) 10.99 68.70 0.71

FCN-MT (Zhang et al., 2017a) 5.31 - 0.85

LC-ResFCN (Laradji et al., 2018) 3.32 - -

Baseline - CSRNet (Li et al., 2018) 3.56 30.64 0.10

Our Approach 3.30 23.60 0.08

Synthetic2Real Domain Shift - GTA Dataset

Baseline - CSRNet (Li et al., 2018) 4.10 25.83 0.28

Our Approach 3.88 23.80 0.27

GT count: 56 GT count: 13 GT count: 35 GT count: 12

Pred count: 53 Pred count: 14 Pred count: 38 Pred count: 11

(a) (b) (c) (d)

Figure 5: Examples of the predicted density maps in the considered scenarios: (a) Day2Nigh Domain Shift using the NDIS-

Park dataset; (b) and (c) Camera2Camera Domain Shift employing the WebCamT and TRANCOS datasets, respectively; (d)

Synthetic2Real Domain Shift using the GTA dataset for the training phase and the WebCamT dataset for testing on real images.

In the ﬁrst row, we report the input images. In the second row, the ground truth, while in the third, the predicted density maps

obtained with our models.

Domain Adaptation for Trafﬁc Density Estimation

193

this time without using manual annotations neither in

the source domain nor in the target one.

6 CONCLUSIONS

In this article, we tackle the problem of determin-

ing the density and the number of objects present in

large sets of images. Building on a CNN-based den-

sity estimator, the proposed methodology can gener-

alize to new data sources for which there are no an-

notations available. We achieve this generalization by

exploiting an Unsupervised Domain Adaptation strat-

egy, whereby a discriminator attached to the output

forces similar density distribution in the target and

source domains. Experiments show a signiﬁcant im-

provement relative to the performance of the model

without domain adaptation. To the best of our knowl-

edge, we are the ﬁrst to introduce a UDA scheme for

counting to reduce the gap between the source and the

target domain without using additional labels. Given

the conventional structure of the estimator, the im-

provement obtained by just monitoring the output en-

tails a great capacity to generalize learned knowledge,

thus suggesting the application of similar principles to

the inner layers of the network.

Another contribution is represented by the cre-

ation of two new per-pixel annotated datasets made

available to the scientiﬁc community. One of the two

novel datasets is a synthetic dataset created from a

photo-realistic video game. Here the labels are auto-

matically assigned while interacting with the API of

the graphical engine. Using this synthetic dataset, we

demonstrated that it is possible to train a model with a

precisely annotated and automatically generated syn-

thetic dataset and perform UDA toward a real-world

scenario, obtaining very good performance without

using additional manual annotations.

In our view, this work’s outcome opens new per-

spectives to deal with the scalability of learning meth-

ods for large physical systems with scarce supervisory

resources.

ACKNOWLEDGEMENTS

This work was partially supported by H2020 project

AI4EU under GA 825619 and by H2020 project

AI4media under GA 951911.

REFERENCES

Aich, S. and Stavness, I. (2018). Improving object

counting with heatmap regulation. arXiv preprint

arXiv:1803.05494.

Amato, G., Bolettieri, P., Moroni, D., Carrara, F., Ciampi,

L., Pieri, G., Gennaro, C., Leone, G. R., and Vairo, C.

(2018). A wireless smart camera network for parking

monitoring. In 2018 IEEE Globecom Workshops (GC

Wkshps), pages 1–6.

Amato, G., Ciampi, L., Falchi, F., and Gennaro, C. (2019).

Counting vehicles with deep learning in onboard uav

imagery. In 2019 IEEE Symposium on Computers and

Communications (ISCC), pages 1–6.

Amato, G., Ciampi, L., Falchi, F., Gennaro, C., and

Messina, N. (2019). Learning pedestrian detection

from virtual worlds. In Ricci, E., Rota Bul

o, S.,

Snoek, C., Lanz, O., Messelodi, S., and Sebe, N., ed-

itors, Image Analysis and Processing – ICIAP 2019,

pages 302–312, Cham. Springer International Pub-

lishing.

Boominathan, L., Kruthiventi, S. S. S., and Babu, R. V.

(2016). Crowdnet: A deep convolutional network for

dense crowd counting. In Proceedings of the 24th

ACM International Conference on Multimedia, MM

’16, page 640–644, New York, NY, USA. Association

for Computing Machinery.

Chen, Y., Li, W., Chen, X., and Gool, L. V. (2019). Learn-

ing semantic segmentation from synthetic data: A ge-

ometrically guided input-output adaptation approach.

In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 1841–1850.

Ciampi, L., Amato, G., Falchi, F., Gennaro, C., and Rabitti,

F. (2018). Counting vehicles with cameras. In SEBD.

Ciampi, L., Messina, N., Falchi, F., Gennaro, C., and Am-

ato, G. (2020a). Virtual to real adaptation of pedes-

trian detectors. Sensors, 20(18):5250.

Ciampi, L., Santiago, C., Costeira, J. P., Gennaro, C., and

Amato, G. (2020b). Unsupervised vehicle counting

via multiple camera domain adaptation. In Safﬁotti,

A., Seraﬁni, L., and Lukowicz, P., editors, Proceed-

ings of the First International Workshop on New Foun-

dations for Human-Centered AI (NeHuAI) co-located

with 24th European Conference on Artiﬁcial Intelli-

gence (ECAI 2020), Santiago de Compostella, Spain,

September 4, 2020, volume 2659 of CEUR Workshop

Proceedings, pages 82–85. CEUR-WS.org.

Ganin, Y. and Lempitsky, V. (2015). Unsupervised domain

adaptation by backpropagation. volume 37 of Pro-

ceedings of Machine Learning Research, pages 1180–

1189, Lille, France. PMLR.

Guerrero-G

omez-Olmedo, R., Torre-Jim

enez, B., L

opez-

Sastre, R., Maldonado-Basc

on, S., and O

noro-Rubio,

D. (2015). Extremely overlapping vehicle counting.

In Paredes, R., Cardoso, J. S., and Pardo, X. M., ed-

itors, Pattern Recognition and Image Analysis, pages

423–431, Cham. Springer International Publishing.

He, K., Gkioxari, G., Doll

ar, P., and Girshick, R. (2017).

Mask r-cnn. In 2017 IEEE International Conference

on Computer Vision (ICCV), pages 2980–2988.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

194

Hoffman, J., Wang, D., Yu, F., and Darrell, T. (2016). Fcns

in the wild: Pixel-level adversarial and constraint-

based adaptation. arXiv preprint arXiv:1612.02649.

Hong, W., Wang, Z., Yang, M., and Yuan, J. (2018). Con-

ditional generative adversarial network for structured

domain adaptation. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition,

pages 1335–1344.

Kingma, D. P. and Ba, J. (2014). Adam: A

method for stochastic optimization. arXiv preprint

arXiv:1412.6980.

Laradji, I. H., Rostamzadeh, N., Pinheiro, P. O., Vazquez,

D., and Schmidt, M. (2018). Where are the blobs:

Counting by localization with point supervision. In

Proceedings of the European Conference on Com-

puter Vision (ECCV), pages 547–562.

Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998).

Gradient-based learning applied to document recogni-

tion. Proceedings of the IEEE, 86(11):2278–2324.

Lempitsky, V. and Zisserman, A. (2010). Learning to count

objects in images. In Lafferty, J. D., Williams, C. K. I.,

Shawe-Taylor, J., Zemel, R. S., and Culotta, A., edi-

tors, Advances in Neural Information Processing Sys-

tems 23, pages 1324–1332. Curran Associates, Inc.

Li, Y., Zhang, X., and Chen, D. (2018). Csrnet: Dilated

convolutional neural networks for understanding the

highly congested scenes. In Proceedings of the IEEE

conference on computer vision and pattern recogni-

tion, pages 1091–1100.

noro-Rubio, D. and L

opez-Sastre, R. J. (2016). Towards

perspective-free object counting with deep learning.

In Leibe, B., Matas, J., Sebe, N., and Welling, M.,

editors, Computer Vision – ECCV 2016, pages 615–

629, Cham. Springer International Publishing.

Radford et al., A. (2015). Unsupervised representation

learning with deep convolutional generative adversar-

ial networks. arXiv preprint arXiv:1511.06434.

Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental

improvement. arXiv preprint arXiv:1804.02767.

Ros, G., Sellart, L., Materzynska, J., Vazquez, D., and

Lopez, A. M. (2016). The synthia dataset: A large

collection of synthetic images for semantic segmenta-

tion of urban scenes. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 3234–3243.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Sindagi, V. A. and Patel, V. M. (2018). A survey of recent

advances in cnn-based single image crowd counting

and density estimation. Pattern Recognition Letters,

107:3 – 16. Video Surveillance-oriented Biometrics.

Torralba, A. and Efros, A. A. (2011). Unbiased look at

dataset bias. In CVPR 2011, pages 1521–1528.

Tsai, Y.-H., Hung, W.-C., Schulter, S., Sohn, K., Yang, M.-

H., and Chandraker, M. (2018). Learning to adapt

structured output space for semantic segmentation. In

Proceedings of the IEEE Conference on Computer Vi-

sion and Pattern Recognition, pages 7472–7481.

Tzeng, E., Hoffman, J., Saenko, K., and Darrell, T. (2017).

Adversarial discriminative domain adaptation. In Pro-

ceedings of the IEEE conference on computer vision

and pattern recognition, pages 7167–7176.

Zhang, S., Wu, G., Costeira, J. P., and Moura, J. M. (2017a).

Understanding trafﬁc density from large-scale web

camera data. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages

5898–5907.

Zhang, Y., David, P., and Gong, B. (2017b). Curriculum

domain adaptation for semantic segmentation of ur-

ban scenes. In Proceedings of the IEEE International

Conference on Computer Vision, pages 2020–2030.

Zhang, Y., Zhou, D., Chen, S., Gao, S., and Ma, Y. (2016).

Single-image crowd counting via multi-column con-

volutional neural network. In Proceedings of the IEEE

conference on computer vision and pattern recogni-

tion, pages 589–597.

Domain Adaptation for Trafﬁc Density Estimation

195