N-MuPeTS: Event Camera Dataset for Multi-Person Tracking and

Instance Segmentation

Tobias Bolten

1 a

, Christian Neumann

1 b

, Regina Pohle-Fröhlich

and Klaus D. Tönnies

Institute for Pattern Recognition, Hochschule Niederrhein, Krefeld, Germany

Department of Simulation and Graphics, University of Magdeburg, Germany

Keywords:

Dynamic Vision Sensor, Event Data, Instance Segmentation, Multi-Person Tracking, Dataset.

Abstract:

Compared to well-studied frame-based imagers, event-based cameras form a new paradigm. They are biolog-

ically inspired optical sensors and differ in operation and output. While a conventional frame is dense and

ordered, the output of an event camera is a sparse and unordered stream of output events. Therefore, to take

full advantage of these sensors new datasets are needed for research and development. Despite their ongoing

use, the selection and availability of event-based datasets is currently still limited.

To address this limitation, we present a technical recording setup as well as a software processing pipeline

for generating event-based recordings in the context of multi-person tracking. Our approach enables the auto-

matic generation of highly accurate instance labels for each individual output event using color features in the

scene. Additionally, we employed our method to release a dataset including one to four persons addressing the

common challenges arising in multi-person tracking scenarios. This dataset contains nine different scenarios,

with a total duration of over 85 minutes.

1 INTRODUCTION

Dynamic Vision Sensors (DVS) are optical sensors

designed to replicate the basic neural architecture and

operating principle of the human eye. Each pixel of

a DVS detects and reacts completely asynchronously

and independently to changes in brightness. In this

process, each pixel generates and sends an output as

soon as a brightness change above a set threshold

value is detected. Therefore, unlike classic image sen-

sors that operate with a ﬁxed sampling rate, the output

of a DVS is a completely data-driven stream of trig-

gered output ‘events’. Each event contains informa-

tion about

(a) the (x,y) position of the triggered pixel in the sen-

sor array,

(b) a very high precision timestamp t of the time of

occurrence, and

brightness change.

The operation and output paradigm of the DVS

results in technical advantages for tracking and seg-

mentation tasks. Compared to classic systems, the

https://orcid.org/0000-0001-5504-8472

https://orcid.org/0000-0002-0871-8629

DVS event stream APS sensor

color frames

Figure 1: DVS event stream concept visualization. The high

time resolution supports continuous tracking approaches by

using high-quality segmentations.

DVS signal contains signiﬁcantly less redundant in-

formation, as only changes are recorded. For segmen-

tation tasks it is advantageous that static components

do not need to be processed. Due to the high time

resolution of the DVS, the resulting 3D (x,y,t) space-

time event cloud provides an almost continuous sig-

nal for moving objects (see Figure 1). A high-quality

segmentation thus supports the future development of

DVS-based tracking methods. However, this requires

publicly available datasets that include these segmen-

tation annotations as well as object tracking scenarios

and challenges.

290

Bolten, T., Neumann, C., Pohle-Fröhlich, R. and Tönnies, K.

N-MuPeTS: Event Camera Dataset for Multi-Person Tracking and Instance Segmentation.

DOI: 10.5220/0011680300003417

In Proceedings of the 18th Inter national Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 4: VISAPP, pages

290-300

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

In the context of multi-object tracking (MOT), a

dataset should include the following common chal-

lenges (Islam et al., 2015; Xu et al., 2019; Luo et al.,

2021):

1. object occlusions (through infrastructure as well

as by other persons in the scene)

2. similar appearance and body shape of recorded

persons

3. included changes in pose and movement patterns

(e.g. kneeling, standing, walking, and running)

4. interactions among multiple persons (including

abrupt changes in movement direction and speed)

5. included objects in different sizes

In addition, to take advantage of the high temporal

resolution of the sensor, this dataset must provide the

instance annotations for each event of the DVS out-

put stream. Due to the novelty of the sensor itself,

the availability of pure event-based datasets is limited

compared to the frame-based domain. To the best of

the authors’ knowledge, currently no DVS dataset ful-

ﬁlls these requirements.

In this paper, we present a technical setup for

recording and also publish a dataset which aims to

provide instance annotations on DVS data. In sum-

mary, we contribute:

• a detailed description of a hardware setup and cor-

responding software processing pipeline allowing

the acquisition of multi-person DVS recordings

with high-quality per-event instance labels

• a ready-to-use and publicly available multi-person

DVS dataset that includes the aforementioned re-

quirements within an outdoor recording setup.

Section 1.1 summarizes related work in terms of

existing datasets and event-based approaches. The re-

quirements for our recording setup are speciﬁed in

Section 2. Subsequently, the hardware setup used

is described in Section 3. The proposed software

pipeline for generating object annotations is presented

in detail in Section 4. Section 5 outlines the details

and statistics of the provided dataset.

1.1 Related Work

Early event-based tracking approaches were devel-

oped and evaluated using basic datasets containing,

for example, objects of simple geometric shape mov-

ing (c.f. the shape scenes from (Mueggler et al.,

2017b)). These types of scenes were used to detect

and track features such as corners (Mueggler et al.,

2017a; Alzugaray and Chli, 2018). In the context

of this paper, however, more complex event-based

datasets concerning person detection and tracking are

of greater interest.

In (Jiang et al., 2019) an approach for person de-

tection in the application area of trafﬁc surveillance

was presented. For this purpose, a custom dataset was

used which only has a total length of ≈ 14 seconds. In

a very similar application context, a person detection

was also implemented in (Chen et al., 2019). This

was followed by the publication of a dataset in (Miao

et al., 2019). The detection part of this dataset con-

sists of only twelve short-length sequences. In (Ojeda

et al., 2020) and (Bisulco et al., 2020) two approaches

for ﬁltering the event stream are presented with the

goal of implementing person detection close to hard-

ware. For this, a DVS dataset consisting of several

hundred sequences is used. Yet, this dataset is not

publicly available.

The task of multi-person tracking is directly ad-

dressed in (Pi ˛atkowska et al., 2012). Due to the lack

of available ground-truth annotations, their approach

was only tested on a very limited set of data which

were manually labeled. With the work of (Hu et al.,

2016), DVS benchmark datasets have been published.

This includes a part that explicitly relates to object

tracking. In this case, though, only the tracking of a

single object is considered. In (Camuñas-Mesa et al.,

2018) challenges arising from object occlusion are ad-

dressed for an event-based tracking. The real-world

capabilities of their approach was tested on a multi-

person tracking scenario. But the qualitative results

were only considered on the basis of a short scene.

In summary, it can be stated that there are cur-

rently only a few event-based datasets available

within the context of person detection and track-

ing. Furthermore, label annotations included in those

datasets are not sufﬁciently discriminative. In large-

scale object detection datasets such as the GEN1 au-

tomotive dataset (de Tournemire et al., 2020), object

labels are often speciﬁed only in the form of bound-

ing boxes. With (Alonso and Murillo, 2019; Bolten

et al., 2021), there are event-based datasets that pro-

vide annotations for semantic segmentation. Never-

theless, to fulﬁll the initial requirements in the context

of multi-person tracking, an instance-level annotation

is required. Currently, no suitable event datasets exist

that satisfy this.

In conventional frame-based computer vision,

datasets exist that provide annotations beyond bound-

ing boxes for tracking and segmentation (Voigtlaen-

der et al., 2019). With work like (Hu et al., 2021),

approaches do exist to convert frame-based datasets

into the event-based domain. However, this conver-

sion does not reﬂect the non-ideal sensor character-

istics fully. Therefore, we propose a procedure that

N-MuPeTS: Event Camera Dataset for Multi-Person Tracking and Instance Segmentation

291

allows the creation of a native DVS dataset with an

automated generation of labels requiring only mini-

mal manual work.

Like (Marcireau et al., 2018), our approach is also

based on exploiting color features within the scene.

Compared to their hardware setup, which consists of

three DVS and an optical beam splitter, our setup

is simpler and does not impose constraints on the

recorded spectrum. A detailed description of our used

setup is given in Section 3.

2 LABEL EXTRACTION

A meaningful dataset requires accurate labels. In the

case of instance labeling of event data, manual an-

notation is not efﬁcient. We solved this problem by

introducing additional information in a way so that

the original data is not inﬂuenced. More speciﬁcally,

a DVS is usually based on CMOS technology. A

CMOS image sensor typically is most sensitive to

near-infrared light. DVS with color ﬁlter matrices ex-

ist only as prototypes and do not feature the higher

image resolution of newer DVS models (Moeys et al.,

2018; Taverni et al., 2018). Thus, for practical pur-

poses DVS are not capable of recording color infor-

mation. We are using this circumstance to encode the

information which event belongs to a given person in

the color of the clothes of the persons itself.

In order to record the color information, a sec-

ond frame-based color camera, referred to as an ac-

tive pixel sensor (APS) camera, is required. Assum-

ing proper recordings, label data with high accuracy

can be extracted from the color frames.

2.1 Color Features

The color features are generated by single-colored

full-body skin-tight garments. The following speci-

ﬁcations are important for this:

1. A single color per suit is required because person

instances will be separated by color hue.

2. The suit must cover the whole body so that color

information is available for all event data trig-

gered by a person.

3. The suit should be skin-tight so that the silhouette

of a person is not larger than normal.

The color of the suits changes the recorded inten-

sities of the DVS only slightly. Synthetic fabrics tend

to be strongly NIR-reﬂective. Independent of the vis-

ible color, the reﬂected NIR-light dominates the spec-

tral response of the garment. It is thus able to trigger

DVS events for any chosen color.

Initial experiments showed that fabrics can be dis-

tinguished by color hue even when the separation in

hue is small (see Figure 5 and the description in Sec-

tion 4.2). This enables the use of many colors and

therefore many person instances can be distinguished

simultaneously.

However, there are multiples sources of color arti-

facts that must be considered:

1. The lens introduces chromatic aberrations, i.e.

phantom colors towards the edges of the image

area, that can only be partially corrected.

2. The camera records color information through a

color ﬁlter matrix. Only three colors, red, green,

and blue, are recorded. The real source color is a

mixture of these and colors ﬁrst needs to be recon-

structed during demosaicing. During this process

errors are introduced near borders because neigh-

boring color signals could be combined while the

sources were separated in reality. This also intro-

duces new colors that are not part of the real scene.

As separation of persons by color hue is the basis

of our work, erroneous color information hinders

our method.

3. A third source of errors is the on-chip binning pro-

cess of the camera sensor. The effect that binning

has on resulting colors is similar to the previous

point. By aggregating 2 × 2 pixels it is possible to

get phantom colors not included in the real scene.

Also over-exposure in each color channel must be

avoided. Single-colored suits can quickly lead to

over-exposure in a single channel. It is recommended

to deliberately under-expose the scene so that no re-

quired information is lost.

2.2 Environmental Inﬂuences

Data acquisition in an outdoor environment is natu-

rally inﬂuenced by the environment itself. One impor-

tant aspect is the lighting of the scene. Direct sunlight

is not preferred because the color of the direct sunlight

itself is different from the color of diffusely reﬂected

light. In consequence, the parts in direct sunlight will

need a different white balance than parts in the shad-

ows. The information about which parts are in direct

sunlight is not known precisely and thus the necessary

distinction can not be made. Overcast is preferred for

its diffuse scene lightning effect.

Another practical problem are airborne particles.

Acquisition during precipitation is not useful because

the sensitivity in detecting changes in brightness and

the temporal resolution of a DVS is high enough to

image the rain drops themselves. Wet scenes should

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

292

be avoided due to the possibility of unwanted reﬂec-

tions and apparent color shift of wet surfaces of some

materials. Thus, dry weather without larger airborne

particles is preferred. This includes also insects and

pollen, as these environmental inﬂuences would also

be included in the DVS signal.

3 HARDWARE SETUP

In order to enable accurate processing of event and

color information, several aspects and requirements

regarding the hardware setup need to be considered.

This includes the selection of the sensors used as well

as their mounting in a stereoscopic setup to optimize

the perspective mapping. Furthermore, the colored

garments used to incorporate the color features must

also be taken into account.

3.1 Sensors

The DVS is a CeleX-IV built by CelePixel (Guo et al.,

2017). The sensor is combined with an 8mm wide-

angle lens from Computar

. The DVS is connected

via USB 3 to a notebook that stores the event data

on an external solid state drive. Previous experiments

showed that event noise is strongly dependent on sen-

sor temperature (Nozaki and Delbruck, 2017; Berth-

elon et al., 2018). We applied a cooling system based

on Peltier elements in order to stabilize the operating

temperature at a low level.

Our APS camera is based on a Sony IMX477

CMOS image sensor. For the color frames, it is nec-

essary to gather the data in a lossless manner. Our

requirements are minimal frame drops, maximal reso-

lution, full color information, and maximal frames per

second (fps). State-of-the-art video compression like

H.264 allocates few bits to color information. The

resulting hue values are not usable for our purposes.

It is possible to record a raw video stream from the

IMX477 sensor. The video stream features a reso-

lution of 4056 × 3040 px with 12-bit per pixel. The

resulting data rates are a challenge for realtime appli-

cations. Multiple limitations arise here:

• The camera interface on the APS sensor board is

not capable of transmitting the full resolution in

uncompressed form at the desired rate of frames

per second. Thus, binning was activated which

in turn halves the image dimensions to 2028 ×

1520px.

Computar MEGAPIXEL V0814-MP, f=8mm, f/1.4-

f/16, 1 inch, C-Mount

a) CeleX-IV with cooler b) Full-body garment.

(bottom) and IMX477 (top) and cotton shoe cover.

Figure 2: Sensor and color garment setup.

• The accumulated data rates of the APS and DVS

as well as the storage on an external USB media

exceed the capacity of a single computer’s USB

bus. Therefore a dedicated Raspberry Pi 4 was

used to acquire the APS frames and store them to

another external solid state drive.

The ﬁeld of view was matched as close as possible by

the choice of the lens. The IMX477 was combined

with a 4mm wide-angle lens from Edmund Optics

Both sensors together form a stereoscopic camera.

The cameras are stacked vertically as parallax is less

apparent along the vertical axis with cameras look-

ing at the scene from an elevated position. Also, the

lenses were mounted as close to each other as possi-

ble. The physical mount was milled from aluminum

alloy and placed on a solid surface so that the like-

lihood of tripod shake triggering unwanted events is

minimized. The optical axis were adjusted towards

the same point in the center of the scene so that lat-

eral overlap of the resulting views is maximal. The

resulting hardware setup is shown in Figure 2a.

3.2 Color Garments

The full-body garment is made up of two parts. The

complete body, including head, hands, and foots, is

covered by a Morphsuit – a commercially available

Spandex-based suit. In order to guarantee a natural

walking pattern, all persons were allowed to wear ev-

eryday shoes. The shoes are covered using separate

covers sewed from colored cotton fabric. An example

of a complete garment is depicted in Figure 2b.

4 SOFTWARE PIPELINE

The software processing pipeline for generating the

label masks as well as their propagation to the DVS

Edmund Optics TECHSPEC® UC Series #33-300,

f=4mm, f/1.8-f/11, 1/2 inch, C-Mount.

N-MuPeTS: Event Camera Dataset for Multi-Person Tracking and Instance Segmentation

293

Figure 3: Overview of software pipeline used for dataset generation.

data consists of several stages which are explained in

the following. An overview of the process is given in

Figure 3.

4.1 Raw Image Development

To minimize color artifacts and inaccuracies, espe-

cially in hue, the frame-based APS recordings from

the IMX477 sensor are performed as lossless as possi-

ble. This means that a full digital photo development

must be applied to develop appropriate color images

from the raw data.

In order to minimize the required data bandwidth

of the sensor, the acquisition is performed in a packed

data format. For every two 12-bit pixels, three bytes

are combined by the sensor. After unpacking the indi-

vidual pixels from this structure, a black level correc-

tion is performed and a color image is generated by

using an edge-aware demosaicing algorithm on the

Bayer-ﬁltered data. Following clipping and normal-

ization of the data, a color correction matrix is applied

to transfer the sensor-speciﬁc color values into a stan-

dardized, device-independent color space. Finally a

gamma correction is applied and the resulting frame

is cropped to the resolution of 2000 × 1500px to re-

move information-free stride pixels and color artifacts

at the borders from the image.

In a second step, chromatic aberrations are cor-

rected. For this, the image is further adapted with an

open-source photo editor

4.2 Label Mask Generation

Label masks are automatically generated based on the

developed APS frames by means of color hue seg-

mentation.

First, regions of interest are marked by hand (see

Figure 4). Areas of different illumination were in-

cluded. This is a countermeasure against the inﬂu-

ences described in Section 2.2. After conversion into

HSV color space, we get multiple color clusters per

garment color. The clusters can be visualized in a

https://www.darktable.org/

Figure 4: Example of one-time manually annotated color

areas for the performed clustering.

0°

15°

30°

45°

60°

75°

90°

105°

120°

135°

150°

165°

180°

195°

210°

225°

240°

255°

270°

285°

300°

315°

330°

345°

Figure 5: Polar histogram of garment colors as contained in

annotations on developed APS frames used for dataset.

polar histogram with 360 bins (see Figure 5). These

clusters are further processed to give a single centroid

in hue for each garment color.

For this purpose, an agglomerative clustering is

computed on the hue values. Clusters with differ-

ences in hue below a given threshold are merged. This

threshold is increased as long as the number of result-

ing clusters is lower than the number of garment col-

ors used during recording. It is assumed that a normal

distribution is suited to model potential color varia-

tions, e.g. due to lighting. A normal distribution is ﬁt

to each cluster. The calculated mean hue of each clus-

ter is used for the hue segmentation in the next step.

The hue range in HSV color space is circular. Care

must be taken to use circular mean for all computa-

tions involving hue.

The hue segmentation is computed by calculating

the differences to the centroids in hue of each clus-

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

294

ter and then applying a threshold α. At this point,

we have selected one circular sector for each color in

HSV color space. We can further narrow the decision

regions by incorporating saturation and value. The

idea is to exclude all colors that cannot origin from

the color garment in the given setup. We remove all

pixels with a saturation ≤ σ and a value ≤ ϕ. The

thresholds must be selected so that as little as possi-

ble from the regions of interest is cut away. At last a

morphological opening closes small holes. Now we

have separate binary mask for each garment color.

Additional care was taken to prevent multiple la-

bels per pixel and removal of false positives. When

the resulting label mask is set for multiple colors at

one position only one color is kept. Centroids’ hue

values are used in ascending order to deﬁne priori-

ties. Most false positives are removed by applying

a connected component analysis and suppressing all

connected components with a size of less than τ.

4.3 Synchronization, Mapping &

Propagation

Several steps are necessary to project the obtained la-

bel masks onto the DVS event stream. A priori, in-

trinsic camera parameters must be estimated for both

sensors of the setup. The required points can be ex-

tracted using a chessboard moving slowly and apply-

ing corner detection to it.

It is important to temporally synchronize the

views so that label masks derived from the APS cam-

era correspond to the view of the DVS. The record-

ing of raw data from the APS color sensor is techni-

cally limited to 40 fps by the hardware setup used

For further processing, the continuous data stream of

the DVS is therefore also divided into sections of the

length of 25ms. The resulting APS frames and the

time windows of the DVS are then synchronized to

each other on the basis of system clock timestamps

corresponding to their acquisition time. The clocks

of the systems used are adjusted via NTP. This limits

the error over time to a small value. Additionally, an

initial synchronization at the start of every recording

is manually performed using a visual cue. This visual

cue is used to compensate most of the absolute error

between the system timestamps.

For mapping, the generated label mask is ﬁrst

undistorted with the determined camera parameters of

the APS. This process step is illustrated exemplarily

in Figure 6a and 6b. Then the mask is projected onto

Using the 12-bit packed data format at a sampling rate

of 40 fps results in an approximate data rate of 178

which must be continuously transferred and stored.

the DVS, changing the ﬁeld of view and image resolu-

tion (see Figure 6c). For this purpose, a homography

is determined on the basis of manually selected points

on the ground plane of the scene. This is then used to

warp the mask in perspective.

The redistorted instance labels (see Figure 6d) are

ﬁnally propagated to the plain DVS event stream. All

events within a synchronized time window are as-

signed with the label that the mask has at each cor-

responding spatial position. The ﬁnal result is thus a

per-event instance label annotation on the DVS output

stream (see Figure 6e).

4.4 Limitations

There is one notable limitation. The label mask gen-

eration based on the APS frames works for both, mov-

ing and standing persons. In contrast, a DVS ideally

only generates event data for regions with movement.

In consequence, noise of the DVS is labeled as be-

longing to a person when a person is not moving.

The label is wrong when looking at event data it-

self because the events were not triggered by the per-

sons themselves. But when viewed from the stand-

point of tracking algorithm the label is correct, be-

cause a person cannot disappear in a scene. There-

fore, we decided not to further post-process the label

masks to address this issue.

5 DATASET: N-MuPeTS

First, we give an overview of what is included in the

dataset and brieﬂy explain every annotation used. Af-

ter that we discuss multiple statistics derived from the

dataset.

During recording in an unconstrained outside

environment, interference can not be completely

avoided. Every deviation from a perfect sequence of

actions must be edited during post-processing. We

decided to mark those sequences which are of lower

quality and split the recording at the point in time

when a change in quality occurs. This process was

done manually to ensure high accuracy. We observed

a number of issues, including both, recording artifacts

and post-processing errors. We distinguished three

classes of quality by the impact of the issues on us-

ability for tracking applications. From worst to best,

the quality classes are:

Quality 3: corresponds to these major issues:

• uninvolved persons in scene

Neuromorphic-Multi-Person Tracking and Segmen-

tation Dataset.

N-MuPeTS: Event Camera Dataset for Multi-Person Tracking and Instance Segmentation

295

2000 × 1500 px

(a) Generated label mask

with lens distortion.

2300 × 1725 px

(b) Label mask with lens

correction applied.

883 × 736 px

label mask

mapped to DVS.

768 × 640 px

(d) Redistorted

label to match

plain DVS data.

(e) Propagated

DVS event labels.

Figure 6: Processing steps to map and propagate label masks to DVS event stream.

(a) All four actors. (b) Masked scene (blended).

Figure 7: Setup during dataset recording (top row) and cor-

responding labels (bottom row).

• cars moving in scene

• wildlife in scene, i.e. birds

Quality 2: corresponds to these minor issues:

• unwanted occlusion, i.e. tree trunks

• one or more person outside of static mask

• incomplete masks, i.e. intersection with static

mask (see Section 5.3)

Quality 1: includes all remaining sequences, i.e. se-

quences without any of the aforementioned issues.

Quality 2 and 3 can be used to get longer sequences

(c.f. supplement to this paper). In the following, we

present quality 1 exclusively. The complete dataset,

including sequences of all three qualities, is available

publicly and free of charge

http://dnt.kr.hsnr.de/DVS-NMuPeTS/

The supplementary material to this paper is also available

for download there.

Table 1: Scenarios.

pattern

primary

parameter

background

(empty scene)

–

single person speed

crossing paths angles of paths, speed

parallel paths

(same directions)

speed, distance

parallel paths

(opposing directions)

speed, distance

occlusions speed, pose

meeting & parting direction

random path speed

helical path –

5.1 Scenarios

We deﬁned a set of scenarios as a protocol for record-

ing and guidance to our actors. Where possible, we

repeated the scenario for each person and all possi-

ble numbers of persons. The scenarios are listed in

Table 1. The order is mostly chronological and cor-

responding to sequence indices. For the majority of

the scenarios, we introduced annotations with a cor-

responding name. The annotations are not exclusive

for certain sequences, e.g. CROSSING will occur fre-

quently during ‘crossing paths’ but also anytime else.

Thus, the scenarios increase the frequency of their

corresponding annotations for their duration.

For the generation of label masks we used α =

10°, σ = 50%, ϕ = 10%, and τ = 10 px.

5.2 Persons & Garment Colors

The recorded persons are pseudonymized in the pro-

vided data by using the color of the suit they were

wearing (see Table 2 as well as Figure 7a and 8).

Some colors are less practical than others in a given

environment. In our case, vegetation tends to be

green-yellowish. Red and blue are well separated.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

296

Table 2: Technical speciﬁcations of participating persons.

garment

color

size sex weight

red 1.84m m 70 kg

orange 1.82m m 82 kg

cyan 1.74m f 80kg

blue 1.80m m 65 kg

Figure 8: Overview of the fabrics used. First row: cotton

shoe covers, second row: Morphsuits. Sorted by hue, from

left to right, i.e. red, orange, cyan, and blue.

The choice of orange and cyan, which are close in hue

(see Figure 5), is due to unavailability of differently

colored suits during the time of dataset creation. Fig-

ure 7c shows an automatically generated label mask

in the used setup, while Figure 7d shows the result of

corresponding mapping and propagation to the DVS

view. The supplement provides further examples of

generated label masks for a qualitative overview.

5.3 Static Mask

We used a binary mask to prevent vegetation as well

as uninvolved persons and objects, as described under

‘Quality 3’, from triggering false positives during la-

bel mask generation. Some sequences needed to be

split and marked as quality 2 instead of quality 1 be-

cause one or more person is masked out by the static

mask.

This static mask is only applied to the color frames

from the APS camera. In the dataset, all event data is

unmasked. You can see the mask applied in Figure 7b.

5.4 Annotations

Annotations were made manually and are available

per-color.

The scenario description, used to brief the actors,

and the annotations, describing all included activities,

are directly related to the main challenges of MOT de-

ﬁned in the introduction. We will now brieﬂy discuss

the mapping and explain our annotations.

5.4.1 Background

One annotation not resembling a problem in itself is

BACKGROUND, i.e. an empty scene. This provides

separate recordings only including sensor noise. In

quality 2, some foreign activity can occur.

BACKGROUND: no person with colored garment is

in the scene.

5.4.2 Object Occlusion

Object occlusion can be further divided into occlusion

with infrastructure and occlusion between persons.

Occlusion between persons can happen whenever two

or more persons are in the scene. It is very fre-

quent during CROSSING and SIDEBYSIDE (see Sec-

tion 5.4.5).

OCCLUSION: one or more persons are behind an ob-

stacle.

5.4.3 Similar Shape

All actors are of comparable size and weight. The

person representing CYAN naturally has a different sil-

houette while the remaining persons have a very sim-

ilar ﬁgure. This makes distinction by size impractica-

ble and provides a greater challenge for tracking al-

gorithms.

RED, ORANGE, CYAN, BLUE: person with speciﬁed

garment color is in scene.

5.4.4 Pose & Movement

Most of the time the actors were upright and walk-

ing. For short durations they were in differing poses.

EXCERCISING is a collection of movement patterns

found in sports. This includes push-ups and burpees

(including jumping). These represent movements

rarely observed in pedestrian monitoring scenes with

high rates of change in movement speed and thus may

pose a challenge for tracking algorithms.

STANDING, WALKING, and RUNNING correspond

to three intervals of speed during bipedalism.

RANDOM is a special case where one actor tries

to move as unpredictable as possible while including

changes of speed and course. This aims at stressing

physical motion models included in many tracking al-

gorithms.

EXERCISING: one or more persons are doing exer-

cise like activity.

KNEELING, STOOPED, WAVING: one or more per-

sons are in the speciﬁed pose.

STANDING, WALKING, RUNNING: one or more per-

sons are moving the speciﬁed manner.

RANDOM: one or more persons are abruptly chang-

ing direction and speed, including backwards

movement.

N-MuPeTS: Event Camera Dataset for Multi-Person Tracking and Instance Segmentation

297

5.4.5 Interaction

Crossing paths require two persons, but the number

of possible variations is limited. We consider the fol-

lowing patterns to be relevant:

1. person A and person B walk past each other

(CROSSING)

2. person A and person B walk in the same direction

(SIDEBYSIDE)

3. person A or person B change their direction at the

point of intersection (MEET)

Parameters are the angle towards the camera and be-

tween paths, and the speed. With three persons the

possible number of variations increases drastically.

For this dataset, we managed to gather four persons.

CROSSING: the paths of two or more persons are

crossing in temporal proximity.

MEET: two or more persons are meeting and parting

again, paths are converging then diverging.

SIDEBYSIDE: two or more persons are moving side

by side at the same speed.

HELIX: two persons try to circle around each other

so that in a (x,y,t)-diagram, i.e. a 3D plot of the

resulting event point cloud, their paths resemble a

helical path.

5.4.6 Different Sizes

As mentioned before, the actors are of similar size.

Still the projected size varies greatly with distance to

the camera due to perspective. While a person in the

center of the scene has an projected size of approxi-

mately 52 px in the DVS event stream, at the far end

of the scene a person is only 29 px in size. Sequences

with persons at the far end of the scene are marked

with FAR.

In conclusion, persons are included in the dataset

in multiple sizes. Of course, the variation in size cor-

responds directly to the position along the vertical

axis in the stream.

FAR: one or more persons are near the distant foot-

way in the scene.

5.5 Statistics

The dataset subset constituting quality class 1 is sum-

marized in Tables 3, 4, and 5. In these tables, all dura-

tions are given in seconds and rounded to the nearest

integer. The supplement contains detailed informa-

tion for all quality classes and annotations. Overall,

the cumulative duration of sequences in quality 1 is

Table 3: Cumulative durations per color combination in

quality class 1.

cumulative

duration

RED

ORANGE

CYAN

BLUE

per

color

combination

sum per

person

count

300

313

882

158

215

195

213

701

125

236

496

687

123

350 350

Table 4: Duration statistics per annotation in quality

class 1.

annotation

number of

sequences

mean

duration

cumulative

duration

RED 137 12.5 1713

ORANGE 123 13.6 1671

CYAN 101 14.0 1416

BLUE 77 12.3 946

BACKGROUND 46 6.5 300

STANDING 96 3.3 312

WALKING 441 5.1 2259

RUNNING 107 4.7 504

RANDOM 18 8.9 160

HELIX 9 7.2 64

FAR 35 4.5 157

approximately 56% of the total dataset which equates

to a length of ≈ 2920 seconds.

Table 3 summarizes the durations recorded per

color combination. It can be noted that all possible

color combinations are included. The last column

contains the cumulated durations for 0, 1, 2, 3, and

4 actors active at the same time. Recordings with up

to two or three persons each contribute 24% of the to-

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

298

Table 5: Occurrence statistics per annotation in quality

class 1.

annotation

number of

occurences

OCCLUSION 136

EXERCISING 9

KNEELING 13

STOOPED 18

WAVING 9

CROSSING 212

MEET 48

SIDE BY SIDE 94

tal dataset duration. Scenes with one person (≈ 30%)

and with all four persons (≈ 12%), along with empty

background scenes, form the remaining components.

Table 4 gives an overview of the amount of se-

quences and their duration in relation to the assigned

annotations. The cumulative duration is included in

a separate column for convenience. Due to the afore-

mentioned rounding, small discrepancies can occur.

Annotations describing short events in time are in-

stead counted, seeTable 5.

Considering the cumulative duration each actor

is present, RED, ORANGE and CYAN are included in

equal shares. Actor BLUE is slightly less frequent.

According to the natural movement pattern of hu-

mans, the annotation WALKING is included signiﬁ-

cantly more often than others.

6 CONCLUSION

Currently, there is still a lack of publicly available

event-based datasets. This is an obstacle in the de-

velopment and evaluation of event-based image pro-

cessing applications. We have described a software-

processing pipeline and its associated hardware setup

for an automated derivation of multi-person instance

annotations within DVS recordings. By exploiting

color features, highly accurate annotations are gen-

erated. This is even the case in error-prone scenarios

including intersecting movements and occlusions.

Additionally, we published an already ready-to-

use dataset that includes the scenarios and challenges

that arise in the context of multi-person tracking ap-

plications. This dataset can be used for further de-

velopment of event-based algorithms. As this dataset

provides annotations on the level of the individual

output events of the DVS, there are no event encoding

constraints which must be considered. This includes

approaches that operate directly on the 3D (x,y,t)

space-time event cloud of the DVS.

Potential challenges supported by this dataset in-

clude therefore, for example, the application and eval-

uation of 3D cloud-based algorithms for instance seg-

mentation (Yang et al., 2019; Zhao and Tao, 2020)

into the event-based vision. The transition of graph-

based approaches, such as from object detection

(Mondal et al., 2021) and recognition (Li et al., 2021)

within DVS data, to tracking is also a supported and

an interesting topic for further work.

ACKNOWLEDGEMENTS

We thank Hans-Günter Hirsch for his support and par-

ticipation in the recording of the dataset.

Funding

This work was supported by the Federal Ministry

for Digital and Transport under the mFUND pro-

gram grant number 19F1115A as part of the project

‘TUNUKI’.

REFERENCES

Alonso, I. and Murillo, A. C. (2019). EV-SegNet: Seman-

tic Segmentation for Event-Based Cameras. In 2019

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition Workshops (CVPRW), pages 1624–

1633.

Alzugaray, I. and Chli, M. (2018). Asynchronous Cor-

ner Detection and Tracking for Event Cameras in

Real Time. IEEE Robotics and Automation Letters,

3(4):3177–3184.

Berthelon, X., Chenegros, G., Finateu, T., Ieng, S.-H., and

Benosman, R. (2018). Effects of Cooling on the SNR

and Contrast Detection of a Low-Light Event-Based

Camera. IEEE Transactions on Biomedical Circuits

and Systems, 12(6):1467–1474.

Bisulco, A., Cladera Ojeda, F., Isler, V., and Lee, D. D.

(2020). Near-Chip Dynamic Vision Filtering for Low-

Bandwidth Pedestrian Detection. In 2020 IEEE Com-

puter Society Annual Symposium on VLSI (ISVLSI),

pages 234–239.

Bolten, T., Pohle-Fröhlich, R., and Tönnies, K. D. (2021).

DVS-OUTLAB: A Neuromorphic Event-Based Long

Time Monitoring Dataset for Real-World Outdoor

Scenarios. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

(CVPR) Workshops, pages 1348–1357.

Camuñas-Mesa, L. A., Serrano-Gotarredona, T., Ieng, S.-

H., Benosman, R., and Linares-Barranco, B. (2018).

Event-Driven Stereo Visual Tracking Algorithm to

Solve Object Occlusion. IEEE Transactions on Neu-

ral Networks and Learning Systems, 29(9):4223–

4237.

N-MuPeTS: Event Camera Dataset for Multi-Person Tracking and Instance Segmentation

299

Chen, G., Cao, H., Ye, C., Zhang, Z., Liu, X., Mo, X., Qu,

Z., Conradt, J., Röhrbein, F., and Knoll, A. (2019).

Multi-Cue Event Information Fusion for Pedestrian

Detection With Neuromorphic Vision Sensors. Fron-

tiers in Neurorobotics, 13:10.

de Tournemire, P., Nitti, D., Perot, E., Migliore, D., and

Sironi, A. (2020). A Large Scale Event-based Detec-

tion Dataset for Automotive. arXiv, abs/2001.08499.

Guo, M., Huang, J., and Chen, S. (2017). Live demon-

stration: A 768 × 640 pixels 200Meps dynamic vi-

sion sensor. In 2017 IEEE International Symposium

on Circuits and Systems (ISCAS), pages 1–1.

Hu, Y., Liu, H., Pfeiffer, M., and Delbruck, T. (2016).

DVS Benchmark Datasets for Object Tracking, Ac-

tion Recognition, and Object Recognition. Frontiers

in Neuroscience, 10.

Hu, Y., Liu, S.-C., and Delbruck, T. (2021). v2e: From

Video Frames to Realistic DVS Events. In 2021

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition Workshops (CVPRW), pages 1312–

1321.

Islam, M. Z., Islam, M., and Rana, M. S. (2015). Prob-

lem Analysis of Multiple Object Tracking System: A

Critical Review. International Journal of Advanced

Research in Computer and Communication Engineer-

ing, 4:374–377.

Jiang, Z., Xia, P., Huang, K., Stechele, W., Chen, G., Bing,

Z., and Knoll, A. (2019). Mixed Frame-/Event-Driven

Fast Pedestrian Detection. In 2019 International Con-

ference on Robotics and Automation (ICRA), pages

8332–8338.

Li, Y., Zhou, H., Yang, B., Zhang, Y., Cui, Z., Bao, H., and

Zhang, G. (2021). Graph-based Asynchronous Event

Processing for Rapid Object Recognition. In 2021

IEEE/CVF International Conference on Computer Vi-

sion (ICCV), pages 914–923.

Luo, W., Xing, J., Milan, A., Zhang, X., Liu, W., and Kim,

T.-K. (2021). Multiple Object Tracking: A Literature

Review. Artiﬁcial Intelligence, 293:103448.

Marcireau, A., Ieng, S.-H., Simon-Chane, C., and Benos-

man, R. B. (2018). Event-Based Color Segmenta-

tion With a High Dynamic Range Sensor. Frontiers

in Neuroscience, 12.

Miao, S., Chen, G., Ning, X., Zi, Y., Ren, K., Bing, Z., and

Knoll, A. (2019). Neuromorphic Vision Datasets for

Pedestrian Detection, Action Recognition, and Fall

Detection. Frontiers in Neurorobotics, 13:38.

Moeys, D. P., Corradi, F., Li, C., Bamford, S. A.,

Longinotti, L., Voigt, F. F., Berry, S., Taverni, G.,

Helmchen, F., and Delbruck, T. (2018). A Sensitive

Dynamic and Active Pixel Vision Sensor for Color or

Neural Imaging Applications. IEEE Transactions on

Biomedical Circuits and Systems, 12(1):123–136.

Mondal, A., R, S., Giraldo, J. H., Bouwmans, T., and

Chowdhury, A. S. (2021). Moving Object Detection

for Event-based Vision using Graph Spectral Clus-

tering. In 2021 IEEE/CVF International Conference

on Computer Vision Workshops (ICCVW), pages 876–

884.

Mueggler, E., Bartolozzi, C., and Scaramuzza, D.

(2017a). Fast Event-based Corner Detection. In

Tae-Kyun Kim, Stefanos Zafeiriou, G. B. and Miko-

lajczyk, K., editors, Proceedings of the British Ma-

chine Vision Conference (BMVC), pages 33.1–33.11.

BMVA Press.

Mueggler, E., Rebecq, H., Gallego, G., Delbruck, T., and

Scaramuzza, D. (2017b). The Event-Camera Dataset

and Simulator: Event-based Data for Pose Estimation,

Visual Odometry, and SLAM. The International Jour-

nal of Robotics Research, 36(2):142–149.

Nozaki, Y. and Delbruck, T. (2017). Temperature and Para-

sitic Photocurrent Effects in Dynamic Vision Sensors.

IEEE Transactions on Electron Devices, 64(8):3239–

3245.

Ojeda, F. C., Bisulco, A., Kepple, D., Isler, V., and Lee,

D. D. (2020). On-Device Event Filtering with Bi-

nary Neural Networks for Pedestrian Detection Using

Neuromorphic Vision Sensors. In 2020 IEEE Interna-

tional Conference on Image Processing (ICIP), pages

3084–3088.

Pi ˛atkowska, E., Belbachir, A. N., Schraml, S., and Gelautz,

M. (2012). Spatiotemporal multiple persons tracking

using dynamic vision sensor. In 2012 IEEE Computer

Society Conference on Computer Vision and Pattern

Recognition Workshops, pages 35–40.

Taverni, G., Paul Moeys, D., Li, C., Cavaco, C., Motsnyi,

V., San Segundo Bello, D., and Delbruck, T. (2018).

Front and Back Illuminated Dynamic and Active Pixel

Vision Sensors Comparison. IEEE Transactions on

Circuits and Systems II: Express Briefs, 65(5):677–

681.

Voigtlaender, P., Krause, M., Osep, A., Luiten, J., Sekar,

B. B. G., Geiger, A., and Leibe, B. (2019). MOTS:

Multi-Object Tracking and Segmentation. In 2019

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition (CVPR), pages 7934–7943.

Xu, Y., Zhou, X., Chen, S., and Li, F. (2019). Deep Learn-

ing for Multiple Object Tracking: A Survey. IET Com-

puter Vision, 13(4):355–368.

Yang, B., Wang, J., Clark, R., Hu, Q., Wang, S., Markham,

A., and Trigoni, N. (2019). Learning Object Bound-

ing Boxes for 3D Instance Segmentation on Point

Clouds. In Wallach, H., Larochelle, H., Beygelzimer,

A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors,

Advances in Neural Information Processing Systems,

volume 32. Curran Associates, Inc.

Zhao, L. and Tao, W. (2020). JSNet: Joint Instance and

Semantic Segmentation of 3D Point Clouds. Proceed-

ings of the AAAI Conference on Artiﬁcial Intelligence,

34(07):12951–12958.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

300