Semantic Segmentation on Neuromorphic Vision Sensor Event-Streams

Using PointNet++ and UNet Based Processing Approaches

Tobias Bolten

1 a

, Regina Pohle-Fr

ohlich

and Klaus D. T

onnies

Institute for Pattern Recognition, Hochschule Niederrhein, Krefeld, Germany

Department of Simulation and Graphics, University of Magdeburg, Germany

Keywords:

Dynamic Vision Sensor, Semantic Segmentation, PointNet++, UNet.

Abstract:

Neuromorphic Vision Sensors, which are also called Dynamic Vision Sensors, are bio-inspired optical sensors

which have a completely different output paradigm compared to classic frame-based sensors. Each pixel of

these sensors operates independently and asynchronously, detecting only local changes in brightness. The out-

put of such a sensor is a spatially sparse stream of events, which has a high temporal resolution. However, the

novel output paradigm raises challenges for processing in computer vision applications, as standard methods

are not directly applicable on the sensor output without conversion.

Therefore, we consider different event representations by converting the sensor output into classical 2D frames,

highly multichannel frames, 3D voxel grids as well as a native 3D space-time event cloud representation. Us-

ing PointNet++ and UNet, these representations and processing approaches are systematically evaluated to

generate a semantic segmentation of the sensor output stream. This involves experiments on two different

publicly available datasets within different application contexts (urban monitoring and autonomous driving).

In summary, PointNet++ based processing has been found advantageous over a UNet approach on lower reso-

lution recordings with a comparatively lower event count. On the other hand, for recordings with ego-motion

of the sensor and a resulting higher event count, UNet-based processing is advantageous.

1 INTRODUCTION

The Dynamic Vision Sensor (DVS), which stems

from the research ﬁeld of neuromorphic engineering,

emulates key aspects of the human retina. This results

in a basically different output paradigm compared to

classic image sensors. A DVS does not operate at a

ﬁxed frame rate. Only local brightness changes in the

scene are detected and directly transmitted. For this

purpose, the pixels of a DVS work independently and

asynchronously from each other. An output is gener-

ated as soon as a change in brightness above a deﬁned

threshold has been detected.

In this context, the triggering of a single DVS-

pixel is called an “event”. Each of these events is

a tuple (x, y, t, p) which contains information about

the spatial (x, y) position of the active pixel in the

sensor array, a very precise timestamp t of trigger-

ing and a polarity indicator p, which encodes the di-

rection of the brightness change (from bright to dark

or vice versa). The output of a DVS is therefore an

information-rich, sparse stream of events with a vari-

https://orcid.org/0000-0001-5504-8472

able data rate that depends directly on the change in

the scene. An example of this stream is shown in Fig-

ure 1.

This DVS operating paradigm results in advanta-

geous characteristics, in terms of high time resolution,

low data redundancy and power consumption and a

very high dynamic range, which can be very useful

in outdoor measurement scenarios like monitoring or

autonomous driving.

The output of a Dynamic Vision Sensor is fun-

damentally different from a standard camera due

to the described operation paradigm (synchronous

and dense frame vs. asynchronous sparse event

stream). Therefore, well-established computer vision

approaches are not direct and natively applicable. In

this work, we evaluate different event representations

and deep learning networks to generate a multi-class

semantic segmentation of DVS event data.

For this challenge, we consider variants of con-

verting the DVS stream into single or multi-channel

images, a 3D voxelization approach as well as the di-

rect interpretation of the event stream as a 3D event

point cloud. We summarize our main contributions as

follows:

168

Bolten, T., Pohle-Fröhlich, R. and Tönnies, K.

Semantic Segmentation on Neuromorphic Vision Sensor Event-Streams Using PointNet++ and UNet Based Processing Approaches.

DOI: 10.5220/0011622700003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 4: VISAPP, pages

168-178

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

(a) Observed stimulus (b) Triggered DVS-Event

Stream

Figure 1: Visualization of DVS output concept (event po-

larity is color-coded; “on” in green and “off” in red).

• consideration of 3D DVS space-time event cloud

processing (Wang et al., 2019) and the extension

of (Bolten et al., 2022) to another dataset in the

application ﬁeld of autonomous driving

• a systematic comparison of single-channel and

high-multichannel 2D event representation, as

well as 3D event stream voxelization

• and evaluation of a UNet (Ronneberger et al.,

2015) network structure to generate a semantic

segmentation on these representations.

The pre-processed datasets as well as the gener-

ated network predictions are available for download

to support further comparisons and research.

The rest of this paper is structured as follows. Sec-

tion 2 outlines related work on semantic segmenta-

tions of neuromorphic event data. The evaluated event

representations and deep learning approaches are ex-

plained in Section 3. Section 4 presents the used

datasets, the performed data pre-processing, the train-

ing conﬁgurations and summarizes the evaluation re-

sults we obtained. Finally, a brief summary is pro-

vided in Section 5.

2 RELATED WORK

Although Dynamic Vision Sensors are a relatively

new type of sensor technology, they are already being

used in a variety of applications. For example, this

includes real-time vibration measurements and con-

trol applications (Dorn et al., 2017) related to indus-

trial applications, applications in the context of au-

tonomous driving (Chen et al., 2020) or the use in

space surveillance applications (McMahon-Crabtree

and Monet, 2021). The goal of this work is to derive a

semantic segmentation of the DVS event stream. This

means that a object class label will be assigned to each

DVS event.

http://dnt.kr.hsnr.de/DVS-UNetSemSeg/

In (Sekikawa et al., 2019) the authors have intro-

duced the so-called EventNet. It is as neural network

designed for the processing of asynchronous event

streams in an event-wise manner which is capable to

produce a semantic segmentation. Their approach is

based on an adaption of a single PointNet structure

(Qi et al., 2017a) which is made real-time capable by

recursive processing of events and precomputed look-

up tables. By design it is not capable to extract hier-

archical features from the data. For this reason, and

because we are not targeting real-time capability, we

did not consider this approach further. Instead, we ex-

amine PointNet’s successor, PointNet++ in its vanilla

form.

The EvDistill approach presented by (Wang et al.,

2021) is based on a student-teacher network de-

sign to overcome the hurdle of missing large-scale,

qualitatively labeled datasets. A teacher network is

trained on large-scale, labeled image data where the

student network learns on unlabeled and unpaired

event data by knowledge distillation. Because not all

datasets considered in our work provide classical im-

ages (source modality) for the teacher, we have not

considered this approach further.

An Xception based encoder-decoder architecture

is used in EV-SegNet by (Alonso and Murillo, 2019)

to obtain a semantic segmentation. For this purpose, a

dense 6-channel 2D frame representation of the event

stream is used. They also provide a labeled dataset

from the autonomous driving domain. We use this

dataset and compare our UNet-based results with their

results.

In (Bolten et al., 2022) an evaluation of a semantic

event-wise segmentation utilizing different data scal-

ing variations and network conﬁgurations based on

PointNet++ is presented. In our work, we extend

this comparison to another dataset. Furthermore, we

consider more event representations and replace their

MaskRCNN based 2D reference processing to UNet

based approaches.

In the literature, DVS event streams are often pro-

cessed by converting them to classic 2D frame rep-

resentations, which are then further processed using

well-known off-the-shelf computer vision techniques.

For example, frame conversion is performed in (Chen

et al., 2019; Jiang et al., 2019; Wan et al., 2021), and

then a Yolo-based approach is used. Furthermore, a

variety of other possible event representations have

developed. A larger set of these representations will

be considered and examined in this study. There-

fore, further details as well as literature references are

given in the following Subsection 3.1.

Semantic Segmentation on Neuromorphic Vision Sensor Event-Streams Using PointNet++ and UNet Based Processing Approaches

169

(a) (x × y × 1)

Single gray 2D frame

representation

(b) (x × y × t

channel

)

Highly multi-dimensional

2D frame build by

time-splitting

bin

× 1)

3D Voxelization (half

shown, for better clarity)

(d) Plain 3D (x, y, t)

space-time event cloud

Figure 2: Graphical rendering of the considered basic event representations ideas.

3 PROPOSED METHOD

The DVS event representations used in this work as

well as the 2D and 3D deep learning network struc-

tures are presented and outlined in the following.

3.1 Event Representations

The event output stream from a Dynamic Vision Sen-

sor is often converted into alternative representations

for processing. In our work, we consider the subse-

quent 2D as well as 3D representations and compare

the achieved results in the subsequent processing.

2D Frame Representation: The DVS event stream

is converted to classic 2D frame representations

by projection along the time axis. Typically, ei-

ther a ﬁxed time window or a ﬁxed number of

events is considered to construct the frame (Liu

and Delbr

uck, 2018). For this conversion, there

are a variety of encoding rules that also aim to

consider the time resolution included in the DVS

stream (Lagorce et al., 2017; Mitrokhin et al.,

2018).

As a baseline for comparing the following encod-

ings, we consider only the binary projection of the

event stream as a pure 2D frame encoding in this

work (compare with Figure 2a), resulting in a rep-

resentation of shape (x × y × 1).

Multi-Channel 2D Frame Representation: Within

this projection of the DVS stream, classical 3-

channel RGB images could also be generated, e.g.

by color coding of the event polarities. In addi-

tion, there are also various other approaches that

encode different aspects of the DVS stream in

non-intuitive multichannel images. The “Merged-

Three-Channel” representation deﬁned in (Wan

et al., 2021) is an example of such an encoding.

In this representation, information about the event

frequency, the timestamps and continuity are rep-

resented in individual image channels.

Within our work, we aim to better represent and

exploit the temporal context of the DVS stream.

Therefore, we consider as an input encoding

a highly multidimensional representation of the

DVS data. Here, the time component of the signal

is separated and stored in many channels during

projection, resulting in a representation of shape

(x × y × t

channel

). A visualization is given in Fig-

ure 2b.

3D Voxel-Grid Representation: Another approach

to maintain and better preserve the high temporal

resolution of the DVS event stream is the inter-

pretation as a 3D spatio-temporal volume. Vox-

elization of this volume encodes the distribution

of events within the spatio-temporal domain (Zhu

et al., 2019; Chaney et al., 2019). This type of

representation is also often used as an intermedi-

ate encoding to convert the event stream into other

forms, such as graphs (Deng et al., 2022).

We form time voxels per pixel in our work, as we

discretize the time dimension per pixel into t

bin

bins to include and consider ﬁne spatial structures.

This discretization leads to a data representation

of shape (x × y × t

bin

× 1) which encodes the oc-

currence of an event per voxel. A visualization is

given in Figure 2c.

3D (x, y, t) Space Time Event Cloud Representa-

tion: In the previously described representation

as a voxel grid, the sparsity of the event stream is

lost in the encoding. This property and the high

resolution of the time information is preserved

when interpreting the DVS data as a 3D (x, y, t)

space-time event cloud. In this way, the spatio-

temporal information is directly encoded as geo-

metric neighborhood information (compare with

Figure 2d).

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

170

(a) Farthest Point

Sampling (input from

Figure 2d)

(b) Subset of Ball-Query

Neighborhood selections

of Layer 1

...

(d) Abstracted Event Set

Layer 2

Figure 3: Summary of PointNet++ processing concept.

By applying point set processing methods, such

as PointNet++ (Qi et al., 2017b), DVS data can be

processed directly (Wang et al., 2019; Sekikawa

et al., 2019; Mitrokhin et al., 2020; Bolten et al.,

2022).

3.2 Network Architectures

In this work we compare 3D and 2D event representa-

tions and processing approaches using the following

deep learning networks:

PointNet++. Hierarchical feature learning

PointNet++ (Qi et al., 2017b) learns a spatial en-

coding of point cloud data. For this purpose, the

input data is hierarchically divided and summa-

rized. By the respective application of a simple

PointNet (Qi et al., 2017a) as a feature extractor,

local and global features are built and combined.

This results ﬁnally in a representation of the entire

point cloud.

This hierarchical process is realized by so called

Set Abstraction Layers (SA) of the network. First

representative centroid points of local regions

are selected by a farthest point sampling (Figure

3a). Subsequently local neighboring points are

selected around these centroids. By default, this

is performed via a ball query which ﬁnds an up-

per limited set of points within a deﬁned radius

(Figure 3b). The extracted pattern feature vectors

of these local regions will be geometrically repre-

sented by the centroid coordinates (Figure 3c for

the ﬁrst and 3d for the subsequent second layer).

This approach to create common structure parti-

tions allows sharing the weights of the feature ex-

tractors per network layer. This leads to relatively

small networks.

In the case of semantic segmentation, the resulting

features are ﬁnally interpolated by a Feature Prop-

agation Layer (FP) to produce point-wise values.

UNet. Convolutional Networks for Biomedical Im-

age Segmentation

The UNet architecture (Ronneberger et al., 2015)

has its origin in medical image segmentation,

but was successfully applied to a various ﬁeld

of applications (Pohle-Fr

ohlich. et al., 2019;

McGlinchy et al., 2019; Liu and Qian, 2021). It

is a convolutional neural network that produces a

precise pixel-by-pixel segmentation.

The architecture follows a division into an en-

coder and a decoder part. Within the encoder, spa-

tial resolution is reduced by convolution and max-

pooling, while the number of feature channels is

increased. This extracts high-resolution and deep

features about the context. In the second part, the

decoder, the original resolution is restored by up-

sampling. By increasing the resolution of the out-

put in this way, the decoder learns to create an out-

put with precise localization. UNet architecture

combines the feature channels from this down and

upsampling by skip connections, allowing the net-

work to propagate and combine context and local-

ization information.

The visualization of an example conﬁguration is

given in Figure 5.

4 EXPERIMENTS

Initially, the used datasets are introduced and the per-

formed data preprocessing is summarized. In the fol-

lowing, the hyper-parameters and the speciﬁc network

layer and training conﬁgurations are presented. Sub-

sequently the description of the used metric, as well

as the achieved results, including a brief discussion

and summary is given.

4.1 Datasets

In comparison to classical frame-based computer vi-

sion, there is currently a signiﬁcantly lower number of

event-based datasets available. This is particularly ev-

ident with the requirement for annotations at the level

Semantic Segmentation on Neuromorphic Vision Sensor Event-Streams Using PointNet++ and UNet Based Processing Approaches

171

(a)

PERSON

(b) DOG

(c)

BICYCLE

(d)

SPORTS-

BALL

(e) BIRD

(f) INSECT (g) TREE

CROWN

(h) TREE

SHADOW

(i) RAIN (j) BACK-

GROUND

Figure 4: False-color DVS-OUTLAB class examples (re-

produced from (Bolten et al., 2022) with permission from

the authors).

of semantic segmentation.

Most large-scale datasets like GEN1 (de Tourne-

mire et al., 2020) or even smaller datasets (Miao et al.,

2019) contain object labels only at the level of pro-

vided bounding boxes to achieve an object detection,

but do not provide labels for a semantic segmentation.

We therefore limited our comparison to the two

datasets below, which provide those annotations in a

multi-class scenario.

DVS-OUTLAB: This dataset (Bolten et al., 2021)

contains recordings of a DVS-based long-time

monitoring of an urban outdoor place. For this

purpose, three CeleX-IV sensors (Guo et al.,

2017) were used. These recordings offer a total

spatial resolution of 768 × 512 pixels.

The dataset contains semantic label annotations

for about 47k regions of interest, separated into

70/15/15% sets for test, train and evaluation. Each

region of interest with a spatial size of 192 × 128

pixels contains events and labels for a sequence of

60ms length of the underlying DVS event stream.

The labeling takes 10 different classes into

account, including different object classes, as

well as environmental noise originating from the

outdoor-setup of the measurement (compare with

Figure 4). The labels are provided on a per event-

basis.

Subset of DDD17 Sequences: The authors of the

Ev-SegNet approach (Alonso and Murillo, 2019)

published with their work a subset of the DDD17

dataset (Binas et al., 2017) extended by semantic

labels.

The DDD17 dataset contains sequences of record-

ings obtained from a moving car in trafﬁc (com-

pare to Figure 6). These recording were taken

with a DAVIS346B Dynamic Vision Sensor, of-

fering a spatial resolution of 346×260 pixels. The

data was cropped to 346×200 pixels, as the lower

60-pixel rows included the dashboard of the car.

The dataset contains 15950 sequences for train-

ing and 3890 for testing, each corresponding to

a 50ms section of the event stream. For these

sequences, the authors automatically generated

pixel-wise semantic labels based on gray-scale

images from the DAVIS sensor by applying a

CNN. Thereby six different classes were consid-

ered: (1) construction/sky, (2) objects (like street

signs or light poles), (3) nature (like trees), (4) hu-

mans, (5) vehicles and (6) street. These labels are

provided as dense 2D frames.

4.2 Data Preprocessing

The following pre-processing was performed to pre-

pare the datasets and generate the presented event rep-

resentation:

Subset of DDD17 Sequences: The DVS event data

was published by (Alonso and Murillo, 2019) in

the form that only a 2D frame representation of

the event stream is directly available. Further-

more, the generated labels are also only avail-

able in the form of 2D frames. Thus, they

are not directly usable for the generation of our

proposed multi-channel, voxel or 3D space-time

event cloud representation.

Therefore, utilizing the native DDD17 event

stream recordings, we ﬁrst propagated the labels

of the EvSegNet subset back to the original event

stream. This results in annotations per event in

the form of (x, y, t, p, label). For each 50ms of

the event stream the corresponding 2D label was

transferred to all underlying events at the same

spatial position within this time window.

DVS-OUTLAB: The labeling of this dataset is al-

ready available in the form of a semantic annota-

tion per event. Therefore, no adaption of the label

representation was necessary.

The 3D (x, y, t) space-time event cloud representa-

tion is built natively direct from the event stream. For

the remaining 2D and voxel representations, an inter-

mediate numpy-array was calculated. For this pur-

pose, a 3D voxel histogram was generated per pixel-

position, splitting the time axis into 64 components

channel

and t

bin

in Figure 2). The labels were trans-

formed into an equivalent voxel form. By applying

numpy-operations a convenient transformation into

the proposed 2D representations is possible

. This

pre-processed data is available for download.

The fast transformation from 3D voxels to the pro-

posed 2D frame representations could be done by simple

reshape and/or amax operations.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

172

Figure 5: UNet 2D conﬁguration example for DVS-OUTLAB dataset (10 object classes plus the “void” for regions without

any event).

Table 1: PointNet++ conﬁguration summary (compare to syntax used in (Qi et al., 2017b)).

DVS-OUTLAB PNet++(4096, 3L)

SA(2048, 9.6, [32, 32, 64]) → SA(256, 28.8, [64, 64, 128]) →

SA(16, 76.8, [128, 128, 256]) → FP([256, 256]) → FP([256, 128]) →

FP([128, 128, 128, 128, 10])

DDD17

PNet++(4096, 5L)

SA(2048, 17.3, [32, 32, 64]) →

PNet++(8192, 5L)

SA(4096, 17.3, [32, 32, 64]) →

followed by

SA(1024, 34.6, [64, 64, 128]) → SA(256, 69.2, [128, 128, 256]) →

SA(64, 103.8, [256, 256, 512]) → SA(16, 138.4, [512, 512, 1024]) →

FP([256, 256]) → FP([256, 128]) → FP([256, 256]) →

FP([256, 128]) → FP([128, 128, 128, 128, 6])

The division into 64-time components is selected

so that in the subsequent logic of UNet processing,

the choice of the input dimension as a power of two

leads to integer dimensions in the downsampling and

upsampling logic. Furthermore, the spatial resolution

was extended by zero-padding into quadratic inputs

for the UNet processing. This results in 192 × 192

pixel resolution for the DVS-OUTLAB data and to

346 × 346 pixel for the DDD17 subset. In the repre-

sentations, the presence or absence of events per spa-

tial position is encoded by the numerical value 255 or

Moreover, we created and tested two variations

for each of the datasets. One plain version, that in-

cludes all events and one that was spatio-temporal

pre-ﬁltered to reduce included sensor background

noise and to estimate the effects of noise reduction

for the semantic processing. For this purpose, a time

ﬁlter was applied to remove all events that were not

supported by another event at the same (x, y) coor-

dinate within the previous 10ms. This type of ﬁlter

has shown a reasonable trade-off between noice re-

duction and preservation of object events (compare to

performed evaluation in (Bolten et al., 2021)).

4.3 Training

The following network layer conﬁgurations and train-

ing hyper-parameters were used in our experiments:

PointNet++. For the PointNet++ training conﬁgura-

tion we followed the selected hyper-parameters

from (Bolten et al., 2022). This results in us-

ing the Adam optimizer with a learning rate of

0.001 and an exponential decay rate of 0.99 ev-

ery 200.000 trainings steps. The batch size was

set to 16 space time event clouds.

For the DVS-OUTLAB dataset we follow also

their data patching and scaling scheme (S

native

layer depth and set abstraction conﬁguration. In

case of the DDD17 data, we adapted the net-

work conﬁguration due to the larger spatial input

dimension (346 × 200 pixel vs 192 × 128 pixel

per region) to address the resulting higher event

count. Additional we trained and tested two Point-

Net++ conﬁgurations with a previous subsam-

pling to 8192, respectively 4096 events.

The speciﬁc PointNet++ conﬁguration used for

training is summarized in Table 1.

UNet. The UNet trainings were carried out utilizing

an Adam optimizer with a learning rate of 0.001

and an exponential decay with a rate of 0.99 af-

ter each epoch. The batch size was set to 6 sam-

ples. A sparse categorical cross entropy weighted

by the class occurrence frequency was chosen as

the loss function to address the class imbalances

included in the datasets.

In all performed UNet experiments the model is

built with a depth of 4 layers and a number of 16

ﬁlters in the ﬁrst block, which are multiplied by

2 in each subsequent block. The kernel size in

Semantic Segmentation on Neuromorphic Vision Sensor Event-Streams Using PointNet++ and UNet Based Processing Approaches

173

(a) 2D Labels

(b) Corresponding Events

(d) Corresponding Events

Figure 6: Ev-SegNet DDD17 subset: dense label compared

to sparse event stream.

the 2D or respective 3D-convolutions were set to

three. A used 2D UNet example conﬁguration is

shown in Figure 5 as reference.

4.4 Metric

In the literature, and this is also the case for event

cameras as in (Alonso and Murillo, 2019), the eval-

uation is often performed on the basis of dense 2D

frames. For each 2D pixel of the annotation the cor-

responding pixel value of the network prediction is

considered and compared. But this type of evaluation

ignores the basic property of a Dynamic Vision Sen-

sor that the produced event stream is spatially sparse.

This is clearly illustrated by Figure 6, which shows

two scenes from the DDD17 data subset. In Subﬁg-

ure (a, b) a scene with limited or without movement is

shown, whereas Subﬁgure (c, d) displays a scenes of

higher speed. Within a slow scene only a few events

are triggered and even with faster movements, there

are many areas where no or only a few events were

triggered as well.

Networks operating on a sparse representation,

such as the used PointNet++, cannot predict results

at positions where no events were triggered. There-

fore, a proper comparison on this dense 2D basis is

not possible. Furthermore, this type of 2D compar-

ison ignores the fact, that at one spatial (x, y) posi-

tion multiple events could have been triggered within

the selected time window. Therefore, we consider

in our evaluation only spatial positions where events

were triggered. Furthermore, the number of triggered

events for each predicted label is also taken into ac-

count.

In contrast to PointNet++, for the UNet based pro-

cessing approaches, it is possible that a class pre-

diction occurs at a spatial position where no DVS

Table 2: Weighted-Avg F1 results on DVS-OUTLAB

dataset.

Network

Back-

ground

Objects

Env-Inﬂu

ences

Over-All

(a) PointNet++ reference results

PNet(4096, 3L) 0.968 0.816 0.853 0.936

(b) Unﬁltered UNet results

UNet 2D 0.951 0.842 0.764 0.902

UNet 2D 64ch 0.958 0.847 0.780 0.912

UNet 3D Voxel 0.941 0.843 0.775 0.895

UNet 2D 0.925 0.838 0.757 0.868

UNet 2D 64ch 0.938 0.850 0.826 0.897

UNet 3D Voxel 0.928 0.843 0.809 0.883

event has been triggered (the “void” background). To

take account for this we perform the following simple

post-processing before evaluation:

If an object class prediction is made but no event

is present (pred 6= void ∧ event =

0), this predic-

tion is interpreted as void and ignored. In case that

no object class prediction is made but an event is

present (pred = void ∧ event 6=

0), this prediction is

re-interpreted and considered as the dominating back-

ground class for evaluation (class background for

DVS-OUTLAB, construction/sky for DDD17).

Considering the number of triggered DVS events

we then calculate the F1 score, which is described as

the harmonic mean of precision and recall:

F1-Score =

T P

T P + 0.5 · (FP + FN)

To summarize the results, we also calculated weighted

averages taking the number of each class’s support

into consideration (Weighted-Avg F1).

For a fair comparison of the 3D and 2D methods

(different counts of predictions and therefore a higher

number of possible errors in case of higher output

dimension) we equalize all generated predictions for

evaluation. The 3D predictions are projected along

the t-axis by considering the most frequent prediction

at each spatial position for comparison with the 2D

results.

4.5 Evaluation

The evaluation results of the PointNet++ processing,

as well as the results for the different event repre-

sentations in combination with the 2D and 3D UNet

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

174

processing for the DVS-OUTLAB dataset are sum-

marized in Table 2. The PointNet++ based process-

ing achieves the better segmentation results on this

dataset compared to the 2D or 3D voxel UNet pro-

cessing. This is consistent to the PointNet++ and 2D

MaskRCNN comparison presented in (Bolten et al.,

2022).

The summary of results for the subset of labeled

DDD17 dataset sequences is given in Table 3. On this

dataset, PointNet++ processing achieves weaker re-

sults in contrast to the UNet variations and the dataset

authors’ Ev-SegNet reference. A noticeable differ-

ence exists here in the results of the class ”Objects”

in the PointNet++ based processing. This class of the

dataset contains, for example, lampposts, street signs

or trafﬁc lights. Although the PointNet++ conﬁgura-

tions used here were adjusted in the number of points

to be considered in the input cloud and the ﬁrst SetAb-

straction layer, as well as the layer count itself, this

suggests that such ﬁne details were not fully captured.

Due to the high number of triggered events in the au-

tonomous driving context of this dataset (compared

to a static sensor in DVS-OUTLAB monitoring) and

the larger spatial input (346 × 200 pixel vs 192 × 128

pixel), the encode/decoder approach to UNet process-

ing seems to have an advantage.

The PointNet++ processing, on the other hand, re-

lies on considering sufﬁcient representative events se-

lected by farthest point sampling and corresponding

neighborhood formation. Please compare to Figure 3,

especially (a) and b, which summarizes the basic idea

of PointNet++ processing.

However, when considering the overall results on

the DDD17 subset, the given quality of the ground

truth label must be considered. These labels were

generated by (Alonso and Murillo, 2019) through an

automatic processing. Out of a total of nearly 12

hours of material from the DDD17 dataset, about 15

minutes were labeled in this way, and the GT labels

obtained are not completely accurate and consistent

over time. Figure 7 gives an example of included ar-

tifacts in the GT labels using two examples that are

separated by a short period of time. The annotations

of the included trafﬁc sign and the tree (marked by

red arrows) varies, although, for example, the UNet

predictions are correct.

Overall, across both datasets, it can be observed

that UNet-based processing achieves better results

on unﬁltered raw data than on spatio-temporal pre-

ﬁltered data. In general, an improvement of the UNet

based results can be observed with an increase of

the dimensionality of the event representation used.

Whereas the use of the 3D voxel grid brings only mi-

nor differences in comparison.

Table 3: Results on subset of DDD17 dataset.

Network

Construc-

tion

Objects

Nature

Human

Vehicle

Street

Macro-

Avg

Weighted-

Avg

(a) Ev-SegNet baseline (Alonso and Murillo, 2019), metric recalculated to match proposed evaluation

Ev-SegNet 0.916 0.229 0.712 0.670 0.850 0.727 0.696 0.876

(b) PointNet++ results

PNet(8192, 5L) 0.842 0.088 0.516 0.398 0.743 0.619 0.534 0.771

PNet(4092, 5L) 0.840 0.103 0.521 0.464 0.748 0.600 0.546 0.766

UNet 2D 0.886 0.266 0.686 0.577 0.835 0.703 0.659 0.829

UNet 2D 64ch 0.895 0.285 0.723 0.533 0.849 0.723 0.668 0.843

UNet 3D Voxel 0.898 0.301 0.729 0.572 0.847 0.719 0.678 0.843

(d) Spatio-temporal ﬁltered (time 10ms) UNet results

UNet 2D 0.882 0.265 0.673 0.557 0.846 0.660 0.647 0.826

UNet 2D 64ch 0.896 0.289 0.713 0.590 0.862 0.681 0.672 0.846

UNet 3D Voxel 0.895 0.296 0.713 0.568 0.858 0.680 0.668 0.843

Semantic Segmentation on Neuromorphic Vision Sensor Event-Streams Using PointNet++ and UNet Based Processing Approaches

175

(a) Ground truth vs prediction at t

(b) Ground truth vs prediction at t

+ 1.25 sec

Figure 7: Visualization of GT labeling quality of DDD17 subset from (Alonso and Murillo, 2019) and predictions of trained

networks. Note the inconsistent GT labeling of the marked trafﬁc sign and trees between the timestamps shown. (best viewed

in color and digital zoomed)

5 CONCLUSION

The improvement in the UNet prediction quality us-

ing the highly multi-channel event representation over

the single 2D frame variant indicates a beneﬁt of the

more complex representation. Whereas the use of 3D

voxel grids also achieves good results (compare ex-

emplarily with the results shown in Figure 7). Un-

fortunately, the associated UNet network structure is

larger due to the 3D convolutions and therefore slower

for inference.

The sensor property of a DVS to produce an out-

put stream that is spatially sparse becomes particu-

larly clear when statistically examining the used voxel

representation. Over the two complete data sets, only

about 0.45% of the voxels are populated by an event

(corresponding 99.55% of the voxels considered are

empty). A classical UNet based on simple 3D con-

volutions already achieves good results on this voxel

representation - and this despite the fact that the ap-

plication of these convolutions quickly increases and

“blurs” the set of active (non-zero) features. There-

fore the usage of sparse convolutions (Graham and

van der Maaten, 2017) and the adaption of the UNet

network into a sparse voxel network for semantic seg-

mentation (Graham et al., 2018; Najibi et al., 2020) is

an interesting task for further work.

In summary, differences between the evaluated

network structures have also emerged. A PointNet++

based processing is better suited for scenes without

ego-motion of the sensor, whereas for moving sen-

sors and the inclusion of larger spatial input patches a

UNet based processing has shown advantages.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

176

ACKNOWLEDGEMENTS

We thank Christian Neumann for helpful discussions

and his support related to the UNet development and

experiments.

Funding

This work was supported by the European Re-

gional Development Fund under grant number

EFRE0801082 as part of the project “plsm” (https:

//plsm-project.com/).

REFERENCES

Alonso, I. and Murillo, A. C. (2019). EV-SegNet: Seman-

tic Segmentation for Event-Based Cameras. In 2019

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition Workshops (CVPRW), pages 1624–

1633.

Binas, J., Neil, D., Liu, S.-C., and Delbruck, T. (2017).

DDD17: End-To-End DAVIS Driving Dataset. In

ICML’17 Workshop on Machine Learning for Au-

tonomous Vehicles (MLAV 2017).

Bolten, T., Lentzen, F., Pohle-Fr

ohlich, R., and T

onnies,

K. D. (2022). Evaluation of Deep Learning based

3D-Point-Cloud Processing Techniques for Seman-

tic Segmentation of Neuromorphic Vision Sensor

Event-streams. In Proceedings of the 17th Interna-

tional Joint Conference on Computer Vision, Imag-

ing and Computer Graphics Theory and Applica-

tions - Volume 4: VISAPP, pages 168–179. INSTICC,

SciTePress.

Bolten, T., Pohle-Fr

ohlich, R., and T

onnies, K. D. (2021).

DVS-OUTLAB: A Neuromorphic Event-Based Long

Time Monitoring Dataset for Real-World Outdoor

Scenarios. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

(CVPR) Workshops, pages 1348–1357.

Chaney, K., Zhu, A. Z., and Daniilidis, K. (2019). Learning

Event-Based Height From Plane and Parallax. In 2019

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition Workshops (CVPRW), pages 1634–

1637.

Chen, G., Cao, H., Conradt, J., Tang, H., Rohrbein, F., and

Knoll, A. (2020). Event-Based Neuromorphic Vision

for Autonomous Driving: A Paradigm Shift for Bio-

Inspired Visual Sensing and Perception. IEEE Signal

Processing Magazine, 37(4):34–49.

Chen, G., Cao, H., Ye, C., Zhang, Z., Liu, X., Mo, X., Qu,

Z., Conradt, J., R

ohrbein, F., and Knoll, A. (2019).

Multi-Cue Event Information Fusion for Pedestrian

Detection With Neuromorphic Vision Sensors. Fron-

tiers in Neurorobotics, 13:10.

de Tournemire, P., Nitti, D., Perot, E., Migliore, D., and

Sironi, A. (2020). A Large Scale Event-based Detec-

tion Dataset for Automotive. arXiv, abs/2001.08499.

Deng, Y., Chen, H., Liu, H., and Li, Y. (2022). A Voxel

Graph CNN for Object Classiﬁcation With Event

Cameras. In Proceedings of the IEEE/CVF Confer-

ence on Computer Vision and Pattern Recognition

(CVPR), pages 1172–1181.

Dorn, C., Dasari, S., Yang, Y., Kenyon, G., Welch, P.,

and Mascare

nas, D. (2017). Efﬁcient full-ﬁeld opera-

tional modal analysis using neuromorphic event-based

imaging. In Shock & Vibration, Aircraft/Aerospace,

Energy Harvesting, Acoustics & Optics, Volume 9,

pages 97–103. Springer.

Graham, B., Engelcke, M., and van der Maaten, L. (2018).

3D Semantic Segmentation with Submanifold Sparse

Convolutional Networks. CVPR.

Graham, B. and van der Maaten, L. (2017). Submani-

fold Sparse Convolutional Networks. arXiv preprint

arXiv:1706.01307.

Guo, M., Huang, J., and Chen, S. (2017). Live demon-

stration: A 768 × 640 pixels 200meps dynamic vi-

sion sensor. In 2017 IEEE International Symposium

on Circuits and Systems (ISCAS), pages 1–1.

Jiang, Z., Xia, P., Huang, K., Stechele, W., Chen, G., Bing,

Z., and Knoll, A. (2019). Mixed Frame-/Event-Driven

Fast Pedestrian Detection. In 2019 International Con-

ference on Robotics and Automation (ICRA), pages

8332–8338.

Lagorce, X., Orchard, G., Galluppi, F., Shi, B. E., and

Benosman, R. B. (2017). HOTS: A Hierarchy of

Event-Based Time-Surfaces for Pattern Recognition.

IEEE Transactions on Pattern Analysis and Machine

Intelligence, 39(7):1346–1359.

Liu, M. and Delbr

uck, T. (2018). Adaptive Time-Slice

Block-Matching Optical Flow Algorithm for Dynamic

Vision Sensors. In 29th British Machine Vision Con-

ference (BMVC).

Liu, M. and Qian, P. (2021). Automatic Segmentation

and Enhancement of Latent Fingerprints Using Deep

Nested UNets. IEEE Transactions on Information

Forensics and Security, 16:1709–1719.

McGlinchy, J., Johnson, B., Muller, B., Joseph, M., and

Diaz, J. (2019). Application of UNet Fully Convo-

lutional Neural Network to Impervious Surface Seg-

mentation in Urban Environment from High Resolu-

tion Satellite Imagery. In IGARSS 2019 - 2019 IEEE

International Geoscience and Remote Sensing Sympo-

sium, pages 3915–3918.

McMahon-Crabtree, P. N. and Monet, D. G. (2021).

Commercial-off-the-shelf event-based cameras

for space surveillance applications. Appl. Opt.,

60(25):G144–G153.

Miao, S., Chen, G., Ning, X., Zi, Y., Ren, K., Bing, Z., and

Knoll, A. (2019). Neuromorphic Vision Datasets for

Pedestrian Detection, Action Recognition, and Fall

Detection. Frontiers in Neurorobotics, 13:38.

Mitrokhin, A., Ferm

uller, C., Parameshwara, C., and Aloi-

monos, Y. (2018). Event-Based Moving Object De-

tection and Tracking. In 2018 IEEE/RSJ International

Conference on Intelligent Robots and Systems (IROS),

pages 6895–6902.

Semantic Segmentation on Neuromorphic Vision Sensor Event-Streams Using PointNet++ and UNet Based Processing Approaches

177

Mitrokhin, A., Hua, Z., Ferm

uller, C., and Aloimonos, Y.

(2020). Learning visual motion segmentation using

event surfaces. In 2020 IEEE/CVF Conference on

Computer Vision and Pattern Recognition (CVPR),

pages 14402–14411.

Najibi, M., Lai, G., Kundu, A., Lu, Z., Rathod, V.,

Funkhouser, T., Pantofaru, C., Ross, D., Davis, L. S.,

and Fathi, A. (2020). DOPS: Learning to Detect 3D

Objects and Predict Their 3D Shapes. In Proceedings

of the IEEE/CVF Conference on Computer Vision and

Pattern Recognition (CVPR).

Pohle-Fr

ohlich., R., Bohm., A., Ueberholz., P., Korb., M.,

and Goebbels., S. (2019). Roof Segmentation based

on Deep Neural Networks. In Proceedings of the 14th

International Joint Conference on Computer Vision,

Imaging and Computer Graphics Theory and Applica-

tions - Volume 4: VISAPP,, pages 326–333. INSTICC,

SciTePress.

Qi, C. R., Su, H., Mo, K., and Guibas, L. J. (2017a). Point-

Net: Deep Learning on Point Sets for 3D Classiﬁ-

cation and Segmentation. In 2017 IEEE Conference

on Computer Vision and Pattern Recognition (CVPR),

pages 77–85.

Qi, C. R., Yi, L., Su, H., and Guibas, L. J. (2017b). Point-

Net++: Deep Hierarchical Feature Learning on Point

Sets in a Metric Space. In Proceedings of the 31st

International Conference on Neural Information Pro-

cessing Systems, NIPS’17, pages 5105–5114, Red

Hook, NY, USA. Curran Associates Inc.

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-Net:

Convolutional Networks for Biomedical Image Seg-

mentation. In Navab, N., Hornegger, J., Wells, W. M.,

and Frangi, A. F., editors, Medical Image Comput-

ing and Computer-Assisted Intervention – MICCAI

2015, pages 234–241, Cham. Springer International

Publishing.

Sekikawa, Y., Hara, K., and Saito, H. (2019). EventNet:

Asynchronous Recursive Event Processing. In 2019

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition (CVPR), pages 3882–3891.

Wan, J., Xia, M., Huang, Z., Tian, L., Zheng, X., Chang, V.,

Zhu, Y., and Wang, H. (2021). Event-Based Pedes-

trian Detection Using Dynamic Vision Sensors. Elec-

tronics, 10(8:888).

Wang, L., Chae, Y., Yoon, S.-H., Kim, T.-K., and Yoon,

K.-J. (2021). EvDistill: Asynchronous Events To

End-Task Learning via Bidirectional Reconstruction-

Guided Cross-Modal Knowledge Distillation. In Pro-

ceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition (CVPR), pages 608–

619.

Wang, Q., Zhang, Y., Yuan, J., and Lu, Y. (2019). Space-

Time Event Clouds for Gesture Recognition: From

RGB Cameras to Event Cameras. In 2019 IEEE Win-

ter Conference on Applications of Computer Vision

(WACV), pages 1826–1835.

Zhu, A. Z., Yuan, L., Chaney, K., and Daniilidis, K. (2019).

Unsupervised Event-Based Learning of Optical Flow,

Depth, and Egomotion. In 2019 IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 989–997.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

178