VK-SITS: Variable Kernel Speed Invariant Time Surface for

Event-Based Recognition

Laure Acin

1 a

, Pierre Jacob

2 b

, Camille Simon-Chane

1 c

and Aymeric Histace

1 d

ETIS UMR 8051, CY Cergy Paris University, ENSEA, CNRS, F-95000, Cergy, France

Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800, F-33400 Talence, France

Keywords:

Event-based Camera, Event-based Vision, Asynchronous Camera, Machine Learning, Time-surface,

Recognition.

Abstract:

Event-based cameras are a recent non-conventional sensor which offer a new movement perception with low

latency, high power efﬁciency, high dynamic range and high-temporal resolution. However, event data is

asynchronous and sparse thus standard machine learning and deep learning tools are not optimal for this data

format. A ﬁrst step of event-based processing often consists in generating image-like representations from

events, such as time-surfaces. Such event representations are proposed with speciﬁc applications. These event

representations and learning algorithms are most often evaluated together. Furthermore, these methods are of-

ten evaluated in a non-rigorous way (i.e. by performing the validation on the testing set). We propose a generic

event representation for multiple applications: a trainable extension of Speed Invariant Time Surface, coined

VK-SITS. This speed and spatial-invariant framework is computationally fast and GPU-friendly. A second

contribution is a new benchmark based on 10-Fold cross-validation to better evaluate event-based represen-

tation of DVS128 Gesture and N-Caltech101 recognition datasets. Our VK-SITS event-based representation

improves recognition performance of state-of-art methods.

1 INTRODUCTION

Event-based cameras are a recent technology com-

posed of autonomous pixels which acquire informa-

tion only when they detect a brightness change in

their individual ﬁeld of view (Lichtsteiner et al., 2008;

Posch et al., 2011). These cameras only record scene

dynamics and there is no information redundancy.

Other advantages of such cameras are their high dy-

namic range (over 120 dB), high temporal resolution

(∼ µs), low latency and low power consumption. For

all these reasons, event-based cameras are an attrac-

tive sensor for movement recognition with efﬁcent

data processing.

Compared to standard cameras where data is

dense (all pixel information is sent for each frame)

and synchronous (all frames are acquired at a ﬁxed

frequency), data from event camera is sparse and

asynchronous: only pixels which sense a change in

https://orcid.org/0000-0001-9140-5406

https://orcid.org/0000-0001-6427-5853

https://orcid.org/0000-0002-4833-6190

https://orcid.org/0000-0002-3029-4412

brightness provide data and this data is sent as soon

as a change is detected (see Figure 1). Standard im-

age processing tools and methods from machine and

deep learning are not adapted for this sparse and asyn-

chronous data. The event-based processing commu-

nity has been developing new tailored algorithms for

this data (Gallego et al., 2020). A common strategy

to use existing computer vision tools is to recreate

image-like representations from events.

Event representation methods are usually eval-

uated for a given object recognition task. How-

ever, standard benchmarks are highly biased and over-

tuned. In most cases, the validation is performed on

the testing set which makes model selection unreli-

able and reported results biased. Also, both training

and testing sets are randomized a single time. The re-

ported results highly depend on the random split, and

might not correctly represent the beneﬁts of the pro-

posed method. This type of evaluation greatly limits

repeatability and reproducibility, which limits the use

of these methods in real-case scenarios. Finally, time-

surfaces are often developed for a given deep-learning

method and application; they are evaluated with the

deep learning method used.

754

Acin, L., Jacob, P., Simon-Chane, C. and Histace, A.

VK-SITS: Variable Kernel Speed Invariant Time Surface for Event-Based Recognition.

DOI: 10.5220/0011779400003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP, pages

754-761

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

Figure 1: Samples from event-based datasets accumulated over 10 ms. Events are on a gray background and colored depending

on their polarity: ON events are represented in white points and OFF events are represented in black. First row: DVS128

Gesture, from left to right air guitar, arm roll, right hand wave, air drum and hand clap classes. Second row: N-Caltech101,

from left to right bonsai, seahorse, anchor, bird and truck classes.

Our proposition is twofold: First, we introduce

Variable-Kernel SITS, coined VK-SITS, a represen-

tation inspired by SITS (Manderscheid et al., 2019)

and EST (Gehrig et al., 2019). We use the principle of

SITS to build a translation and speed-invariant time-

surface while making the kernel learnable to improve

its representation power. We extend this method to

multiple kernel learning, unlike the models it is based

on. Moreover, SITS has only been used for corner de-

tection. We evaluate it on a recognition task, a new

application for this method.

Second, we propose a new evaluation benchmark.

Compared to the previous one, we ﬁx the train-val-test

splits, which ensures repeatability. Then, we evalu-

ate the models based on a 10-Fold cross-validation.

This ensures that model selection is correctly per-

formed, real model performances are not biased, and

that the results are reproducible. This also keeps the

training time and complexity of the model evalua-

tion traceable. The evaluation we propose is inde-

pendent from the preprocessing and learning method

used. These event representations are evaluated with

the same classiﬁcation network. This guarantees that

we evaluate only the event representation and not the

entire framework. We compare our representation

to time-surface methods that are most often used in

competitive applications: SITS (Manderscheid et al.,

2019), TORE (Baldwin et al., 2022) and VoxelGrid

(Zhu et al., 2019).

2 RELATED WORK

An increasingly deployed event representation is the

”time-surface”. A time-surface describes the spatial

neighborhood of an event over an interval of time in

two dimensions (Lagorce et al., 2017). Time-surfaces

can thus be easily input to most of machine learning

tools such as deep convolutional networks. However,

the large majority of state-of-the-art time-surfaces,

such as HOTS (Lagorce et al., 2017), HATS (Sironi

et al., 2018), and SITS (Manderscheid et al., 2019),

are not end-to-end trainable which limits the learn-

ing process. On the other hand, end-to-end trainable

representations such as EST (Gehrig et al., 2019) are

not speed-invariant, contrary to SITS. We know of no

speed-invariant, fully trainable event representation,

though such method would be a powerfull input for a

classiﬁcation network.

The rest of this section is divided into three parts:

First we review the basics of event-based cameras and

time-surface methods proposed in the literature fol-

lowing (Gehrig et al., 2019) framework. Then, we

focus on SITS (Manderscheid et al., 2019) and EST

(Gehrig et al., 2019) from which our method is in-

spired.

2.1 Time-Surface

Many event-based processing algorithms are based on

a time-surface representation. This image-like repre-

sentation is often the ﬁrst step in recognition methods

(Lagorce et al., 2017). It provides a 2D description of

the past activity of an event neighborhood.

Let us start by introducing the event con-

VK-SITS: Variable Kernel Speed Invariant Time Surface for Event-Based Recognition

755

cept. When a brightness change, characterized by

∆L(x

) = p

C with



∆L(x

) = L(x

) − L(x

− ∆t

)

L = log(I)

(1)

happens at the pixel location [x

]

and timestamp

, the camera records an event

= [x

, p

]

. (2)

The polarity of the event p

∈

{

−1,1

}

corresponds

to the sign of the brightness change. Event repre-

sentations aim at encoding a sequence of N events

E =

{

}

i=1

with increasing temporality, into a mean-

ingful form suitable for the subsequent task.

Practically, events are points in a four dimensional

points manifold spanned by their spatial coordinates x

and y, their timestamp t and their polarity p. Mathe-

matically, this is represented by an event-ﬁeld with a

measure f :

(x,y,t) =

∑

(x,y,t) δ(x − x

,y − y

,t − t

) (3)

where F

(resp. F

−

) is computed with events

with polarity +1 (resp. −1). Thanks to this

formulation, event-ﬁeld preserves the high tempo-

ral resolution of event-based cameras, but also en-

forces spatio-temporal locality. Thus, designing an

event representation could be resumed by design-

ing an efﬁcient measure function f . In particular,

event polarity assumes f

(x,y,t) = ±1, and counting

events (Maqueda et al., 2018) is retrieved by using

(x,y,t) = 1. Voxel grids (Zhu et al., 2019; Mueg-

gler et al., 2018) and leaky surfaces (Cannici et al.,

2019) are also two event representations that fall into

this framework.

Time-surfaces generalize Equation 3 by consider-

ing a spatio-temporal convolution kernel, such that:

(x,y,t) = (k ⋆ F

)(x,y,t)

∑

(x,y,t) k(x − x

,y − y

,t − t

) (4)

where ⋆ is the convolution operation. In order to

retrieve most known event representations, this con-

tinuous function is discretized, for most of them,

into the spatio-temporal coordinates (x

) where

∈ {0,1, . . . ,W − 1},y

∈ {0,1, . . . ,H − 1} and t

∈

+∆ t,...,t

+B∆ t} with (H,W ) the image size,

∆ t the bin size, and B the number of bins. In partic-

ular, HOTS (Lagorce et al., 2017) can be retrieved

from Equation 4 by using k(x,y,t) = δ(x,y)exp(−

HATS by adding spatio-temporal average (Sironi

et al., 2018), and TORE (Baldwin et al., 2022) by

applying kernel k(x, y,t) = δ(x, y) log (1 +t). DART

(Ramesh et al., 2020) considers a log-polar repa-

rameterization of the time-surface and uses a soft-

assignment kernel to rings and wedges.

2.2 Speed Invariant Time Surface

Time-surfaces are dependent on the speed and direc-

tion of the movement. For this reason, the Speed

Invariant Time Surface (SITS) has been formalized

(Manderscheid et al., 2019). The aim is to have the

same silhouette for events produced by the movement

of an object, whatever the speed of the object. So,

when an event arrives, it puts up a large value at

this position in the time-surface and reduces neigh-

borhood values, while being independent of the time

value. In such manner, values of previous events are

sequentially scaled down and similar values are re-

duced by a constant value, the time-surface created

is independent with movement’s speed. Mathemati-

cally, SITS is deﬁned as follows:

(x,y,t) =

∑

(x,y,t)k(x − x

,y − y

) (5)

where k(x, y) is deﬁned as a ﬁxed 3 × 3 convolution

kernel in (Manderscheid et al., 2019)





−1 −1 −1

−1 9 −1

−1 −1 −1





. (6)

Then, the time-surface is normalized between 0 and

9 to become invariant to speed and direction of the

movement.

SITS is an improvement over most recent time-

surface methods, as it is the only one to consider

speed invariance. However, we observe that the ker-

nel used by SITS is arbitrarily chosen. Moreover, in

frame based algorithms it is common to learn convo-

lution kernels as it is done in CNN.

2.3 Event Spike Tensor

(Gehrig et al., 2019) proposed Event Spike Tensor

(EST) as the ﬁrst end-to-end learnable event represen-

tation. Following Equation 4, the convolution kernel

is learned in order to ﬁnd a data-driven representation

best suited for the task. EST replaces the handcrafted

function with a multi-layer perceptron (MLP) com-

posed of three layers. This MLP takes the relative

spatio-temporal coordinates as input, then generates

an activation map around it.

Thanks to this formulation, EST is both

translation-invariant and end-to-end trainable. How-

ever, their proposed formulation is not speed-invariant

by design, and the training process has to discover this

property. In the following, we present the variable-

kernel SITS (VK-SITS), which takes advantages of

both representations.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

756

Figure 2: Overview of the paper. In green is our method, in pink are the methods with which our method is compared to. For

a stream of events, we preprocess the data: ﬁrst we sort the data chronologically, then we choose a window of 40 000 events

randomly, we sub-sample the window by 10 and if necessary, we use padding if there isn’t enough data. The second part is

our method. We compare our method to TORE (Baldwin et al., 2022), SITS (Manderscheid et al., 2019) and VoxelGrid (Zhu

et al., 2019) methods using the same preprocessing and neural network.

3 MATERIALS AND METHODS

In this section, we describe our method, the datasets

used to evaluate our method and ﬁnally our experi-

mental setup. The source code is publicly available

for download.

3.1 Variable Kernel SITS

While our method is based on the SITS kernel, we

use several kernels to simultaneously focus on differ-

ent features and we learn these kernels to better ﬁt

the data. Figure 2 shows an overview of our method,

which we present in three steps: processing, event

representation and classiﬁcation.

3.1.1 Preprocessing

First, events are sorted in chronological order, in

case the sequence does not respect the AER protocol

(Delbr

uck et al., 2010). Then, we randomly choose

a window of 40000 consecutive events in case se-

quences are not cut right at the start and end of the

movement. We then sub-sample the sequence by a

factor of 10 to obtain a sequence of 4000 events. Deep

learning algorithms are robust to sub-sampling and

we noticed that these short sequences are sufﬁcient

source code: https://github.com/LaureAcin/Event-

based-processing.git

to perform the recognition task while decreasing the

computation cost.

3.1.2 Representation

To compute the representation, events are transformed

in tensor as done in (Maqueda et al., 2018), then four

kernels are applied on events: two ﬁlters per polarity.

Mathematically, we ﬁx a tensor E

which encodes

the number of events per polarity in each pixel of the

camera:

(x,y) ∈ N

H×W

(7)

with H and W respectively the height and the width

of the event-based sensor. Then, we apply kernels K

with k ∈

{

0,1

}

to create time-surfaces T

(x,y, k) for

each kernel K

as:

(x,y, k) = K

⋆ E

(x,y) (8)

We can consider an event e

at the position (x

) as

a Dirac δ

) and so the tensor E

as a sum of

Diracs:

∑

(δ

)) (9)

So, we can deﬁne T

(x,y, k) as:

(x,y, k) = K

⋆

∑

(δ

))

∑

(x − x

,y − y

) (10)

Batch Normalisation and Relu functions are applied

next to keep a normalisation process and we obtain

time-surfaces of events.

VK-SITS: Variable Kernel Speed Invariant Time Surface for Event-Based Recognition

757

3.1.3 Classiﬁcation

We use a ResNet18 network (He et al., 2016) pre-

trained on ImageNet (Russakovsky et al., 2015). We

replace the ﬁrst layer of ResNet18 by a layer adapted

to the size of the input data, as needed. The last layer

of ResNet18 is also deleted and replaced by a classi-

ﬁcation layer.

3.2 Evaluation

We evaluate our method using two datasets: a small

real world dataset, DVS128 Gesture and a larger

dataset, N-Caltech101.

DVS128 Gesture. is a real world dataset composed

of 11 hand and arm gestures performed by 29 sub-

jects in 3 different illumination conditions, resulting

in a total of 1342 sequences (Amir et al., 2017). This

dataset was created in laboratory with a ﬁxed back-

ground and controlled illumination (natural, ﬂuores-

cent and LED lights are used). The scenes are aquired

with a DVS128, so the event-streams cover a range of

128 × 128 pixels.

We respect the original split between training and

testing: 23 subjects are used for training and 6 sub-

jects in the test set. To perform a 10-Fold cross val-

idation we successively extract 4 subjects from the

training set and use them for the validation. This ex-

traction is performed using a sliding window of step 2

subjects. Finally, the training set of DVS128 Gesture

represents about 65% of the data, the validation set

is about 14% and the testing set represents 21%. We

avoid subject bias by keeping all sequences of a given

subject grouped in the same set (training, testing or

validation).

N-Caltech101. is obtained by recording static im-

ages of the Caltech101 dataset on a computer dis-

play (Orchard et al., 2015) with a moving ATIS event-

based camera mimicking sacades (Posch et al., 2011).

The dataset contains 100 objects classes and one

background class. Each category contains between

45 and 400 sequences, resulting in a total of 8709 se-

quences. Since the recordings are performed with an

ATIS sensor, the resulting event-streams cover a range

of 240×304 pixels, even though the input Caltech im-

ages are of varying sizes.

20 % of each class is reserved for testing. A 10-

Fold cross validation is performed by splitting the re-

maining data in 10 parts per class, each part is succes-

sively used as a validation set while the remaining 9

form the training set. Each set thus respects the un-

balanced statistics of the global database. The train-

ing set represents 72% of the data, the validation set

8% and testing set represents 20%.

Evaluation Pipeline. We compare VK-SITS to

three other competitive event representation methods:

SITS (Manderscheid et al., 2019), TORE (Baldwin

et al., 2022) and VoxelGrid (Zhu et al., 2019). The

evaluation pipeline, described in Figure 2, consists

in successively training a ResNet18 network on both

datasets, using the Adam and SGD optimizer for all

four representation methods. The input data under-

goes the same pre-processing. This way, the inﬂuence

of the event representation on the recognition task can

be isolated.

Global parameters are ﬁne-tuned empirically. For

VK-SITS the radius of the time-surface kernel is set

to r = 4, the dilatation is set to d = 1 and we use 4

ﬁlters. Common parameters between SITS and VK-

SITS are set to the same value. We use a memory of

100 for TORE and 100 bins for VoxelGrid. We use

a learning rate of 10

−4

for the Adam optimizer and

−2

for SGD. In all experiments we train during 150

epochs. We save the model with the highest average

top-1 accuracy and we save the number of epochs that

were needed to train the best model, and we use it

with the testing set.

4 RESULTS AND DISCUSSION

During training, we calculate the average time per

step for training models, the average number of

epochs needed to train the models and the average of

classiﬁcation top-1 accuracy from the 10-Fold cross-

validation. Average, we calculate during testing the

mean of classiﬁcation top-1 accuracy, standard de-

viation, minimum and maximum top-1 accuracy ob-

tained.

Recognition performance on the DVS128 Gesture

dataset are given in Table 1. Our work achieves

the lowest average learning time by step for both

SGD (1.75s) and Adam (0.60s) as well as the highest

training mean top-1 accuracy with the two optimiz-

ers too (89.29 % with SGD and 89.30 % with Adam).

Slightly better results are obtained using Adam opti-

mizer. The lowest average number of epochs needed

to train models is obtained by SITS using SGD (90

epochs) and TORE using Adam (89 epochs).

In the 10-Fold testing, TORE obtains the best re-

sults during testing, in terms of average, minimum

and maximum top-1 accuracy and standard deviation

with both optimizers. For this dataset, our work does

not obtain the best results but is in the same range as

the other methods. Figure 3 shows that the differences

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

758

Table 1: Comparison between our method and SITS (Manderscheid et al., 2019), TORE (Baldwin et al., 2022) and VoxelGrid

(Zhu et al., 2019) methods on DVS128 Gesture and N-Caltech101 according to mean of classiﬁcation top-1 accuracy (%),

time performance by step (in s) and number of epochs needed to learn models. Variance, minimum and maximum of accuracy

are reported for testing phase using a 10-Fold cross-validation. Best values per column and optimizer highlighted in in bold.

Training Testing

Dataset Method Optim.

Avg. time Avg. # Avg. Top-1 accuracy

per step (s) of epochs top-1 acc. Avg. σ Min Max

DVS128

Gesture

VK-SITS SGD 1.75 137 89.29 87.43 2.22 84.03 89.93

SITS SGD 3.59 90 88.81 87.67 2.44 83.33 89.93

TORE SGD 2.93 102 88.64 88.16 1.59 85.76 91.32

VoxelGrid SGD 2.09 126 81.56 80.38 4.39 71.88 85.42

VK-SITS Adam 0.60 104 89.30 87.92 1.40 85.76 89.58

SITS Adam 2.57 116 89.15 88.30 1.59 85.42 90.63

TORE Adam 3.03 89 88.81 88.37 1.31 86.81 91.32

VoxelGrid Adam 1.10 128 84.63 83.16 2.37 78.47 86.46

VK-SITS SGD 0.31 75 73.75 73.52 0.52 72.82 74.21

SITS SGD 2.31 83 73.12 72.66 0.98 70.92 73.96

TORE SGD 2.08 69 73.21 72.27 0.86 70.80 73.33

N-Caltech101

VoxelGrid SGD 0.57 100 72.74 72.29 1.09 70.20 73.58

VK-SITS Adam 0.25 80 71.50 71.08 0.73 69.86 71.88

SITS Adam 1.79 49 71.61 71.33 0.94 70.17 72.60

TORE Adam 1.93 58 72.16 72.21 0.90 70.89 73.73

VoxelGrid Adam 3.11 39 72.84 72.69 1.14 70.52 73.98

Figure 3: 10-Fold classiﬁcation accuracy for DVS128 Ges-

ture database per event representation: VK-SITS (this

work), SITS (Manderscheid et al., 2019), TORE (Baldwin

et al., 2022) and VoxelGrid (Zhu et al., 2019).

in accuracy between VK-SITS, SITS and TORE are

not signiﬁcant. Only the VoxelGrid event representa-

tion method provides meaningfully lower recognition

results.

Table 1 shows the recognition performance on the

N-Caltech101 dataset. VK-SITS reaches higher re-

sults than other methods for the average time per step

with SGD (0.31 s) and Adam (0.25 s). The small-

est numbers of epochs needed to train the models

is achieved with TORE using SGD (69 epochs) and

VoxelGrid using Adam (39 epochs). We report a fast

overﬁtting when we use Adam optimizer for all meth-

ods tested. This can be explained by the heterogeneity

of the dataset. Best overall top-1 accuracy is reached

with VK-SITS using SGD (highest training and test-

ing average top-1 accuracy, highest min and max test-

ing accuracy, best standard deviation).

The VoxelGrid representation with Adam opti-

mizer also performs well (highest training and testing

average top-1 accuracy and highest maximum accu-

racy), though less consistently over all 10 folds. This

is evident in the high standard deviation of the accu-

racy of 1.14, compared to 0.73 for VK-SITS. Figure 4

visually conﬁrms that VK-SITS with the SGD opti-

mizer achieves the best classiﬁcation results for the

N-Caltech 101 database. VK-SITS is more stable and

repeatable than other methods tested.

Results are mixed for DVS-Gesture, a simple

dataset on which most methods obtain good results

with no signiﬁcantly better method. However, the

more challenging N-Caltech101 dataset shows that

VK-SITS permits the best recognition compared to

SITS, TORE and VoxelGrid. Using several learnable

kernels creates a representation which better ﬁts the

VK-SITS: Variable Kernel Speed Invariant Time Surface for Event-Based Recognition

759

data and can focus on different features. We retain

the speed-invariant advantage of SITS. VK-SITS also

consistently permits a faster learning.

Finally, DVS128 Gesture sequences are in a range

from 35267 to 1594 557 events and N-Caltech101 se-

quences are in a range from 6718 to 399 321 events.

Considering these large ranges, preprocessing phase

make difference regarding others methods, with less

events than others methods, we obtain results close to

other papers.

In the original two datasets, only the split between

training and testing set is done. Most other works use

the same set for validation and testing. The results are

thus biased and higher than what can be achieved on

unseen data. In this conﬁguration, TORE completed

by a GoogLeNet network obtains 96,2% on DVS128

Gesture and 79.8% on N-Caltech101 (Baldwin et al.,

2022). Similarly, VoxelGrid reaches 75.4% on N-

Caltech101 (Gehrig et al., 2019). The proper training,

validation and testing split performed in this work al-

lows for the ﬁrst time to realistically and fairly com-

pare four event representation methods for a deep-

learning recognition task.

SITS was developed for corner detection, this

work is the ﬁrst evaluation of SITS for a deep-learning

recognition task. The original SITS implementation

uses a random forest for corner detection whereas we

use a ResNet18 for classiﬁcation. Finally, SITS is

a fully asynchronous event-per-event method which

can process up to 1.6 Mev/s on a single CPU, to

provide the event representation. Our method pro-

cesses sequences of 40 000 events downsampled to

4000 events. Though the recognition is quite quick

(150 µs to compute the representation and perform

the classiﬁcation) there is an inherent latency given

the time to accumulate the necessary events. This la-

tency is of the order of 50 ms for DVS128 Gesture and

10 ms for N-Caltech101.

5 CONCLUSION

The paper introduces VK-SITS, an event-based rep-

resentation based on the SITS time-surface and on

the Event Spike Tensor. VK-SITS is speed-invariant,

translation invariant, and characterized by four end-

to-end trainable kernels. This allows VK-SITS to

learn faster than the other representations evaluated.

VK-SITS is compared to three state-of-the-art event

representations using a uniﬁed preprocessing on a

recognition task using ResNet18 on two commonly

used event-based datasets, DVS128 Gesture and N-

Caltech101. The comparison is performed with a 10-

Fold cross-validation with a proper training, valida-

Figure 4: 10-Fold classiﬁcation accuracy for N-Caltech101

database per event representation: VK-SITS (this work),

SITS (Manderscheid et al., 2019), TORE (Baldwin et al.,

2022) and VoxelGrid (Zhu et al., 2019).

tion and testing split. When trained with the SGD op-

timizer, VK-SITS provides more robust learning re-

sults.

This paper provides a methodological contribution

by proposing a pipeline to compare event represen-

tation methods independently from the deep-learning

networks used. We also implement and evaluate for

the ﬁrst time SITS for a deep-learning based recogni-

tion task.

To further evaluate the potential of VK-SITS as

a generic event representation, it should be tested on

other tasks and databases. SL-Animals for example,

represents 19 different animals in American sign lan-

guage and would provide a new application (Vasude-

van et al., 2021). The evaluation methodology pre-

sented in this work is a rigorous comparison of an

off-line situation. One of the strengths of event-based

cameras is their ability to quickly record high speed

movements. As such, it is important to also evaluate

these recognition tasks in an on-line setting. The im-

plementation of real-time event-based algorithms is

non trivial, especially with the advent of increasingly

large sensors. Future work should concentrate on

comparing the on-line performances of event-based

representations and recognition algorithms.

REFERENCES

Amir, A., Taba, B., Berg, D., Melano, T., Mckinstry, J.,

Nolfo, C. D., Nayak, T., Andreopoulos, A., Garreau,

G., Mendoza, M., Kusnitz, J., Debole, M., Esser, S.,

Delbruck, T., Flickner, M., and Modha, D. (2017). A

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

760

Low Power, Fully Event-Based Gesture Recognition

System. In IEEE Conference on Computer Vision and

Pattern Recognition, pages 7388–7397.

Baldwin, R. W., Liu, R., Almatraﬁ, M., Asari, V., and

Hirakawa, K. (2022). Time-Ordered Recent Event

(TORE) Volumes for Event Cameras. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence,

pages 1–1.

Cannici, M., Ciccone, M., Romanoni, A., and Matteucci,

M. (2019). Asynchronous Convolutional Networks

for Object Detection in Neuromorphic Cameras. In

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition Workshops, pages 1656–1665.

Delbr

uck, T., Linares-Barranco, B., Culurciello, E., and

Posch, C. (2010). Activity-driven, event-based vision

sensors. In IEEE International Symposium on Circuits

and Systems, pages 2426–2429.

Gallego, G., Delbruck, T., Orchard, G., Bartolozzi, C.,

Taba, B., Censi, A., Leutenegger, S., Davison, A.,

Conradt, J., Daniilidis, K., and Scaramuzza, D.

(2020). Event-based Vision: A Survey. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence,

44(1):154–180.

Gehrig, D., Loquercio, A., Derpanis, K. G., and Scara-

muzza, D. (2019). End-to-End Learning of Represen-

tations for Asynchronous Event-Based Data. In IEEE

International Conference on Computer Vision, pages

5632–5642.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep Resid-

ual Learning for Image Recognition. In The IEEE

Conference on Computer Vision and Pattern Recog-

nition, pages 770–778.

Lagorce, X., Orchard, G., Galluppi, F., Shi, B. E., and

Benosman, R. B. (2017). HOTS: A Hierarchy of

Event-Based Time-Surfaces for Pattern Recognition.

IEEE Transactions on Pattern Analysis and Machine

Intelligence, 39(7):1346–1359.

Lichtsteiner, P., Posch, C., and Delbruck, T. (2008). A 128×

128 120 dB 15 µs Latency Asynchronous Temporal

Contrast Vision Sensor. IEEE Journal of Solid-State

Circuits, 43(2):566–576.

Manderscheid, J., Sironi, A., Bourdis, N., Migliore, D., and

Lepetit, V. (2019). Speed Invariant Time Surface for

Learning to Detect Corner Points with Event-Based

Cameras. In IEEE/CVF Conference on Computer Vi-

sion and Pattern Recognition, pages 10237–10246.

Maqueda, A. I., Loquercio, A., Gallego, G., Garcia, N., and

Scaramuzza, D. (2018). Event-based Vision meets

Deep Learning on Steering Prediction for Self-driving

Cars. In IEEE/CVF Conference on Computer Vision

and Pattern Recognition, pages 5419–5427.

Mueggler, E., Gallego, G., Rebecq, H., and Scaramuzza, D.

(2018). Continuous-Time Visual-Inertial Odometry

for Event Cameras. IEEE Transactions on Robotics,

34.

Orchard, G., Jayawant, A., Cohen, G. K., and Thakor, N.

(2015). Converting Static Image Datasets to Spiking

Neuromorphic Datasets Using Saccades. Frontiers in

Neuroscience, 9.

Posch, C., Matolin, D., and Wohlgenannt, R. (2011). A

QVGA 143 dB Dynamic Range Frame-Free PWM

Image Sensor with Lossless Pixel-Level Video Com-

pression and Time-Domain CDS. IEEE Journal of

Solid-State Circuits, 46.

Ramesh, B., Yang, H., Orchard, G., Thi, N. A. L., Zhang,

S., and Xiang, C. (2020). DART: Distribution Aware

Retinal Transform for Event-based Cameras. IEEE

Transactions on Pattern Analysis and Machine Intel-

ligence, 42(11):2767–2780.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,

Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-

stein, M., Berg, A. C., and Fei-Fei, L. (2015). Ima-

geNet Large Scale Visual Recognition Challenge. In-

ternational Journal of Computer Vision, 115.

Sironi, A., Brambilla, M., Bourdis, N., Lagorce, X., and

Benosman, R. (2018). HATS: Histograms of Aver-

aged Time Surfaces for Robust Event-based Object

Classiﬁcation. In IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pages 1731–1740.

Vasudevan, A., Negri, P., Ielsi, C. D., Linares-Barranco,

B., and Serrano-Gotarredona, T. (2021). SL-Animals-

DVS: Event-Driven Sign Language Animals Dataset.

Pattern Analysis and Applications, 25(3):505–520.

Zhu, A. Z., Yuan, L., Chaney, K., and Daniilidis, K. (2019).

Unsupervised Event-based Learning of Optical Flow,

Depth, and Egomotion. In IEEE/CVF Conference on

Computer Vision and Pattern Recognition, pages 989–

997.

VK-SITS: Variable Kernel Speed Invariant Time Surface for Event-Based Recognition

761