ShapeAug: Occlusion Augmentation for Event Camera Data

Katharina Bendig

, Ren

e Schuster

1,2

and Didier Stricker

1,2

RPTU, University of Kaiserslautern-Landau, Germany

DFKI, German Research Center for Artiﬁcial Intelligence, Germany

ﬁ

Keywords:

Event Camera Data, Augmentation, Classiﬁcation, Object Detection.

Abstract:

Recently, Dynamic Vision Sensors (DVSs) sparked a lot of interest due to their inherent advantages over con-

ventional RGB cameras. These advantages include a low latency, a high dynamic range and a low energy

consumption. Nevertheless, the processing of DVS data using Deep Learning (DL) methods remains a chal-

lenge, particularly since the availability of event training data is still limited. This leads to a need for event

data augmentation techniques in order to improve accuracy as well as to avoid over-ﬁtting on the training data.

Another challenge especially in real world automotive applications is occlusion, meaning one object is hin-

dering the view onto the object behind it. In this paper, we present a novel event data augmentation approach,

which addresses this problem by introducing synthetic events for randomly moving objects in a scene. We

test our method on multiple DVS classiﬁcation datasets, resulting in an relative improvement of up to 6.5 %

in top1-accuracy. Moreover, we apply our augmentation technique on the real world Gen1 Automotive Event

Dataset for object detection, where we especially improve the detection of pedestrians by up to 5 %.

1 INTRODUCTION

Dynamic Vision Sensors (DVSs), also known as event

cameras, are vision sensors that register changes in in-

tensity in an asynchronous manner. This allows them

to record information with a much lower latency (in

the range of milliseconds) compared to conventional

RGB cameras. In addition, they have a very wide dy-

namic range (in the range of 140 dB) and can there-

fore even detect motion at night and in poor lighting

conditions. Furthermore, DVS cameras impress with

their low energy consumption and they offer the pos-

sibility to ﬁlter out unimportant information (for ex-

ample a stagnant background) in many applications.

However, since this is a fairly recent technol-

ogy, the mass of available data for deep learning ap-

proaches is very limited. Compared to RGB datasets

like ImageNet (Deng et al., 2009) with 14 million im-

ages, event datasets like N-CARS (Sironi et al., 2018)

have only a few thousand labels. This makes data

augmentation very important in order to avoid overﬁt-

ting and to increase the robustness of the neural net-

work. Even with larger datasets like the Gen1 Auto-

motive Event Dataset (De Tournemire et al., 2020), it

has been shown that data augmentation signiﬁcantly

increases the performance of Deep Learning (DL)

methods, (Gehrig and Scaramuzza, 2023).

Another challenge, particularly in the context of

autonomous approaches, is the occurrence of occlu-

sion, meaning that some objects are partially covered

by other objects. In order to ensure save driving, DL

approaches have to be able to detect these occluded

objects nevertheless. Regarding occlusion, current

event augmentation methods only consider missing

events either over time or in an area like it is done

for RGB images. However, these approaches only

model the behavior of an object moving in sync with

the camera, which deviates from real-world scenar-

ios. Additionally, automotive scenes are inherently

dynamic and also event streams posses a temporal

component. Consequently, the majority of objects

within the scene are typically in motion relative to

the camera, which can not be depicted by the simple

dropping of events at a ﬁxed location.

Our objective is it to develop a more realistic aug-

mentation technique, that accounts for the additional

events generated by moving foreground objects. For

this reason, we introduce ShapeAug, an occlusion

augmentation approach, which simulates the events

as well as the occlusions caused by objects moving

in front of the camera. Our method utilizes a ran-

dom number of objects and also a randomly generated

linear movement in the foreground. We evaluate our

augmentation technique on the most common event

352

Bendig, K., Schuster, R. and Stricker, D.

ShapeAug: Occlusion Augmentation for Event Camera Data.

DOI: 10.5220/0012393500003654

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 13th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2024), pages 352-359

ISBN: 978-989-758-684-2; ISSN: 2184-4313

datasets for classiﬁcation and further demonstrate its

applicability in a real-world automotive task using

the Gen1 Automotive Event Dataset (De Tournemire

et al., 2020). Since Spiking Neural Networks (SNNs)

(Gerstner and Kistler, 2002) share the asynchronous

nature as well as the temporal component of event

data, they are the natural choice for processing events

and thus we choose to use SNNs for all our experi-

ments.

Our contribution can be summarized trough the

following points: We introduce ShapeAug, a novel

occlusion augmentation method for event data, and

assess its effectiveness for classiﬁcation and object

detection tasks. Furthermore, We evaluate the robust-

ness of ShapeAug in comparison to other event aug-

mentation techniques on challenging variants of the

DVS-Gesture (Amir et al., 2017) dataset.

2 RELATED WORK

2.1 Occlusion-Aware RGB Image

Augmentation

Regarding RGB image augmentation, there are two

main methods for statistical input-level occlusion:

Hide-and-Seek (Singh and Lee, 2017) and Cutout

(DeVries and Taylor, 2017). Hide-and-Seek divides

the image into G × G patches and removes (mean-

ing zeros out) each patch with a certain probability.

Cutout on the other hand chooses N squares with a

ﬁxed side length and randomly chooses their center

point in order to drop the underlying pixel values. The

work of (Fong and Vedaldi, 2019) builds upon these

methods while including a gradient-based saliency

method as well as Batch Augmentation (Hoffer et al.,

2019). These augmentations, however, are not able to

mimic occlusion in real world event data, since event

data has a temporal component meaning occluding

objects would move and produce events themselves.

2.2 Event Data Augmentation

Many data augmentation techniques for event data are

adaptations of augmentation methods for RGB im-

ages. The work of (Li et al., 2022b) applies known ge-

ometric augmentations, including horizontal ﬂipping,

rolling, rotation, shear, Cutout (DeVries and Taylor,

2017) as well as CutMix (Yun et al., 2019). Geomet-

ric augmentations are a widely established technique,

which is why we will use it in combination with our

own augmentation method.

CutMix is a method to combine two samples with

labels using linear interpolation. The EventMix (Shen

et al., 2023) augmentation builds upon the idea of

CutMix and applies it on event input data. However,

this method is not able to realistically model occlu-

sion, since it does not consider that the body of a fore-

ground object may completely cover the background

object.

Inspired by Dropout (Srivastava et al., 2014),

the authors of (Gu et al., 2021) propose EventDrop,

which drops events randomly, by time and by area.

However, this method is not able to simulate occlu-

sion in real-world dynamic scenes, since only ob-

jects moving in sync with the camera would not gen-

erate additional events themselves. Therefore, our

method simulates not only the occlusion caused by

foreground objects but also the resulting events and

their own movement.

3 METHOD

3.1 Event Data Handling

An event camera outputs an event of the form e

, y

, p

), when the pixel at position (x

, y

) and at

time t

registers a logarithmic intensity change with

a positive or negative polarity p

∈ {0, 1}. Due to

asynchronous nature of the camera, its output is very

sparse and thus difﬁcult to handle by neural networks.

Therefore, we create event histograms E in the shape

(T, 2, H,W ) with (H,W ) as the height and width of

the event sensor, where one event sample is split into

T time steps. We keep the polarities separated and

feed the time steps consecutively into the network. A

set of events E is thus processed in the following way:

E (τ, p, x, y) =

∑

∈E

δ(τ − τ

)δ(p − p

)δ(x − x

)δ(y − y

), (1)



−t

· T



, (2)

with δ(·) as the Kronecker delta function.

3.2 Shape Augmentation

Our occlusion-aware event augmentation is based on

random objects moving on linear paths in the fore-

ground. Since event streams have a temporal com-

ponent, it is necessary to treat them similar to videos

instead of image data. Thus it is important to avoid

unnecessary noise corruptions of the temporal rela-

tions between time steps or frames. That is why

we keep the augmentation consistent between time

steps in an event stream, like the authors of (Isobe

ShapeAug: Occlusion Augmentation for Event Camera Data

353

t +1

[

cosγ

sinγ

]

)

(a) Shape movement.

t +1

occlusion

area

positive events

negative events

(b) Positive and nega-

tive event computation.

Figure 1: Visualization of the shape parameters (position,

size and direction) that are randomly chosen for each sim-

ulated object (a). The objects move between timesteps and

are used to simulate the events that their movement would

cause (b).

et al., 2020) keep their augmentation consistent over

all frames in a video sequence. To do so, we generate

N ∈ [1, 5] random objects (circle, rectangle, ellipse),

where each object o gets assigned a random starting

position (x

, y

), a random size h

, w

∈ [3px, s

max

] as

well as a speed v

and an angle γ

. The shape param-

eters are illustrated in Figure 1a. We visualize these

objects at their respective position and create frames

for each timestep of the input event histogram. Be-

tween frames, the objects move linearly in the direc-

tion of a vector that is based on their assigned angle

and speed:



t+1







+ v



cosγ

sinγ



(3)

with t as the index of the frame. Whenever an object

moves outside of the frame, a new object is created to

maintain a consistent object count within the frame.

In order to keep the needed computational over-

head of the augmentation low, we choose a straight

forward simulation technique for the events caused by

the moving objects. Since real DVSs register changes

in the intensity, we simply use the difference between

consecutive frames to ﬁnd areas of events as illus-

trated in Figure 1b. In these frames the background

is colored black and the shapes are assigned the color

gray. We further clip the frame difference to the mean

of the non-zero event sample values in order to re-

semble the input more closely. If the difference at an

image position is positive, it corresponds to a posi-

tive polarity event. Conversely, a negative difference

indicates an event with negative polarity. However,

since DVSs exhibit a certain level of noise, we remove

events with a probability of p = 0.2.

We then include the generated events in the fore-

ground of the original sample. However, since our

goal is the modeling of occlusion, we remove the

events in the sample that would be occluded by our

simulated objects. This is because events are only

triggered at the edge of moving shapes, which are ho-

mogeneously colored, not inside of them, where no

intensity changes occur. The whole pipeline of our

ShapeAug method is visualized in Figure 2.

4 EXPERIMENTS AND RESULTS

4.1 Datasets

We validate our approach for classiﬁcation on four

DVS datasets, including two simulated image-based

datasets and two recordings of real movements with

a DVS. DVS-CIFAR10 (Li et al., 2017) is a con-

verted DVS version of the Cifar10 (Krizhevsky, 2012)

dataset. It includes 10,000 event streams, which were

recorded by smoothly moving the respective image in

front of a DVS with a resolution of 128px × 128px.

N-Caltech101 (Orchard et al., 2015) is likewise a con-

verted dataset based on Caltech101 (Fei-Fei et al.,

2004) containing 8709 images with varying sizes.

N-CARS (Sironi et al., 2018) is a real world DVS

dataset for vehicle classiﬁcation. It contains 15422

training and 8607 test samples with a resolution of

120px × 100px. DVS-Gesture (Amir et al., 2017) is a

real world event dataset for gesture recognition. The

dataset includes 11 hand gestures from 29 subjects re-

sulting in 1342 samples with a size of 128px × 128px.

For the datasets, which do not include a pre-deﬁned

train-validation split deﬁnition, we used the same split

as (Shen et al., 2023). We further resize all the event

streams to a resolution of 80px × 80px using bi-linear

interpolation, before applying any augmentation and

divide them into 10 timesteps.

For the object detection task we choose the Gen1

Automotive Event Dataset (De Tournemire et al.,

2020), which is recorded by a DVS with a resolution

of 304px × 240px during diverse driving scenarios. It

includes 255k labels with bounding boxes for pedes-

trians and cars. Like previous work (Li et al., 2022a),

(Gehrig and Scaramuzza, 2023), (Perot et al., 2020),

we remove bounding boxes with a diagonal less than

30px or a width or height less than 10px. We create

discretized samples with a window size of 125ms and

further divide them into 5 timesteps.

4.2 Implementation

We choose SNNs for conducting all our experiments,

since their asynchronous and temporal nature ﬁts well

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

354

t +1

t +1 t +1 t +1

Grayscale

Shape Frames

Frame Difference +

Noise

Augmented

Frame

Original

Frame

Figure 2: ShapeAug pipeline example for DVS-Gesture (Amir et al., 2017).

to the processing of event data. Furthermore, SNNs

are inherently more energy efﬁcient than compara-

ble Artiﬁcal Neural Networks (ANNs), making them

particularly useful for real world automotive applica-

tions.

For our classiﬁcation experiments, we adopt the

training implementation of (Shen et al., 2023) and

therefore use a preactivated ResNet34 (He et al.,

2016) in combination with PLIF neurons (Fang et al.,

2021). As an optimizer we use AdamW (Loshchilov

and Hutter, 2017) with a learning rate of 0.000156

and a weight decay of 1 × 10

−4

. We further apply a

cosine decay to the learning rate for 200 epochs. Be-

cause it is already an established practice, we apply

geometrical augmentation on all our experiments in-

cluding the runs for our baseline. Similar to (Shen

et al., 2023), we therefore apply random cropping to

size 80px × 80px after padding with 7px, as well as

random horizontal ﬂipping and random rotation by up

to 15

◦

Regarding our object detection experiments, we

follow the example of (Cordone et al., 2022) and

choose a Spiking DenseNet (Huang et al., 2017) with

SSD (Liu et al., 2016) heads as our network. We

also use an AdamW optimizer (Loshchilov and Hut-

ter, 2017) with a learning rate of 5 × 10

−4

and a co-

sine decay for 100 epochs. Moreover, we utilize a

weight decay of 1 × 10

−4

as well as a batch size of

64. As in (Cordone et al., 2022), we apply a smooth

L1 loss for the box regression and the Focal Loss (Lin

et al., 2020) for the classiﬁcation task. Following the

example of (Gehrig and Scaramuzza, 2023), we uti-

lize the following geometrical augmentations for all

experiments: Random zoom-in and -out as well as

random horizontal ﬂipping. Since rotations distort

bounding boxes, we choose to not apply this augmen-

tation method for the object detection task. As a met-

ric for our results, we use the mean Average Precision

(mAP) over 10 IoU thresholds [.5:.05:.95].

4.3 Event Data Classiﬁcation

Table 1 shows the results of our augmentation tech-

nique on various event classiﬁcation datasets using

different maximum shape sizes s

max

. Our method is

able to outperform the baseline for all four datasets.

Especially the data, that was recorded based on RGB

images, beneﬁted greatly from the shape augmen-

tation. This may be caused by the similarity of

movements, between the recording and the simulated

shapes. More dynamic scenes may require more com-

plex movements of the shapes, which however can

lead to an increased simulation overhead. Addition-

ally, it is challenging to further improve the results

on the N-CARS dataset, since the baseline is already

able to nearly perfectly classify the validation set. The

results also show that, in the majority of cases, even

using s

max

= 50 improves the results. However, the

best choice of shape sizes depends on the actual ob-

jects depicted in the dataset.

4.4 Comparison with Existing

Literature on Robustness

In order to evaluate and compare the robustness of

our event augmentation, we create three challeng-

ing validation datasets based on DVS Gesture (Amir

et al., 2017) using the following augmentations on

each sample: Geometric (horizontal ﬂipping, rotation,

cropping), EventDrop (Gu et al., 2021) and ShapeAug

with s

max

= 30. Notably, we decided to not include

the EventMix (Shen et al., 2023) for validation aug-

ShapeAug: Occlusion Augmentation for Event Camera Data

355

Table 1: Comparison of classiﬁcation results using different max shape sizes s

max

. We report the top-1 accuracy as well as

the top-5 accuracy in brackets, except for datasets with less than 5 classes. The best and second best results are shown in bold

and underlined respectively.

Method Max Shapesize [px] DVS-CIFAR10 N-Caltech101 N-CARS DVS-Gesture

Geo - 73.8 (95.5) 62.2 (81.5) 97.1 89.8 (99.6)

Geo + ShapeAug 10 74.3 (95.1) 68.0 (83.9) 97.3 90.9 (99.6)

Geo + ShapeAug 30 73.9 (94.7) 68.7 (86.9) 96.9 91.7 (100)

Geo + ShapeAug 50 75.7 (96.7) 68.2 (85.2) 96.9 90.5 (99.2)

Table 2: Comparison of the robustness of current event augmentation approaches on various augmented versions of DVS

Gesture (Amir et al., 2017). The best and second best results are shown in bold and underlined respectively.

Train

Valid

- Geo Drop Shape

Geo 89.8 87.5 58.0 63.6

Geo + Drop 89.8 87.9 86.0 73.9

Geo + Mix 93.1 92.4 82.6 76.9

Geo + Shape 91.7 90.5 84.1 87.9

Geo + Drop + Shape 91.7 90.2 88.6 87.5

Geo + Mix + Shape 94.7 91.7 87.5 91.3

Geo + Mix + Drop 92.8 89.8 89.8 75.4

Geo + Mix + Drop + Shape 95.8 94.7 92.8 89.8

mentation due to its generation of multi-label sam-

ples, which would create an unfair comparison, as the

other methods were not trained for that case. The re-

sults of our experiments can be found in Table 2.

Since augmentation techniques have the unique

property that they can be combined, we decided to

not only compare the existing approaches but also in-

vestigate their combination during training. Generally

we applied every augmentation with a probability of

p = 0.5, resulting in some training examples under-

going multiple augmentations. Geometrical augmen-

tation was also here universally applied to all training

samples.

Comparison to State-of-the-Art. The results on

the standard validation dataset show, that ShapeAug

is able to improve the networks performance much

more than EventDrop, which approximately achieves

the same accuracy like the baseline. Generally, Event-

Mix has the greatest positive impact on the outcome.

Nevertheless, it has to be noted that the mixing of

samples, allows the network to see every sample mul-

tiple times during an epoch which is not the case for

other augmentation techniques.

Robustness. Also on most of the augmented valida-

tion data, ShapeAug is able to outperform EventDrop,

proving its increased robustness. Only on the drop-

augmented data, EventDrop naturally performs better

in comparison to ShapeAug and EventMix. Further-

more, no method that was trained without ShapeAug

is able to perform well on the shape-augmented data

(> 10% lower accuracy than ShapeAug), showing

their lack of robustness against moving foreground

objects in event data. Conversely, methods trained

with ShapeAug or EventMix are still capable of pre-

dicting drop-augmented data (only 2 − 4% lower ac-

curacy compared to EventDrop), which proves their

natural robustness against this augmentation.

Combination of Methods. Our ShapeAug method

shows great potential for being combined with other

augmentation techniques, since it is always able to

improve the performance on the standard as well as

all the challenging validation data. The usage of

EventDrop on the other hand even leads to a de-

creased accuracy when combined with multiple ap-

proaches on most of the validation data. The best per-

formance, outperforming the baseline by about 6%,

can be achieved when we train on all event augmen-

tation techniques combined during training. Overall,

however, we ﬁnd that our ShapeAug augmentation

can nearly compensate for all the beneﬁts of utilizing

drop-augmentation.

4.5 Automotive Object Detection

We further test our ShapeAug technique for object

detection on the Gen1 Automotive Event Dataset

(De Tournemire et al., 2020). The results can be seen

in Table 3 and show that ShapeAug is able to increase

the performance of the detection. It especially has

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

356

Table 3: Object detection test results on the Gen1 Automotive Event Dataset (De Tournemire et al., 2020).

Method Max Shape size [px] mAP AP

car

ped

Geo - 29.26 55.20 26.99 41.13 17.37

Geo + ShapeAug 50 29.60 56.99 26.70 40.91 18.30

Geo + ShapeAug 100 27.43 52.16 25.59 40.09 14.78

Geo + ShapeAug 150 26.33 50.59 24.36 39.43 13.24

Table 4: Evaluation of robustness over multiple augmented test sets of the Gen1 Automotive Event Dataset (De Tournemire

et al., 2020).

Method Max Shape size [px] - Geo Drop Shape

Geo - 29.26 29.15 23.48 25.68

Geo + ShapeAug 50 29.60 28.76 24.21 27.94

a positive impact on the bounding box prediction of

pedestrians, where it increases the AP by over 5%.

Pedestrians are in general much more challenging to

detect, since they usually appear smaller than cars in

the images and their movements as well as their ap-

pearances are much more complex and have a high

variance. Furthermore, they are more prone to be oc-

cluded by other trafﬁc participants, which may be the

reason ShapeAug is especially beneﬁting their detec-

tion. However, the results also indicate that the size of

the shapes has a signiﬁcant impact on the predictions.

Compared to the classiﬁcation dataset, the objects in

the Gen1 Automotive Event Dataset can appear very

small and too much occlusion may hinder the training

signal to pass trough the network effectively. Further-

more, the movements of objects in real world automo-

tive scenes are very complex and do not just follow a

linear pattern. Therefore, the occlusion augmentation

can be further improved by increasing the complexity

of movements and shapes, which however will also

increase the simulation overhead.

Robustness Analysis. Also for the object detection

task, we examined the robustness of ShapeAug on

different augmented test sets of the Gen1 Automotive

Event Dataset (De Tournemire et al., 2020). Regard-

ing the geometrical augmentation, we applied zoom

(either zoom-in or zoom-out) on all images as well

as horizontal ﬂipping with a probability of p = 0.5.

Since our experiments were done on pre-processed

event data in order to decrease the computations dur-

ing training, we opted for Random Erasing (Zhong

et al., 2020), which randomly erases rectangles of the

input image, as our drop augmentation. Table 4 con-

tains the results of our robustness evaluation. If the

appropriate size of shapes is chosen, ShapeAug in-

deed improves the results for the prediction on shape-

augmented as well as on drop-augmented data. This

means for tasks where a high degree of occlusion dur-

ing inference is expected, ShapeAug can be a valuable

technique to increase prediction performance. How-

ever, it is necessary to evaluate the right values for

hyperparameters, including the shape size, the num-

ber of objects as well as the movement pattern of the

shapes.

5 CONCLUSION

Augmentation for event data during the training of

neural networks is crucial in order to ensure robust-

ness as well as to avoid overﬁtting and to improve

accuracy. In this work, we introduced ShapeAug,

an augmentation technique simulating moving fore-

ground objects in event data. Our method includes the

simulation of a random amount of objects, moving on

randomly chosen linear paths, and using the resulting

events from their movement. Since the objects are

in the foreground, ShapeAug allows the modeling of

realistic occlusions, since in real world scenarios oc-

cluding objects would cause the generation of events.

We show the effectiveness of our approach on the

most common event classiﬁcation datasets, where it

is able to improve the accuracy signiﬁcantly. Fur-

thermore, ShapeAug proves to increase the robustness

of predictions on a set of challenging validation data

and is able to outperform other event drop augmen-

tations. Our technique can be also easily combined

with other augmentation methods, leading to an even

higher boost of the prediction performance. Addition-

ally, ShapeAug improved the object detection on a

real world automotive dataset and further enhanced

the robustness against various augmentations on the

test dataset.

Currently, our shape augmentation method only

simulates very simple homogenously colored shapes

and their movements. It remains for future work to ex-

plore the simulation of complex movements and more

sophisticated textures and object shapes.

ShapeAug: Occlusion Augmentation for Event Camera Data

357

ACKNOWLEDGEMENTS

This work was funded by the Carl Zeiss Stiftung,

Germany under the Sustainable Embedded AI project

(P2021-02-009) and partially funded by the Federal

Ministry of Education and Research Germany under

the project DECODE (01IW21001).

REFERENCES

Amir, A., Taba, B., Berg, D., Melano, T., McKinstry, J.,

Di Nolfo, C., Nayak, T., Andreopoulos, A., Garreau,

G., Mendoza, M., Kusnitz, J., Debole, M., Esser, S.,

Delbruck, T., Flickner, M., and Modha, D. (2017). A

low power, fully event-based gesture recognition sys-

tem. In Conference on Computer Vision and Pattern

Recognition (CVPR).

Cordone, L., Miramond, B., and Thierion, P. (2022). Object

detection with spiking neural networks on automotive

event data. In International Joint Conference on Neu-

ral Networks (IJCNN).

De Tournemire, P., Nitti, D., Perot, E., Migliore, D.,

and Sironi, A. (2020). A large scale event-based

detection dataset for automotive. arXiv preprint

arXiv:2001.08499.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). Imagenet: A large-scale hierarchical

image database. In Conference on Computer Vision

and Pattern Recognition (CVPR).

DeVries, T. and Taylor, G. W. (2017). Improved regular-

ization of convolutional neural networks with cutout.

arXiv preprint arXiv:1708.04552.

Fang, W., Yu, Z., Chen, Y., Masquelier, T., Huang, T., and

Tian, Y. (2021). Incorporating learnable membrane

time constant to enhance learning of spiking neural

networks. In International Conference on Computer

Vision (ICCV).

Fei-Fei, L., Fergus, R., and Perona, P. (2004). Learning gen-

erative visual models from few training examples: An

incremental bayesian approach tested on 101 object

categories. In Conference on Computer Vision and

Pattern Recognition Workshop (CVPRW).

Fong, R. and Vedaldi, A. (2019). Occlusions for effective

data augmentation in image classiﬁcation. In Interna-

tional Conference on Computer Vision Workshop (IC-

CVW).

Gehrig, M. and Scaramuzza, D. (2023). Recurrent vision

transformers for object detection with event cameras.

In Conference on Computer Vision and Pattern Recog-

nition (CVPR).

Gerstner, W. and Kistler, W. M. (2002). Spiking Neu-

ron Models: Single Neurons, Populations, Plasticity.

Cambridge University Press.

Gu, F., Sng, W., Hu, X., and Yu, F. (2021). Eventdrop:

data augmentation for event-based learning. In Inter-

national Joint Conferences on Artiﬁcial Intelligence

(IIJCAI).

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Identity

mappings in deep residual networks. In European

Conference on Computer Vision (ECCV).

Hoffer, E., Ben-Nun, T., Hubara, I., Giladi, N., Hoeﬂer, T.,

and Soudry, D. (2019). Augment your batch: better

training with larger batches. In arXiv.

Huang, G., Liu, Z., Maaten, L. V. D., and Weinberger, K. Q.

(2017). Densely connected convolutional networks. In

Conference on Computer Vision and Pattern Recogni-

tion (CVPR).

Isobe, T., Han, J., Zhuz, F., Liy, Y., and Wang, S.

(2020). Intra-clip aggregation for video person re-

identiﬁcation. In International Conference on Image

Processing (ICIP).

Krizhevsky, A. (2012). Learning multiple layers of features

from tiny images. University of Toronto.

Li, H., Liu, H., Ji, X., Li, G., and Shi, L. (2017). CIFAR10-

DVS: An event-stream dataset for object classiﬁca-

tion. Frontiers in Neuroscience.

Li, J., Li, J., Zhu, L., Xiang, X., Huang, T., and Tian,

Y. (2022a). Asynchronous spatio-temporal memory

network for continuous event-based object detection.

Transactions on Image Processing.

Li, Y., Kim, Y., Park, H., Geller, T., and Panda, P. (2022b).

Neuromorphic data augmentation for training spiking

neural networks. In European Conference on Com-

puter Vision (ECCV).

Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P.

(2020). Focal loss for dense object detection. Trans-

actions on Pattern Analysis and Machine Intelligence

(PAMI).

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,

Fu, C.-Y., and Berg, A. C. (2016). SSD: Single shot

MultiBox detector. In European Conference on Com-

puter Vision (ECCV).

Loshchilov, I. and Hutter, F. (2017). Decoupled weight

decay regularization. In International Conference on

Learning Representations (ICLR).

Orchard, G., Jayawant, A., Cohen, G. K., and Thakor, N.

(2015). Converting static image datasets to spiking

neuromorphic datasets using saccades. Frontiers in

Neuroscience.

Perot, E., de Tournemire, P., Nitti, D., Masci, J., and

Sironi, A. (2020). Learning to detect objects with a 1

megapixel event camera. In Neural Information Pro-

cessing Systems (NeurIPS).

Shen, G., Zhao, D., and Zeng, Y. (2023). Eventmix: An

efﬁcient data augmentation strategy for event-based

learning. Information Sciences.

Singh, K. K. and Lee, Y. J. (2017). Hide-and-seek: Forc-

ing a network to be meticulous for weakly-supervised

object and action localization. In International Con-

ference on Computer Vision (ICCV).

Sironi, A., Brambilla, M., Bourdis, N., Lagorce, X., and

Benosman, R. (2018). Hats: Histograms of averaged

time surfaces for robust event-based object classiﬁca-

tion. In Conference on Computer Vision and Pattern

Recognition (CVPR).

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,

and Salakhutdinov, R. (2014). Dropout: A simple way

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

358

to prevent neural networks from overﬁtting. Journal

of Machine Learning Research (JMLR).

Yun, S., Han, D., Chun, S., Oh, S. J., Yoo, Y., and Choe,

J. (2019). CutMix: Regularization strategy to train

strong classiﬁers with localizable features. In Interna-

tional Conference on Computer Vision (ICCV).

Zhong, Z., Zheng, L., Kang, G., Li, S., and Yang, Y. (2020).

Random erasing data augmentation. In AAAI confer-

ence on artiﬁcial intelligence.

ShapeAug: Occlusion Augmentation for Event Camera Data

359