EveCamNet: A Fusion Framework of Event and RGB Camera Towards

Detecting Drivable Area for Autonomous Vehicles

Jyoti M Satihal

, Safa Imtihaz Sayyad

and Ujwala Patil

KLE Technological University, Vidyanagar, Hubballi, Karnataka, 580031, India

Keywords:

Event Camera, Drivable Area, Dynamic Conditions.

Abstract:

We propose EveCamNet, a novel framework that fuses features from RGB and event cameras to detect drivable

areas in dynamic scenarios. This framework enhances semantic segmentation for identifying drivable regions

in autonomous driving systems. While RGB cameras excel at capturing spatial and textural details, they often

struggle in dynamic conditions, such as motion blur and varying lighting. In contrast, event cameras offer high

temporal resolution and low latency, but they lack detailed spatial context. Our proposed framework effectively

combines the strengths of both modalities through attention-based feature fusion mechanisms and a robust

loss function incorporating Cross-Entropy and Dice losses. Using an open-source dataset, we evaluate the

performance of EveCamNet, achieving a mean Intersection over Union (mIoU) of 69.94% and Pixel Accuracy

(PA) of 90.54%. These results highlight the potential of RGB-Event fusion as a promising approach for

advancing autonomous driving systems.

1 INTRODUCTION

The rapid advancements in autonomous driving have

made intelligent mobility an important aspect of mod-

ern transportation. It is crucial to understand the envi-

ronment to ensure safe and efﬁcient navigation. How-

ever, to achieve this, it is important to understand the

challenges it poses, especially in diverse and complex

scenarios. One of the fundamental tasks in this do-

main is the detection of drivable areas, which serves

as the basis for path planning, obstacle avoidance, and

overall navigation. This task becomes even more dif-

ﬁcult in unstructured environments, such as settings

with dynamic objects, varying terrains, or rural roads

that lack clear lane markings.

Semantic segmentation has emerged as a key tech-

nique to address the challenges of drivable area detec-

tion (Qiao and Zulkernine, 2021) (Jain et al., 2023).

Unlike traditional object detection (Zou et al., 2023),

which provides bounding-box outputs, semantic seg-

mentation offers pixel-level classiﬁcation of road sur-

faces and surrounding elements. This detailed scene

understanding is crucial for accurately distinguishing

between drivable and non-drivable areas, even in the

https://orcid.org/0009-0009-5800-0496

https://orcid.org/0009-0002-9888-5865

https://orcid.org/0000-0001-5776-785X

presence of environmental variations.

RGB cameras are widely used for semantic seg-

mentation due to their ability to capture detailed spa-

tial and color information. They are cost-effective,

scalable, and capable of handling a variety of road

scenarios. However, they face signiﬁcant limitations

in dynamic environments, such as those involving

fast-moving objects, motion blur, or changes in il-

lumination. These challenges necessitate the use of

complementary sensors to ﬁll in the gaps left by RGB

data.

In recent applications of autonomous driving,

event cameras have attracted attention due to their

unique functionality. Unlike conventional cameras,

event cameras operate asynchronously, capturing

pixel-level changes in brightness with microsecond-

level temporal resolution. Their advantages consist of

extremely high temporal resolution and low latency,

both of which are measured in microseconds. Addi-

tionally, they offer a dynamic range of 140dB, signif-

icantly higher than the 60dB typically found in stan-

dard cameras, along with lower power consumption

(Shariff et al., 2024) (Gallego et al., 2022). This

makes them highly effective in scenarios with rapid

motion or abrupt lighting changes, where traditional

cameras often struggle. Their sparse data output and

resilience to motion blur make them particularly suit-

able for detecting dynamic objects. Despite their

Satihal, J. M., Sayyad, S. I. and Patil, U.

EveCamNet: A Fusion Framework of Event and RGB Camera Towards Detecting Drivable Area for Autonomous Vehicles.

DOI: 10.5220/0013588200004664

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 3rd International Conference on Futuristic Technology (INCOFT 2025) - Volume 2, pages 135-143

ISBN: 978-989-758-763-4

135

strengths, event cameras lack the spatial richness re-

quired for tasks such as semantic segmentation. Addi-

tionally, aggregating event data into frame-based rep-

resentations requires careful handling to align tempo-

ral and spatial information effectively.

To overcome the individual limitations of RGB

and event cameras, RGB-Event fusion has emerged

as a promising approach. By combining the comple-

mentary strengths of these modalities, fusion-based

frameworks can generate richer and more reliable

scene representations. For semantic segmentation,

this fusion enables models to leverage the spatial de-

tail of RGB data and the motion sensitivity of event

data, resulting in improved performance across di-

verse driving conditions. While RGB-Event fusion

has been explored for tasks like object detection and

tracking (Zhou et al., 2023) (Tomy et al., 2022), its

application to semantic segmentation for drivable area

detection remains underexplored.

In this paper, we propose a novel RGB-Event fu-

sion framework for semantic segmentation to improve

drivable area detection in autonomous driving. The

framework incorporates mid-level cues and utilizes an

attention-based fusion mechanism to effectively inte-

grate and reﬁne RGB and event features across spatial

and channel dimensions. This approach ensures accu-

rate and consistent segmentation, making it suitable

for challenging conditions. Our key contributions in-

clude:

• Integration of RGB and event data to enhance seg-

mentation accuracy by combining the spatial rich-

ness of RGB images with the temporal sensitivity

of event data, addressing the limitations of stan-

dalone modalities.

• Implementation of a robust combined loss func-

tion to balance pixel-level accuracy and boundary

precision, improving the model’s ability to handle

segmentation tasks.

• Evaluation of the model’s effectiveness using met-

rics such as mean Intersection over Union (mIoU)

and Pixel Accuracy, demonstrating its applicabil-

ity and reliability across diverse and real-world

driving scenarios using DDD17.

The remainder of this paper is structured as follows:

Section 2 reviews related work on various segmenta-

tion techniques that utilize different modalities. Sec-

tion 3 describes the proposed methodology, focus-

ing on input representation, feature extraction, fusion,

and the segmentation process. Section 4 presents the

experimental results and compares them with base-

line methods. Section 5 discusses the ﬁndings from

the ablation studies. Finally, Section 6 concludes the

paper and explores potential future directions.

2 LITERATURE SURVEY

Semantic segmentation is an essential technique in

autonomous driving that allows vehicles to catego-

rize each pixel in an image into speciﬁc classes. By

doing so, it signiﬁcantly enhances the vehicle’s abil-

ity to understand and perceive its environment at a

detailed level. The introduction of deep learning,

particularly through Convolutional Neural Networks

(CNNs), has transformed semantic segmentation by

providing more accurate and scalable solutions.

Early deep learning models, particularly Fully

Convolutional Networks (FCNs) (Long et al., 2015),

laid the groundwork for semantic segmentation by al-

lowing for end-to-end learning, which demonstrated

strong performance on RGB datasets like Cityscapes

(Cordts et al., 2016) and KITTI (Geiger et al., 2013).

Semantic segmentation for drivable area has pro-

gressed through two primary approaches: monocular

vision-based methods and multimodal sensor-based

methods (Rasib et al., 2021).

Monocular vision-based methods primarily uti-

lize RGB cameras, which makes them more cost-

effective and easier to implement. Notable exam-

ples of these methods include road detection using

neural networks, as demonstrated by Li et al. (Li

et al., 2022), and instance segmentation techniques

employed by Chan et al (Chan et al., 2019). Mod-

els like ENet (Almeida et al., 2020) and EdgeNet

(Han et al., 2021) have achieved signiﬁcant improve-

ments in pixel-level segmentation accuracy for both

structured and unstructured roads. U-Net (Siddique

et al., 2021) has gained popularity in autonomous

driving due to its encoder-decoder architecture and

skip connections, which effectively capture both spa-

tial and semantic features. DeepLabv3+ (Liu et al.,

2021) enhances segmentation capabilities even fur-

ther by utilizing improved convolutions and spatial

pyramid pooling, allowing for better handling of com-

plex scenes.

Despite these advancements, methods relying

solely on RGB cameras struggle in dynamic environ-

ments and varying weather conditions. While RGB

cameras provide rich spatial and color information,

they face limitations such as motion blur, changing

lighting conditions, and difﬁculties in capturing un-

structured roads without clear lane markings. These

challenges underscore the need to explore comple-

mentary sensing modalities.

Multimodal methods that incorporate LiDAR and

point clouds utilize spatial and depth information to

enhance detection capabilities. For example, Demir

et al. (Demir et al., 2017) developed a point-cloud-

based adaptive method, while Yang et al. (Yang et al.,

INCOFT 2025 - International Conference on Futuristic Technology

136

2020) introduced a fusion network that combines Li-

DAR with image data. However, these approaches

often face challenges such as high costs and limited

datasets.

A recent advancement in the industry is the use

of event cameras, which provide signiﬁcant advan-

tages over conventional cameras in various scenarios.

Event cameras have emerged as a promising alter-

native due to their unique ability to capture changes

in brightness at the pixel level with high temporal

resolution. Unlike traditional cameras, event cam-

eras operate asynchronously, making them resilient to

motion blur and variations in lighting (Shariff et al.,

2024) (Gallego et al., 2022).

Research has demonstrated that event cameras

are particularly effective in dynamic scenes involving

fast-moving objects or rapid transitions between light

and shadow. Recent advancements have explored

their utility in various ﬁelds, including robotics, aug-

mented reality, and object tracking. The unique abil-

ity of event cameras to operate efﬁciently under chal-

lenging lighting conditions and their low power con-

sumption further enhance their suitability for wear-

able devices and autonomous systems. However,

the unconventional output of these sensors neces-

sitates developing specialized algorithms to exploit

their potential fully. Emerging methods, ranging from

low-level vision tasks such as optical ﬂow estima-

tion to high-level applications like recognition and

segmentation, underscore the transformative impact

of this technology on computer vision and robotics

(Chakravarthi et al., 2024).

The ﬁrst signiﬁcant work in this domain intro-

duced an event-based semantic segmentation dataset

derived from DDD17, utilizing an Xception-type net-

work to demonstrate robust performance, particu-

larly in edge-case scenarios like overexposed im-

ages (Alonso and Murillo, 2019) (Binas et al., 2017)

(Chollet, 2017). Semantic labels for the dataset were

generated using a pre-trained network on grayscale

frames from the DAVIS346B sensor, which aligns

events with frames. However, the DAVIS346B’s

low resolution and image quality introduced signif-

icant artifacts and limited label granularity. Simu-

lated datasets like EventScape, recorded in CARLA,

offer higher-quality labels but suffer from sim-to-real

gaps, reducing real-world applicability (Gehrig et al.,

2021) (Hidalgo-Carri

o et al., 2020). Follow-up stud-

ies leveraged synthetic events from video datasets to

enhance performance, while others explored combin-

ing labeled image datasets like Cityscapes with un-

labeled events to reduce dependence on video data

(Cordts et al., 2016) (Gehrig et al., 2019) (Wang et al.,

2021). Despite these advancements, methods relying

solely on events face challenges due to their sparse

data output and lack of spatial richness.

Standalone event-based segmentation struggles

with providing detailed scene understanding due to

the inherently sparse and asynchronous nature of

event data. Techniques like Temporal Multi-Scale

Aggregation (ETMA) (Zhou et al., 2023) have at-

tempted to address this by aggregating temporal in-

formation into event frames, improving feature util-

ity. However, even with such advancements, event-

only approaches fall short in high-resolution semantic

segmentation tasks, particularly in scenarios requiring

ﬁne-grained pixel-level classiﬁcation.

To overcome these limitations, fusion-based tech-

niques integrating RGB and event data have been pro-

posed. Fusion frameworks combine the spatial rich-

ness of RGB images with the temporal sensitivity

of event data, offering a more comprehensive under-

standing of the scene. While early works focused

on object detection and tracking (Zhou et al., 2023)

(Tomy et al., 2022), where fusion demonstrated im-

proved accuracy in dynamic environments, applica-

tions to semantic segmentation remain limited. Ex-

isting methods often rely on simple feature concate-

nation or static weighting, failing to fully exploit the

complementary strengths of the two modalities.

Our project addresses these gaps by proposing

a novel RGB-Event fusion framework for seman-

tic segmentation, speciﬁcally designed for drivable

area detection in autonomous driving. By leveraging

attention-based mechanisms for feature reﬁnement

and alignment, our framework dynamically integrates

RGB and event features, addressing their individual

limitations and enhancing segmentation performance.

This approach aims to advance the ﬁeld by optimiz-

ing RGB-Event fusion for pixel-level tasks, ensuring

robust and scalable solutions for autonomous driving

systems.

3 EVECAMNET’S PROPOSED

FRAMEWORK

This section outlines the methodology for integrating

RGB and event data to achieve semantic segmenta-

tion of drivable areas. The proposed framework, il-

lustrated in Figure 1, describes the overall architec-

ture, which consists of several essential components

divided into ﬁve stages. These components include

temporal multi-scale aggregation of event data, fea-

ture extraction speciﬁc to each modality, fusion of

features from both types, and the ﬁnal decoding pro-

cess for segmentation. Each element is carefully de-

signed to address the unique challenges posed by dy-

EveCamNet: A Fusion Framework of Event and RGB Camera Towards Detecting Drivable Area for Autonomous Vehicles

137

namic driving environments. The following subsec-

tions will provide a detailed explanation of each com-

ponent and its role within the overall framework.

3.1 Input Representation

The proposed model of RGB and event fusion for se-

mantic segmentation utilizes input derived from the

DDD17 dataset (Binas et al., 2017), which contains

synchronized RGB frames and event data. The RGB

frames are captured at a standard rate, providing rich

spatial and textural information vital for understand-

ing scenes. Each RGB frame has a resolution of H ×

W, where H represents the height and W denotes the

width.

In contrast, event data consists of asynchronous

brightness changes recorded by event cameras. This

data is presented as a stream of events, each deﬁned

by a timestamp, polarity (indicating the direction of

brightness change), and spatial coordinates. To adapt

the event data for deep learning architectures, it is ag-

gregated over discrete temporal intervals and trans-

formed into a multi-channel image format. For this

study, the event data is categorized into three tem-

poral scales: short-range, medium-range, and long-

range, each highlighting different motion dynamics

from ﬁne details to broader, slower movements. The

RGB frames and event data are preprocessed and re-

sized to 288 × 288 to ensure consistency and compat-

ibility with the model.

3.2 Feature Extraction

Feature extraction for RGB and event data is con-

ducted separately using distinct encoders tailored to

each modality.

RGB Data: The RGB data is processed through

a ResNet-101 encoder, a deep convolutional neural

network designed to capture high-resolution spatial

and textural characteristics. The deeper layers of the

ResNet-101 model capture abstract, global features,

while the earlier layers retain detailed spatial infor-

mation. This encoder analyzes the input RGB frame

and produces a set of hierarchical feature maps, which

are later utilized for fusion and skip connections.

Event Data (E-TMA): To handle event data, the

Event-Based Temporal Multiscale Aggregation (E-

TMA) module (Zhou et al., 2023) is used to har-

ness the temporal richness inherent in asynchronous

events. The event data from each temporal scale ξ

(where k ∈ {1, 2, 3}) is transformed into a latent fea-

ture space as follows:

= φ(ξ

) (1)

where φ represents a shared convolutional projection

layer that includes batch normalization and ReLU ac-

tivation.

To capture hierarchical motion dynamics, pooling

operations with varying kernel sizes are applied:

= pool

) (2)

where the kernel sizes increase with the temporal

range. To standardize the resolutions of these fea-

tures, smaller feature maps undergo upsampling and

are then concatenated:

event

= concat(e

, up(e

), up(e

)) (3)

This combined event representation encapsulates mo-

tion patterns from ﬁne to coarse and is subsequently

fed into a lightweight ResNet-18 encoder for feature

extraction. The event encoder is designed to be com-

putationally efﬁcient while preserving essential mo-

tion information.

3.3 Fusion Network

The fusion network integrates features obtained from

both RGB and event encoders, effectively leveraging

the unique advantages of each modality. Prior to fu-

sion, the event features are transformed to match the

dimensionality of the RGB features:

aligned

= ψ(F

event

) (4)

where ψ denotes a 1×1 convolution.

Channel Attention: Channel attention mecha-

nisms are utilized to highlight the most pertinent fea-

ture channels within each modality. The features, en-

hanced by attention, are computed as follows:

cal

rgb

= CA(F

aligned

) ⊙ F

rgb

+ F

rgb

(5)

cal

event

= CA(F

rgb

) ⊙ F

aligned

+ F

aligned

(6)

Here, CA represents the channel attention module,

and ⊙ signiﬁes element-wise multiplication.

Spatial Attention: Spatial attention is applied to

enhance the feature maps by concentrating on the

most signiﬁcant spatial areas:

rgb

= SA(F

cal

event

) ⊙ F

cal

rgb

+ F

cal

rgb

(7)

event

= SA(F

cal

rgb

) ⊙ F

cal

event

+ F

cal

event

(8)

In this context, SA denotes the spatial attention mod-

ule.

Fusion : The reﬁned features are then concate-

nated and processed through a convolutional layer to

generate the fused representation:

fused

= Conv3x3(F

rgb

⊕ F

event

) (9)

where ⊕ indicates concatenation.

INCOFT 2025 - International Conference on Futuristic Technology

138

Figure 1: Architecture for our proposed EveCamNet RGB+Event fusion model (1) The framework initiates with the Input

Representation, which effectively encompasses both grayscale RGB images and event data. (2) The inputs are independently

processed through separate Feature Extraction networks, where RGB features are extracted using ResNet-101 and event

features are extracted using ResNet-18. (3) The extracted feature maps, F

(RGB features) and F

(event features) are passed

to a Fusion Network, which combines spatial and temporal information from both modalities. (4) The fused representation

is fed into the Segmentation Decoder to generate pixel-wise segmentation masks. (5) Finally, the Output with Loss Function

compares the predicted segmentation masks with the ground truth (GT) to compute the loss, optimizing the model for precise

segmentation of drivable areas and objects. This pipeline enhances performance in dynamic conditions and by utilizing the

complementary strengths of RGB and event data.

3.4 Segmentation Decoder

The combined feature map F

fused

is transformed into

a pixel-level segmentation map using an architecture

inspired by UNet (Siddique et al., 2021). This de-

coder includes multiple upsampling layers along with

convolutional operations that work together to gradu-

ally restore the original spatial resolution of the input.

Skip connections are used between matching lay-

ers in the encoder and decoder to retain important spa-

tial information that might otherwise be lost during

downsampling. For example, the output from an ear-

lier encoder layer is added to the output of the cor-

responding decoder layer, which helps maintain ﬁne

details in the spatial data:

(l)

decoder

= concat(F

(l)

encoder

, up(F

(l+1)

decoder

)) (10)

where l represents the layer index.

The ﬁnal segmentation map S is generated using a

softmax activation function, which converts the out-

put into class probabilities for each pixel:

S = σ(Conv3x3(F

ﬁnal

decoder

)) (11)

3.5 Loss Function

The model produces a pixel-wise probability map, de-

noted as S, which is then compared to the actual seg-

mentation map, P, to calculate the loss. During train-

ing, we aim to minimize a combined loss function

represented as:

L = L

+ β · L

dice

, β = 0.5 (12)

In this equation:

- L

(cross-entropy loss) focuses on ensuring ac-

curate classiﬁcation at the pixel level. L

dice

(Dice loss)

helps tackle class imbalance and improves the seg-

mentation of boundaries, calculated by the formula:

dice

= 1 −

2 ·

∑

P ·

∑

P +

∑

P + ε

(13)

Here,

P is the predicted segmentation map, and ε is a

small constant to avoid division by zero.

The cross-entropy loss penalizes wrong class pre-

dictions, while the Dice loss emphasizes enhancing

the overlap between the predicted areas and the ac-

tual ground truth, especially at the edges. By combin-

ing these loss functions, the model effectively learns

not only to classify accurately but also to delineate

boundaries precisely, resulting in strong segmentation

performance even in difﬁcult situations.

EveCamNet: A Fusion Framework of Event and RGB Camera Towards Detecting Drivable Area for Autonomous Vehicles

139

4 RESULTS AND DISCUSSION

This section presents a concise evaluation of the

RGB-Event fusion framework, covering the dataset,

training process, and performance analysis of results.

We compare our model’s performance with estab-

lished baselines, offering a clear understanding of the

pipeline’s capabilities.

4.1 Dataset

The DDD17 dataset (Binas et al., 2017) plays a cru-

cial role in examining how RGB and event data can

work together for segmenting drivable areas. It cap-

tures urban driving situations using an event-based

camera paired with standard RGB frames, allowing

for asynchronous brightness changes (events) that are

aligned with detailed RGB images. This dataset is

excellent for providing high temporal resolution and

a wide dynamic range, which are essential for navi-

gating complex driving environments.

For our project, we focused on a subset of se-

quences that met our criteria for urban environments

with good visibility, using 15,950 frames for training

and 3,890 frames for testing across the dataset’s tem-

poral range.

In the training phase, we integrated events over

a 50ms period to capture the necessary temporal de-

tails. For testing, we evaluated the model at 10ms and

250ms intervals to check its performance across dif-

ferent speeds. This approach ensures that the model

can adapt to both fast-paced and slower situations en-

countered in autonomous driving.

4.2 Training

The training process focused on ﬁne-tuning a fusion-

based segmentation model that integrates both RGB

and event data. The RGB frames were resized to

288x288 pixels, while the event data was consoli-

dated into multi-channel formats over various tempo-

ral intervals: short (10ms), medium (50ms), and long

(250ms). This setup was intended to capture different

motion characteristics, ranging from intricate details

to broader, slower movements. The optimization was

carried out using the Adam optimizer, starting with a

learning rate of 5 × 10

−4

. and progressively decaying

it using a polynomial schedule. A batch size of 8 was

maintained over 30 epochs, and we included data aug-

mentations like random rotations, ﬂips, and shifts to

enhance generalization. The loss function combined

cross-entropy loss for pixel-level accuracy with Dice

loss to tackle class imbalance:

L = L

+ β · L

dice

, β = 0.5 (14)

where the Dice loss L

dice

is computed as:

dice

= 1 −

∑

P ·

∑

P +

∑

P + ε

(15)

Here, P is the mask of ground truth and

P is the

mask predicted.

4.3 Experimental Results

The experimental results are divided into qualitative

and quantitative analyses, providing a comprehensive

evaluation of our RGB-event fusion model’s perfor-

mance across visual and metric-based perspectives.

4.3.1 Performance Analysis

Here, we analyzied our model through qualitative and

quantitative comparisons with existing baseline mod-

els, deriving key observations and results that high-

light its effectiveness.

Qualitative Results : In our qualitative analysis, we

looked at how well our model segments by compar-

ing its output to some baseline methods. These in-

cluded RGB-only segmentation, event-only segmen-

tation, and a combination of both RGB and event data.

The models we compared were UNet for RGB U-Net

(Siddique et al., 2021), EV-SegNet for events (Alonso

and Murillo, 2019), and our own model that fuses

RGB and event information.

The results are presented visually in Figure 2,

where we arranged each input scenario in rows and

the different models in columns. This setup allows

for an easy comparison of the segmentation masks

across methods, highlighting the differences in object

boundaries and how well each method performs un-

der tricky conditions. For instance, in high-motion

scenarios, the event data signiﬁcantly boosts segmen-

tation performance by keeping the boundaries sharp,

something the RGB-only models often struggle with.

Our model successfully integrates the advantages of

both RGB and event data, resulting in precise segmen-

tation of drivable areas even in dynamic situations.

Quantitative Results : The performance of each

model was quantitatively evaluated using mean Inter-

section over Union (IoU) and pixel accuracy metrics.

• mean Intersection over Union (mIoU)

mIoU =

∑

c=1

T P

+ FP

+ FN

(16)

INCOFT 2025 - International Conference on Futuristic Technology

140

Figure 2: Qualitative comparison of segmentation results across different input modalities and models. The columns

show grayscale RGB images, event streams, and outputs from three models: UNet (RGB-only), EV-SegNet (event-only), and

our proposed EveCamNet (RGB+Event Fusion) model. The rows depict urban road scenes with vehicles and drivable areas.

UNet struggles with object boundaries, while EV-SegNet captures motion details but lacks spatial context. Our RGB-Event

Fusion model effectively integrates spatial and temporal information, resulting in accurate segmentation masks with clear

object boundaries, even in dynamic scenarios. This analysis demonstrates our model’s effectiveness in identifying drivable

areas and objects. The red box highlights that EveCamNet achieves segmentation more efﬁciently than the baseline models.

• Pixel Accuracy (PA)

PA =

Number of correctly classiﬁed pixels

Total number of pixels

(17)

The results in Table 1 clearly demonstrate that

our RGB-event fusion model outperforms others in

all evaluated metrics. The mean Intersection over

Union (mIoU) for drivable areas shows a signiﬁcant

increase, indicating the model’s strength in accurately

identifying navigable regions. Moreover, the im-

proved mIoU for vehicle detection highlights better

boundary delineation, which can be attributed to the

complementary capabilities of both RGB and event

data. There are also notable gains in pixel accuracy

(PA) for our proposed model.

By incorporating results from various temporal in-

tervals (including 10ms and 250ms), we showcase the

model’s ability to adapt to both high-speed and slower

scenarios. The performance at 50ms, which was used

during training, acts as a reference point. The en-

hancements observed at 10ms illustrate its sensitiv-

ity to ﬁne temporal details, while the 250ms perfor-

mance indicates robustness to aggregated motion pat-

terns. Overall, these ﬁndings underscore the model’s

versatility across different temporal dynamics, mak-

ing it a strong candidate for a range of operational

conditions.

5 ABLATION STUDY

This section explores the contribution of the indi-

vidual component in our framework via the ablation

study, allowing us to measure the impact on segmen-

tation tasks and model efﬁciency.

Loss Importance Analysis : To assess the im-

portance of the proposed loss functions in our

RGB+Event fusion framework, we conducted abla-

tion experiments by selectively omitting certain com-

ponents of the loss function during training. This

method enabled us to pinpoint the role of each com-

ponent and evaluate how it affected segmentation per-

formance.

The results presented in Table 2 illustrate how

various loss functions impact critical metrics such

as mean Intersection over Union (mIoU), accuracy,

and ﬁnal training loss. Notably, the removal of the

Dice loss resulted in a signiﬁcant drop in mIoU, high-

lighting its crucial role in addressing class imbalance

and enhancing segmentation boundaries. Addition-

ally, eliminating channel or spatial attention mecha-

nisms led to performance declines, emphasizing their

importance in effective feature fusion and alignment

between modalities. The complete model consistently

achieved the highest scores, highlighting the impor-

tance of the proposed loss functions for effective seg-

mentation.

EveCamNet: A Fusion Framework of Event and RGB Camera Towards Detecting Drivable Area for Autonomous Vehicles

141

Table 1: Performance comparison of different segmentation models (UNET, EV-SegNet, and our RGB+Event fusion

approach) across various temporal intervals. The proposed RGB+Event fusion model consistently outperforms others in

both mean Intersection over Union (mIoU) and accuracy, highlighting its effectiveness in leveraging multi-modal data for

improved segmentation performance.

Model mIoU (50ms) Accuracy (50ms) mIoU (10ms) Accuracy (10ms) mIoU (250ms) Accuracy (250ms)

UNET (Siddique et al., 2021) 65.77 87.92 65.34 84.5 65.47 83.89

EV-Segnet (Alonso and Murillo, 2019) 67.15 88.12 66.87 87.91 66.98 87.34

EveCamNet (Ours) 69.94 90.54 69.83 89.97 68.73 90.12

Table 2: Adapted table for loss-based ablation. The table compares the performance of the model under different conﬁgu-

rations. Metrics include Accuracy (%), mIoU (%), and Final Loss.

Method Accuracy [%]↑ mIoU [%]↑ Final Loss↓

w/o L

dice

88.76 60.12 0.25

w/o Channel Attention 89.65 62.40 0.23

w/o Spatial Attention 89.78 63.05 0.22

w/o Both Attention Mechanisms 89.40 61.50 0.24

EveCamNet (Ours) 93.50 71.50 0.18

6 CONCLUSION

In conclusion, the proposed RGB+Event fusion

framework effectively combines the unique advan-

tages of RGB and event data for robust segmentation.

By incorporating both channel and spatial attention

mechanisms, the model enhances feature fusion, al-

lowing it to capitalize on the dense spatial details of-

fered by RGB frames while also capturing dynamic

motion cues from event data. This method has shown

impressive segmentation performance across various

temporal scales, particularly in the realm of drivable

area segmentation for autonomous driving applica-

tions.

To tackle challenges like class imbalance and pre-

cise boundary segmentation, which are crucial for

distinguishing between drivable and non-drivable ar-

eas, the framework employs specialized loss func-

tions such as Dice loss. The results demonstrate the

model’s capability to adapt to a range of scenarios, in-

cluding varying lighting conditions and high-motion

environments, highlighting its practical applicability

for real-world situations.

Looking ahead, future research could aim to ex-

pand this framework to tackle low-light and extreme

lighting environments, where event cameras excel.

Additionally, enhancing the framework for greater

computational efﬁciency may facilitate real-time seg-

mentation, opening up possibilities for its deployment

in edge devices and autonomous systems with limited

resources.

REFERENCES

Almeida, T., Lourenc¸o, B., and Santos, V. (2020).

Road detection based on simultaneous deep learn-

ing approaches. Robotics and Autonomous Systems,

133:103605.

Alonso, I. and Murillo, A. C. (2019). Ev-segnet: Seman-

tic segmentation for event-based cameras. In 2019

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition Workshops (CVPRW), pages 1624–

1633.

Binas, J., Neil, D., Liu, S.-C., and Delbr

uck, T. (2017).

Ddd17: End-to-end davis driving dataset. ArXiv,

abs/1711.01458.

Chakravarthi, B., Verma, A. A., Daniilidis, K., Fermuller,

C., and Yang, Y. (2024). Recent event camera innova-

tions: A survey. arXiv preprint arXiv:2408.13627.

Chan, Y.-C., Lin, Y.-C., and Chen, P.-C. (2019). Lane mark

and drivable area detection using a novel instance seg-

mentation scheme. In 2019 IEEE/SICE International

Symposium on System Integration (SII), pages 502–

506.

Chollet, F. (2017). Xception: Deep Learning with Depth-

wise Separable Convolutions . In 2017 IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 1800–1807, Los Alamitos, CA, USA.

IEEE Computer Society.

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler,

M., Benenson, R., Franke, U., Roth, S., and Schiele,

B. (2016). The cityscapes dataset for semantic urban

scene understanding. In Proc. of the IEEE Conference

on Computer Vision and Pattern Recognition (CVPR).

Demir, S., Ertop, T. E., Koku, A. B., and Konukseven, E.

(2017). An adaptive approach for road boundary de-

tection using 2d lidar sensor. In 2017 IEEE Interna-

tional Conference on Multisensor Fusion and Integra-

tion for Intelligent Systems (MFI), pages 206–211.

INCOFT 2025 - International Conference on Futuristic Technology

142

Gallego, G., Delbr

uck, T., Orchard, G., Bartolozzi, C.,

Taba, B., Censi, A., Leutenegger, S., Davison, A. J.,

Conradt, J., Daniilidis, K., and Scaramuzza, D.

(2022). Event-based vision: A survey. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence,

44(1):154–180.

Gehrig, D., Gehrig, M., Hidalgo-Carri’o, J., and Scara-

muzza, D. (2019). Video to events: Recycling video

datasets for event cameras. 2020 IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 3583–3592.

Gehrig, M., Aarents, W., Gehrig, D., and Scaramuzza, D.

(2021). Dsec: A stereo event camera dataset for driv-

ing scenarios. IEEE Robotics and Automation Letters.

Geiger, A., Lenz, P., Stiller, C., and Urtasun, R. (2013).

Vision meets robotics: The kitti dataset. International

Journal of Robotics Research (IJRR).

Han, H.-Y., Chen, Y.-C., Hsiao, P.-Y., and Fu, L.-C. (2021).

Using channel-wise attention for deep cnn based real-

time semantic segmentation with class-aware edge in-

formation. IEEE Transactions on Intelligent Trans-

portation Systems, 22(2):1041–1051.

Hidalgo-Carri

o, J., Gehrig, D., and Scaramuzza, D. (2020).

Learning monocular dense depth from events. In 2020

International Conference on 3D Vision (3DV), pages

534–542.

Jain, M., Kamat, G., Bachari, R., Belludi, V. A., Hegde, D.,

and Patil, U. (2023). Affordrive: Detection of driv-

able area for autonomous vehicles. In Maji, P., Huang,

T., Pal, N. R., Chaudhury, S., and De, R. K., editors,

PReMI, volume 14301 of Lecture Notes in Computer

Science, pages 532–539. Springer.

Li, K., Xiong, H., Yu, D., Liu, J., Guo, Y., and Wang, J.

(2022). An end-to-end multi-task learning model for

drivable road detection via edge reﬁnement and geo-

metric deformation. IEEE Transactions on Intelligent

Transportation Systems, 23(7):8641–8651.

Liu, M., Fu, B., Xie, S., He, H., Lan, F., Li, Y., Lou,

P., and Fan, D. (2021). Comparison of multi-source

satellite images for classifying marsh vegetation using

deeplabv3 plus deep learning algorithm. Ecological

Indicators, 125:107562.

Long, J., Shelhamer, E., and Darrell, T. (2015). Fully

convolutional networks for semantic segmentation .

In 2015 IEEE Conference on Computer Vision and

Pattern Recognition (CVPR), pages 3431–3440, Los

Alamitos, CA, USA. IEEE Computer Society.

Qiao, D. and Zulkernine, F. (2021). Drivable area detection

using deep learning models for autonomous driving.

In 2021 IEEE International Conference on Big Data

(Big Data), pages 5233–5238.

Rasib, M., Butt, M. A., Riaz, F., Sulaiman, A., and Akram,

M. (2021). Pixel level segmentation based drivable

road region detection and steering angle estimation

method for autonomous driving on unstructured roads.

IEEE Access, 9:167855–167867.

Shariff, W., Dilmaghani, M. S., Kielty, P., Moustafa, M.,

Lemley, J., and Corcoran, P. (2024). Event cam-

eras in automotive sensing: A review. IEEE Access,

12:51275–51306.

Siddique, N., Paheding, S., Elkin, C. P., and Devabhaktuni,

V. (2021). U-net and its variants for medical image

segmentation: A review of theory and applications.

IEEE Access, 9:82031–82057.

Tomy, A., Paigwar, A., Mann, K. S., Renzaglia, A., and

Laugier, C. (2022). Fusing event-based and rgb cam-

era for robust object detection in adverse conditions.

In 2022 International Conference on Robotics and Au-

tomation (ICRA), pages 933–939.

Wang, L., Chae, Y., Yoon, S.-H., Kim, T.-K., and Yoon, K.-

J. (2021). Evdistill: Asynchronous events to end-task

learning via bidirectional reconstruction-guided cross-

modal knowledge distillation. In 2021 IEEE/CVF

Conference on Computer Vision and Pattern Recog-

nition (CVPR), pages 608–619.

Yang, F., Wang, H., and Jin, Z. (2020). A fusion network

for road detection via spatial propagation and spatial

transformation. Pattern Recognition, 100:107141.

Zhou, Z., Wu, Z., Boutteau, R., Yang, F., Demonceaux, C.,

and Ginhac, D. (2023). Rgb-event fusion for moving

object detection in autonomous driving. In 2023 IEEE

International Conference on Robotics and Automation

(ICRA), pages 7808–7815.

Zou, Z., Chen, K., Shi, Z., Guo, Y., and Ye, J. (2023). Ob-

ject detection in 20 years: A survey. Proceedings of

the IEEE, 111(3):257–276.

EveCamNet: A Fusion Framework of Event and RGB Camera Towards Detecting Drivable Area for Autonomous Vehicles

143