GIFF: Graph Iterative Attention Based Feature Fusion for Collaborative

Perception

Ahmed N. Ahmed

, Siegfried Mercelis

and Ali Anwar

Imec Research Group, IDLab, Faculty of Applied Engineering, University of Antwerp, 2000 Antwerp, Belgium

{ahmed.ahmed, siegfried.mercelis, ali.anwar}@uantwerpen.be

Keywords:

Collaborative Perception, Autonomous Driving, Attention, Graphs, Object Detection.

Abstract:

Multi-agent collaborative perception has gained signiﬁcant attention due to its ability to overcome the chal-

lenges stemming from the limited line-of-sight visibility of individual agents that raised safety concerns for

autonomous navigation. This paper introduces GIFF, a graph-based iterative attention collaborative percep-

tion framework designed to improve situational awareness among multi-agent systems, including vehicles and

roadside units. GIFF enhances autonomous driving perception by fusing perceptual data shared among neigh-

boring agents, allowing agents to “see” through occlusions, detect distant objects, and increase resilience to

sensor noise and failures, at low computational cost. To achieve this, we propose a novel framework that

integrates both channel and spatial attention mechanisms, learned iteratively and in parallel. We evaluate our

approach on object detection task using the V2X-Sim and OPV2V datasets by conducting extensive experi-

ments. GIFF has demonstrated effectiveness compared to state-of-the-art methods and has proved to achieve

notable improvements in average precision and the number of model parameters.

1 INTRODUCTION

Situational awareness is an important topic in the ﬁeld

of autonomous driving. Autonomous vehicles (AV)

mainly rely on onboard sensors to perceive their sur-

rounding environment. However, as shown in Fig. 1,

the onboard sensors deployed on the AV are limited

by the sensor’s ﬁeld of view, and horizontal range,

due to that the perception system becomes suscepti-

ble to many challenges such as occlusion and long-

distance perception sparsity, which hinder the situa-

tional awareness ability of the AV. While deep learn-

ing has improved the perception stacks with data-

driven techniques (Qian et al., 2022), the percep-

tion module in AV to date is still brittle, especially

in the face of extreme situations and corner cases

that can lead to catastrophic scenarios. In recent

years, there has been an increasing amount of re-

search focused on collaborative perception enabling

the vehicle to communicate with neighboring AVs

and roadside units to achieve Vehicle-to-Everything

(V2X) (Ahmed et al., 2024a; Ahmed et al., 2024b;

Li et al., 2021; Wang et al., 2020) signiﬁcantly im-

proving the situational awareness abilities, a simpli-

https://orcid.org/0000-0002-7192-699X

https://orcid.org/0000-0001-9355-6566

https://orcid.org/0000-0002-5523-0634

ﬁed illustration is shown in Fig. 1. With the ad-

vent of telecommunication technology developments,

collaborative perception (Han et al., 2023) is becom-

ing a promising paradigm that enables sensor infor-

mation to be shared between neighboring agents (for

simplicity, we refer to vehicles and roadside units as

agents) in real time. The collaborative perception

module operates by intelligently aggregating visual

data from multiple relevant agents within the commu-

nication range to enhance visual reasoning and detec-

tion precision as shown in Fig. 1. In practice, the efﬁ-

cacy of collaborative perception hinges on what data

to transmit within the limited network bandwidth and

how to aggregate the information received from other

agents to build a coherent situational awareness of the

surroundings. Due to the topological nature of this

problem, in this work, we propose a graph iterative

attention-based network to aggregate the ego agent’s

local observations with those of neighboring agents.

By utilizing both the attention mechanism to attend

only to the relevant region of the information provided

by the neighboring agents and message-passing func-

tionality within the graph networks, our methodology

yields in enhanced situational awareness.

On the one hand, various types of graph neural

networks have been proposed (Zhou et al., 2020; Wu

et al., 2020), and have proved to be effective for fea-

820

Ahmed, A. N., Mercelis, S. and Anwar, A.

GIFF: Graph Iterative Attention Based Feature Fusion for Collaborative Perception.

DOI: 10.5220/0013297900003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 2: VISAPP, pages

820-829

ISBN: 978-989-758-728-3; ISSN: 2184-4321

Figure 1: Illustration of single-agent perception challenges.

From the perspective of the ego vehicle (A), vehicle (B) is

occluded by the yellow vehicle. Likewise, Vehicle (C) is

around the corner lying outside the perception range of Ve-

hicle (A). These typical road scenarios cause dangerous col-

lision risks. If vehicles are given the ability to inform each

other “what they can see” achieving collaborative percep-

tion those collisions will be avoided.

ture aggregation (Ahmed et al., 2021). We chose

GATs (Veli

ckovi

c et al., 2017) to be our core aggrega-

tion method, as it exploits the underlying graph struc-

ture of the multi-agent collaborative perception data

aggregation problem by utilizing the message pass-

ing among nodes and attention in the graph. The

node features are updated by aggregating node fea-

tures from the neighbors. Addressing the collabo-

rative perception problem through a graph-based ap-

proach allows for the embedding of both ego and re-

ceived feature maps as graph nodes. This method

enables the model to learn edge weights and atten-

tion coefﬁcients, which adaptively weigh nodes and

their associated features based on inter-node correla-

tions. On the other hand, the beneﬁt of incorporat-

ing attention within our proposed feature aggregation

scheme is that attention enhances the representation

power by directing the model’s to focus on the signif-

icant regions within the fused semantic information

and suppressing unnecessary ones. In this work, we

utilize channel and spatial attention modules to attend

to both local and global contexts. We also introduce

an iterative attention fusion approach to further reﬁne

the feature fusion process, further improving the qual-

ity of the ﬁnal fused feature. The contributions of this

work can be summarized as follows:

• Our method proposes a novel methodology for

aggregation of informative features on channel-

spatial dimensions and incorporating it within

the GATs method which simultaneously aggre-

gates complementary information from connected

nodes

• Our proposed attention learning network is de-

signed so that channel and spatial attention are

learned separately, allowing the model to analyze

spatial and channel information without the bias

introduced by the correlations between channel

and spatial features.

• We propose an iterative attention learning strategy

that gradually builds up a richer, more nuanced

understanding of the fused features progressively

down-weighting less relevant information and fo-

cusing on the most signiﬁcant elements. This fur-

ther improves the model performance with a sub-

stantial increase in the model’s learnable parame-

ters

• We validate our work using a large open dataset

V2XSim (Li et al., 2022) which includes LiDAR

data retrieved from both vehicles and roadside

units. We also perform an extensive ablation study

to investigate the performance gain of our pro-

posed design choices.

The rest of this paper is organized as follows. Sec-

tion. 2 introduces the related work published in recent

years. Section 3 describes our proposed method in de-

tail. The experimental results are given in section. 4

and 5, then we perform an ablation study in section. 6

we conclude the paper in section. 7.

2 RELATED WORKS

Graphs have been extensively applied in collaborative

perception due to their capability to propagate and

aggregate information across neighboring nodes, ef-

fectively updating each node’s feature representation.

The importance of attention mechanisms in enhanc-

ing computer vision tasks has also been well estab-

lished in prior literature (Guo et al., 2022). Conse-

quently, numerous studies have explored combining

graphs with attention mechanisms to improve infor-

mation aggregation among collaborating agents. The

authors in (Zhou et al., 2022) implemented GNN

in multi-robot systems by modeling each robot as

a graph node and leveraging message-passing com-

bined with cross-attention encoding to enable infor-

mation sharing and fusion within the team. In the do-

main of AVs, V2VNet (Wang et al., 2020) employed

GNNs to aggregate shared neural features for joint de-

tection and prediction; however, this approach used a

GIFF: Graph Iterative Attention Based Feature Fusion for Collaborative Perception

821

convolutional gated recurrent unit for message aggre-

gation, which signiﬁcantly increased model parame-

ters. DiscoNet (Li et al., 2021) introduced a teacher-

student framework that applied a matrix-valued edge

weight within the graph to learn node interactions.

V2X-ViT (Xu et al., 2022a) examined the use of at-

tention alone by utilizing a vision transformer with

window attention for V2X collaboration, though it re-

quires the transmission of full feature maps, increas-

ing bandwidth usage. In addition, Where2comm (Hu

et al., 2022) utilized attention on ego and received fea-

ture maps to assess correlations among agents. De-

spite its advantages, Where2comm lacks ﬂexibility in

allowing ego agents to adjust their perceptual focus

based on immediate environmental conditions, poten-

tially reducing effectiveness in limited communica-

tion scenarios. CollabGAT (Ahmed et al., 2024a) in-

corporated spatial and channel attention in a sequen-

tial setup, following the CBAM (Woo et al., 2018)

model; however, this approach may not fully capture

complex interdependencies between channel and spa-

tial features. Alternatively, the authors in (Ahmed

et al., 2024b) integrated spatial and channel attention

in a parallel arrangement within their collaborative

graph, differing from the sequential arrangement in

CollabGAT (Ahmed et al., 2024a). In contrast, our

proposed method proposes a graph-iterative attention-

based method that incorporates both channel and spa-

tial attention in an iterative manner that learns inter-

dependent patterns in both dimensions parallelly.

3 METHODOLOGY

3.1 Overview

The goal of our proposed method is to aggregate in-

formation received from other agents to help enhance

the ego agent’s situational awareness. In our pro-

posed collaborative perception scheme, we assume

that the environment consists of N agents equipped

with LiDARs, and their point cloud observations X =

, X

, .., X

}. In principle, agents can transmit all

their retrieved raw point cloud data to the ego agent

to aggregate them. However, in practice, we have

to consider the network bandwidth limit, as sharing

the raw point cloud data among neighboring agents

can overload the network, causing huge transmission

delays. Thus, we derive a distributed and efﬁcient

information-fusing framework that is able to: (i) max-

imize the object detection accuracy, for the ego agent

and (ii) minimize the size of the shared data to prevent

bandwidth overloading. The overview of our method-

ology is presented in Fig. 2.

In that regard, the raw point cloud of each agent

is processed using a uniﬁed feature extractor

(section. 3.2) into compact semantic representation,

named feature map F

to be transmitted through the

V2X channels in real-time. Subsequently, these fea-

tures are fed into a compression block, further com-

pressing this feature map to further reduce its size to

prepare for transmission(section. 3.3). Afterward, us-

ing the compressed feature map and the pose of the

broadcasting agent we create a collaborative percep-

tion message (CPM) to be broadcast to neighboring

agents. The ego (receiving) agent decompresses the

CPM and passes it to the collaborator selection mod-

ule (section. 3.4) which selects only relevant agents

based on pre-deﬁned metrics (section. 3.4). Even-

tually, the feature map of the selected agents trans-

formed to the ego agent perspective (section. 3.4).

The ego agent and the transformed features are then

fed into the feature fusion network to iteratively ag-

gregate all the received feature maps taking into ac-

count the relevancy of the neighboring agents to the

ego agent (section. 3.5). The fused features are then

forwarded to the decoder network (section 3.6) to

generate predictions on the ﬁnal outputs in object de-

tection.

3.2 Feature Extractor

To alleviate communication overhead, each agent in-

dependently processes its own LiDAR data, encoding

raw point clouds into semantic information, as illus-

trated in Fig. 2. Speciﬁcally, each agent transforms

its collected point cloud data, X , into a bird’s-eye-

view (BEV) representation, ﬂattened along the height

dimension. This BEV representation is then inputted

into a feature extractor, denoted by Θ(·), to produce a

feature map, F

= Θ(BEV

), where F

∈ R

W ×H×C

, with

W , H, and C representing the width, height, and chan-

nel dimensions of the feature map, respectively. Our

approach assumes homogeneous intermediate collab-

orative perception; thus, all agents utilize the same

feature extractor architecture, sharing the same Θ(·).

The primary objective of this work is to improve the

effectiveness of the feature map fusion strategy and to

evaluate our proposed aggregation approach against

state-of-the-art models. To achieve this, we bench-

mark our intermediate feature aggregation methods

by employing the feature extractors from DicoNet (Li

et al., 2021) and V2VNet (Wang et al., 2020); en-

abling an independent analysis of the proposed fusion

strategies across different feature extractor architec-

tures. Further details of this analysis are provided in

Section 6.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

822

3.3 Compression and Sharing

To minimize transmission bandwidth, each agent

compresses its feature map before communication.

We employ the variational compression algorithm de-

scribed in (Ball

e et al., 2018) for this purpose, a CNN

is trained to compress the feature map, in a way that

supports end-to-end optimization. This approach al-

lows the system to preserve essential feature map in-

formation while minimizing bandwidth usage. Each

agent transmits a compressed form of its intermediate

semantic information, denoted as F

, along with its

pose ζ

, in what we refer to as the collaborative per-

ception message (CPM). This CPM is shared among

all neighboring agents. Upon receiving a CPM, de-

compresses it for further processing, enabling it to se-

lect relevant collaborators and transform their feature

maps to align with its own perspective.

3.4 Collaborator Selection and Spatial

Transformation

It is important to note that not all neighboring agents

contribute positively to enhancing the ego agent’s sit-

uational awareness. In some cases, the semantic in-

formation provided by neighboring agents may de-

grade perception performance due to irrelevant view-

points (Liu et al., 2020b). Therefore, agent i employs

the collaborator selector function which selects only

the agents positioned within a 70-meter radius and

exhibiting a heading intersection of 70 degrees rel-

ative to the ego agent. This relevancy metric range

is based on existing dedicated short-range communi-

cations (DSRC) standards (Kenney, 2011), and was

adopted by multiple collaborative perception meth-

ods (Ahmed et al., 2022; Ahmed et al., 2024a; Wang

et al., 2020). In this work, we assume ideal commu-

nication between agents, where agents consistently

transmit and receive the CPMs of their neighbors at

each timestep. Since each selected collaborator per-

ceives the environment from different viewpoints and

perspectives, its semantic information needs to be

transformed to the ego agent’s perspective. The ego

agent transforms each neighboring agent’s semantic

information to its perspective using the ego and the

selected agents pose ζ

, ζ

, respectively. The trans-

formed feature of the j-th agent to the ego agent i

is represented as F

j→i

= Γ

j→i

((F

, ζ

), (F

, ζ

)), where

j→i

represents the afﬁne transformation. We utilize

the afﬁne transformation due to its ability to preserve

parallel lines and distance during rotations. The afﬁne

transformation adopted in this work is closely aligned

with the method proposed in (Jaderberg et al., 2015),

with the key distinction being the absence of a lo-

calization network, as each agent broadcasts its pose.

The ego agent repeats this afﬁne transformation pro-

cess for all selected collaborators.

3.5 Graph Fusion Network

Since selected collaborators possess different loca-

tions, and viewpoints of the surroundings their se-

mantic information therefore to account for their dis-

tinct characteristics, the signiﬁcance of each agent to

the ego agents must be distinguished, and the inter-

actions between multiple agents should vary. To cap-

ture this heterogeneity, we present a novel graph it-

erative attention, employing both spatial and chan-

nel attention parallel iteratively to appropriately en-

hance the feature aggregation. The graph attention-

based aggregation scheme proposed indicates i)the

collaborator’s importance relative to the ego agent,

and ii)emphasizing the signiﬁcant regions within the

collaborator’s feature map further strengthening the

cross-agent feature aggregation.

Graph Network Structure. As shown in Fig. 2, we

consider each agent’s feature map as a node in the

graph, and the edge weights represent the signiﬁcance

of those nodes to each other. Intuitively, we repre-

sent the graph as G = (V, E), where V is the set of

nodes incorporating the semantic information of each

agent V = {F

, F

j→i

. . . F

N→i

}, and E is a set of edges

connecting the nodes, where E = {W

, W

i j

, W

} rep-

resent the importance between selected collaborator

and the ego agent determining their signiﬁcance to

each other. In addition to the edge weights W

, we

incorporate an efﬁcient multi-scale attention learn-

ing scheme that learns two different attention maps

a) spatial attention (α

) and b) channel attention

(α

) (as shown in Fig. 3). Different from the edge

weights that reﬂect the signiﬁcance of the nodes to

each other, attention directs the models to attend only

to signiﬁcant regions within the feature maps of the

selected collaborator relative to the ego agent. Incor-

porating channel-spatial attention encodes both local

and global interactions between connected nodes to

better capture the ambiguity in the semantic feature

space. Local attention can help preserve object de-

tails, while global attention can provide a better un-

derstanding of environmental contexts. To this end,

we present a graph-structured attention-based fusion

process where each agent establishes its own graph,

the nodes in the graph maintain the semantic infor-

mation of the selected collaborators, and the ego node

state is updated based on the feature fusion process

driver by the edge weights as well the attention maps.

Attention Fusion Module. The attention module in-

cludes the parallelly learned channel and spatial at-

GIFF: Graph Iterative Attention Based Feature Fusion for Collaborative Perception

823

Shared Feature

Extractor

Message sharing

CPM

Semantic

Information

Graph Feature Fusion

Collaborator

Ego Agent

Collaborator

Decoder

Network

Collaborator Selection

Spatial Transformation

Semantic Information

Pose Information

Figure 2: The overall architecture of GIFF. Each agent converts its perceived point cloud X

into BEV image. The shared

feature extractor Θ processes the BEV image to obtain the feature map. Subsequently, each agent shares its CPM to initiate

the collaborative selection process. The feature maps corresponding to the relevant agents are then transformed into the ego

agent’s coordinate system. Subsequently, the features are passed to the graph fusion network to aggregate the collaborator’s

feature maps with the ego feature map to produce an updated feature representation. The updated representation is then fed

to the decoder network to perform object detection.

tention maps as illustrated in Fig. 3. In this manner,

the features obtained after applying the attention maps

are aggregated to combine both the low and high-

level features and effectively direct the attention to the

most signiﬁcant regions within the feature map. In-

cluding both channel and spatial attention boosts our

proposed fusion strategy to handle objects of varying

sizes within the feature maps and aggregate informa-

tion from multiple receptive ﬁelds. Instead of relying

solely on global channel attention, which favors large

objects, our method incorporates local channel con-

texts to highlight small objects as well. This allows

the network to adaptively adjust its focus based on the

scale of the objects present in the image. Addition-

ally, the parallel sub-networks block helps effectively

capture the cross-dimension interaction and establish

the inter-dimensional dependencies independently. It

also allows the information ﬂow within the network

by learning which information to emphasize or sup-

press.

i j

= AGG(F

, F

j→i

) (1)

where AGG is the aggregation operation of the F

and F

j→i

, which is computed as summation F

j→i

∈

C×H×W

or concatenation F

j→i

∈ R

2C×H×W

this will

be further discussed in section. 6.

• Channel Attention. AS shown in Fig. 3(a),

to compute the channel attention map α

, we

squeeze the spatial dimension of the aggregated

feature F

i j

by applying global average pooling

GlobalAvgPooling

ReLU

PWConv

ReLU

Sigmoid

ReLU

PWConv

ReLU

PWConv

(a)

(b)

Figure 3: Illustration of the attention map learning scheme

within GIFF. Part (a) depicts the channel attention map

branch, while part (b) represents the spatial attention map.

The parameter r denotes the channel reduction ratio within

the encoder-decoder framework.

(GAP) to model only the cross-channel informa-

tion. GAP generates a compact feature represen-

tation F

i j

of shape R

C×1×1

by averaging the spa-

tial dimension within each channel of F

i j

. This

reduces the 2D spatial dimension H × W into a

single value per channel i.e. C × 1 × 1. This

distills the most important information from the

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

824

entire spatial dimension of F

i j

into a more com-

pact form as illustrated in Fig. 3. F

i j

is then

used to learn the per-channel attention map that

reﬂects how important each channel is for the ob-

ject detection task. To learn the attention map, and

maintain a lightweight model, F

i j

is passed to an

encoder-decoder point-wise convolution network

(PwConv) (1 × 1) which local channel context ag-

gregator exploiting only channel interactions. The

channel attention map α

is learned as follows:

i j

= GAP(F

i j

) (2a)

= Ψ(ϒ(F

i j

)) (2b)

Ψ and ϒ are the decoding-encoding PwConv-

based network used to learn the channel attention

map.

• Spatial Attention. In parallel, we generate a spa-

tial attention map by utilizing the spatial relation-

ship of features as shown in Fig. 3(b). Different

from channel attention, spatial attention focuses

on where within the spatial dimension are the in-

formative parts of the aggregated feature F

i j

, and

increases their weight within the attention map.

On F

i j

, we apply encoder-decoder PwConv lay-

ers to generate the spatial attention map α

. The

spatial attention process is expressed as follows:

= Ω(Λ(F

i j

)) (3)

where Ω and Λ are the decoding-encoding

PwConv-based network tailored to learn the spa-

tial attention map.

Following the computation of the channel and

spatial attention maps, these maps are combined

to form the ﬁnal feature map, denoted as α, in

order to exploit the learned representations. The

feature map α is subsequently utilized to update

the aggregated feature map F

i j

, resulting in the re-

ﬁned feature map F

(l)

i j

, as expressed by:

α = σ(α

⊕ α

) (4)

i j

= α ⊗ F

i j

(5)

• Iterative Attention To enhance the attention map

of complementary information from the aggre-

gated features, we propose an iterative attention

learning strategy. This strategy progressively

reﬁnes the spatial and channel attention maps,

thereby enhancing the discriminative power of the

learned features and improving the feature fusion

process. At the end of each iteration, the features

are aggregated and forwarded to the next iterative

layer, reﬁning the input to the attention module

Attention

Module

Attention

Module

number of iterations

Figure 4: Illustration of the proposed iterative attention fea-

ture fusion where the attention module, shown in Fig.3, is

repeated for l iterations.

and potentially generating more expressive atten-

tion maps. After l iterations, as illustrated in Fig.

4, the fused feature F

(l)

i j

is incrementally updated

through each attention iteration, ultimately pro-

ducing the ﬁnal representation

i j

The attention fusion module is repeated for every

connected node to compute the updated feature

Node Feature Aggregation. After obtaining updated

features from the attention fusion module, the ﬁnal

updated feature, H

is computed as a weighted sum,

where each feature

s multiplied by its correspond-

ing learnable edge weight matrix W

as follows:

∑

) (6)

3.6 Decoder Network

After the graph-based fusion, the ego agent the fea-

ture map H

is passed into the detection decoder that

decodes it into objects, including class and regression

output. This study aims to enhance the feature ag-

gregation methodology using a graph attention-based

network and assess its performance in comparison

to state-of-the-art techniques. In line with the fea-

ture extractor (discussed in Section 3.2, we adopt the

same detection decoder network utilized by DiscoNet

DiscoNet

(·) (Li et al., 2021), and V2VNet Φ

V2VNet

(·)

(Wang et al., 2020), to produce the ﬁnal detection out-

puts.

GIFF: Graph Iterative Attention Based Feature Fusion for Collaborative Perception

825

4 EXPERIMENTAL SETUP

Dataset. We evaluate our work using V2X-Sim

(Li et al., 2022) and OPV2V (Xu et al., 2022b)

datasets. V2X-Sim dataset integrates the SUMO

platform (Krajzewicz et al., 2012) for generating

trafﬁc ﬂow data and the Carla simulator (Dosovitskiy

et al., 2017) to capture sensor data from multiple

agents. V2X-Sim consists of 10,000 frames across

100 scenes, each involving 2-5 collaborative agents.

We split the dataset into training, validation, and

test sets containing 8,000, 1,000, and 1,000 frames,

respectively. Each frame includes data collected

from vehicles and roadside units (RSUs), resulting in

37,200 training samples, 5,000 validation samples,

and 5,000 test samples. This work evaluates object

detection performance in two scenarios: without

RSU (w/o RSU) and with RSU (w/ RSU).

OPV2V is a large-scale V2V perception dataset

created utilizing CARLA (Dosovitskiy et al., 2017)

and OpenCDA (Xu et al., 2021). The dataset con-

sists of around 11,464 LiDAR point cloud frames.

OPV2V is divided into two subsets: the default

CARLA towns and the Culver City digital town. The

default town subset has a total of 10,914 frames.

These frames are divided into train/val/test splits of

6,764/1,980/2,170 frames, respectively. This subset

offers a broad spectrum of scenarios characterized by

varying levels of complexity. In contrast, the Culver

City subset consists of 550 frames used for evaluation

that simulate a real-world urban environment, with a

wide range of objects and structures.

Evaluation Metrics.To supervise foreground-

background classiﬁcation loss, we utilize the

binary cross-entropy (Mannor et al., 2005). For

the bounding-box regression loss, we utilize the

weighted smooth loss. To assess the collaborative

perception detection performance we utilize average

precision (AP) over the Intersection over Union (IoU)

thresholds of 0.5 and 0.7.

Training Setup. We utilize the Adam optimizer with

an initial learning rate of 10

−3

and steadily decay at

every 10 epochs using a factor of 0.1. All models

are trained on NVIDIA Tesla V100 GPU with a batch

size of 4. We compare GIFF with no, early, and late

collaboration methods. For the intermediate collab-

oration methods, we benchmark six approaches that

evaluated their result using V2XSim: When2Com

(Liu et al., 2020a), Who2Com (Liu et al., 2020b),

V2VNet (Wang et al., 2020), DiscoNet (Li et al.,

2021), Ahmed et. al. (Ahmed et al., 2024b), Collab-

GAT (Ahmed et al., 2024a). For OPV2V the bench-

marks are: F-Cooper(Chen et al., 2019), Who2Com,

AttFuse (Xu et al., 2022b), V2VNet, HP3D-V2V

(Chen et al., 2024) and CollabGAT.

Table 1: Object detection AP on V2X-SIM reporting re-

sults of both with and without RSU at IoU of 0.5 and 0.7.

Note results in red, blue, green denoting the 1

and 3

highest AP results.

Method

AP@IoU=0.5 AP@IoU=0.7

w/o RSU w/RSU w/o RSU w/RSU

When2com 44.02 46.39 39.89 40.32

Who2com 44.02 46.39 39.89 40.32

V2VNet 68.35 72.08 63.83 65.85

DiscoNet 69.03 72.87 63.44 66.40

Ahmed et. al 68.97 72.96 63.48 65.94

CollabGAT 69.67 75.57 63.72 73.29

GIFF (Ours) 73.62 78.93 68.37 75.82

No Collaboration 49.90 46.96 44.21 42.33

Late Collaboration 43.99 42.98 39.10 38.26

Early Collaboration 70.43 77.08 67.04 72.57

Table 2: Object detection AP on OPV2V reporting results

tested on default and Culver at IoU of 0.5 and 0.7. Note re-

sults in red, blue, green denoting the 1

and 3

high-

est AP results.

Method

Default Culver

AP@0.5 AP@0.7 AP@0.5 AP@0.7

F-Cooper 61.77 49.85 53.79 44.50

Who2Com 62.04 50.52 54.11 44.21

AttFuse 62.86 50.84 54.01 46.37

V2VNet 63.33 51.67 54.54 45.87

HP3D-V2V 67.42 56.50 58.83 50.51

CollabGAT 68.41 58.32 60.01 51.82

GIFF (Ours) 69.60 60.04 61.35 51.93

No Collaboration 49.13 38.38 40.66 26.70

Late Collaboration 59.61 42.53 49.45 39.76

Early Collaboration 52.35 40.66 42.59 35.34

5 RESULTS AND DISCUSSION

Detection Performance. Tables. 1 and 2 shows

the AP object detection performance of GIFF on

V2XSim and OPV2V datasets. As shown in Table. 1,

our method signiﬁcantly outperforms V2VNet (Wang

et al., 2020), DiscoNet (Li et al., 2021), (Ahmed et al.,

2024b), and CollabGAT (Ahmed et al., 2024a); for

instance, at IoU of 0.7 w/RSU, our method achieves

performance gains of 15.14%, 14.19%, 15%, and

3.45%, respectively. For the OPV2V results shown

in Table. 2, among all fusion models GIFF consis-

tently achieves the highest AP scores for both driv-

ing scenarios. Especially GIFFS’s superiority in the

Culver City scenario demonstrates its strong general-

ization ability. These illustrated results highlight the

effectiveness of GIFF in enhancing the object detec-

tion AP when compared to other state-of-the-art in-

termediate collaboration methods. This improvement

can be attributed to our proposed iterative attention-

based learning network, which iteratively reﬁnes the

attention map, allowing the model to focus more ac-

curately on relevant regions in both ego-centric and

received semantic information. Unlike the attention

mechanisms in CollabGAT and Ahmed et al., which

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

826

Table 3: Number of parameters of each model trained on

V2XSim dataset.

Method No. of parameters (M)

V2VNet 21.08

DiscoNet 15.84

Ahmed et. al 15.98

CollabGAT 15.93

GIFF (Ours) 16.12

also incorporate attention within their method, our

iterative attention fusion approach enables superior

feature fusion by progressively learning the signiﬁ-

cance of each feature map in relation to ego-centric

semantic information while preserving spatial rela-

tionships across feature maps. In addition to that, the

attention cooperation within the multi-agent fusion

scheme, where channel attention directs the model to

relevant features across channels, while spatial atten-

tion focuses on important spatial locations, enhancing

the model’s overall feature fusion.

Computational Efﬁciency. Table. 3 presents the pa-

rameter counts for each state-of-the-art method. Our

proposed method demonstrates a 23.5% reduction in

parameter count compared to the V2VNet network.

For other methods, the parameter count of our model

is either comparable or marginally higher, with an in-

creased range of approximately 0.9% to 1.7%. How-

ever, this slight increase is negligible given the sub-

stantial performance improvements achieved. This

efﬁciency is attributed to our iterative PWConv atten-

tion mechanism, which iteratively enhances the atten-

tion map without signiﬁcantly impacting model size,

thereby supporting performance gains in object detec-

tion. A more detailed analysis is presented in the ab-

lation study (Section 6).

Table 4: This table gives an experiment number to differ-

entiate the different settings of GIFF conducted within the

ablation study.

Experiment No. Model Base

Aggregation

Operation

Depth

DiscoNet

Sum 256, 128, 64

2(Default) Sum 256, 128, 64, 32

3 Concat 512, 256, 128, 64

4 Concat 512, 256, 128, 64, 32

V2VNet

Concat 512, 256, 128, 64

6 Concat 512, 256, 128, 64, 32

6 ABLATION STUDY

Table. 4 shows the design of each experiment con-

ducted to evaluate the effect of every module of GIFF,

with every design carrying the species experiment tag.

Effect of Deeper Attention Layers. This module is

deﬁned by Eqs. 2b and 3, which govern the learn-

ing of channel and spatial attention maps. As shown

in Table 5, a deeper encoder-decoder architecture

leads to a higher object detection AP. This improve-

ment is attributed to our proposed attention network,

which is based on a PwC framework. In this net-

work, deeper layers capture higher-level and more ab-

stract representations of the input data. Consequently,

the network learns intricate patterns and correlations

among features, as the deeper layers combine features

learned in earlier stages to create representations that

capture more complex aspects of the input. These

high-level representations are crucial for learning at-

tention weights effectively. However, we observed

that increasing depth beyond the tested level led to

a decrease in AP due to the vanishing gradient prob-

lem, where the gradient signal becomes too weak to

propagate effectively through multiple layers.

Effect of Aggregation Operation. This section ex-

amines the aggregation function “AGG” employed in

Eq. 1. As presented in Table 5, the experimental

setup in experiment “2” achieves the highest detec-

tion AP with a minimal model parameter count, while

experiment “1” attains the second-highest AP, how-

ever, it achieves the lowest parameter count of all

experiments. Quantitative analysis of the proposed

methodology demonstrates that summation slightly

outperforms concatenation. This can be attributed

to summation’s ability to seamlessly integrate infor-

mation, effectively combining low-level details (such

as edges) with high-level semantics (such as object

shapes), thus yielding more cohesive and generaliz-

able features. Additionally, summation aids gradient

ﬂow during backpropagation by preserving feature

map size and channel consistency, which contributes

to stable training—especially in deep networks prone

to gradient degradation. Concatenation increases di-

mensionality and computational requirements, as re-

ﬂected in a higher parameter count in the last column

of Table 5.

Effect of Iterative Fusion. As shown in Table. 5

adding another layer of attention further improves

the performance as the iterative extraction allows the

model to tune the parameters to extract even more in-

formation from the initially fused feature map. How-

ever, this improvement may be obtained at the cost of

increasing the model’s number of parameters. Inter-

estingly, we ﬁnd that extra iterations do not boost per-

formance, and two iterations achieve the best results

in our experiment.

GIFF: Graph Iterative Attention Based Feature Fusion for Collaborative Perception

827

Table 5: The AP and the number of parameters are represented by different design considerations of GIFF. Aggregation Op-

eration represents the “AGG” featured in Eq. \ref{aggregate features}. “Depths” represents the feature map dimensionality

reduction to compute the attention weights. w/IAtten and w/o represent the AP with and without iterative attention.

Experiment No.

AP@IoU=0.5 AP@IoU=0.7 No. of Parameters (M)

w/ IAtten w/o IAtten w/ IAtten w/o IAtten

w/ IAtten w/o IAtten

w/o RSU w/RSU w/o RSU w/RSU w/o RSU w/RSU w/o RSU w/RSU

1 71.28 77.26 68.56 71.81 66.85 73.29 62.88 65.86 16.10 15.39

2 (Default) 73.62 78.93 68.97 72.96 68.37 75.82 63.48 65.94 16.12 15.98

3 68.53 74.33 68.14 71.61 62.76 67.42 62.36 64.41 16.89 16.20

4 67.15 72.14 68.50 72.25 64.32 68.46 63.32 63.74 16.92 16.21

5 68.56 70.05 67.53 70.0 62.68 71.14 61.55 63.52 17.06 16.78

6 69.93 72.78 68.46 70.94 63.12 68.37 63.10 63.10 17.13 16.85

7 CONCLUSION AND FUTURE

WORK

This paper presents GIFF, a graph iterative attention-

based network designed to address collaborative per-

ception challenges in multi-agent systems. GIFF ef-

fectively facilitates multi-agent collaboration by intel-

ligently fusing perceptual information received from

collaborators. It achieves this by learning the relative

importance of collaborators and identifying the spa-

tial regions within the received semantic information

that require higher attention. The iterative attention

mechanism further enhances the reﬁnement of the

attention-learning process. GIFF achieves superior

performance on the object detection task, as demon-

strated on standard benchmarks such as V2XSim

and OPV2V. Despite these promising results, the ap-

proach has signiﬁcant potential for future improve-

ments. As part of future work, we aim to address the

impact of transmission delays caused by communica-

tion network characteristics, which hinder the perfor-

mance of collaborative perception.

REFERENCES

Ahmed, A. N., Anwar, A., Mercelis, S., Latr

e, S., and

Hellinckx, P. (2021). Ff-gat: Feature fusion using

graph attention networks. In IECON 2021–47th An-

nual Conference of the IEEE Industrial Electronics

Society, pages 1–6. IEEE.

Ahmed, A. N., Mercelis, S., and Anwar, A. (2024a). Col-

labgat: Collaborative perception using graph attention

network. IEEE Access.

Ahmed, A. N., Mercelis, S., and Anwar, A. (2024b). Graph

attention based feature fusion for collaborative per-

ception. In 2024 IEEE Intelligent Vehicles Symposium

(IV), pages 2317–2324. IEEE.

Ahmed, A. N., Ravijts, I., de Hoog, J., Anwar, A., Mer-

celis, S., and Hellinckx, P. (2022). A joint perception

scheme for connected vehicles. In 2022 IEEE Sensors,

pages 1–4. IEEE.

Ball

e, J., Minnen, D., Singh, S., Hwang, S. J., and Johnston,

N. (2018). Variational image compression with a scale

hyperprior. arXiv preprint arXiv:1802.01436.

Chen, H., Wang, H., Liu, Z., Gu, D., and Ye, W.

(2024). Hp3d-v2v: High-precision 3d object detection

vehicle-to-vehicle cooperative perception algorithm.

Sensors, 24(7):2170.

Chen, Q., Ma, X., Tang, S., Guo, J., Yang, Q., and Fu,

S. (2019). F-cooper: Feature based cooperative per-

ception for autonomous vehicle edge computing sys-

tem using 3d point clouds. In Proceedings of the 4th

ACM/IEEE Symposium on Edge Computing, pages

88–100.

Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., and

Koltun, V. (2017). CARLA: An open urban driving

simulator. In Proceedings of the 1st Annual Confer-

ence on Robot Learning, pages 1–16.

Guo, M.-H., Xu, T.-X., Liu, J.-J., Liu, Z.-N., Jiang, P.-T.,

Mu, T.-J., Zhang, S.-H., Martin, R. R., Cheng, M.-M.,

and Hu, S.-M. (2022). Attention mechanisms in com-

puter vision: A survey. Computational visual media,

8(3):331–368.

Han, Y., Zhang, H., Li, H., Jin, Y., Lang, C., and Li, Y.

(2023). Collaborative perception in autonomous driv-

ing: Methods, datasets, and challenges. IEEE Intelli-

gent Transportation Systems Magazine.

Hu, Y., Fang, S., Lei, Z., Zhong, Y., and Chen, S. (2022).

Where2comm: Communication-efﬁcient collabora-

tive perception via spatial conﬁdence maps. Advances

in neural information processing systems, 35:4874–

4886.

Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015).

Spatial transformer networks. Advances in neural in-

formation processing systems, 28.

Kenney, J. B. (2011). Dedicated short-range communica-

tions (dsrc) standards in the united states. Proceedings

of the IEEE, 99(7):1162–1182.

Krajzewicz, D., Erdmann, J., Behrisch, M., and Bieker,

L. (2012). Recent development and applications of

sumo-simulation of urban mobility. International

journal on advances in systems and measurements,

5(3&4).

Li, Y., Ma, D., An, Z., Wang, Z., Zhong, Y., Chen, S.,

and Feng, C. (2022). V2x-sim: Multi-agent col-

laborative perception dataset and benchmark for au-

tonomous driving. IEEE Robotics and Automation

Letters, 7(4):10914–10921.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

828

Li, Y., Ren, S., Wu, P., Chen, S., Feng, C., and Zhang,

W. (2021). Learning distilled collaboration graph for

multi-agent perception. Advances in Neural Informa-

tion Processing Systems, 34:29541–29552.

Liu, Y.-C., Tian, J., Glaser, N., and Kira, Z. (2020a).

When2com: Multi-agent perception via communica-

tion graph grouping. In Proceedings of the IEEE/CVF

Conference on computer vision and pattern recogni-

tion, pages 4106–4115.

Liu, Y.-C., Tian, J., Ma, C.-Y., Glaser, N., Kuo, C.-W., and

Kira, Z. (2020b). Who2com: Collaborative percep-

tion via learnable handshake communication. In 2020

IEEE International Conference on Robotics and Au-

tomation (ICRA), pages 6876–6883. IEEE.

Mannor, S., Peleg, D., and Rubinstein, R. (2005). The cross

entropy method for classiﬁcation. In Proceedings of

the 22nd international conference on Machine learn-

ing, pages 561–568.

Qian, R., Lai, X., and Li, X. (2022). 3d object detection for

autonomous driving: A survey. Pattern Recognition,

130:108796.

Veli

ckovi

c, P., Cucurull, G., Casanova, A., Romero, A., Lio,

P., and Bengio, Y. (2017). Graph attention networks.

arXiv preprint arXiv:1710.10903.

Wang, T.-H., Manivasagam, S., Liang, M., Yang, B., Zeng,

W., and Urtasun, R. (2020). V2vnet: Vehicle-to-

vehicle communication for joint perception and pre-

diction. In Computer Vision–ECCV 2020: 16th Euro-

pean Conference, Glasgow, UK, August 23–28, 2020,

Proceedings, Part II 16, pages 605–621. Springer.

Woo, S., Park, J., Lee, J.-Y., and Kweon, I. S. (2018). Cbam:

Convolutional block attention module. In Proceed-

ings of the European conference on computer vision

(ECCV), pages 3–19.

Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Philip,

S. Y. (2020). A comprehensive survey on graph neural

networks. IEEE transactions on neural networks and

learning systems, 32(1):4–24.

Xu, R., Guo, Y., Han, X., Xia, X., Xiang, H., and Ma,

J. (2021). Opencda: an open cooperative driving

automation framework integrated with co-simulation.

In 2021 IEEE International Intelligent Transportation

Systems Conference (ITSC), pages 1155–1162. IEEE.

Xu, R., Xiang, H., Tu, Z., Xia, X., Yang, M.-H., and Ma, J.

(2022a). V2x-vit: Vehicle-to-everything cooperative

perception with vision transformer. In European con-

ference on computer vision, pages 107–124. Springer.

Xu, R., Xiang, H., Xia, X., Han, X., Li, J., and Ma, J.

(2022b). Opv2v: An open benchmark dataset and

fusion pipeline for perception with vehicle-to-vehicle

communication. In 2022 International Conference on

Robotics and Automation (ICRA), pages 2583–2589.

IEEE.

Zhou, J., Cui, G., Hu, S., Zhang, Z., Yang, C., Liu, Z.,

Wang, L., Li, C., and Sun, M. (2020). Graph neu-

ral networks: A review of methods and applications.

AI open, 1:57–81.

Zhou, Y., Xiao, J., Zhou, Y., and Loianno, G. (2022).

Multi-robot collaborative perception with graph neu-

ral networks. IEEE Robotics and Automation Letters,

7(2):2289–2296.

GIFF: Graph Iterative Attention Based Feature Fusion for Collaborative Perception

829