BEVSeg2TP: Surround View Camera Bird’s-Eye-View Based Joint

Vehicle Segmentation and Ego Vehicle Trajectory Prediction

Sushil Sharma

1,2

, Arindam Das

, Ganesh Sistu

, Mark Halton

and Ciar

an Eising

1,2

Department of Electronic & Computer Engineering, University of Limerick, Ireland

SFI CRT Foundations in Data Science, University of Limerick, Ireland

{ﬁrstname.lastname}@ul.ie

Keywords:

Surrounded-View Camera, Encoder-Decoder Transformer, Segmentation, Trajectory Prediction.

Abstract:

Trajectory prediction is, naturally, a key task for vehicle autonomy. While the number of trafﬁc rules is lim-

ited, the combinations and uncertainties associated with each agent’s behaviour in real-world scenarios are

nearly impossible to encode. Consequently, there is a growing interest in learning-based trajectory prediction.

The proposed method in this paper predicts trajectories by considering perception and trajectory prediction

as a uniﬁed system. In considering them as uniﬁed tasks, we show that there is the potential to improve

the performance of perception. To achieve these goals, we present BEVSeg2TP - a surround-view camera

bird’s-eye-view-based joint vehicle segmentation and ego vehicle trajectory prediction system for autonomous

vehicles. The proposed system uses a network trained on multiple camera views. The images are transformed

using several deep learning techniques to perform semantic segmentation of objects, including other vehicles,

in the scene. The segmentation outputs are fused across the camera views to obtain a comprehensive repre-

sentation of the surrounding vehicles from the bird’s-eye-view perspective. The system further predicts the

future trajectory of the ego vehicle using a spatiotemporal probabilistic network (STPN) to optimize trajectory

prediction. This network leverages information from encoder-decoder transformers and joint vehicle segmen-

tation. The predicted trajectories are projected back to the ego vehicle’s bird’s-eye-view perspective to provide

a holistic understanding of the surrounding trafﬁc dynamics, thus achieving safe and effective driving for vehi-

cle autonomy. The present study suggests that transformer-based models that use cross-attention information

can improve the accuracy of trajectory prediction for autonomous driving perception systems. Our proposed

method outperforms existing state-of-the-art approaches on the publicly available nuScenes dataset. This link

is to be followed for the source code: https://github.com/sharmasushil/BEVSeg2TP/.

1 INTRODUCTION

Accurate trajectory prediction is a critical capabil-

ity for autonomous driving systems, playing a piv-

otal role in enhancing safety, efﬁciency, and driv-

ing policies. This technology is increasingly vital

as autonomous vehicles become more prevalent on

public roads, as it enables these vehicles to antic-

ipate the movements of various road users, includ-

ing pedestrians, cyclists, and other vehicles. By do-

ing so, autonomous vehicles can proactively plan and

execute safe manoeuvres, reducing the risk of po-

tential collisions (Li and Guo, 2021; Cheng et al.,

2019) and effectively navigating through complex

trafﬁc scenarios. Moreover, trajectory prediction em-

powers autonomous vehicles to optimise their driv-

ing behaviour, enabling smoother lane changes (Chen

et al., 2020) and seamless merging to improve over-

all trafﬁc ﬂow and reduce congestion (Wei et al.,

2021). Furthermore, trajectory prediction also plays

a crucial role in facilitating effective communication

and interaction between autonomous vehicles, human

drivers, and pedestrians. By behaving predictably, au-

tonomous vehicles can earn the trust of other road

users (Liu et al., 2021; Yang et al., 2021) and support

other extended applications in the ADAS perception

stack, such as pedestrian detection (Das et al., 2023;

Dasgupta et al., 2022), and pose estimation (Das et al.,

2022).

In this paper, we introduce an approach called

BEVSeg2TP for joint vehicle segmentation and ego

vehicle trajectory prediction, leveraging a bird’s-

eye-view perspective from surround-view cameras.

Our proposed system employs a network trained on

Sharma, S., Das, A., Sistu, G., Halton, M. and Eising, C.

BEVSeg2TP: Surround View Camera Bird’s-Eye-View Based Joint Vehicle Segmentation and Ego Vehicle Trajectory Prediction.

DOI: 10.5220/0012321700003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 4: VISAPP, pages

25-34

ISBN: 978-989-758-679-8; ISSN: 2184-4321

Figure 1: Our proposed BEVSeg2TP framework - surround-view camera joint vehicle segmentation and ego vehicle

trajectory prediction in bird’s-eye-view approach consists of an encoder-decoder transformer, BEV projection module

followed by segmentation outputs fed to the spatio-temporal probabilistic network to produce ego vehicle trajectory prediction.

surround-view or multi-camera view from the host

vehicle, which it transforms into bird’s-eye-view im-

agery of the surrounding context. These images un-

dergo deep learning-driven processes to perform se-

mantic segmentation on objects, including neighbor-

ing vehicles within the scene. The segmentation out-

comes are then amalgamated across camera perspec-

tives to generate a comprehensive representation of

the surrounding vehicles from a bird’s-eye-view per-

spective (Zhou and Kr

ahenb

uhl, 2022). Building

upon this segmented data, the proposed system also

anticipates the future trajectories of the host vehicle

using a spatio-temporal probabilistic network (STPN)

(Cui et al., 2019). The STPN learns the spatiotempo-

ral patterns of vehicle motion from historical trajec-

tory data. The predicted trajectories are then projected

back to the ego vehicle’s bird’s-eye-view perspective

to provide a holistic understanding of the surrounding

trafﬁc dynamics. Figure 1 represents the overarch-

ing depiction of our approach. Our principal contri-

butions to the BEVSeg2TP proposal are:

• Our proposed deep architecture offers an ap-

proach to jointly accomplish vehicle segmenta-

tion and ego vehicle trajectory prediction tasks by

combining and adapting the works of (Zhou and

ahenb

uhl, 2022; Phan-Minh et al., 2020; Cui

et al., 2019).

• We propose enhancements to the capabilities of

the current encoder-decoder transformer used in

the spatio-temporal probabilistic network (STPN)

for optimizing trajectory prediction.

• We implemented an end-to-end trainable

surround-view camera bird’s-eye-view-based

network that achieves state-of-the-art results on

the nuScenes dataset (Caesar et al., 2020) when

jointly trained with segmentation.

2 PRIOR ART

Joint vehicle segmentation and ego vehicle trajectory

prediction using a surround or multi-camera bird’s-

eye view is currently an emerging area of research

with several motivating factors. Firstly, working on

this problem could help advance the ﬁeld and con-

tribute to the development of more effective and accu-

rate autonomous driving systems. The potential uses

of precise vehicle segmentation and predictions for

ego vehicle trajectories are vast, encompassing do-

mains such as self-driving vehicles, intelligent trans-

portation systems, and automated driving systems,

among others.

Moreover, this problem is complex and challeng-

ing, requiring the integration of information from

multiple sensors and camera views. Addressing the

technical challenges of this problem, such as design-

ing effective deep learning models or developing ef-

ﬁcient algorithms, could be a motivating factor for

researchers interested in solving complex and chal-

lenging problems. Our primary focus is on enhancing

map-view segmentation. It is undeniable that exten-

sive research has been conducted in this ﬁeld, which

lies at the convergence of 3D recognition (Ma et al.,

2019; Lai et al., 2023; Manhardt et al., 2019), depth

estimation (Eigen et al., 2014; Godard et al., 2019;

Ranftl et al., 2020; Zhou et al., 2017), and mapping

(Garnett et al., 2019; Sengupta et al., 2012; Zhu et al.,

2021).

These are the key areas that can facilitate segmen-

tation construction and improvement. While trajec-

tory prediction or motion planning for autonomous

systems is crucial, we acknowledge the need to con-

sider various aspects of the vehicle state, such as cur-

rent position and velocity, road geometry (Lee and

Kim, 2016; Wiest et al., 2020; Wu et al., 2017), other

vehicles, environmental factors, and driver behaviour

(Zhang et al., 2020; Abbink et al., 2017; McDonald

and Mazumdar, 2020). The architecture previously

described by the authors (Sharma et al., 2023) ex-

plores the utilization of the CNN-LSTM model for

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

predicting trajectories, covering unique scenarios like

pedestrians crossing roads. While the model adeptly

comprehends these scenarios, it adheres to a model-

driven methodology, thereby carrying inherent lim-

itations. In our pursuit to address these limitations

and devise an alternative approach, we propose the

integration of a transformer-based model into our tra-

jectory prediction methodology. Our strategy entails

a partial adoption of the principles from CoverNet

(Phan-Minh et al., 2020), albeit with notable distinc-

tions. CoverNet’s trajectory prediction relies on raster

maps, whereas our model pivots towards real-time

map view representations.

3 PROPOSED METHODOLOGY

In this section, we present BEVSeg2TP - our

proposed deep architecture designed to efﬁciently

achieve both vehicle segmentation and ego vehicle

trajectory prediction tasks simultaneously. The pro-

posed method, as depicted in Figure 2, utilizes mul-

tiple cameras to create a comprehensive view of the

environment around the ego vehicle, improving ego

vehicle and object segmentation, based on the work

presented by (Zhou and Kr

ahenb

uhl, 2022). We ex-

tend this transformer technique to incorporate trajec-

tory prediction using a spatio-temporal probabilistic

network to calculate path likelihoods, as presented in

(Phan-Minh et al., 2020; Cui et al., 2019). This ap-

proach combines multiple sources of information for

more accurate future trajectory predictions, enhanc-

ing self-driving car safety and performance by jointly

learning the segmentation and the trajectory predic-

tion.

3.1 Surround-View Camera Inputs

The dataset used in this paper is nuScenes (Caesar

et al., 2020). It consists of six cameras located on

the vehicle, providing a 360

◦

ﬁeld of view. All cam-

eras in each scene have extrinsic (R,t) and intrinsic K

calibration parameters provided at every timestamp;

the intrinsic parameters remain unchanged with time.

Other perception sensors in the nuScenes dataset

(radar and lidar) are not used in this work.

3.2 Image Encoder

We use the simple and effective encoder-decoder ar-

chitecture for map-view semantic segmentation from

(Zhou and Kr

ahenb

uhl, 2022). In summary, the au-

thors proposed an image encoder that generates a

multi-scale feature representation {φ} for each input

image, which is then combined into a shared map-

view representation using a cross-view cross-attention

mechanism. This attention mechanism utilizes a po-

sitional embedding {δ} to capture both the geomet-

ric structure of the scene, allowing for accurate spa-

tial alignment, and the sequential information be-

tween different camera views, facilitating temporal

understanding and context integration. All camera-

aware positional embeddings are presented as a sin-

gle key vector δ = [δ

, δ

......δ

]. Image features are

combined into a value vector φ = [φ

, φ

.....]. Both

are merged to create a comparison of attention keys

and subsequently, a softmax-cross attention is used

(Vaswani et al., 2017).

3.3 Cross Attention

As illustrated in Figure 2, the cross-view transforma-

tion component aims to establish a connection be-

tween a map view and image features, as presented

by (Zhou and Kr

ahenb

uhl, 2022). To summarise, pre-

cise depth estimation is not learned; rather, the trans-

former learns a depth proxy through positional em-

bedding {δ} (x

world

remains ambiguous). The cosine

similarity is used to express the geometric relation-

ship between the world and unprojected image coor-

dinates:

cos(θ) =



−1

image





world

−t



∥R

−1

image

∥∥x

world

−t

∥

(1)

where denoted as x

image

∈ P

is a homogeneous

image point for a given world coordinate x

world

∈ R

The cosine similarity traditionally relies on precise

world coordinates.

However, in this approach, the cosine similarity

is augmented with positional embeddings, thus hav-

ing the capability to learn both geometric and appear-

ance features (Zhou and Kr

ahenb

uhl, 2022). Direc-

tion vectors d

k,i

= R

−1

image

are created for each

image coordinate x

image

, serving as a reference point

in world coordinates. An MLP is used to convert

the direction vector d

k,i

into a D-dimensional posi-

tional embedding denoted as δ

k,i

∈ R

(Per (Zhou

and Kr

ahenb

uhl, 2022), we have set the value of D

to 128).

3.4 Joint Vehicle Segmentation

To enhance the vehicle segmentation, we have de-

signed our segmentation head to be simple, utilizing

a series of convolutions on the bird’s-eye view (BEV)

feature. Speciﬁcally, it consists of four 3 ×3 convo-

lutions followed by a 1 ×1 convolution, resulting in

BEVSeg2TP: Surround View Camera Bird’s-Eye-View Based Joint Vehicle Segmentation and Ego Vehicle Trajectory Prediction

Figure 2: Our proposed BEVSeg2TP architecture: Joint vehicle segmentation and ego vehicles trajectory prediction in-

volves extracting image features {φ} at multiple scales and using a camera-aware positional embedding {δ} to account for

perspective distortion. We then use map-view positional embedding and cross-attention layers to capture contextual infor-

mation from multiple views and reﬁne the vehicle segmentation. This segmentation information is then used as input to a

spatio-temporal probabilistic network (STPN) for trajectory prediction based on the surrounding environment.

a BEV tensor of size h ×w ×n, where n represents

the number of categories. In our case, we set n to 1,

as we focus solely on the vehicles and other agents

related to it following the approach used in the cross-

view transformer (Zhou and Kr

ahenb

uhl, 2022). To

enhance road and vehicle segmentation in the dataset

using an encoder-decoder transformer, we employ the

following equation:

y = f (X1,X 2)

where y is the output segmentation map, X1 is the

input image from one sensor modality (e.g., camera),

and X 2 is the input image from another sensor modal-

ity (e.g., map information). f is the cross-view trans-

former, which learns to combine the information from

the two modalities to produce a more accurate seg-

mentation map. The cross-attention mechanism can

be implemented using the following equation:

M = softmax



Q.(K

)

√



V (2)

where Q, K, and V are the queries, keys, and values,

respectively, for each modality. The dot product be-

tween the queries and keys is present in the form of

Q.(K

) is divided by the square root of the dimen-

sionality of the key vectors (d

) to prevent the dot

product from becoming too large. Subsequently, the

obtained attention weights are employed to weigh the

values associated with each modality. These weighted

values are then combined to generate the output fea-

ture map M.

3.5 Spatio-Temporal Probabilistic

Network (STPN)

This section describes the Spatio-temporal prob-

abilistic network for trajectory prediction of the

future states of an ego vehicle and a high-deﬁnition

map, assuming that we have access to the state

outputs of an object detection and tracking system of

sufﬁcient quality for autonomous vehicles, based on

(Phan-Minh, 2021). The agents that an ego vehicle

interacts with at time t are denoted by the set I

and s

represents the state of agent i ∈ I

at time

t. The discrete-time trajectory of agent i for times

t =



m, ....., n) is denoted by s

m:n



, ......, s

where m < n and i ∈ I

Additionally, we presume that the high-deﬁnition

map, as depicted in our proposed method, will be ac-

cessible. This includes lane geometry, crosswalks,

drivable areas, and other pertinent information. The

scene context over the past m steps, which includes

the map and partial history of ego vehicles, is denoted

by C =



t−m:t

;Map Information



VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

Our architecture follows the trajectory prediction

layer with the approach presented in (Cui et al., 2019).

To achieve effectiveness in this domain, we employ

ResNet-50 (Table:1) (He et al., 2016), as recom-

mended by previous research (Cui et al., 2019; Chai

et al., 2019). Although our network currently gener-

ates predictions for one agent at a time, our approach

has the potential to predict for multiple agents simul-

taneously in a manner similar to (Chai et al., 2019).

However, we limit our focus to single-agent predic-

tions (as in (Cui et al., 2019)) to streamline the pa-

per and emphasize our primary contributions. To rep-

resent probabilistic trajectory predictions in multiple

modes, we utilize a classiﬁcation technique that se-

lects the relevant trajectory set based on the agent of

interest and scene context C. The softmax distribution

is employed, as is typical in classiﬁcation literature.

Speciﬁcally, the probability of the k-th trajectory is

expressed as follows:

p(s

t:t+N

|x) =

exp f

(x)

∑

exp f

(x)

(3)

where f

(x) ∈ R is the output of the network of proba-

bilistic layer. We have implemented Multi-Trajectory

Prediction (MTP) (Cui et al., 2019) with adjustments

made for our datasets. This model forecasts a set

number of trajectories (modes) and determines their

respective probabilities. Note that we are now focus-

ing on single trajectory prediction (STP)(Djuric et al.,

2020).

3.6 Loss Function

The loss function employed for vehicle segmentation

in our transformer-based model is deﬁned as follows:

seg

(m,

m) = −

∑

i=1



·log(p( ˆm

))

+ (1 −m

) ·log(1 − p( ˆm

))



(4)

where, L

seg

(m,

m) is the binary cross-entropy loss

(Jadon, 2020) for vehicle segmentation, m is the input

tensor, and

m is the target tensor for all N points. This

loss function is particularly valuable for binary classi-

ﬁcation challenges where our model generates logits

(unbounded real numbers) as output. It facilitates the

computation of the binary cross-entropy loss concern-

ing binary target labels

m, ensuring effective training

and performance evaluation for vehicle segmentation

in our transformer-based approach.

In terms of trajectory prediction, the loss function

we are considering is one of the most commonly used:

the mean squared error (MSE). This loss function typ-

ically involves measuring the dissimilarity between

the predicted and the ground-truth trajectories.

tra j

∑

i=1

||ˆy

−y

(5)

Here, N is the number of training examples, ˆy

the predicted trajectory for ego vehicle i, and y

is the

corresponding ground truth trajectory. The squared

difference between the two trajectories is calculated

element-wise and then averaged across all elements in

the trajectory. The resulting value is the mean squared

error loss, which measures the overall performance

of the model in predicting the trajectories for the ego

vehicle.

Our ﬁnal loss function L

total

constitutes two com-

ponents, as shown in the equation below.

total

= αL

seg

+ βL

traj

(6)

Gradients are mutually shared by both tasks till the

initial layers of the network. In the above equation,

α and β are the hyperparameters to balance between

segmentation and trajectory prediction losses.

4 EXPERIMENTAL SETUP

4.1 Dataset

Experiments are carried out on the nuScenes dataset

(Caesar et al., 2020), which comprises 1000 video

sequences gathered in Boston and Singapore. The

dataset is composed of scenes that have a duration of

20 seconds and consist of 40 frames each, resulting

in a total of 40k samples. The dataset is divided into

training, validation, and testing sets, with 700, 150,

and 150 scenes respectively. The recorded data pro-

vides a comprehensive 360

◦

view of the surrounding

area around the ego-vehicles and comprises six cam-

era perspectives. Note that we are employing iden-

tical train-test-validation splits as those used in the

previous works (Zhou and Kr

ahenb

uhl, 2022; Philion

and Fidler, 2020) for comparison.

4.2 Transformer Architecture and

Implementation Details

The initial step of the network involves creating a

camera-view representation for each input image. To

achieve this, we utilize EfﬁcientNet-B4 (Tan and Le,

2019) as the feature extractor and input each im-

age I

to obtain a multi-resolution patch embedding



, δ

, .....δ



, where R denotes the number of

resolutions that are taken into account.

BEVSeg2TP: Surround View Camera Bird’s-Eye-View Based Joint Vehicle Segmentation and Ego Vehicle Trajectory Prediction

According to our experimental ﬁndings, accurate

results can be achieved when using R = 1 resolution.

However, if we were to increase the value of R to 2, as

suggested by CVT in (Zhou and Kr

ahenb

uhl, 2022),

the camera-view representation for each input image

in the network would incorporate additional informa-

tion, such as BEV features. While this has the po-

tential to result in a more detailed representation of

the input images, it also comes with drawbacks, in-

cluding increased computational requirements and a

higher risk of overﬁtting.

The processing for each resolution is carried out

individually, beginning with the lowest resolution.

We employ cross-view attention to map all image fea-

tures to a map-view and reﬁne the map-view embed-

ding, repeating this procedure for higher resolutions.

In the end, we employ three up-convolutional lay-

ers to produce the output at full resolution. Once

we obtain the full-resolution output, we input the ego

vehicle features, which have a resolution of h

bev

×256, into the probabilistic function for trajec-

tory forecasting, resulting in the set of trajectories

, p

, ..., p

]. Subsequently, we reﬁne and ob-

tain the probabilistic value, which represents our ﬁnal

trajectory.

To implement the architecture, we employ a pre-

trained EfﬁcientNet-B4 (Tan and Le, 2019) that we

ﬁne-tune. The two scales, (28, 60) and (14, 30), cor-

respond to an 8× and 16× downscaling, respectively.

For the initial map view positional embedding, we

use a tensor of learned parameters with dimensions

w ×h ×D, where D is set to 128. To ensure computa-

tional efﬁciency, we limit the grid size to w = h = 25,

as the cross-attention function becomes quadratic in

growth with increasing grid size. The encoder com-

prises two cross-attention blocks, one for each scale

of patch features, which utilize multi-head attention

with 4 heads and an embedding size of d

head

= 64.

The decoder includes three layers of bilinear up-

sampling and convolution, each of which increases

the resolution by a factor of 2 up to the ﬁnal output

resolution of 200×200, corresponding to a 100 ×100

meter area around the ego-vehicle. The map-view

representation obtained through the cross-attention

transformer is passed through the joint vehicle seg-

mentation module to accurately identify the vehicle’s

segmentation. This segmentation is then utilized as

input to the Spatial-Temporal Probabilistic Network

(STPN), which offers probabilistic predictions. In-

stead of providing a single deterministic trajectory,

the network offers a probability distribution over pos-

sible future trajectories. This information aids in

identifying the motion planning of the ego vehicle.

Precisely segmenting the pixels corresponding to the

ego vehicle enables the system to more accurately es-

timate its position, speed, and orientation in relation

to other objects in the environment. This, in turn,

facilitates improved decision-making during naviga-

tion. Figure 2 offers a comprehensive overview of this

architecture.

5 ABLATION STUDY

We perform a detailed ablation experiment to assess

the inﬂuence of several factors on the functionality

of our segmentation model. We speciﬁcally exam-

ined the impacts of various backbone models and loss

functions.

Table 1: Comparison study of different standard back-

bone models employed for trajectory prediction on

nuScenes dataset (Caesar et al., 2020).

Backbone # Params. (M) Features MSE ↓

EfﬁcientNet-80 1.9 1280 0.3385

DenseNet-121 1.7 1024 0.2079

ResNet-50 1.4 512 0.1062

We performed an ablation on different backbone

models to investigate their impact on the performance

of our target task on the nuScenes dataset, as pre-

sented in Table 1. Notably, the ResNet-50 back-

bone, with 1.4 million trainable parameters and a

feature size of 512, demonstrated promising results,

achieving the lowest MSE of 0.1062. It is likely that

ResNet-50 works well for trajectory prediction on the

nuScenes dataset, as its model parameters align well

with the characteristics of that dataset.

Table 2: Ablation on different loss functions for segmen-

tation task on the nuScenes dataset (Caesar et al., 2020).

Loss Function No. of Class Loss ↓

Binary Cross Entropy 2 0.1848

Binary Focal Loss 2 0.2758

In our task, we utilize the binary cross-entropy

loss function, which aligns well with the inherent

characteristics of our standard binary classiﬁcation

problem. Additionally, we explore and compare al-

ternative loss functions, including binary focal loss.

However, our ﬁndings indicate that the binary cross-

entropy loss function yields superior results, as pre-

sented in Table 2. This is primarily attributed to the

balanced distribution of classes within our dataset,

which favors the effectiveness of binary cross-entropy

in accurately modeling the classiﬁcation problem.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

Table 3: Comparison of visibility-based methods for Setting 1 and Setting 2, where our method achieves the highest

visibility rate among those with visibility greater than 40%.

Visibility > 40%

Method Setting 1 Setting 2

LSS (Philion and Fidler, 2020) - 32.1

CVT (Zhou and Kr

ahenb

uhl, 2022) 37.5 36.0

BEVSeg2TP (Ours) 37.8 37.9

Table 4: Comparison of vehicle segmentation performance on the nuScene dataset using different methods, including LSS,

CVT, and our proposed method. Results are presented in terms of Intersection over Union (IoU) scores.

Method Resolution R Vehicle ↑

LSS (Philion and Fidler, 2020) - 32.1

CVT (Zhou and Kr

ahenb

uhl, 2022) 2 36.0

BEVSeg2TP (Ours) 1 37.9

Table 5: Comparison of the Minimum Average Prediction Error (MinADE) and Final Displacement Error (MinFDE)

for Competing Methods on the nuScenes Dataset, over a Prediction Horizon of 6 Seconds.

Method MinADE

↓ MinADE

↓ MinFDE

↓

Const Vel and Yaw 4.61 4.61 4.61 11.21 11.21 11.21

Physics oracle 3.69 3.69 3.69 9.06 9.06 9.06

CoverNet (Phan-Minh et al., 2020) 2.62 1.92 1.63 11.36 - -

Trajectron++ (Salzmann et al., 2020) 1.88 1.51 - - - -

MTP (Cui et al., 2019) 2.22 1.74 1.55 4.83 3.54 3.05

MultiPath (Chai et al., 2019) 1.78 1.55 1.52 3.62 2.93 2.89

BEVSeg2TP (Ours) 1.63 1.29 1.15 3.85 2.13 1.65

6 RESULTS

We evaluate the BEV map representation and trajec-

tory planning of the BEVSeg2TP model on the pub-

licly available nuScenes dataset. The evaluation is

conducted in two different settings - ’Setting 1’ refers

to a 100m ×50m grid with a 25cm resolution, while

’Setting 2’ refers to a 100m ×100m grid with a 50cm

resolution. During training and validation, vehicles

with a visibility level above the predeﬁned thresh-

old of 40% are considered. Table 3 demonstrates the

comparison of our proposed approach with other ex-

isting works such as LSS (Philion and Fidler, 2020)

and CVT (Zhou and Kr

ahenb

uhl, 2022).

First, we compare the BEV segmentation obtained

from various methods, including LSS and CVT with

the results from our proposed BEVSeg2TP. Accu-

rately predicting the future motion of vehicles is criti-

cal, as it helps the model gain a comprehensive under-

standing of the environment by capturing the spatial

relationships among pedestrians, vehicles, and obsta-

cles. However, our second contribution focuses on

improving map-view segmentation of vehicles. Our

experimental ﬁndings show that employing a resolu-

tion of R = 1 yields promising results. However, in-

creasing the value of R to 2, as recommended by CVT,

would lead to the camera-view representation for each

input image in the network losing information, such

as BEV features. We conducted further evaluations

using various methods, as illustrated in Table 4.

As shown in Table 5, the ablation study has been

evaluated by comparing it with four baselines: (Cui

et al., 2019) (Chai et al., 2019) (Phan-Minh et al.,

2020) and (Salzmann et al., 2020) and two physics-

based approaches. These four baselines are a recently

proposed model which is considered to be the cur-

rent state-of-the-art for multimodel trajectory predic-

tion. This comparison aims to assess the effectiveness

and accuracy of our model in predicting trajectories in

comparison to existing models. The goal is to deter-

mine if our model performs better than or at least as

well as the state-of-the-art baseline model. By doing

so, we can gain insight into the strengths and weak-

nesses of our model and identify areas for further im-

provement. To evaluate the performance of our model

on the nuscenes dataset, we ﬁrst obtained the output

trajectories



, y

, ....y



. We evaluated the per-

formance of the model on this speciﬁc dataset for dif-

ferent values of K, where K was set to 5, 10, and 15

respectively.

MinADE

= min

i∈{1...K}

∑

t=1



−y

(i)



(7)

To train the model, we minimized the minimum

BEVSeg2TP: Surround View Camera Bird’s-Eye-View Based Joint Vehicle Segmentation and Ego Vehicle Trajectory Prediction

Figure 3: Qualitative results of BEVSeg2TP model for joint vehicle segmentation and ego vehicle trajectory prediction:

Six camera views around the vehicle (top three facing forward, bottom three facing backwards) with ground truth segmentation

on the right. Our trajectory prediction with improved map-view segmentation (second from right) compared to the CVT

method (third from right).

average displacement error over K (MinADE

) on the

training set. In other words, we aimed to reduce the

error between the predicted trajectories and the actual

trajectories by minimizing the minimum distance be-

tween them for each of the K time steps. This method

allowed us to improve the accuracy of our model’s

predictions and ensure that it performs well on the

nuScenes dataset. Here, y

represents the ground

truth position of the object at the ﬁnal time step T,

and y

(i)

represents the predicted position of the object

at the ﬁnal time step T for the i

trajectory in the set

of K trajectories.

We took the output trajectories



, y

, ....y



and we used K = 15 for nuscenes datasets. we min-

imize the minimum over K average displacement er-

ror (MinADE

) over the training set. As depicted in

Figure 3, on the left-hand side of the image, there

are six camera views surrounding the vehicle. The

top three views are oriented forward, while the bot-

tom three views face backwards. On the right side

of the image, there is ground truth segmentation for

reference. Moving from right to left, the second im-

age from the right displays our trajectory prediction,

along with improved map-view segmentation for ve-

hicles. Lastly, the third image from the right illus-

trates the CVT (Zhou and Kr

ahenb

uhl, 2022) method,

which we use to conduct a comparison and present the

results.

The black color corresponds to the results ob-

tained using a model called CVT, the red color cor-

responds to the results obtained using our model, and

the white color corresponds to the nuScenes ground

truth, which is the true segmentation of the images.

The purpose of the comparison was to evaluate the

performance of the other model and compare it with

our model. Figure 3 reveals that our model performs

well compared to the other model in both vehicle and

road segmentation tasks. When it comes to vehicle

segmentation, our model demonstrates a high level of

accuracy in identifying the precise positions of vehi-

cles within the image. In contrast, the other model ex-

hibits a slightly lower level of accuracy in this regard.

This distinction is clearly visible in the accompany-

ing ﬁgure, where the red markings, representing the

outcomes produced by our model, closely align with

the green markings, representing the ground truth, in

comparison to the black markings, which correspond

to the results generated by the other model. Similarly,

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

with regard to road segmentation, our model also ex-

hibits decent performance. To gain further insights,

additional results can be explored via the following

link: https://youtu.be/FNBMEUbM3r8.

7 CONCLUSION

In this paper, we propose BEVSeg2TP - a surround-

view camera bird’s-eye-view-based joint vehicle seg-

mentation and ego vehicle trajectory prediction us-

ing encoder-decoder transformer-based techniques

that have shown promising results in achieving safe

and effective driving for autonomous vehicles. The

system processes images from multiple cameras

mounted on the vehicle, performs semantic segmen-

tation of objects in the scene, and predicts the future

ego vehicle trajectory of surrounding vehicles using a

combination of transformer and spatio-temporal prob-

abilistic network (STPN) to calculate the trajectory.

The predicted trajectories are projected back to the

ego vehicle’s bird’s-eye-view perspective, providing

a comprehensive understanding of the surrounding

trafﬁc dynamics. Our ﬁndings underscore the poten-

tial beneﬁts of employing transformer-based models

in conjunction with spatio-temporal networks, high-

lighting their capacity to signiﬁcantly enhance trajec-

tory prediction accuracy. Ultimately, these advance-

ments contribute to the overarching goal of achieving

a safer and more efﬁcient autonomous driving experi-

ence.

While the camera conﬁguration of nuScenes is

important, it is not a typical commercially deployed

surround-view system. Commercial surround view

systems, used for both viewing and vehicle automa-

tion and perception tasks (Kumar et al., 2023; Eis-

ing et al., 2022), typically employ a set of four ﬁsh-

eye cameras around the vehicle. In the future, we in-

tend to apply the methods discussed here to ﬁsheye

surround-view camera systems.

ACKNOWLEDGEMENTS

This publication has emanated from research sup-

ported in part by a grant from Science Foundation

Ireland under Grant number 18/CRT/6049. For the

purpose of Open Access, the author has applied a CC

BY public copyright licence to any Author Accepted

Manuscript version arising from this submission.

REFERENCES

Abbink, D. A., Mulder, M., and de Winter, J. C. (2017).

Driver behavior in automated driving: Results from a

ﬁeld operational test. Transportation Research Part F:

Trafﬁc Psychology and Behaviour, 45:93–106.

Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E.,

Xu, Q., Krishnan, A., Pan, Y., Baldan, G., and Bei-

jbom, O. (2020). nuscenes: A multimodal dataset for

autonomous driving. In Proceedings of the IEEE/CVF

conference on computer vision and pattern recogni-

tion, pages 11621–11631.

Chai, Y., Sapp, B., Bansal, M., and Anguelov, D. (2019).

Multipath: Multiple probabilistic anchor trajectory

hypotheses for behavior prediction. arXiv preprint

arXiv:1910.05449.

Chen, D., Jiang, L., Wang, Y., and Li, Z. (2020). Au-

tonomous driving using safe reinforcement learning

by incorporating a regret-based human lane-changing

decision model. In 2020 American Control Confer-

ence (ACC), pages 4355–4361. IEEE.

Cheng, H., Wang, Y., and Wu, J. (2019). Research on

the design of an intelligent vehicle collision avoidance

system. In Journal of Physics: Conference Series, vol-

ume 1239, page 012096.

Cui, H., Radosavljevic, V., Chou, F.-C., Lin, T.-H., Nguyen,

T., Huang, T.-K., Schneider, J., and Djuric, N. (2019).

Multimodal trajectory predictions for autonomous

driving using deep convolutional networks. In 2019

International Conference on Robotics and Automation

(ICRA), pages 2090–2096. IEEE.

Das, A., Das, S., Sistu, G., Horgan, J., Bhattacharya, U.,

Jones, E., Glavin, M., and Eising, C. (2022). Deep

multi-task networks for occluded pedestrian pose es-

timation. Irish Machine Vision and Image Processing

Conference.

Das, A., Das, S., Sistu, G., Horgan, J., Bhattacharya, U.,

Jones, E., Glavin, M., and Eising, C. (2023). Revisit-

ing modality imbalance in multimodal pedestrian de-

tection. In 2023 IEEE International Conference on

Image Processing (ICIP), pages 1755–1759.

Dasgupta, K., Das, A., Das, S., Bhattacharya, U.,

and Yogamani, S. (2022). Spatio-contextual deep

network-based multimodal pedestrian detection for

autonomous driving. IEEE transactions on intelligent

transportation systems, 23(9):15940–15950.

Djuric, N., Radosavljevic, V., Cui, H., Nguyen, T., Chou,

F.-C., Lin, T.-H., Singh, N., and Schneider, J. (2020).

Uncertainty-aware short-term motion prediction of

trafﬁc actors for autonomous driving. In Proceedings

of the IEEE/CVF Winter Conference on Applications

of Computer Vision, pages 2095–2104.

Eigen, D., Puhrsch, C., and Fergus, R. (2014). Depth map

prediction from a single image using a multi-scale

deep network. Advances in neural information pro-

cessing systems, 27.

Eising, C., Horgan, J., and Yogamani, S. (2022). Near-ﬁeld

perception for low-speed vehicle automation using

surround-view ﬁsheye cameras. IEEE Transactions

on Intelligent Transportation Systems, 23(9):13976–

13993.

BEVSeg2TP: Surround View Camera Bird’s-Eye-View Based Joint Vehicle Segmentation and Ego Vehicle Trajectory Prediction

Garnett, N., Cohen, R., Pe’er, T., Lahav, R., and Levi, D.

(2019). 3d-lanenet: end-to-end 3d multiple lane detec-

tion. In Proceedings of the IEEE/CVF International

Conference on Computer Vision, pages 2921–2930.

Godard, C., Mac Aodha, O., Firman, M., and Brostow, G. J.

(2019). Digging into self-supervised monocular depth

estimation. In Proceedings of the IEEE/CVF interna-

tional conference on computer vision.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In 2016 IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 770–778.

Jadon, S. (2020). A survey of loss functions for semantic

segmentation. In 2020 IEEE conference on compu-

tational intelligence in bioinformatics and computa-

tional biology (CIBCB), pages 1–7. IEEE.

Kumar, V. R., Eising, C., Witt, C., and Yogamani, S. K.

(2023). Surround-view ﬁsheye camera perception for

automated driving: Overview, survey & challenges.

IEEE Transactions on Intelligent Transportation Sys-

tems, 24(4):3638–3659.

Lai, X., Chen, Y., Lu, F., Liu, J., and Jia, J. (2023). Spheri-

cal transformer for lidar-based 3d recognition. In Pro-

ceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pages 17545–17555.

Lee, J. and Kim, J. (2016). Road geometry recognition for

intelligent vehicles: a survey. International Journal of

Automotive Technology, 17(1):1–10.

Li, Y. and Guo, Q. (2021). Intelligent vehicle collision

avoidance technology and its applications. Journal of

Advanced Transportation, 2021:6623769.

Liu, Y., Li, X., Li, X., Li, Z., Wu, C., and Li, J. (2021).

Autonomous vehicles and human factors: A review of

the literature. IEEE Access, 9:38416–38434.

Ma, X., Wang, Z., Li, H., Zhang, P., Ouyang, W., and Fan,

X. (2019). Accurate monocular 3d object detection

via color-embedded 3d reconstruction for autonomous

driving. In Proceedings of the IEEE/CVF Interna-

tional Conference on Computer Vision.

Manhardt, F., Kehl, W., and Gaidon, A. (2019). Roi-10d:

Monocular lifting of 2d detection to 6d pose and met-

ric shape. In Proceedings of the IEEE/CVF Confer-

ence on Computer Vision and Pattern Recognition.

McDonald, M. and Mazumdar, S. (2020). Drivers’ per-

ceived beneﬁts and barriers of advanced driver as-

sistance systems (adas) in the uk. Transportation

Research Part F: Trafﬁc Psychology and Behaviour,

73:1–16.

Phan-Minh, T. (2021). Contract-based design: Theories

and applications. PhD thesis, California Institute of

Technology.

Phan-Minh, T., Grigore, E. C., Boulton, F. A., Beijbom, O.,

and Wolff, E. M. (2020). Covernet: Multimodal be-

havior prediction using trajectory sets. In Proceedings

of the IEEE/CVF conference on computer vision and

pattern recognition, pages 14074–14083.

Philion, J. and Fidler, S. (2020). Lift, splat, shoot: Encod-

ing images from arbitrary camera rigs by implicitly

unprojecting to 3d. In Computer Vision–ECCV 2020:

16th European Conference, Glasgow, UK, August 23–

28, 2020, Proceedings, Part XIV 16. Springer.

Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., and

Koltun, V. (2020). Towards robust monocular depth

estimation: Mixing datasets for zero-shot cross-

dataset transfer. IEEE transactions on pattern anal-

ysis and machine intelligence, 44(3):1623–1637.

Salzmann, T., Ivanovic, B., Chakravarty, P., and Pavone,

M. (2020). Trajectron++: Dynamically-feasible tra-

jectory forecasting with heterogeneous data. In Com-

puter Vision–ECCV 2020: 16th European Confer-

ence, Glasgow, UK, August 23–28, 2020, Proceed-

ings, Part XVIII 16, pages 683–700. Springer.

Sengupta, S., Sturgess, P., Torr, P., et al. (2012). Automatic

dense visual semantic mapping from street-level im-

agery. in 2012 ieee. In RSJ International Conference

on Intelligent Robots and Systems, pages 857–862.

Sharma, S., Sistu, G., Yahiaoui, L., Das, A., Halton, M.,

and Eising, C. (2023). Navigating uncertainty: The

role of short-term trajectory prediction in autonomous

vehicle safety. In Proceedings of the Irish Machine

Vision and Image Processing Conference.

Tan, M. and Le, Q. (2019). Efﬁcientnet: Rethinking model

scaling for convolutional neural networks. In Interna-

tional conference on machine learning. PMLR.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. Advances in neural

information processing systems, 30.

Wei, Y., Cheng, S., Wu, Y., and Liu, Y. (2021). Trafﬁc con-

gestion prediction and control using machine learning:

A review. IEEE Transactions on Intelligent Trans-

portation Systems, 22(7):4176–4195.

Wiest, J., Omari, S., K

ohler, J., L

utzenberger, M., and

Ziegler, J. (2020). Learning to predict the effect of

road geometry on vehicle trajectories for autonomous

driving. IEEE Robotics and Automation Letters,

5(2):2426–2433.

Wu, C., Li, X., Li, X., and Guo, K. (2017). Road geometry

modeling and analysis for vehicle dynamics control.

Mechanical Systems and Signal Processing.

Yang, Y., Chen, Y., and Zhang, J. (2021). A survey on

human-autonomous vehicle interaction: Past, present

and future. IEEE Transactions on Intelligent Vehicles,

6(2):141–154.

Zhang, Y., Liu, H., Shen, S., and Wang, D. (2020). Multi-

modal trajectory prediction with maneuver-based mo-

tion prediction and driver behavior modeling. IEEE

Robotics and Automation Letters, 5(4):5461–5468.

Zhou, B. and Kr

ahenb

uhl, P. (2022). Cross-view transform-

ers for real-time map-view semantic segmentation. In

Proceedings of the IEEE/CVF conference on com-

puter vision and pattern recognition, pages 13760–

13769.

Zhou, T., Brown, M., Snavely, N., and Lowe, D. G. (2017).

Unsupervised learning of depth and ego-motion from

video. In Proceedings of the IEEE conference on com-

puter vision and pattern recognition.

Zhu, M., Zhang, S., Zhong, Y., Lu, P., Peng, H., and Lenne-

man, J. (2021). Monocular 3d vehicle detection using

uncalibrated trafﬁc cameras through homography. In

2021 IEEE/RSJ International Conference on Intelli-

gent Robots and Systems (IROS). IEEE.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications