BEVFastLine: Single Shot Fast BEV Line Detection for Automated

Parking Applications

Praveen Narasappareddygari

1 a

, Venkatesh Moorthi Karunamoorthy

1 b

, Shubham Sonarghare

2 c

Ganesh Sistu

2 d

and Prasad Deshpande

2 e

Valeo India Pvt. Ltd., Chennai, India

Valeo Vision Systems, Tuam, Ireland

{ﬁrstname.lastname}@valeo.com

Keywords:

Road Marking Detection, Autonomous Parking, Multi Camera Line Detection, Birds Eye View.

Abstract:

In autonomous parking scenarios, accurate near-ﬁeld environmental perception is crucial for smooth opera-

tions. Parking line detection, unlike the well-understood lane detection, poses unique challenges due to its

lack of spatial consistency in orientation, location, and varied appearances in color, pattern, and background

surfaces. Consequently, state-of-the-art models for lane detection, which rely on anchors and offsets, are not

directly applicable. This paper introduces BEVFastLine, a novel end-to-end line marking detection architec-

ture in Birds Eye View (BEV) space, designed for 360

◦

multi-camera perception applications. BEVFastLine

integrates our single-shot line detection methodology with advanced Inverse Perspective Mapping (IPM) tech-

niques, notably our fast splatting technique, to efﬁciently detect line markings in varied spatial contexts. This

approach is suitable for real-time hardware in Level-3 automated vehicles. BEVFastLine accurately localizes

parking lines in BEV space with up to 10 cm precision. Our methods, including the 4X faster Fast Splat

and single-shot detection, surpass LSS and OFT in accuracy, achieving 80.1% precision, 90% recall, and

nearly doubling the performance of BEV-based segmentation and polyline models. This streamlined solution

is highly effective in complex, dynamic parking environments, offering high precision localization within 10

meters around the ego vehicle.

1 INTRODUCTION

In the domain of autonomous vehicles, the emergence

of autonomous parking systems marks a signiﬁcant

stride towards enhanced efﬁciency, safety, and user

convenience. Unlike other driving scenarios such

as highways and urban roads, parking areas present

complex challenges due to their conﬁned nature and

the presence of various static and dynamic obstacles.

Central to navigating these challenges are the percep-

tion task of line detection, which is critical for accu-

rate vehicle positioning and safe maneuvering within

parking environments.

In the spectrum of perception tasks, two pivotal

tasks—line detection and lane identiﬁcation—often

become conﬂated. Line detection, regarded as a

https://orcid.org/0009-0000-2009-2182

https://orcid.org/0009-0007-2445-5754

https://orcid.org/0009-0008-4921-1526

https://orcid.org/0009-0003-1683-9257

https://orcid.org/0009-0008-9954-680X

lower-level task, is principally devoted to discerning

physical markings within the drivable terrain, em-

ploying either geometric or learning-based method-

ologies. Traditionally, geometric modeling algo-

rithms like the Hough Transform (Hough, 1962) and

Canny edge (Canny, 1986) detection have been uti-

lized to pinpoint pixels in camera imagery corre-

sponding to road lines, culminating in outputs such

as binary images or geometric parameters encapsu-

lating these lines. However, a contemporary shift is

evident towards adopting learning-based algorithms,

marking a progressive trend in this domain(Zakaria

et al., 2023). Contrarily, lane identiﬁcation, a higher-

tier task, endeavors to decipher the semantic implica-

tion of the detected lines. This includes distinguish-

ing which lines demarcate the current driving lane,

adjacent lanes, or other classiﬁcations of lanes such

as oncoming lanes or bike lanes.

Line landmarks, exhibited as various line types

on the parking terrain, serve as essential navigational

aids, guiding vehicles into designated spaces. Their

role is paramount in demarcating drivable zones,

220

Narasappareddygari, P., Karunamoorthy, V., Sonarghare, S., Sistu, G. and Deshpande, P.

BEVFastLine: Single Shot Fast BEV Line Detection for Automated Parking Applications.

DOI: 10.5220/0012435700003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 4: VISAPP, pages

220-231

ISBN: 978-989-758-679-8; ISSN: 2184-4321

delineating parking slot boundaries, and ensuring

safe vehicular maneuvering without collisions or en-

croachments on neighboring spaces (Lee and Park,

2021). Nonetheless, the detection of these line land-

marks presents a challenge due to their elongated

form and diverse visual appearances, which often ren-

der them indiscernible from other similar patterns in

the environment. Such ambiguity poses a challenge

to perception algorithms and complicates the data an-

notation process, potentially resulting in discrepan-

cies in the training data for deep learning models. In

this work, the focus is on line identiﬁcation for park-

ing systems, given the critical role of precise navi-

gation especially in densely populated parking envi-

ronments. Accurate detection of spatial parameters is

pivotal for optimal space utilization, enabling the ve-

hicle to align itself accurately within a parking spot’s

bounds. This accuracy becomes particularly vital dur-

ing complex parking maneuvers, like parallel parking,

where navigating through tight spaces with minimal

margin for error is imperative.

Traditional line detection techniques, such as

those outlined in Wang et al. (2018) (Wang et al.,

2018), typically involve a two-stage process. The ﬁrst

stage encompasses either semantic segmentation of

lines or direct regression of polynomial line coefﬁ-

cients in pixel space. The second stage involves pro-

jecting these lines into Birds Eye View (BEV) space,

often based on a ﬂat-world assumption. However, this

methodology harbors several limitations, including a

lack of geographical awareness of road surface varia-

tions, such as ramps and slopes.

Furthermore, BEV-based lane detection models,

which are predominantly designed for lanes with spa-

tial consistency, cannot be directly applied to line de-

tection in parking scenarios. This is due to their re-

liance on spatial anchors (Garnett et al., 2019) and

offsets (Efrat et al., 2020), which are less effective

for lines that exhibit a diverse spatial spread, as op-

posed to the relatively uniform lanes encountered in

driving contexts. Figure 1 highlights the distribution

of lanes in two prominent public datasets, KITTI and

TuSimple (Wen et al., 2021), in pixel space, and con-

trasts it with the distribution of parking lines in our

dataset within a 25-meter range around the ego vehi-

cle, represented in yellow color. This stark contrast

underscores the necessity for a novel approach in line

detection speciﬁcally tailored to the complexities of

parking environments. The heatmap of lines is pre-

sented in the Bird’s Eye View (BEV) space, rather

than in the pixel space. This approach is chosen be-

cause in ﬁsheye images, lines tend to be distributed

throughout, leading to clutter. Visualizing them in

the BEV space, which offers a top-view perspective in

Euclidean space, simpliﬁes the analysis and enhances

clarity.

Figure 1: Left: Lanes Heatmap for KITTI and TuSimple

datasets in Pixel Space. Right: Parking Lines Heatmap in

our dataset in BEV Space with 25 meters around the ego

vehicle.

Addressing the challenges associated with line de-

tection necessitates the development of comprehen-

sive and robust detection strategies. These strategies

should account for the unique characteristics of line

landmarks, requiring the design of context-aware al-

gorithms capable of discerning subtle differences in

visual features. Furthermore, the algorithms need to

be efﬁcient and real-time to accommodate the dy-

namic nature of parking environments and ensure

timely responses for real-time hardware in Level-3

automated vehicles.

To this end, our proposed methods aim to address

these challenges, and our primary contributions are

summarized as follows:

• We introduce a novel single shot deep detector for

parking lines, proﬁcient in capturing the geomet-

ric intricacies of parking boundaries. This system

inherently exhibits resilience to challenging con-

ditions within parking environments.

• We introduced an alternative to the LSS method’s

cumsum trick for splatting the features to BEV

space, utilizing one-hot sparse encoding and ma-

trix multiplication for efﬁcient points data pro-

cessing, thereby reducing computational com-

plexity and memory usage. The splatting tech-

nique works with any Inverse Perpective Mapping

based BEV projection models and it was demon-

strated with OFT BEV model as well.

• Our approach demonstrates its speed, efﬁciency,

and precision via rigorous testing with the large

scale surround view perception dataset for parking

line detection.

BEVFastLine: Single Shot Fast BEV Line Detection for Automated Parking Applications

221

2 RELATED WORK

We delve into prior studies that leverage surround-

view cameras, examine techniques associated with

processing surround-view images, and provide a con-

cise overview of methods used for line landmark de-

tection, as outlined below.

2.1 Surround-View Camera System

Owing to its large ﬁeld-of-view (Kumar et al., 2023)

of 190°, the surround-view camera system has gained

signiﬁcant popularity in the realm of near-ﬁeld visual

perception (Kumar et al., 2021). A near-ﬁled percep-

tion involves accurate localization within 10-15 me-

ters around the ego vehicle. A typical such system

(ﬁgure 2) in autonomous parking frameworks (Ku-

mar et al., 2020; Yahiaoui et al., 2019), integrates four

ﬁsheye cameras positioned on the ego-vehicle. Such

a conﬁguration ensures a comprehensive capture of

objects in close proximity, facilitating dependable vi-

sual perception. Our dataset, while derived from the

comprehensive Woodscape collection established by

(Yogamani et al., 2021), includes a subset of data not

previously released, reserved for commercial use due

to stringent data protection policies and privacy con-

siderations in line with EU GDPR and Chinese data

laws.

Figure 2: Data capture setup with 4 ﬁsheye surround view

cameras covering 360°around the vehicle.

2.2 Orthographic Feature Transform

(OFT)

The Orthographic Feature Transform (OFT) method

reprojects image-based features into an orthographic

3D space, enhancing spatial conﬁguration under-

standing (Roddick et al., 2018). This technique,

though primarily developed for 3D object detection,

holds signiﬁcant promise for applications like lines

and curb detection due to its unique feature process-

ing approach.

In the initial phase, a Convolutional Neural Net-

work (CNN) serves as the front-end feature extractor,

processing the input image to identify various fea-

tures, mathematically represented as a feature map

f (u, v), where (u, v) coordinates correspond to the im-

age plane. The OFT populates a 3D voxel feature map

g(x, y, z), with each voxel in g corresponding to a 3D

space region and associated with a speciﬁc 2D image

feature map area. The mapping from a voxel to the

image plane is approximated by a bounding box, cal-

culated based on intrinsic camera parameters and the

voxel’s 3D world coordinates. The relationship is ex-

pressed as:

= f ·

(x − 0.5r)



z +0.5

|x|



+ c

(1)

= f ·

(y −0.5r)



z +0.5

|y|



+ c

(2)

= f ·

(x + 0.5r)



z −0.5

|x|



+ c

(3)

= f ·

(y +0.5r)



z −0.5

|y|



+ c

(4)

where r represents the voxel size, and [u1, u2] and

[v1, v2] deﬁne the corresponding voxel’s bounding

box in the image plane.

Features within the deﬁned bounding box on the

image are aggregated (typically by average pooling)

to assign a feature vector to the voxel, represented as:

g(x, y, z) =

− u

)(v

− v

)

∑

u=u

∑

v=v

f (u, v) (5)

with the summation over [u1, u2] and [v1, v2].

The 3D voxel feature map g undergoes transfor-

mation into an orthographic feature map h(x, z), col-

lapsing the 3D map down to a 2D bird’s-eye view rep-

resentation, removing the original image’s perspec-

tive effects. This process might involve summing fea-

tures along the vertical axis or employing more com-

plex transformation matrices, such as:

h(x, z) =

∑

y=y

W (y)g(x, y, z) (6)

where W (y) are learned weight matrices, and the sum-

mation occurs over the voxel map’s height y (Roddick

et al., 2018).

In the context of lines detection, OFT’s transfor-

mation of complex 3D information into a simpliﬁed

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

222

2D orthographic representation is invaluable. The or-

thographic view provides a consistent scale for fea-

tures, crucial for detecting lines and curbs that ap-

pear distorted in perspective images. The OFT en-

sures well-represented features of lines and curbs

through feature aggregation in the voxel and ortho-

graphic maps. This representation highlights the con-

sistent characteristics of lines and curbs, simplify-

ing detection. Furthermore, the orthographic map

enhances spatial reasoning, as the relative position-

ing and orientation of lines and curbs are easily in-

terpreted in orthographic space. This facilitation is

particularly beneﬁcial in complex scenarios like in-

tersections or curved roads. Lastly, the orthographic

feature map can integrate with further neural network

stages speciﬁcally designed for lines and curbs detec-

tion, leading to more accurate and reliable detections

in autonomous driving systems.

2.3 Lift, Splat, Shoot (LSS)

In the domain of autonomous vehicle perception,

the “Lift, Splat, Shoot” (LSS) framework repre-

sents a signiﬁcant advancement, addressing the chal-

lenge of integrating multi-camera data into a co-

herent bird’s-eye-view (BEV) representation (Philion

and Fidler, 2020). The LSS model adheres to cru-

cial symmetries—translation equivariance, permuta-

tion invariance of camera inputs, and ego-frame isom-

etry equivariance—facilitating a uniﬁed scene inter-

pretation in the ego vehicle frame. This approach

overcomes limitations of single-image detectors that

cannot fully exploit the multi-viewpoint sensory in-

formation.

The LSS process begins with the “Lift” stage,

where images are transformed into a shared 3D space

without the reliance on depth sensors, generating a

latent depth distribution for each pixel across a range

of discrete depths. This results in a comprehensive

point cloud that encapsulates the scene from all cam-

eras in a frustum-shaped structure. The subsequent

“Splat” stage involves an efﬁcient pillar pooling tech-

nique that transforms the point cloud into a 2D BEV

grid. This grid is then processed by a conventional

CNN, enabling BEV semantic segmentation or mo-

tion planning. A key innovation in this stage is the

use of a cumulative sum operation that optimizes the

pooling step, signiﬁcantly enhancing the model’s ef-

ﬁciency.

Finally, the “Shoot” stage of the LSS model is

tailored for motion planning. By evaluating various

trajectories on the inferred cost map, the model se-

lects the trajectory with the minimum associated cost.

This approach allows for end-to-end differentiability,

where the model learns to fuse camera inputs and op-

timizes for motion planning objectives concurrently.

The LSS model’s capability to learn data-driven fu-

sion across cameras and its backpropagation-friendly

architecture enables a level of adaptability and learn-

ing potential not seen in previous models, which is

empirically demonstrated to be effective in fusing in-

formation and aiding motion planning in autonomous

driving tasks.

2.4 Line Landmarks Detection

Line segment detection has evolved signiﬁcantly from

traditional image gradient-based methods to more so-

phisticated learning-based techniques. Initially, meth-

ods relied on image gradients, with early approaches

thresholding gradient magnitude to identify strong

edges and aligned pixel sets. These techniques, while

fast and accurate, often faltered in noisy or low-

illumination conditions.

In contrast, recent learning-based methods have

showcased remarkable advancements. Huang et al.

(Huang et al., 2018) and (Huang et al., 2020) intro-

duced end-to-end trainable CNNs, enhancing detec-

tion in man-made environments and developing the

TP-LSD model for real-time detection. Xu et al. (Xu

et al., 2021) leveraged Transformers for line segment

detection, signiﬁcantly reﬁning the process. Lin et

al. (Lin et al., 2020) merged classical techniques with

deep learning, while Gao et al. (Gao et al., 2022) fo-

cused on camera pose reﬁnement through point and

line optimization. However, these developments have

not directly addressed the unique challenges of park-

ing line detection in BEV space using ﬁsheye cam-

eras.

The ULSD method, as proposed by Li et al. (Li

et al., 2021), offers a signiﬁcant advancement in line

segment detection across various camera types. This

end-to-end network leverages the B

ezier curve model

to achieve a model-free representation of line seg-

ments, which is particularly beneﬁcial for images

with distortion, such as those from ﬁsheye or spher-

ical cameras. The approach is independent of camera

distortion parameters, simplifying the detection pro-

cess for both distorted and undistorted images. De-

spite these advantages, the paper does not extensively

address computational efﬁciency or real-time pro-

cessing capabilities, which are critical for deployment

in autonomous systems. Further, while the method

performs well on the constructed datasets, its adapt-

ability to a broader range of real-world scenarios re-

mains to be thoroughly assessed.

An important aspect to note is that these methods

operate in the pixel domain, and the line detections

BEVFastLine: Single Shot Fast BEV Line Detection for Automated Parking Applications

223

must be projected into the metric (BEV) space for

vehicle navigation planning. This projection poses

a signiﬁcant challenge as it typically assumes a ﬂat

ground surface, not accounting for varying road con-

ditions like slopes or ramps. Such an assumption can

lead to unacceptable localization errors, particularly

in precise navigation applications such as parking in

tight spaces. This underscores the need for advanced

methodologies capable of accurately interpreting and

adapting to the complex topography within the learn-

ing models. Though BEV perception using both

IPM (Inverse Perspective Mapping) and Transformer-

based methods has made strides in lane detection,

these techniques often rely on anchors and offsets.

This reliance limits their applicability to spatially con-

sistent patterns like lanes, and less so to more diverse,

spatially occurring categories such as parking lines or

curbs.

3 METHODS

Our architecture, BEVFastLine, comprises three key

stages: an image encoder, a fast splatting technique

for feature projection into BEV space, and a BEV en-

coder followed by a Single Shot Line Detection net-

work. (Figure 3).

3.1 LIFT Features

Images captured by four Surround View System

(SVS) cameras undergo processing via an Efﬁcient-

Net B0 encoder, which facilitates the extraction of

embeddings. This approach aligns with the method-

ology employed in the Lift-Splat-Shoot (LSS) system

(Philion and Fidler, 2020).The images processed have

a resolution of 1024x856 pixels. The EfﬁcientNet-B0

encoder subsequently produces feature maps with a

resolution of 64x54x512. These image features are

further processed by a depth net convolutional net-

work, which is designed to extract 17 depth features

maintaining the same spatial resolution. Additionally,

64 context features are extracted at the same reso-

lution. The depth features from each camera extend

over a spatial range of 10 meters, effectively covering

a 10-meter radius around the central vehicle.

3.2 FastSplat over Splatting

In the Lift, Splat, Shoot (LSS) approach, the ”cumu-

lative sum trick” is a central mechanism employed to

efﬁciently splat the features from an entire sensor rig

to BEV. This trick operates by sorting all points based

on their bin id (grid location in BEV Space), execut-

ing a cumulative sum over all features, and then sub-

tracting the cumulative sum values at the boundaries

of the bin sections. This method was designed to

enhance memory efﬁciency, which would otherwise

be used for padding during sum pooling across pil-

lars (see Figure 4). The cumulative sum operation is

central to this approach, effectively aggregating data

for subsequent algorithmic steps. However, it is not

without drawbacks. The necessity for sorting and the

scatter operation for summing are both computation-

ally intensive and not optimally compatible with neu-

ral accelerators, which typically expedite inference by

distributing vector algebraic operations.

In contrast to the LSS framework, our FastSplat

method, depicted in Figure 4, adopts a distinctive ap-

proach by forgoing the cumulative sum trick. Instead,

it utilizes a one-hot sparse encoding technique for

identifying the bin id of each point. In this method,

each bin id is represented by a unique one-hot en-

coded vector, marked by a single ’1’ in its corre-

sponding position, with all other elements set to ’0’.

This efﬁcient encoding simpliﬁes the differentiation

between various bin ids.

The process then involves matrix multiplication of

these one-hot encoded points with the camera features

to compute the Bird’s Eye View (BEV) features. Fast-

Splat offers a 4X speed in splatting due the two pri-

mary advantages:

• It obviates the need for a sorting operation,

which typically has a computational complexity

of O(nlog n), where ’n’ is the number of grid cells

in BEV space.

• The method replaces the cumsum trick, which at

its core is a scatter operation involving summation

based on an index vector. This operation is not a

standard vector algebraic process, making it un-

suitable for neural engines on edge devices. By

modeling the process as generalized matrix mul-

tiplication, FastSplat aligns well with the capabil-

ities of neural accelerators, which are essentially

matrix multiplication engines.

The Fast Split module receives its inputs from

the Lift Module and Camera Frustums, with a sig-

niﬁcant modiﬁcation in the approach to camera ge-

ometry. Unlike the method employed by Philion et

al. (2020) (Philion and Fidler, 2020), which utilizes

pinhole camera geometry for frustum creation, our

approach adapts the camera frustums to ﬁsheye ge-

ometry. This adaptation is achieved using a polyno-

mial distortion camera model, a technique compre-

hensively described by Usenko et al. (2018).

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

224

Figure 3: Illustration of the Proposed Method Featuring Fast Lifting and Single Shot Line Detection Head. The notation

used in the method includes D, representing the Number of Depths; C, indicating the Number of Cameras; Cx and Cy, which

denote the Center of the Line; and Ex and Ey, signifying the Endpoints of the Line in the X and Y directions, respectively.

3.3 Single Shot Line Head

Though methods like ULSD offer a solution for di-

rect regression of a parametric line model, they are

not ideally suited for integrating into the BEV archi-

tectures due to computational issues as outlined be-

low:

• ULSD operates as a two-stage method. Initially,

a line proposal network generates potential line

candidates. These candidates are then used as

control points for a bezier model to compute uni-

form points. Subsequently, the point indices are

employed to extract feature vectors from the en-

coder block, which are then passed to a binary

classiﬁer for line detection.

• Additionally, the concept of junctions and centers

in ULSD, and the associated learning process to

match them and form line proposals, involve com-

plex operations. These operations are not typi-

cally supported by hardware accelerators on em-

bedded systems, posing a challenge for real-time

application.

Our proposed model, directly regressing lines in

a grid fashion as shown in Figure 3, marks a signiﬁ-

cant shift in line detection methodologies. By regress-

ing the line center and the lengths extending from it,

and employing a classiﬁcation node for line existence

probability, our model aligns with the paradigm shift

observed in object detection. Similar to how single-

stage, anchor-free algorithms like (Zhou et al., 2019)

and (Tian et al., 2020) have provided effective al-

ternatives to the two-stage methods like (Ren et al.,

2016), our single-stage line detection approach of-

fers a comparable alternative in the line detection ap-

proach. This represents a notable advancement, mir-

roring the evolution seen in object detection towards

more efﬁcient and streamlined models.

BEVFastLine: Single Shot Fast BEV Line Detection for Automated Parking Applications

225

Figure 4: TOP: Fast Splatting with Sprase Matrix Multipli-

cation. Bottom: Point Pillars based Splatting using Cumu-

lative Sum.

4 EXPERIMENTS

This section presents our methodology and the tech-

nical details behind the evaluation of our parking line

detection work. We discuss the setup used for training

our models, the characteristics of the commercial sub-

set of the dataset utilized, and the speciﬁc implemen-

tation strategies adopted. This section also details the

post-processing techniques that reﬁne the output and

underscores the measures ensuring system reliability

across varied scenarios.

4.1 Experimental Setup

The experimental setup for evaluating our models

was conducted using high-performance computing re-

sources. Speciﬁcally, the models were trained on an

NVIDIA A100 GPU with 40 GB of memory, which

is well-suited for demanding deep learning tasks. We

chose a batch size of 4 to balance between computa-

tional efﬁciency and memory constraints.

For training times for the OFT and LSS ap-

proaches for an epoch are 1.6 and 1.5 hours respec-

tively. The choice of the optimizer was AdamW,

a variant of the Adam optimizer that incorporates a

decoupled weight decay regularization, which often

leads to better performance in deep learning models.

The learning rate was set to 1e − 4, providing a

suitable compromise between convergence speed and

stability of the training process. In the interest of fo-

cusing on the models’ ability to learn from raw data,

no augmentations were applied to either the image or

BEV representations during training. This decision

was made to assess the models’ performance under

unaltered data conditions, thus ensuring the evalua-

tion of their fundamental feature extraction and gen-

eralization capabilities.

4.1.1 Dataset

Our dataset, curated for the development and assess-

ment of our parking line detection system, comprises

images from ﬁsheye cameras that encapsulate a 360-

degree view of parking environments. The image

set incorporates views from Front View (FV), Rear

View (RV), Mid-View Right (MVR), and Mid-View

Left (MVL) cameras, each capturing at a resolution

of 1024x856 pixels. The training subset consists of

12,574 timesteps, translating to 50,296 images across

all camera perspectives. For evaluation, the dataset

encompasses 3,529 timesteps, totaling 14,116 im-

ages. Here, a ’timestep’ denotes a singular temporal

instance with simultaneous captures from all cameras,

reﬂecting various parking scenarios as shown in Fig-

ure 5.

The timestamps are carefully sourced from 602

traces out of which 473 for training and 129 for val-

idation. The dataset offers sequences that showcase

continuity in line features, crucial for temporal con-

sistency and algorithmic robustness. Each sequence

represents a complete trajectory of the ego vehicle

from entering to exiting a parking area. Emphasiz-

ing geographic diversity, the dataset includes captures

from the Czech Republic (CZE), Ireland (IRL), and

the United States (USA), with 147, 416, and 45 traces,

respectively. This diversity ensures the adaptability of

our system to various parking layouts and conditions.

Figure 5: Parking Lines in Our dataset. a)Solid lines ,

b)Lines on cobble stones, c)Washed out lines, d) Angled

lines, e) Dashed lines and f) Colored lines.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

226

Data collection was performed using six differ-

ent vehicles to afﬁrm the model’s invariance to spe-

ciﬁc vehicle camera rig conﬁgurations. The environ-

mental contexts covered include a spectrum of park-

ing settings: city lots, indoor garages, urban parking

structures, and rural parking areas, with respective in-

stances of 5, 61, 491, and 51, thus providing extensive

exposure to various parking situations.

The dataset further addresses a variety of light-

ing conditions critical for parking applications: arti-

ﬁcial lighting for indoor settings (61 instances), low-

light conditions found during dawn or dusk (6 in-

stances), and bright daylight scenarios typical for out-

door parking (511 instances). Street lighting condi-

tions, representing nighttime parking environments,

are included in 30 instances. This extensive range

of lighting conditions ensures the detection system is

thoroughly tested and capable of reliable performance

in real-world parking applications, ready for rigorous

training and evaluation.

4.1.2 Implementation Details

Our system processes four ﬁsheye images by pass-

ing them to an EfﬁcientNet-lite B0 inspired image en-

coder. This encoder consists of ﬁve reduction blocks,

reducing spatial dimensions and enhancing feature

depth, with a skip connection between the latter stages

to preserve context in the output features.

For the Orthographic Feature Transform (OFT)

projection, an adaptation for ﬁsheye cameras aligns

with the SVS BEV grid using extrinsic parameters.

The BEV encoder then reﬁnes these features through

three reduction and two upsampling layers, culminat-

ing in a feature map tailored for line detection.

The lines decoder, with three branches, predicts

junction points, center points, and line segmentation.

It employs varied loss functions like binary cross en-

tropy and weighted L1 and smooth L1 losses for dif-

ferent prediction types. Post-processing utilizes non-

maximum suppression and max pooling on junction

maps, followed by matching and correction steps to

compute Lines of Interest (LOIs).

This streamlined process, emphasizing computa-

tional efﬁciency and modularity, ensures real-time ap-

plicability across diverse computational setups, main-

taining precision without excessive computational

load. In comparison to the standard semantic segmen-

tation model (LSS), which requires 14.8 TOPS (Tril-

lion Operations Per Second) for a 10 FPS (Frames Per

Second) throughput and incurs a post-processing time

of 4.8 seconds on a typical Intel Xeon PC equipped

with an Nvidia A100 GPU, our proposed BEVFast-

Line model demonstrates a similar computational de-

mand at 14.81 TOPS. However, it signiﬁcantly re-

duces post-processing time to just 0.4 seconds. This

reduction in processing time translates into decreased

CPU cycle requirements and also enhancing overall

performance efﬁciency as detailed in Table 1.

The post-processing phase is only needed for the

baseline models (LSS & OFT) to convert the line seg-

mentation output to poly-line representation.

The post-processing phase is pivotal in transform-

ing the segmentation masks produced by the decoder

into vectorized line representations. Following bi-

nary thresholding to enhance the contrast between

line features and the background, the masks are skele-

tonized to single-pixel thickness, preserving the es-

sential structure of the lines.

Subsequent to skeletonization, any intersecting

lines are segmented at junctions to facilitate individ-

ual analysis. Connected Component Analysis (CCA)

is then utilized to identify and isolate these distinct

line segments. At this juncture, we employ B

ezier

curve ﬁtting to model the geometry of each line seg-

ment accurately.

ezier curves offer a parametric approach to

deﬁning curves and are particularly advantageous for

their scalability and the ease with which they can

model complex shapes. For relatively straight paths or

gentle curves, quadratic B

ezier curves are ﬁtted using

the endpoints and a calculated control point from the

segmented lines. For more complex or sharply curv-

ing paths, cubic B

ezier curves are utilized, requiring

two control points in addition to the endpoints to ac-

curately trace the line’s trajectory.

By ﬁtting B

ezier curves to the segmented lines,

our system generates a smooth and continuous repre-

sentation of road markings. These vectorized lines

are mathematically deﬁned and easily manipulated,

which is essential for applications that rely on pre-

cise pathing, such as autonomous vehicle navigation

systems and advanced mapping services.

The ﬁtting of B

ezier curves marks the ﬁnal step

in the line detection process, concluding with the ex-

traction of endpoints for each curve. This compre-

hensive post-processing strategy ensures that our sys-

tem’s outputs are not only accurate in terms of spatial

positioning but also robust against the variability of

real-world driving environments.

5 PERFORMANCE EVALUATION

The effectiveness of our line landmark detection sys-

tem is assessed quantitatively through precision and

recall metrics, while also incorporating a spatial val-

idation method to determine the veracity of detected

line endpoints.

BEVFastLine: Single Shot Fast BEV Line Detection for Automated Parking Applications

227

Figure 6: Qualitative Results on four samples from the test set.

5.1 Evaluation Criteria

True positive detections are validated if the predicted

line endpoint falls within a circle whose radius is at

certain percent of the distance (Circular Radius in 1

from the ego vehicle to the predicted endpoint. This

radius scales with distance to account for the relative

position of line features, recognizing that distant lines

allow for a larger margin of error due to perspective

effects (see Figure 7).

False positives are identiﬁed when predicted end-

points do not coincide with any ground truth line

within the scaled radius, indicating a detection where

no line feature exists. These criteria ensure that

our evaluation accounts for both the accuracy of line

placement and the needed scaling for positional er-

rors.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

228

Table 1: Comparative Performance Metrics for Line Detection Approaches.

Experiment Baseline Proposed Baseline Proposed

Metrics Circular Radius LSS + Lines Derived BEV FastLine (LSS) OFT + Lines Derived OFT + Single Shot

Precision

20%

45 80.1 28.8 80.2

Recall 46.2 90 12 25.6

Precision

25%

51.1 81.3 33.7 82.4

Recall 52.5 71.1 13.8 26

Precision

30%

55.3 82.3 40.5 83.6

Recall 56.8 71.9 16.7 26.6

Precision

35%

59.6 83.4 47.3 84.8

Recall 61.2 72.6 19.6 27.2

Figure 7: Illustration of line detection evaluation criteria:

On the left, a True Positive (TP) where a predicted line’s

endpoints are within a set radius. On the right, a False Pos-

itive (FP) where any endpoint is outside this radius.

5.2 Precision

Precision is quantiﬁed by the ratio of true positive

detections to the total number of positive predictions

made by the model. The formula is as follows:

P =

T P

T P + FP

(7)

Here, T P represents true positives, corroborated

by the spatial validation method, and FP indicates

false positives, where predicted endpoints fall outside

the deﬁned evaluation radius from any ground truth

line. High precision reﬂects the model’s ability to ac-

curately detect lines without triggering excessive false

alarms, crucial for reliable autonomous navigation.

5.3 Recall

Recall measures the model’s capacity to detect all per-

tinent line features present in the data:

R =

T P

T P + FN

(8)

FN denotes false negatives, where the model fails

to identify an actual line within the evaluation radius.

High recall signiﬁes comprehensive detection cover-

age, vital for the completeness of navigational cues.

This evaluation methodology, encompassing both

precision and recall augmented with spatial valida-

tion, provides a robust measure of our system’s per-

formance, ensuring that detected lines are precise and

consistent with real-world locations relative to the ego

vehicle.

6 DISCUSSION

The comparative analysis of line detection methods,

as illustrated in Table 1, reveals a promising direc-

tion for the proposed BEV FastLine and OFT + Sin-

gleShot approaches. Both methods demonstrate sig-

niﬁcant precision rates, with BEV FastLine achieving

80.1% and OFT + SingleShot reaching 80.2%, which

is approximately 2X higher than their respective base-

lines of 45% and 28.8%. This high precision suggests

a reduced rate of false positives, a crucial factor in au-

tonomous navigation where inaccurate line detection

could lead to misguidance or safety hazards.

Table 1 underscores the overall 2X improve-

ment (averaged over all metrics) achieved by single-

stage direct regression of lines as polylines in BEV

space. This approach contrasts with methods that per-

form segmentation followed by line ﬁtting as a post-

processing operation. These improvements are con-

sistent across both architecture styles (LSS, OFT).

Figure 6 demonstrates the limitations of segmentation

models in fully regressing line structures, even with

extensive post-processing.

When considering recall, the BEV FastLine

method demonstrates a superior ability to detect rele-

vant line features with a recall of 90%, compared to

the OFT + SingleShot method which stands at 25.6%.

The higher recall of the BEVFastLine method is due

to the fact that LSS itself is superior to OFT and it

is more effective in identifying the presence of lines,

a quality that could prove crucial in complex driving

scenarios where every line marking is signiﬁcant for

vehicle guidance and decision-making processes.

BEVFastLine: Single Shot Fast BEV Line Detection for Automated Parking Applications

229

The results indicate that while OFT provides a

framework for accurate line prediction when it de-

tects them, its overall line feature detection rate needs

enhancement. On the other hand, the BEV FastLine

demonstrates an improved balance between detecting

lines and maintaining a lower false positive rate.

These observations warrant further investigation

into the underlying factors contributing to the perfor-

mance disparities. It is conceivable that the intrinsic

characteristics of the LSS method, such as its focus

on learning from a comprehensive point cloud, may

afford it a broader detection capability. In contrast,

the OFT method’s dependence on transforming image

features to an orthographic view might limit its sensi-

tivity to certain types of line markings or variances in

environmental conditions.

7 CONCLUSIONS

In conclusion, our proposed methodologies intro-

duced in this study provide a signiﬁcant advance-

ment in line landmark detection for autonomous park-

ing systems. The empirical results underscore the

precision of both methods, with the BEV FastLine

approach demonstrating a commendable balance be-

tween precision and recall. This balance is crucial for

real-world applications where accurate line detection

is instrumental in safe and reliable vehicle navigation.

The OFT + SingleShot method, also superior in

precision, when compared to baseline OFT based seg-

mentation model. The current work lays the founda-

tion for future enhancements in detection rates and

suggests that a hybrid approach may yield a more op-

timal solution. Such improvements are vital for nav-

igating complex environments and ensuring compre-

hensive line detection coverage.

The Fast Splatting technique introduced in our

work requires 4X less computational time when com-

pared to started cumsum operation and also suitable

for neural engines on the embedded systems.

Finally, the work delineated in this paper signif-

icantly enriches the evolving domain of autonomous

vehicle technologies. By highlighting the strengths

and areas for development in line landmark detection,

it steers future efforts towards creating more sophisti-

cated and robust systems. These systems will be es-

sential in realizing the full potential of autonomous

vehicles, ensuring safety, efﬁciency, and reliability in

automated parking and beyond.

ACKNOWLEDGEMENTS

The authors thank Valeo Vision System for their sup-

port, resources, and the opportunity to contribute to

the broader research community with this work.

REFERENCES

Canny, J. (1986). A computational approach to edge de-

tection. IEEE Transactions on pattern analysis and

machine intelligence, (6):679–698.

Efrat, N., Bluvstein, M., Oron, S., Levi, D., Garnett, N., and

Shlomo, B. E. (2020). 3d-lanenet+: Anchor free lane

detection using a semi-local representation.

Gao, S., Wan, J., Ping, Y., Zhang, X., Dong, S., Yang, Y.,

Ning, H., Li, J., and Guo, Y. (2022). Pose reﬁnement

with joint optimization of visual points and lines. In

2022 IEEE/RSJ International Conference on Intelli-

gent Robots and Systems (IROS), pages 2888–2894.

IEEE.

Garnett, N., Cohen, R., Pe’er, T., Lahav, R., and Levi, D.

(2019). 3d-lanenet: End-to-end 3d multiple lane de-

tection.

Hough, P. V. (1962). Method and means for recognizing

complex patterns. US Patent 3,069,654.

Huang, K., Wang, Y., Zhou, Z., Ding, T., Gao, S., and Ma,

Y. (2018). Learning to parse wireframes in images of

man-made environments. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recogni-

tion, pages 626–635.

Huang, S., Qin, F., Xiong, P., Ding, N., He, Y., and Liu,

X. (2020). Tp-lsd: Tri-points based line segment de-

tector. In European Conference on Computer Vision,

pages 770–785. Springer.

Kumar, V. R., Eising, C., Witt, C., and Yogamani, S. (2023).

Surround-view ﬁsheye camera perception for auto-

mated driving: Overview, survey & challenges. IEEE

Transactions on Intelligent Transportation Systems.

Kumar, V. R., Hiremath, S. A., Bach, M., Milz, S., Witt,

C., Pinard, C., Yogamani, S., and M

ader, P. (2020).

Fisheyedistancenet: Self-supervised scale-aware dis-

tance estimation using monocular ﬁsheye camera for

autonomous driving. In 2020 IEEE international con-

ference on robotics and automation (ICRA), pages

574–581. IEEE.

Kumar, V. R., Yogamani, S., Rashed, H., Sitsu, G., Witt,

C., Leang, I., Milz, S., and M

ader, P. (2021). Om-

nidet: Surround view cameras based multi-task visual

perception network for autonomous driving. IEEE

Robotics and Automation Letters, 6(2):2830–2837.

Lee, Y. and Park, M. (2021). Around-view-monitoring-

based automatic parking system using parking line de-

tection. Applied Sciences, 11(24):11905.

Li, H., Yu, H., Wang, J., Yang, W., Yu, L., and Scherer,

S. (2021). Ulsd: Uniﬁed line segment detection

across pinhole, ﬁsheye, and spherical cameras. IS-

PRS Journal of Photogrammetry and Remote Sensing,

178:187–202.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

230

Lin, Y., Pintea, S. L., and van Gemert, J. C. (2020). Deep

hough-transform line priors. In Computer Vision–

ECCV 2020: 16th European Conference, Glasgow,

UK, August 23–28, 2020, Proceedings, Part XXII 16,

pages 323–340. Springer.

Philion, J. and Fidler, S. (2020). Lift, splat, shoot: Encod-

ing images from arbitrary camera rigs by implicitly

unprojecting to 3d. In Computer Vision–ECCV 2020:

16th European Conference, Glasgow, UK, August 23–

28, 2020, Proceedings, Part XIV 16, pages 194–210.

Springer.

Ren, S., He, K., Girshick, R., and Sun, J. (2016). Faster

r-cnn: Towards real-time object detection with region

proposal networks.

Roddick, T., Kendall, A., and Cipolla, R. (2018). Ortho-

graphic feature transform for monocular 3d object de-

tection. arXiv preprint arXiv:1811.08188.

Tian, Z., Shen, C., Chen, H., and He, T. (2020). Fcos: A

simple and strong anchor-free object detector.

Wang, Z., Ren, W., and Qiu, Q. (2018). Lanenet: Real-

time lane detection networks for autonomous driving.

arXiv preprint arXiv:1807.01726.

Wen, T., Yang, D., Jiang, K., Yu, C., Lin, J., Wijaya, B.,

and Jiao, X. (2021). Bridging the gap of lane detec-

tion performance between different datasets: Uniﬁed

viewpoint transformation. IEEE Transactions on In-

telligent Transportation Systems, 22(10):6198–6207.

Xu, Y., Xu, W., Cheung, D., and Tu, Z. (2021). Line seg-

ment detection using transformers without edges. In

Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pages 4257–

4266.

Yahiaoui, M., Rashed, H., Mariotti, L., Sistu, G., Clancy,

I., Yahiaoui, L., Kumar, V. R., and Yogamani, S.

(2019). Fisheyemodnet: Moving object detection on

surround-view cameras for autonomous driving. arXiv

preprint arXiv:1908.11789.

Yogamani, S., Hughes, C., Horgan, J., Sistu, G., Var-

ley, P., O’Dea, D., Uricar, M., Milz, S., Simon, M.,

Amende, K., Witt, C., Rashed, H., Chennupati, S.,

Nayak, S., Mansoor, S., Perroton, X., and Perez, P.

(2021). Woodscape: A multi-task, multi-camera ﬁsh-

eye dataset for autonomous driving.

Zakaria, N. J., Shapiai, M. I., Ghani, R. A., Yasin, M.,

Ibrahim, M. Z., and Wahid, N. (2023). Lane detection

in autonomous vehicles: A systematic review. IEEE

Access.

Zhou, X., Wang, D., and Kr

ahenb

uhl, P. (2019). Objects as

points.

BEVFastLine: Single Shot Fast BEV Line Detection for Automated Parking Applications

231