Calibration-Accuracy Measurement in Railway Overlapping

Multi-Camera Systems

Mart

ı S

anchez

1,2 a

, Nerea Aranjuelo

1 b

, Jon Ander I

niguez de Gordoa

1 c

, Pablo Alonso

1 d

Mikel Garc

ıa

1,2 e

, Marcos Nieto

1 f

and Mikel Labayen

3 g

Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebasti

an, Spain

University of the Basque Country (UPV/EHU), Donostia-San Sebasti

an, Spain

CAF Signalling, Donostia-San Sebasti

an, Spain

Keywords:

Calibration-Accuracy Degree, Radial Grid, Railway Transportation, Multi-Camera Systems.

Abstract:

This paper presents a method for assessing calibration quality in overlapping multi-camera systems used in

railway transportation. We propose a novel approach that considers the extrinsic and intrinsic parameters of

the cameras and extracts features from their images, providing relevant patterns regarding the pose of the

cameras to detect cameras’ calibration misalignment. Three feature extractors, including traditional image

processing techniques and deep learning approaches, are evaluated and compared. The extracted features are

used to provide a calibration quality metric, enabling real-time detection of camera calibration degradation.

Additionally, we introduce a radial grid design that weights the contribution of pixels based on their distance

from the camera’s optical center. The results demonstrate the effectiveness of our method in assessing the

calibration degree between camera pairs. The ﬁndings highlight the superior performance of the deep learning

approaches in analyzing the similarity degree between captured images. Overall, our method lays a solid

foundation for the development of an online camera calibration pipeline.

1 INTRODUCTION

The railway industry is increasingly adopting artiﬁ-

cial intelligence to enhance traditional manual driving

systems, aiming to improve efﬁciency, reduce costs,

and upgrade the quality of service (Yin et al., 2017).

Intelligent transport systems (ITS) in this context rely

on a variety of sensors, including cameras, LiDARs,

and global navigation satellite systems (GNSS), to

collect and process data. Computer vision and ma-

chine learning algorithms are employed to process

this data, performing tasks such as obstacle detection,

track detection, and train positioning (Yanan et al.,

2018; He et al., 2021). Sensor calibration plays a

crucial role in ensuring the accurate functioning of

these algorithms and enabling effective data fusion

across different sensors (Heng et al., 2019). How-

ever, the calibration accuracy between sensors might

https://orcid.org/0009-0007-9103-2364

https://orcid.org/0000-0002-7853-6708

https://orcid.org/0000-0002-9008-5620

https://orcid.org/0000-0001-9650-0525

https://orcid.org/0000-0002-3973-7267

https://orcid.org/0000-0001-9879-0992

https://orcid.org/0000-0001-8136-5324

get compromised due to poor installation procedures,

degradation of mechanical components, or other fac-

tors like vibrations or temperature changes (Georgiev

et al., 2010). Inaccurate sensor calibration can re-

sult in inconsistent data and unreliable object detec-

tion or recognition, compromising the overall per-

formance and safety of the system, particularly in

railway driving environments that commonly employ

multi-sensor set-ups. Consequently, awareness of cal-

ibration quality is essential for taking appropriate ac-

tions, such as recalibration, and for avoiding reliance

on inaccurate data for automated driving.

In this paper, we present a method for assessing

calibration accuracy in overlapping multi-camera sys-

tems. Our method takes into account the cameras’ ex-

trinsic and intrinsic parameters and extracts features

from their respective images accordingly to provide a

calibration quality metric. Thus, our method enables

real-time detection of camera calibration degradation.

The contributions of this paper are:

• Introducing a calibration-accuracy measurement

approach for overlapping multi-camera set-ups.

• Comparative analysis of traditional image pro-

cessing techniques and deep learning approaches

to identify the optimal feature extractor for the tar-

geted task.

Sánchez, M., Aranjuelo, N., Iñiguez de Gordoa, J., Alonso, P., García, M., Nieto, M. and Labayen, M.

Calibration-Accuracy Measurement in Railway Overlapping Multi-Camera Systems.

DOI: 10.5220/0012308000003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 3: VISAPP, pages

317-324

ISBN: 978-989-758-679-8; ISSN: 2184-4321

317

• Development of a radial grid design that weights

the contribution of pixels to the similarity metric

based on their distance from the camera’s optical

center.

2 RELATED WORK

In recent years, several methods have been proposed

to re-estimate the pose of out-of-calibration cam-

eras providing promising results. These methods can

be broadly categorized as photometric-based or neu-

ral network-based approaches. The former methods

(Zhang et al., 2020; Liu et al., 2019; Zhang et al.,

2021) rely on minimizing the photometric discrep-

ancy between overlapping areas of adjacent cameras

extracted from features present in common scenarios

(such as lane marking corners). Whereas, a weakly-

supervised framework is proposed in (Chen et al.,

2022), which uses a deep neural network to minimize

geometric misalignment between cameras in common

regions. Weakly-supervised information from cali-

bration patterns is used to train the network, and fur-

ther ﬁne-tuning is performed using photometric dis-

crepancy of overlapping regions between cameras.

However, the main drawback of both approaches is

that in low-textured environments, the photometric er-

ror is mainly determined by noise rather than an in-

accurate camera pose. It is important to note that

these approaches often neglect to evaluate the prac-

tical aspects of performing online camera calibration.

Speciﬁcally, it remains unclear how to decide when

recalibration is necessary or not. In contrast, (S

anchez

et al., 2023) proposes a basic method to evaluate the

quality of adjacent cameras’ calibration by extracting

features from the bird’s-eye-view (BEV) projection of

common areas and analyzing their similarities. Most

of the existing methods related to online camera cali-

bration are designed speciﬁcally for the calibration of

cameras embedded in cars. However, the method pro-

posed in this paper takes a more generic approach that

can be extended to other domains, as demonstrated by

the experiments conducted on railway data. The pro-

posed method lays the foundation for a generic online

camera calibration pipeline applicable to multiple do-

mains.

3 METHODOLOGY

The set-up considered in this paper consists of two

pinhole cameras with different focal lengths and ﬁelds

of view (FOV) located at a given height above the

ground plane in the frontal part of a train. The dif-

ferent stages of the proposed method are depicted in

Fig. 1, where the images captured by each camera are

ﬁrst transformed into a BEV perspective to analyze

their overlapping regions (detailed in Section 3.1).

This transformation to a BEV perspective is important

not only for analyzing overlapping regions but also

to reduce dimensionality and uncertainty in the initial

image domain. Then, a radial grid is designed and

applied based on the intersection between a sphere,

which models the coordinate system of a camera, and

the ground plane (Section 3.2). Features are extracted

from each radial patch of both images (Section 3.3),

and ﬁnally, their similarity degree is calculated by an-

alyzing the extracted features from corresponding ra-

dial patches (Section 3.4).

3.1 Area of Overlap

To accurately determine the degree of similarity be-

tween two images, it is necessary to identify their area

of overlap (AOO). Given both cameras’ FOVs and

focal distances, their common regions can be deter-

mined and bounded. In our case, we project both im-

ages to the BEV representation by using the cameras’

extrinsic and intrinsic parameters and estimate their

overlapping region (see Fig. 2). Since the distance be-

tween the cameras’ optical center and objects cannot

be estimated directly, the ground projection becomes

a valuable resource for precisely assigning weights to

distances within the overlapping region.

3.2 Radial Grid Design

The BEV projection is a projective transformation

that provides an orthographic view of the captured

scene. This transformation is achieved by mapping

each pixel from the BEV domain to its correspond-

ing pixel in the source image. However, this process

inherently results in fewer correspondences for pixels

located far from the optical center of the camera. As

a consequence, pixels that do not have a direct corre-

spondence in the source image must be interpolated,

leading to a decay in the resolution details of the re-

sulting BEV image over distance. To address this is-

sue, it is important to consider those pixels in the BEV

image located close to the camera’s optical center in

the real world should have a higher conﬁdence level

than those located far from the camera’s optical cen-

ter. Thus, a radial grid is proposed to weigh pixels’

contribution to the ﬁnal similarity metric depending

on their distance to the camera. The design of this

radial grid is deﬁned by modeling the camera as a

sphere.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

318

Bird's-eye-view

projection

CAM 1

Radial grid design Features extraction Similarity calculation

Image 1

BEV

Im 1

Features

Im 1

Image 2

CAM 2

BEV

Im 2

BEV'

Im 1

BEV'

Im 2

Features

Im 2

Figure 1: The pipeline of the proposed method. Images from the cameras are ﬁrst transformed into a bird’s-eye-view projec-

tion, then a radial grid is applied to discretize the overlapping areas between cameras’ FOVs and extract features from each

image. Finally, the similarity between the extracted features is calculated to evaluate the quality of the cameras’ calibration.

Figure 2: Diagram of the BEV projection module from the

main pipeline, where each image captured by each cam-

era is transformed into a BEV perspective. The AOO is

bounded by analyzing both FOVs, ensuring that the com-

mon regions seen by both cameras are considered.

Radial grid discretization

Figure 3: Radial grid creation: camera’s coordinate system

is modeled as a sphere, where slices are determined by the

sphere-ground intersections.

3.2.1 Sphere-Plane Intersection

The camera’s coordinate system is modeled as a

sphere for the purpose of designing the radial grid.

The sphere is centered at the camera’s location and

its radius is increased in steps. As the sphere grows, it

intersects with the ground plane, and the points where

the sphere intersects with it are used to create the ra-

dial slices that form the grid. The size of those steps

is determined by the scene’s maximum distance pro-

jected to the BEV image and the desired number of

radial slices (see Fig. 3).

3.2.2 Information Ratio Calculation

In the BEV image, the resolution of pixels that are

located far away from the camera’s center is lower

than those located closer to the center. To address

this issue, it is necessary to adjust the contribution of

each radial slice to the overall image based on how

far away it is from the center. In other words, slices

closer to the center should have a greater contribution

than those farther away. To achieve this, the informa-

tion ratio (IR) of each BEV pixel is calculated. This

metric quantiﬁes the area of pixels in the source im-

age that are represented at a speciﬁc distance from

the camera within the BEV image, based on the cam-

era’s internal geometry. To calculate this ratio, the

four corners of a given pixel in the BEV image are

back-projected to the source image, providing the to-

tal number of pixels that are mapped and represented

at a certain distance in the BEV image. Calculating

this ratio for each radial slice makes it possible to de-

termine its relative contribution based on its average

distance to the camera’s optical center. This serves as

a regularization factor, ensuring that slices closer to

the camera’s optical center have a higher weight than

those located further away.

3.3 Features Extraction

After designing and applying the radial grid to the

bounded regions of the BEV projection, the next step

is to extract features that encode the texture and pat-

terns present in each radial slice. Various feature ex-

traction techniques are used to identify and extract im-

portant information from images, which can be based

on traditional image processing techniques or deep

learning-based approaches. Hand-crafted feature ex-

traction algorithms identify and get relevant features

from the image data. Deep learning-based approaches

use neural networks to automatically learn and extract

features from the data. Concerning deep learning-

based approaches, different features are encoded in

each layer of the network. Typically, the ﬁrst few lay-

ers of a neural network encode low-level features such

as edges, lines, and simple shapes. These low-level

features provide ﬁne-grained information to compare

the degree of similarity between images and detect

possible geometric misalignment, which is the phe-

nomenon we expect when the calibration of the sen-

sors is not accurate. A detailed comparison of the

strengths and weaknesses of both approaches is pro-

vided in Section 4.2.

Calibration-Accuracy Measurement in Railway Overlapping Multi-Camera Systems

319

3.4 Similarity Calculation

As shown in the pipeline in Fig. 1, the next step in-

volves comparing the features extracted from each ra-

dial slice for both images. We use the cosine sim-

ilarity to compare the extracted features encoded in

feature vectors, regardless the method used to extract

features. This metric depends only on the angle be-

tween the two vectors, being 1 if they are proportional

and 0 when the vectors are orthogonal. Thus, two ra-

dial slices that contain the same features’ distribution

should yield a cosine similarity value close to 1.

We use the cosine similarity to assess the disparity

degree between two images captured from cameras

with a common area of overlap. In this context, their

similarity measurement (σ) is calculated by summing

the weighted cosine similarity degree for each slice by

its associated normalized IR value, as shown in (1). If

the features have been extracted using a deep neural

network, an additional summation must be performed

to average the calculated features of the M ﬁlters in

the selected layer, as shown in (2). In these equations,

denotes the cosine similarity degree for slice i (or

for slice i in ﬁlter j in (2)), while φ

represents the IR

regularization term for slice i, normalized between 0

and 1.

σ =

∑

i=1

· φ

(1)

σ =

∑

j=1

∑

i=1

j,i

· φ

(2)

Therefore, the calibration quality between a cam-

era pair is denoted by this similarity degree between

the two images.

4 EXPERIMENTS AND RESULTS

We have conducted a series of experiments to evalu-

ate the effectiveness of the proposed method in var-

ious environments and determine its robustness and

ability to generalize. A dataset of out-of-calibration

cameras has been created including different weather

conditions such as sunny, foggy, and rainy in both ru-

ral and urban scenes. Further details on this dataset

can be found in Section 4.1. The experiments them-

selves are described in Section 4.2, while the obtained

results are presented in Section 4.3 along with a de-

tailed analysis of the method’s reliability.

4.1 Synthetic Dataset

We have created a synthetic dataset to evaluate the

proposed method. This dataset (I

niguez de Gordoa

et al., 2023) is built from a railway virtual scenario

using the game development engine Unreal Engine 4.

For creating it, two pinhole cameras are located in the

frontal part of the train at a given height. Moreover,

those cameras highly differ in focal lengths, as a chal-

lenging set-up for the proposed method. Fig. 4 shows

the mentioned set-up from three different perspectives

(frontal view, side view, and top-down view).

Figure 4: Different points-of-view of a train with an on-

boarded camera-pair.

The dataset comprises a collection of images cap-

tured from two calibrated cameras that are closely po-

sitioned to each other. By adjusting the extrinsic pa-

rameters of the cameras, such as rotation and transla-

tion, various levels of decalibration are induced. It is

worth noting that the modiﬁcations are independent

along a single axis for both rotation and translation,

allowing for a thorough analysis of the method’s per-

formance under different conditions. These modiﬁca-

tions produce scale changes, small translations, and

non-coplanar relationships between the transformed

points. Table 1 provides further details on the speciﬁc

ranges of rotation and translation applied to create the

dataset. In total 1152 images have been created, based

on urban or rural environments, and different weather

conditions (sunny, foggy, and rainy).

Table 1: Translation (centimeters) and rotation (degrees)

axes modiﬁcation ranges.

Modiﬁed ranges x-axis y-axis z-axis step

Translation [-5, 6] [-5, 6] [-5, 6] 1

Rotation [-3, 3] [-3, 3] [-3, 3] 0.25

4.2 Experiments

This section outlines the various experiments that

have been performed to determine if the proposed

method can distinguish between a calibrated set of

cameras and an out-of-calibration one. To achieve

this, a baseline case has been established using a cal-

ibrated camera set from the sunny rural scene. This

allows for measuring the similarity degree from the

baseline set and formulating a hypothesis: any sim-

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

320

ilarity degree measured within an out-of-calibration

set should provide a lower similarity degree than the

one obtained in the baseline case. The produced out-

of-calibration cases are created by ﬁxing one cam-

era to help the analysis. In other words, one cam-

era retains its calibrated extrinsic conﬁguration while

the other slightly modiﬁes its pose based on the val-

ues shown in Table 1. To assess whether the pro-

posed method performs well regardless of which cam-

era falls out of calibration, experiments are conducted

in both ways: keeping the baseline extrinsic param-

eters for camera 1 and modifying camera 2’s pose,

and vice versa. The following steps involve compar-

ing the similarity degree between the baseline case

and the extrinsic pose modiﬁcations considered in the

dataset. Additionally, this section presents the differ-

ent feature extractors used to obtain feature vectors

from each radial slice for each image.

4.2.1 Scenes

Two different scenes (rural and urban) have been

considered under three distinct weather conditions

(sunny, foggy, and rainy). Fig. 5 shows a sample im-

age for each case.

Figure 5: Sample images from rural and urban scenes under

sunny, foggy, and rainy weather conditions for one camera.

4.2.2 Features Extraction

As described in Section 3.3, the proposed method al-

lows for the application of multiple feature extractors

to get relevant patterns and features from each slice

of the radial grid. We explore the use of both image-

processing traditional feature extraction methods and

techniques based on deep learning.

Traditional Feature Extractor. Our aim is to ana-

lyze the visual properties of images that may be per-

ceived differently based on camera calibration, such

as edge and line orientations, to provide valuable in-

sights into their calibration status. It is crucial to use

a feature extractor robust to photometric variations

for this purpose. The histogram of oriented gradients

Figure 6: Structural dissimilarities between a calibrated

(left) and a decalibrated (right) BEV image using HOG fea-

ture descriptor. The odd slices of the radial grid from the

same camera are shown for comparison.

(HOG) feature descriptor is an adequate choice since

it captures the orientation information of edges and

lines in images, making it an ideal tool for detecting

changes in camera calibration. Fig. 6 demonstrates

the suitability of the HOG feature descriptor for an-

alyzing the structural dissimilarities between two im-

ages. In this ﬁgure, the odd slices from the radial grid

of the same camera are displayed in both a calibrated

(left) and an out-of-calibration (right) case. The HOG

feature descriptor’s ability to capture edge and line

orientations in images makes it effective in detecting

changes in their geometric structure caused by differ-

ences in cameras’ calibrations.

Deep Learning-Based Approaches. In deep neural

networks (DNNs), the initial layers capture low-level

features such as corners, lines, and edges. There-

fore, these are the layers we are interested in. Two

widely-used DNN architectures are examined: the

ResNet-18 network (He et al., 2016), pre-trained on

the generalist ImageNet dataset (Deng et al., 2009),

and a modiﬁed version of the Yolo-P network (Wu

et al., 2022) trained speciﬁcally on railway environ-

ment data. The objective is to assess the perfor-

mance of both approaches in detecting calibration is-

sues and generalization capabilities. In Fig. 7, exam-

ples of feature maps from some of the ﬁlters of the

ﬁrst layer in ResNet-18 (left) and Yolo-P (right) are

displayed. Particularly, in the Yolo-P case, lines and

edges are more pronounced compared to the ResNet-

18 case, where lower-level features are more promi-

nent. This difference can be attributed to ResNet-18

being pre-trained on a generalist dataset, while Yolo-

P is trained speciﬁcally for rail track detection. Ad-

ditionally, Yolo-P’s ﬁrst layer is located within a fo-

cus block, which selectively enhances signiﬁcant in-

formation while discarding irrelevant details based on

the domain-speciﬁc knowledge acquired during train-

ing.

4.3 Results and Analysis

This section presents the most relevant results from

the previously explained experiments. We analyze

the suitability of our method based on the different

Calibration-Accuracy Measurement in Railway Overlapping Multi-Camera Systems

321

Figure 7: Comparison of feature maps, output from the ﬁrst

layer of ResNet-18 (left) and Yolo-P (right).

feature extractors and their similarity threshold to de-

termine if the analyzed camera pair has become out-

of-calibration. We also analyze the performance of

the proposed methodology and its ability to accurately

detect changes in camera calibration.

4.3.1 Results

We evaluate the performance of HOG, ResNet-18,

and Yolo-P feature extractors in generating robust fea-

tures for comparing the similarity degree between im-

ages from two cameras within the same scene. Specif-

ically, we assess their performance in a sunny rural

scene while modifying the rotation axes of one cam-

era and keeping the other calibrated (Fig. 8, ﬁrst col-

umn). Additionally, we analyze modiﬁcations to the

translation axes (Fig. 8, second column). We also

present a similar comparison for foggy and rainy en-

vironments in both rural and urban scenes, as depicted

in Fig. 9, focusing on the Yolo-P feature extractor.

4.3.2 Analysis

The effectiveness of the method is illustrated in Fig.

8 and Fig. 9. Regardless of the weather condi-

tions or scene environment, these ﬁgures report how

the method accurately captures the decay in simi-

larities when a rotation or translation noise is in-

troduced to one of the cameras from the calibrated

camera pair. Ideally, the decay in similarity should

follow a smooth, decreasing, and exponential trend.

Since the level of camera pair calibration quality is

directly related to the cameras’ application context,

the recalibration trigger of the method can be cus-

tomized. This trigger is represented by the similar-

ity threshold value, which can be determined empiri-

cally by observing the behaviour of the method on the

dataset. An in-depth analysis of the feature extrac-

tors employed in our work is performed to identify

the most suitable one that aligns with the mentioned

decay requirements. Thus, we need to examine two

key aspects: the decay trend of the similarity degree

when extrinsic modiﬁcations are introduced to one

of the camera pairs, and the deviation of each mea-

sured similarity from the theoretical ideal decay line.

Figure 8: Similarity degree comparison of feature extrac-

tors in a sunny rural scene. The ﬁrst column displays ro-

tation axes modiﬁcations, while the second column repre-

sents translation modiﬁcations. Rows correspond to feature

extractors: HOG, ResNet-18, and Yolo-P.

To establish a baseline, an ideal decay line is used,

which must meet the previously mentioned require-

ments: smoothness, a decrease from the baseline cal-

ibrated similarity, and an exponential decay pattern.

Fig. 10 illustrates two examples from an urban scene,

captured under foggy (left) and sunny (right) condi-

tions. The similarity values in these examples have

been calculated using the HOG and Yolo-P feature

extractors for a roll and a translation in the z-axis,

respectively. Features extracted from the ResNet-

18 approach present similar behaviours. The orange

lines represent the ideal decay trends, while the green-

dashed lines correspond to the similarity threshold

values empirically set. Samples exceeding the simi-

larity threshold are considered part of the calibrated

set and are depicted in green, while red samples in-

dicate potential decalibrated cases. To assess the de-

cay trend for each feature extractor, it is important to

ensure a consistent decrease in the measured similar-

ities; otherwise, penalization is applied. Furthermore,

we compute and evaluate the distance between each

sample and the ideal decay line, aiming for minimal

deviation.

Table 2 presents the percentage at which each

feature extractor provides similarity values that align

with the explained criteria for similarity decay. Also,

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

322

Figure 9: Similarity degrees for foggy urban (top row) and

rural rainy (bottom row) cases using the Yolo-P feature ex-

tractor. The left column represents modiﬁcations in rotation

axes, while the right column represents variations in trans-

lation.

Figure 10: Comparison of foggy (left) and sunny (right)

conditions in an urban scene using HOG and YOLO-P fea-

ture extractors for roll modiﬁcation and z-axis translation,

respectively.

it indicates the percentage of times a speciﬁc extractor

provides features whose similarity values are closer

to the ideal decay. From the table it becomes evident

that Yolo-P consistently demonstrates the highest per-

centage of occurrences in both cases, making it the

most suitable feature extractor within the context of

the method presented in this paper. In contrast, HOG

performs poorly in comparison to the other extractors.

In general, the similarity decay associated with HOG

tends to be slow, with occasional ﬂuctuations that may

appear as side lobes. These observations suggest that

HOG may not be the most effective choice for the pur-

pose of this study.

Table 2: Extractor performance summary based on similar-

ity decay.

HOG ResNet-18 Yolo-P

Trend % 0 42.5 42.5

Distance % 0 16.7 83.3

It can also be observed that detecting translation

modiﬁcations poses a greater challenge than rotation

modiﬁcations for the three extractors. Speciﬁcally,

calibration issues are less notorious when they come

from translations along the x and y-axes within the

camera pair. This is because these transformations

do not introduce non-coplanar relationships between

the projected points in the images. However, modi-

ﬁcations applied to the z-axis of either camera in the

pair result in scale changes in the BEV image. These

scale changes lead to more noticeable geometric mis-

alignment in the analyzed pair, clearly indicating a de-

crease in calibration accuracy between the cameras.

5 CONCLUSIONS

Monitoring calibration accuracy for a camera set em-

bedded in a sensor set-up is crucial to ensure its re-

liable use. In this paper, we propose a method to

measure the calibration accuracy between a pair of

cameras. Our method takes as input the extrinsic and

intrinsic parameters of the cameras, along with cap-

tures of the scene observed by both cameras. Each

image is then projected into a BEV perspective, and

a radial grid is applied to weigh the contribution of

pixels based on their distance from the camera’s opti-

cal center. Low-level features are extracted from each

slice of the transformed images, and the disparity be-

tween corresponding slices is measured. These dis-

parity measurements determine the calibration accu-

racy of the camera system.

The results and subsequent analysis demonstrate

that our proposed method effectively provides the cal-

ibration accuracy of the camera pair. We analyze

the suitability of different feature extractors for our

method and conclude that a DNN (Yolo-P) trained on

domain-speciﬁc data yields the most relevant features

for analyzing the similarity degree between the cap-

tured images. Additionally, based on its extracted

features, the calculated similarity decay exhibits a

smooth, decreasing trend that follows an exponential

shape.

Overall, the satisfactory results achieved by our

method lay a solid foundation for developing a com-

prehensive online camera calibration pipeline, as our

method can be transformed into an optimization prob-

lem. As part of our future work, we aim to extend

the capabilities of our method by incorporating au-

tomatic camera pose correction for out-of-calibration

cameras.

Calibration-Accuracy Measurement in Railway Overlapping Multi-Camera Systems

323

ACKNOWLEDGEMENTS

This work has received funding from Basque Gov-

ernment under project AutoTrust of the program

ELKARTEK-2023.

REFERENCES

Chen, Y., Zhang, L., Shen, Y., Zhao, B. N., and Zhou, Y.

(2022). Extrinsic self-calibration of the surround-view

system: A weakly supervised approach. IEEE Trans-

actions on Multimedia.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). Imagenet: A large-scale hierarchical

image database. In 2009 IEEE conference on com-

puter vision and pattern recognition, pages 248–255.

Ieee.

Georgiev, M., Boev, A., Gotchev, A., and Hannuksela, M.

(2010). Inﬂuence of camera parameters on the qual-

ity of mobile 3d capture. In Multimedia on Mobile

Devices 2010, volume 7542, pages 246–257. SPIE.

He, D., Zou, Z., Chen, Y., Liu, B., Yao, X., and Shan, S.

(2021). Obstacle detection of rail transit based on deep

learning. Measurement, 176:109241.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Heng, L., Choi, B., Cui, Z., Geppert, M., Hu, S., Kuan,

B., Liu, P., Nguyen, R., Yeo, Y. C., Geiger, A., et al.

(2019). Project autovision: Localization and 3d scene

perception for an autonomous vehicle with a multi-

camera system. In 2019 International Conference on

Robotics and Automation (ICRA), pages 4695–4702.

IEEE.

niguez de Gordoa, J. A., Garc

ıa, S., P

erez, L., Urbieta,

I., Aranjuelo, N., Nieto, M., and Ochoa, D. (2023).

Scenario-based validation of automated train systems

using a 3d virtual railway environment. in press.

Liu, X., Zhang, L., Shen, Y., Zhang, S., and Zhao, S. (2019).

Online camera pose optimization for the surround-

view system. In Proceedings of the 27th ACM Inter-

national Conference on Multimedia, pages 383–391.

anchez, M., I

niguez de Gordoa, J. A., Nieto, M., and Car-

balleira, P. (2023). Method for the automatic measure-

ment of camera-calibration quality in a surround-view

system. In Fifteenth International Conference on Ma-

chine Vision (ICMV 2022), volume 12701, pages 428–

436. SPIE.

Wu, D., Liao, M.-W., Zhang, W.-T., Wang, X.-G., Bai, X.,

Cheng, W.-Q., and Liu, W.-Y. (2022). Yolop: You

only look once for panoptic driving perception. Ma-

chine Intelligence Research, 19(6):550–562.

Yanan, S., Hui, Z., Li, L., and Hang, Z. (2018). Rail surface

defect detection method based on yolov3 deep learn-

ing networks. In 2018 chinese automation congress

(CAC), pages 1563–1568. IEEE.

Yin, J., Tang, T., Yang, L., Xun, J., Huang, Y., and Gao,

Z. (2017). Research and development of automatic

train operation for railway transportation systems: A

survey. Transportation Research Part C: Emerging

Technologies, 85:548–572.

Zhang, T., Zhang, L., Shen, Y., Ma, Y., Zhao, S., and Zhou,

Y. (2020). Oecs: Towards online extrinsics correction

for the surround-view system. In 2020 IEEE Interna-

tional Conference on Multimedia and Expo (ICME),

pages 1–6. IEEE.

Zhang, T., Zhao, N., Shen, Y., Shao, X., Zhang, L., and

Zhou, Y. (2021). Roecs: A robust semi-direct pipeline

towards online extrinsics correction of the surround-

view system. In Proceedings of the 29th ACM Interna-

tional Conference on Multimedia, pages 3153–3161.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

324