Synthetic Data-Driven Object Detection for Rail Transport: YOLO vs

RT-DETR in Train Loading Operations

Thiago Leonardo Maria

1 a

, Saul Delabrida

1,2 b

and Andrea Gomes Campos

1,2 c

Graduate Program in Instrumentation, Control and Automation of Mining Processes (PROFICAM),

Federal University of Ouro Preto (UFOP) and Vale Institute of Technology (ITV), Minas Gerais, Brazil

Department of Computing (DECOM), Federal University of Ouro Preto (UFOP), Ouro Preto, Brazil

Keywords:

Wagon, Object Detection, Synthetic Data.

Abstract:

Efﬁcient wagon loading plays a crucial role in logistic efﬁciency and supplying essential raw materials to

various industries. However, ensuring the cleanliness of the wagons before loading is a critical aspect of this

process as it directly impacts the quality and integrity of the transported item. Early detection of objects inside

empty wagons before loading is a key component in this logistic puzzle. This study proposes a computer

vision approach for object detection in train wagons before loading and performs a comparison between two

models: YOLO (You Only Look Once) and RT-DETR (Real-Time Detection Transformer), which are based on

Convolutional Neural Networks (CNNs) and Transformers, respectively. Additionally, the research addresses

the generation of synthetic data as a strategy for model training, using the Unity platform to create virtual

environments that simulate real conditions of wagon loading. Therefore, the ﬁndings highlight the potential of

combining computer vision and synthetic data to improve the safety, efﬁciency, and automation of train loading

processes, offering valuable insights into the application of advanced vision models in industrial scenarios.

1 INTRODUCTION

Computer vision has emerged as a powerful tool in the

industry, transforming the way object inspection and

detection are carried out in industrial environments,

all due to the exponential advancement of technol-

ogy (Zhang et al., 2014). Combining image process-

ing algorithms and artiﬁcial intelligence techniques,

computer vision enables systems to interpret and un-

derstand the visual environment, enabling a variety

of applications in industrial automation (Sonka et al.,

2014).

Autonomy in industrial inspection is fundamen-

tal to guarantee the efﬁciency, quality, and safety of

production processes. Traditionally, inspection is car-

ried out manually, which is susceptible to environ-

mental inﬂuences, often leading to human errors (El-

Masry et al., 2012), in addition to requiring signiﬁcant

time and resources, and its accuracy is not guaranteed

(Park et al., 1996). However, advances in computer

vision have enabled automated inspection systems ca-

https://orcid.org/0009-0002-8815-4989

https://orcid.org/0000-0002-8961-5313

https://orcid.org/0000-0001-7949-1188

pable of analyzing and interpreting images quickly

and accurately. These systems can identify defects,

anomalies, and objects of interest in real time, all

without requiring physical contact with the environ-

ment. (Aydin et al., 2014).

Object detection in industry is a key application of

computer vision, with diverse applications including

product tracking, automated selection, quality con-

trol, and workplace safety (Megahed and Camelio,

2012). The ability to detect and recognize objects

in complex and dynamic industrial environments is

essential to ensure process efﬁciency and prevent ac-

cidents and contamination. As a result, foreign ob-

ject detection makes industrial processes more efﬁ-

cient and safer.

Rail transport plays an essential role in the global

supply chain, providing a viable and sustainable alter-

native to road transport. In the United States and the

European Union, the average volume of rail freight

transported between 2006 and 2019 was approxi-

mately 2.8 trillion tonne-kilometers (Gholamizadeh

et al., 2024). Statistics show that expanding rail net-

works and using them efﬁciently are crucial to meet-

ing global logistics demand and supporting sustain-

able economic and environmental practices.

Maria, T. L., Delabrida, S. and Campos, A. G.

Synthetic Data-Driven Object Detection for Rail Transport: YOLO vs RT-DETR in Train Loading Operations.

DOI: 10.5220/0013402500003929

In Proceedings of the 27th International Conference on Enterprise Information Systems (ICEIS 2025) - Volume 1, pages 873-880

ISBN: 978-989-758-749-8; ISSN: 2184-4992

873

Railroads often pass through areas where low-

income people live. Due to the presence of single-

track sections of a single line where there is trafﬁc in

both directions, the train must wait to pass through

the route it chooses. Therefore, trains that are not op-

erating must be stopped at intersections and in areas

adjacent to cities. It is not uncommon for waste to be

improperly discarded into the wagons, posing chal-

lenges to operational efﬁciency and cargo integrity

Early detection of objects inside these empty wag-

ons before loading is a critical necessity to ensure op-

erational efﬁciency and safety of the logistics process.

The presence of unwanted objects can compromise

not only the logistical organization but also the in-

tegrity of the transported item.

In the mining industry, trains must typically main-

tain a speed of approximately 10 km/h when entering

a loading terminal to perform the tare process. How-

ever, when foreign object detection process is carried

out visually by the operator, this speed is reduced in

some terminals. Consequently, the tare time is in-

creased, which leads to a delay in the start of charging.

The relevance of keeping wagons clean before

loading goes beyond logistical organization. Clean-

ing plays a crucial role in preserving the quality of the

transported item, especially in the mining industry,

preventing unwanted contamination that could com-

promise the integrity of the ﬁnal product. Efﬁcient

object detection inside wagons not only optimizes the

loading process but also ensures the quality of the ore

from its origin to its ﬁnal destination, affecting both

its value and reliability.

It is often difﬁcult to obtain real data for train-

ing object detection models. Some of these difﬁcul-

ties may be dangerous or difﬁcult-to-access environ-

ments, rare or extremely speciﬁc scenarios, applica-

tions with private or sensitive data, cost of data collec-

tion, and ethics and consent for data collection. Many

of these data acquisition difﬁculties come with the

search for synthetic data, and one way to obtain this

data is to use AI to generate this data. However, this

method comes with some difﬁculties, some of which

are the lack of perfect realism, it requires a lot of com-

putational effort, it can generate bias in the generated

data, problems with dynamic scenarios, and difﬁculty

in generating extreme cases. Another alternative for

generating synthetic data is to create a realistic virtual

environment that simulates the situation from which

you want to obtain the data.

The main objective of this study is to develop a

computer vision-based approach for detecting objects

in train wagons before loading operations.

The speciﬁc objectives include:

• Address the generation of synthetic data as a strat-

egy for training the models, using the Unity plat-

form to create virtual environments that faithfully

simulate the real conditions of wagons.

• Compare two object detection models, one widely

recognized, YOLO (You Only Look Once) v.10

and RT-DETR (Real Time Detection Trans-

former), based on convolutional neural networks

(CNNs) and Transformers, respectively, high-

lighting their particularities and advantages for the

detection of objects speciﬁcally in train cars.

• Contribute to the efﬁciency and safety of the

wagon loading process, exploring the capabilities

of computer vision, the comparison between es-

tablished models, and the generation of realistic

synthetic data.

Using computer vision and comparing two known

models trained using realistic synthetic data generated

by 3D simulation, this paper aims to analyzing which

is the best speciﬁc object detection system for train

loading with that increasing the efﬁciency and safety

in train loading by assisting the loading operator in

decision-making. Progress made in this area not only

improves operational practices but also encourages a

smarter and more technologically advanced approach

to rail transportation and logistics.

2 RELATED WORKS

The use of computer vision and artiﬁcial intelligence

can solve many practical problems, including those

currently being solved in the industrial environment.

This chapter examines several ways in which these

technologies can be used to detect foreign objects and

anomalies in the industrial environment.

(Yang et al., 2014) presents a real-time inspec-

tion system for conveyor belts using computer vision.

The objective of the system is to ﬁnd and classify

unwanted objects that may impair the efﬁciency or

safety of the conveyor process in industrial produc-

tion lines. The proposed system analyzes the images

captured by cameras installed along the conveyor belt

and, using a combination of machine learning and im-

age processing methods classiﬁes the objects to make

decisions. The experimental results show the effec-

tiveness of the proposed system in detecting and clas-

sifying objects in real-time, highlighting its potential

to automate and improve inspection processes in in-

dustrial production lines.

(Yao et al., 2023) present an improved method for

detecting foreign objects on conveyor belts using the

YOLOX model. To detect unwanted objects on indus-

trial conveyor belts, they introduced modiﬁcations to

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

874

the YOLOX model. This improves industrial inspec-

tion systems by ensuring the safety and efﬁciency of

transportation processes. But this system often needs

adjustments for different belt types, materials or de-

fect patterns. This limits generalization and requires

speciﬁc settings for each application environment.

In the railway (Chen W, 2022) describes a system

for detecting foreign objects in images, using a two-

stage convolutional neural network. The authors pro-

pose an approach to identify and classify unwanted

objects in railway track images, aiming to improve

the safety and efﬁciency of railway operations. The

system is trained on a large dataset of railway images,

allowing it to learn to detect and classify foreign ob-

jects with high accuracy.

utten et al., 2022) explores the application of

the Transformer architecture in industrial visual in-

spection, trained in three different use cases in the do-

main of visual damage assessment for railway freight

car maintenance. This paper explains how the Trans-

former approach has been modiﬁed and used for vi-

sual inspection tasks. The paper discusses the advan-

tages and disadvantages of the ViT approach com-

pared to other traditional convolutional neural net-

work architectures, such as CNNs. The experiment

results show that the Vision Transformer architecture

can be competitive in several industrial visual inspec-

tion tasks, outperforming or equaling other conven-

tional architectures.

(Salas et al., 2020) presents an approach to train

object detection and segmentation models on im-

ages of industrial machinery. Rather than relying

solely on real-world datasets, the authors propose us-

ing computer-generated synthetic images to train and

improve the performance of models. The paper de-

scribes the process of synthetic image generation,

which involves creating 3D models of industrial ma-

chines and objects of interest, rendering these models

in different virtual environments, and capturing im-

ages from these environments. After generating the

synthetic images, the authors conduct experiments to

compare the performance of models trained with syn-

thetic data compared to models trained only with real

data.

Most of the articles presented use CNNs in their

object detection, (Yang et al., 2014) and (Yao et al.,

2023) mention some limitations of using this type of

approach on conveyor belts, such as adverse environ-

mental conditions, such as dust, variable lighting and

complex background noise. These factors can affect

the detection accuracy and robustness of the model,

especially in scenarios with poor visibility or partially

obscured objects. (H

utten et al., 2022) validates the

use of the Transformer architecture saying that it has

a better performance than other traditional architec-

tures in industrial visual inspection, this highlights

the potential of the Transformer approach to boost

automation and efﬁciency in industrial environments

through advanced visual inspection. (Salas et al.,

2020) demonstrate that training with synthetic data, in

addition to real data, can lead to signiﬁcant improve-

ments in the accuracy and robustness of the models,

especially in challenging conditions that were not ad-

equately covered by real data.

3 METHODOLOGY

The methodology of this work consists of the follow-

ing steps: 3D modeling and texturing, scene creation

in Unity, synthetic data generation, training YOLO

and RE-DETR detection networks, and comparing re-

sults.

Figure 1: Steps of the proposed methodology.

3.1 3D Modeling and Texturing

3D models are necessary for the creation of the sim-

ulated 3D virtual environment. The models used in

this work will be acquired from marketplaces. The

marketplaces used were blendswap.com, free3d.com,

assetstore.unity.com and mixamo.com. Most of the

objects were obtained for free and some that could

not be found this way were purchased. Models not

found this way will be modeled and textured using

the Blender tool, which is a free modeling tool.

The objects will be divided into two categories:

main objects and foreign objects. The main objects

will be all the objects that make up the scene, such

as the train car, the tracks, vegetation, and terrain.

The foreign objects will be the unwanted objects that

will appear inside the cars. The foreign objects will

be divided into two categories: inanimate objects and

living objects: animals and humans. A total of 200

objects and variations will be selected, with 100 inan-

imate objects, 50 humans, and 50 animals.

Inanimate objects and animals will be chosen ac-

cording to the chance of this type of object appearing

Synthetic Data-Driven Object Detection for Rail Transport: YOLO vs RT-DETR in Train Loading Operations

875

inside the wagons and the quantity of objects chosen

to provide sufﬁcient variability in the generation of

synthetic data. The human category will have 10 dif-

ferent humans in 5 different poses.

Figure 2: Example of objects, animal and human category

respectively.

3.2 Scene Creation in Unity

Unity is a game engine widely used for simulations

and creation of interactive virtual environments.

The construction of the simulated scene in Unity

is a crucial step, it consists of assembling a virtual en-

vironment capable of simulating the reality of train

loading with a certain realism, using acquired and

modeled objects incorporating various techniques to

increase the ﬁdelity of the simulation, techniques such

as:

• Using ray tracing to enhance visual realism, al-

lowing for a more accurate representation of light-

object interaction in the scene (Skala et al., 2024).

• Development of a dynamic lighting system, capa-

ble of adjusting to the variable conditions of the

simulated environment, providing realistic varia-

tions in luminosity.

• Implementation of a rain system to add climate

variations to the simulation.

• Addition of wagons representing mineral waste to

incorporate realistic conditions.

• Implementation of a random arrangement of ob-

jects inside the wagons to increase the variability

of the synthetic data.

The Unity tool has three rendering pipelines, one

of which is the High Deﬁnition Render Pipeline

(HDRP), which prioritizes graphic quality using sev-

eral tools. One of these tools is ray tracing, a tech-

nology that simulates the behavior of light in the real

world to create more realistic and immersive graphics.

The use of this pipeline also facilitates the creation

of a climate system with rain cycles and has many

post-processing tools that allow us to add effects to

the camera to improve the visuals.

Unity uses C# to create scripts, these scripts allow

the manipulation of almost everything within the tool.

With these scripts, the entire dynamic system of this

simulated scene will be developed, such as the move-

ment of the trains, the day and night cycle, climate

control, and the random appearance of objects inside

the wagons, among other tasks.

3.3 Synthetic Data Generation

From the virtual loading environment previously cre-

ated in Unity, with all simulations and techniques to

increase realism, some additional conﬁgurations will

be made to generate synthetic data. This process will

be conducted following the steps:

• Careful selection of camera settings, including

resolution, aspect ratio, FOV, and position/tilt,

aligned with the usual speciﬁcations in the indus-

try.

• Conﬁgure post-processing effects such as bloom,

ambient occlusion, motion blur and vignette.

• Creation of a script to generate tag ﬁles to la-

bel objects in the simulated scene, facilitating the

training of detection models.

• Generation of images to compose the training and

validation data set, considering a comprehensive

representation of the object detection conditions.

Will be generated a set contain 6600 images, 5280

for training and 1320 for validation, with 10% of

background images in each category. These images

will contain different settings such as day, night, day

with rain and night with rain. This number of images

was calculated based on Ultralytics suggestion in its

documentation.

3.4 Training YOLO and RE-DETR

Model

The training phase of the YOLO and RT-DETR de-

tection networks involves preparing the synthetic data

generated. This step will be performed using train-

ing and testing sets from the Unity simulation of the

different experiments, with the appropriate object an-

notations. For this training, the RT-DETR-l and RT-

DETR-x models will be used, which are the large

and extra large models, respectively. In addition,

the YOLO models to be used are YOLOv10l and

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

876

YOLOv10x, which are compatible with the RT-DETR

models used. These models were chosen according to

the availability of models provided by Ultralytics.

The training will be performed on a physical ma-

chine equipped with a 12-core AMD Ryzen 9 5900X

processor, an NVIDIA GeForce RTX3090 graphics

card with 24GB of VRAM and 64GB of DDR4 RAM

at 3200MHz.

3.5 Comparing Results

The ﬁnal phase of the research involves a comparison

between the YOLO and RT-DETR networks consid-

ering the different experiments.

The MAP (Mean Average Precision) metric will

be used, which is commonly used to evaluate the

accuracy and performance of object detection algo-

rithms in computer vision. It is especially relevant in

object detection tasks where multiple objects can be

present in a single image and it is important to evalu-

ate the detection accuracy for each object class.

The mAP50 and mAP90 metrics will be used to

evaluate object detection models. Evaluation at Dif-

ferent Accuracy Thresholds mAP50 measures model

performance considering a more relaxed intersection-

over-union criterion. It is useful for evaluating

whether the model can detect objects at approximate

positions, even if the segmentation or bounding is not

highly accurate. mAP90 measures performance un-

der a more rigorous criterion. It focuses on detection

accuracy, evaluating whether the bounding boxes are

perfectly aligned with the objects.

4 RESULTS

During the course of the project, several versions of

the synthetic data generator were developed. The ﬁrst

was the proof of concept, in which a model of moving

wagons was created, in which objects such as cubes

and spheres were randomly inserted, replicating the

conditions found in a railway environment. These ob-

jects were later replaced by items found inside wag-

ons, increasing the ﬁdelity of the simulation.

To control the movements of the wagons and the

items that appear inside them, a dedicated script was

developed. This script allows the variation of parame-

ters such as the speed of the wagons and the chance of

a wagon having an item inside. In addition, the strate-

gic addition of a camera positioned above the wagons

allowed for the accurate capture of images, essential

for object detection and subsequent analysis.

As an integral part of the simulation, a dynamic

day-night cycle system was implemented, contribut-

ing to diversifying the lighting conditions and increas-

ing the realism of the scene. This approach enriches

the simulated environment, making it more complex

and closer to the real conditions encountered during

loading operations.

Figure 3: Proof of concept scene in Unity.

4.1 Acquisition of 3D Objects

The next step was to obtain the 3D objects. Scene

composition objects such as train tracks, trees, and

terrain were modeled and textured in the Blender tool.

The train car, one of the main models requiring more

details, was acquired from a 3D object marketplace

and subsequently applied a custom texture using the

Blender tool.

The models of inanimate objects, animals, and hu-

mans were mostly acquired for free from various mar-

ketplaces; all models have a CC-00 license.

All models went through an adaptation process

to be used in Unity, which consisted of reducing the

mesh size for objects with a very high polygon count

and retexturing for some objects.

4.2 Conﬁguration Script

After collecting the 3D models, the development of

the synthetic data generator began. Based on the

proof of concept, new 3D models were added, cre-

ating a more realistic virtual environment.

A script called “GameManager” was created to

manage and conﬁgure all the settings for the symmet-

ric data generator. This script has multiple settings

that allow for the customization of the loading envi-

ronment, the objects that appear randomly inside the

wagon, and the dynamic weather and climate condi-

tions. Additionally, the script automates the creation

of the folder structure needed to separate and save the

generated data.

4.3 Positioning of Objects

The positioning of the objects inserted into the wagon

is randomly chosen from eight positions along the

Synthetic Data-Driven Object Detection for Rail Transport: YOLO vs RT-DETR in Train Loading Operations

877

Figure 4: Part of the conﬁguration script.

wagon, in addition to the rotation of the object on

the vertical axis being deﬁned randomly. Objects

are placed slightly above the wagon, and the Unity

physics engine simulates the item falling into the

wagon. This process allows objects to rotate and settle

in various orientations, exponentially increasing the

diversity of their positions within the wagon.

The number of objects inside the wagon ranges

from 1 to 3. The ﬁrst object will always be inserted,

the second object has a 50% chance of appearing, and

the third object has a 25% chance, ensuring variability

in the dataset while maintaining a realistic scenario.

4.4 Day and Night and Rain Cycles

The day and night cycle implemented in the proof

of concept uses Unity’s lighting systems to conﬁgure

light emission, activate dynamic shadows, and rotate

sun objects.

In addition to the day and night cycle, a rain cy-

cle was implemented. This rain cycle uses Unity’s

particle system to simulate rain and has a random in-

tensity of rain, fog, and lighting every time the rain

starts. Furthermore, textures for the wagon, environ-

ment, and objects are dynamically modiﬁed to appear

wet when the rain cycle starts.

To enhance the realism of the simulated environ-

ment, Ray Tracing and other ﬁlters such as Bloom,

Screen Space Ambient Occlusion, Screen Space Re-

ﬂection and several others were used. These ﬁlters

are commonly used in games with realistic graphics,

which helps maintain ﬁdelity to real environments.

The combination of these ﬁlters with day and night

and rain cycles helps to diversify the synthetic data

generated.

Figure 5: The cycles in the simulation.

4.5 3d Object Separation

In addition to the objects, humans and animals di-

vision, another general division of the 3D objects is

made. They are randomly divided into ﬁve groups

using a seed to enable repeatability. This division is

necessary so that the training data and validation data

contain different 3d models, promoting better gener-

alization of the trained models. Four of these ﬁve

groups will be used for training and the last for val-

idation.

4.6 Label File and Synthetic Data

Generation

One of the advantages of using synthetic data is the

ability to generate labels automatically. In Unity we

used the list of vertex positions of the mesh of the ob-

jects inside the wagon. This list of vertex positions

is collected and converted to the camera space. By

ﬁnding the maximum and minimum horizontally and

vertically, the bounding box of the object can be de-

termined. After that, it is necessary to apply some

limiting ﬁlters so that the values do not exceed the

screen limits and do not fall behind the wagon wall.

With the label data collected, all that remains is to

convert them to the label system that will be used. In

this case, the YOLO system will be used.

To generate synthetic data, it is checked whether

there are objects within the viewing space of the cam-

era that is inside the wagon. If there are any objects,

an image is generated and the process of generating

the label for this image begins. After that, the im-

age and the label are saved in their respective folders.

After this, a period of time is waited for the next iter-

ation and generation of the next data, this process is

repeated until the deﬁned amount of data is reached.

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

878

Figure 6: Example of data generated.

A total of 5.280 training images were generated,

including 480 of which are background-only im-

ages and 1.320 validation images, 120 of which are

background-only images. The images have a resolu-

tion of 1.080x1.080 pixels and were generated in ap-

proximately 6 hours. Figure 6 shows an example with

the bounding boxes already placed on the generated

data.

4.7 Model Training

Before the main model training, some pre-training

phase was performed to deﬁne the training resolu-

tion of the images and the ideal number of epochs for

this dataset. Four training sessions were performed

with resolutions of 1.080p, 640p, 480p and 384p in

50 epochs. The results are shown in the Table 1.

Table 1: Resolution pre-training result.

mAP50 mAP50-90 Train Time (h)

1080p 0.811 0.573 2.28

640p 0.809 0.561 0.85

480p 0.800 0.551 0.53

384p 0.765 0.523 0.38

After analyzing the results, the training resolution

was deﬁned as 480p, since higher resolution values

did not show much improvement in the mAP50 val-

ues and at 480p the training time was reduced consid-

erably compared to higher resolutions, the inference

time will also be reduced in lower resolution. These

tests were performed for both YOLO and RT-DETR,

obtaining comparable results.

Another pre-training session was performed to de-

ﬁne the number of epochs for training the models.

Figure 7: Epoch pre-training result.

Training was performed without an epoch limit with

a patience of 50, that is, 50 epochs without any im-

provement and the training ended. The training was

completed in 246 epochs. Analyzing the loss and

mAP graphs in the Figure 7, a limit of 250 epochs

was deﬁned for training.

After the training conﬁgurations were deﬁned,

the synthetic data generated was separated, the ﬁfth

folder was selected as validation and the other four

folders was selected as training. Subsequently, the

four models, YOLOV10l, YOLOV10x, RT-DETR-l

and RT-DETR-x were trained using the models pro-

vided by the Ultralytics library. Each training session

took an average of 6 hours to complete. The results

are presented in the Table 2 and Table 3.

Table 2: mAP50 values for classes and models.

Model All Obj Animal Human

YoloV10l 0.902 0.890 0.832 0.981

YoloV10x 0.909 0.890 0.855 0.982

RT-DETR-l 0.817 0.776 0.717 0.956

RT-DETR-x 0.800 0.747 0.702 0.950

Table 3: mAP50-95 values for classes and models.

Model All Obj Animal Human

YoloV10l 0.727 0.773 0.593 0.814

YoloV10x 0.729 0.765 0.612 0.811

RT-DETR-l 0.620 0.637 0.470 0.755

RT-DETR-x 0.589 0.595 0.435 0.736

The two tables provide comparisons between the

YoloV10l, YoloV10x, RT-DETR-l, and RT-DETR-

x detection models on different performance met-

rics, mAP50 (Table 2) and mAP50-95 (Table 3).

On mAP50, YoloV10x performed best overall with

0.909, beating YoloV10l with 0.902. The RT-DETR

models have lower values, with RT-DETR-x achiev-

ing 0.800 and RT-DETR-l 0.817.

In the mAP50-95 metric YoloV10x continues to

lead with 0.729 in overall performance, closely fol-

Synthetic Data-Driven Object Detection for Rail Transport: YOLO vs RT-DETR in Train Loading Operations

879

lowed by YoloV10x with 0.727. Although YoloV10l

and YoloV10x maintain high and similar perfor-

mance, the RT-DETR models face considerable dif-

ﬁculties with the mAP50-95 metric, indicating that

they have lower generalization ability across multiple

accuracy thresholds.

5 CONCLUSIONS

This work developed a synthetic data generator within

a fully virtual environment created using the Unity

tool. The environment was created speciﬁcally before

the loading of trains to evaluate two computer vision

models, one very well-known and more traditional

that uses convolutional neural networks, YOLO, and

another model that is based on Vision Transformers

(VITs). Speciﬁcally in this work, RT-DETR is being

used, which is a real-time VIT.

From the training results of the four models, two

from YOLO and two from RT-DETR of compatible

sizes, it was observed that, for this environment, in

the context of train loading, YOLO outperforms RT-

DETR by approximately 10% mAP results. Addition-

ally, no signiﬁcant improvement was seen in perfor-

mance with the larger model sizes of either YOLO or

RT-DETR, indicating that increasing the model size

did not yield better results for this speciﬁc task.

In a real-world scenario, the insights from this

study show that a YOLO network can be applied in

the area of train loading to improve automation and

monitoring processes in loading operations. The abil-

ity of computer vision models to detect and analyze

loading conditions in real time can signiﬁcantly im-

prove efﬁciency and become a great ally for the oper-

ator reducing human error and increasing safety.

Grounds for future work include expanding the

analysis to other vision networks, comparing perfor-

mance on different hardware, and comparing the use

of synthetic data, real data, and mixed data in training

the networks.

ACKNOWLEDGEMENTS

The authors thank the entire team of the Master’s

Program in Instrumentation, Control and Automa-

tion of Mining Processes (PROFICAM), Fundac¸

de Amparo

a Pesquisa do Estado de Minas Gerais

(FAPEMIG), Vale Technological Institute (ITV) and

Federal University of Ouro Preto (UFOP). This

study was ﬁnanced in part by the Coordenac¸

de Aperfeic¸oamento de Pessoal de N

ıvel Superior

- Brasil (CAPES) - Finance Code 001, the Con-

selho Nacional de Desenvolvimento Cient

ıﬁco e Tec-

nol

ogico (CNPQ) ﬁnancing code 306101/2021-1,

FAPEMIG ﬁnancing code APQ-00890-23 and APQ-

01306-22, the Instituto Tecnol

ogico Vale (ITV) and

the Universidade Federal de Ouro Preto (UFOP).

REFERENCES

Aydin, I., Karakose, M., and Akin, E. (2014). A new

contactless fault diagnosis approach for pantograph-

catenary system using pattern recognition and image

processing methods. Advances in Electrical and Com-

puter Engineering, 14(3):79–89.

Chen W, Meng S, J. Y. (2022). Foreign object detection in

railway images based on an efﬁcient two-stage convo-

lutional neural network. Comput Intell Neurosci.

ElMasry, G., Cubero, S., Molt

o, E., and Blasco, J. (2012).

In-line sorting of irregular potatoes by using auto-

mated computer-based machine vision system. Jour-

nal of Food Engineering, 112(1):60–68.

Gholamizadeh, K., Zarei, E., and Yazdi, M. (2024). Rail-

way Transport and Its Role in the Supply Chains:

Overview, Concerns, and Future Direction, pages

769–796. Springer International Publishing, Cham.

utten, N., Meyes, R., and Meisen, T. (2022). Vision trans-

former in industrial visual inspection. Applied Sci-

ences, 12(23).

Megahed, F. and Camelio, J. (2012). Real-time fault detec-

tion in manufacturing environments using face recog-

nition techniques. Journal of Intelligent Manufactur-

ing, 23:1–16.

Park, B., Chen, Y.-R., Nguyen, M., and Hwang, H.

(1996). Characterizing multispectral images of tumor-

ous, bruised, skin-torn, and wholesome poultry car-

casses. Transactions of the ASABE, 39:1933–1941.

Salas, A. J. C., Meza-Lovon, G., Fern

andez, M. E. L., and

Raposo, A. (2020). Training with synthetic images for

object detection and segmentation in real machinery

images. IEEE.

Skala, T., Mari

cevi

c, M., and

Cule, N. (2024). Optimization

and application of ray tracing algorithms to enhance

user experience through real-time rendering in virtual

reality. Acta Graphica, 32(2):108–129.

Sonka, M., Hlavac, V., and Boyle, R. (2014). Image

Processing, Analysis, and Machine Vision. Cengage

Learning.

Yang, Y., Miao, C., Li, X., and Mei, X. (2014). On-line con-

veyor belts inspection based on machine vision. Optik,

125(19):5803–5807.

Yao, R., Qi, P., Hua, D., Zhang, X., Lu, H., and Liu, X.

(2023). A foreign object detection method for belt

conveyors based on an improved yolox model. Tech-

nologies, 11(5).

Zhang, B., Huang, W., Li, J., Zhao, C., Fan, S., Wu, J., and

Liu, C. (2014). Principles, developments and appli-

cations of computer vision for external quality inspec-

tion of fruits and vegetables: A review. Food Research

International, 62:326–343.

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

880