3.2.3  Data Augmentation 
To improve  the dataset's quality, data  augmentation 
techniques such as flipping, rotation, and scaling was 
used to increase the diversity of the dataset.  
3.2.4  Data Splitting 
Data  splitting  is  essential  to  prevent  overfitting, 
which can occur when a model is too closely tailored 
to the training data. The model needs to be trained to 
recognize and classify the different types of damage 
accurately, such as dents, scratches, and cracks, and 
to  differentiate  between different  levels  of  severity. 
This  is  a  complex  task  that  requires  a  large  and 
diverse dataset, which must be split into appropriate 
subsets for training, validation, and testing.   
The  training  subset  is  the  largest  of  the  three 
subsets.  It  is  used  to  train  the  model  to  recognize 
patterns  and  features  in  the data  that  correspond  to 
different types and levels of damage.  
The validation subset was used to tune the model's 
hyperparameters, such as the learning rate, batch size, 
and  number  of  epochs.  Hyperparameters  are 
important as they control how the model learns from 
the training data, and they can significantly impact the 
model's  performance.  The  validation  set  is  used  to 
fine-tune the hyperparameters, allowing the model to 
generalize better to new data.  
The testing subset was used to evaluate the final 
model's  performance.  It  is  kept  separate  from  the 
training  and  validation  sets  and  is  used  to  simulate 
how the model will perform on new, unseen data. The 
performance on the testing set provides an unbiased 
estimate of how  well  the model will perform  in the 
real world. 
The  dataset  comprises  1631  images  of  vehicle 
damage with corresponding labels indicating the type 
of damage (e.g.,  scratches, dents, cracks, etc.). This 
dataset is randomly divided into training, validation, 
and testing subsets with a 70-15-15 split. 70% of the 
dataset used for training, 15% for validation, and 15% 
for testing.  
The table 2 below illustrates the process:  
Table 2: Training and testing results. 
DATASET   NUMBER OF IMAGES   PERCENTAGE 
Training Set   1141   70%  
Validation Set   245   15%  
Testing Set   245   15%  
After splitting the dataset, the training set was used to 
train  the  model  and  adjust  the  model's  hyper 
parameters using the validation set. Once the model's 
performance is optimized, the testing set evaluates its 
accuracy.  
3.2.5  Data Encoding 
Data encoding is necessary to transform the catego- 
 
rical  labels  of  vehicle  damage  types  into  numerical 
values  that  machine  learning  algorithms  can 
understand.  
The dataset of images of damaged vehicles with 
corresponding  labels indicating the  type of damage. 
The  labels  include  categories  such  as  "Scratch," 
"Dent,"  "Crack,",  "Tear", "Chip”, “Glass Damage", 
"Spider  Crack",  "Large  range  glass  damage", 
"Miscellaneous damage" and "Broken Windows." To 
use this data for machine learning algorithms, there is 
a  need  to  encode  these  categorical  labels  into 
numerical values.  
One standard data encoding method used is one-
hot  encoding,  where  each  category  is  assigned  a 
unique  numerical  value,  represented  as  a  binary 
vector.  
The datasets consist of 1631 images of damaged 
vehicles,  with  corresponding  labels  indicating  the 
type of damage. Table 3 shows a sample of the dataset 
and the corresponding encoded labels using one-hot 
encoding:  
Table  3:  Sample  of  the  dataset  and  the  corresponding 
encoded labels using one-hot encoding. 
IMAGE   LABEL   ENCODED LABEL  
Image 1   Scratch   [1, 0, 0, 0, 0, 0, 0, 0, 0,0] 
Image 2   Dent   [0, 1, 0, 0, 0, 0, 0, 0, 0,0] 
Image 3   Crack   [0, 0, 1, 0, 0, 0, 0, 0, 0,0] 
Image 4   Broken Window   [0, 0, 0, 1, 0, 0, 0, 0, 0,0] 
Image 5   Tear   [0, 0, 0, 0, 1, 0, 0, 0, 0,0] 
Image 6   Chip   [0, 0, 0, 0, 0, 1, 0, 0, 0,0] 
Image 7   Spider Crack   [0, 0, 0, 0, 0, 0, 1, 0, 0,0] 
Image 8   Miscellaneous Damage  [0, 0, 0, 0, 0, 0, 0, 1, 0,0] 
Image 9   Large Range Glass 
Damage
[0, 0, 0, 0, 0, 0, 0, 0, 1,0] 
Image 10   Metal Damage   [0, 0, 0, 0, 0, 0, 0, 0, 0,1] 
…   …   …  
Image 1627   Scratch   [1, 0, 0, 0, 0, 0, 0, 0, 0,0] 
Image 1628   Scratch   [0, 1, 0, 0, 0, 0, 0, 0, 0,0] 
Image 1629   Crack   [0, 0, 1, 0, 0, 0, 0, 0, 0,0] 
Image 1630   Broken Window   [0, 0, 0, 1, 0, 0, 0, 0, 0,0] 
Image 1631   Scratch   [0, 0, 0, 0, 1, 0, 0, 0, 0,0] 
IoTBDS 2024 - 9th International Conference on Internet of Things, Big Data and Security