concerning  ML  modelling  in  phishing  website 
classification  can  be  categorized  by  the  type  of 
features  included  in  their  datasets.  Gupta  et  al. 
provided  a  parsimonious  URL-based  classification 
model of 9 variables extracted from URL text using 
Random Forest (RF), with an excellent performance 
of 99.57% accuracy (Gupta et al., 2021). Karim et al. 
designed  a  hybrid  ML  model  combining  Logistic 
Regression  (LG),  Support  Vector  Machines  (SVM) 
and Decision Tree (DT) methods, which outperforms 
(accuracy of 95.23%, precision of 95.15%, recall of 
96.38%, specificity of 93.77%, and F1-score 95.77%) 
other  traditional  ML  methods  (Karim  et  al.,  2023). 
Ahammad  et  al.  compared  the  performance  of  ML 
models  such  as  DT,  RF  and  Gradient  Boosting 
Machines  (GBM).  Their  GBM  model  achieved  a 
training accuracy of 0.895 and a testing accuracy of 
0.860, while the RF approach scored 0.883 and 0.853 
and  DF  scored  0.850  (Ahammad  et  al.,  2022).  In 
addition, Almomani et al. collected semantic features 
including  URL  structure,  HTML,  JavaScript 
behaviors,  and  WHOIS  metadata,  and  therefore 
evaluated 16 classifiers. The top accuracy scores for 
the  models  are  around  97%  for  Gradient  Boosting 
(GB) and RF (Almomani et al., 2022). Wei and 
Sekiya  introduced  deep  learning  methods  into 
modelling and found that the models using Ensemble 
ML  techniques  outperformed  others  when  they  are 
applied  to  their  datasets,  even  with  reduced  feature 
sets  (Wei  and  Sekiya,  2022).  Najjar-Ghabel  et  al. 
compared  six  ML  models  using  a  dataset  of  47 
features  (content,  behavior,  domain  info),  which 
suggests RF model performing the best with 96.7% 
accuracy  and  high  F1-score  (Najjar-Ghabel  et  al., 
2024). 
Therefore,  the  objective  of  this  paper  is  to 
evaluate  the  classification  performance  of  DT,  RF, 
and  GB  methods  in  the  classification  of  phishing 
websites.  The  models  are  compared  in  terms  of 
predictive  accuracy,  robustness  across  training  and 
testing  sets.  The  findings  contribute  to  the 
development  of  more  accurate  ML-based  phishing 
detection methods and their future application to web 
security tools. 
2  METHODS 
2.1   Data Source 
This study utilizes a dataset from the Mendeley Data 
website,  named  Web  Page  Phishing  Detection  and 
published  on  25  June  2021.  The  original  dataset  is 
built  by  Hannousse  and  Yahiouche  based  on  the 
proposed guidelines. It contains 11430 groups of data 
and 89 variables, including raw URLs, status labels, 
and 87 extracted features (Hannousse and Yahiouche, 
2021). Features are sorted into three categories: URL-
based features, features based on page contents, and 
features extracted via external services. The original 
dataset is in .csv format. 
2.2  Data Preprocessing 
Due  to  the  large size  of  original  dataset, only 5000 
observations  (2500  legitimate,  2500  phishing) 
selected by stratified sampling is used for this study 
instead. To ensure the repeatability of the selection, 
the random seed is set to 42, while the same setting is 
employed for later feature selection and model tuning. 
Several  data  preprocessing  steps  are  therefore 
applied  to  the  extracted  sample.  The  statuses  of 
websites are encoded (0 for legitimate, 1 for phishing). 
Features  with  constant  values or no longer relevant 
(such as the raw URL and the web traffic feature 
based on the now-defunct Alexa ranking service) are 
removed.  For  external  features  concerning  the 
domain registration length and the domain age, there 
are erroneous values found which are represented by 
negative values (-1/-2). To properly deal with them, 
the  two  features  are  first  manually  sorted  into  four 
categories:  "error"  (value  <  0),  "zero"  (value  =  0), 
"small_positive"  (0  <  value≤365,  namely  within  a 
year),  and  "large_positive"  (value  >  365,  namely 
longer than a year). Afterwards, these external-based 
features are encoded by the one-hot encoding (OHE) 
method. 
2.3   Feature Selection 
After  cleaning  and  encoding  the  data,  feature 
selection  is  performed  via  a  model-based  approach. 
The process contains the training of three individual 
classifiers based on DT, RF and GB techniques on the 
preprocessed  dataset  to  extract  feature  importance. 
Afterwards, the top 25 informative features from each 
model are identified, and the final subset of selected 
features is determined by the intersection of the three 
selections to only remain features that are consistently 
important for all three models.  
Therefore, the sample dataset used for this study 
contains  5000  observations  and  18  variables  (17 
features and a label). The names and explanations of 
features  belonging  to  three  categories  are  shown  in 
Table 1.