concerning ML modelling in phishing website
classification can be categorized by the type of
features included in their datasets. Gupta et al.
provided a parsimonious URL-based classification
model of 9 variables extracted from URL text using
Random Forest (RF), with an excellent performance
of 99.57% accuracy (Gupta et al., 2021). Karim et al.
designed a hybrid ML model combining Logistic
Regression (LG), Support Vector Machines (SVM)
and Decision Tree (DT) methods, which outperforms
(accuracy of 95.23%, precision of 95.15%, recall of
96.38%, specificity of 93.77%, and F1-score 95.77%)
other traditional ML methods (Karim et al., 2023).
Ahammad et al. compared the performance of ML
models such as DT, RF and Gradient Boosting
Machines (GBM). Their GBM model achieved a
training accuracy of 0.895 and a testing accuracy of
0.860, while the RF approach scored 0.883 and 0.853
and DF scored 0.850 (Ahammad et al., 2022). In
addition, Almomani et al. collected semantic features
including URL structure, HTML, JavaScript
behaviors, and WHOIS metadata, and therefore
evaluated 16 classifiers. The top accuracy scores for
the models are around 97% for Gradient Boosting
(GB) and RF (Almomani et al., 2022). Wei and
Sekiya introduced deep learning methods into
modelling and found that the models using Ensemble
ML techniques outperformed others when they are
applied to their datasets, even with reduced feature
sets (Wei and Sekiya, 2022). Najjar-Ghabel et al.
compared six ML models using a dataset of 47
features (content, behavior, domain info), which
suggests RF model performing the best with 96.7%
accuracy and high F1-score (Najjar-Ghabel et al.,
2024).
Therefore, the objective of this paper is to
evaluate the classification performance of DT, RF,
and GB methods in the classification of phishing
websites. The models are compared in terms of
predictive accuracy, robustness across training and
testing sets. The findings contribute to the
development of more accurate ML-based phishing
detection methods and their future application to web
security tools.
2 METHODS
2.1 Data Source
This study utilizes a dataset from the Mendeley Data
website, named Web Page Phishing Detection and
published on 25 June 2021. The original dataset is
built by Hannousse and Yahiouche based on the
proposed guidelines. It contains 11430 groups of data
and 89 variables, including raw URLs, status labels,
and 87 extracted features (Hannousse and Yahiouche,
2021). Features are sorted into three categories: URL-
based features, features based on page contents, and
features extracted via external services. The original
dataset is in .csv format.
2.2 Data Preprocessing
Due to the large size of original dataset, only 5000
observations (2500 legitimate, 2500 phishing)
selected by stratified sampling is used for this study
instead. To ensure the repeatability of the selection,
the random seed is set to 42, while the same setting is
employed for later feature selection and model tuning.
Several data preprocessing steps are therefore
applied to the extracted sample. The statuses of
websites are encoded (0 for legitimate, 1 for phishing).
Features with constant values or no longer relevant
(such as the raw URL and the web traffic feature
based on the now-defunct Alexa ranking service) are
removed. For external features concerning the
domain registration length and the domain age, there
are erroneous values found which are represented by
negative values (-1/-2). To properly deal with them,
the two features are first manually sorted into four
categories: "error" (value < 0), "zero" (value = 0),
"small_positive" (0 < value≤365, namely within a
year), and "large_positive" (value > 365, namely
longer than a year). Afterwards, these external-based
features are encoded by the one-hot encoding (OHE)
method.
2.3 Feature Selection
After cleaning and encoding the data, feature
selection is performed via a model-based approach.
The process contains the training of three individual
classifiers based on DT, RF and GB techniques on the
preprocessed dataset to extract feature importance.
Afterwards, the top 25 informative features from each
model are identified, and the final subset of selected
features is determined by the intersection of the three
selections to only remain features that are consistently
important for all three models.
Therefore, the sample dataset used for this study
contains 5000 observations and 18 variables (17
features and a label). The names and explanations of
features belonging to three categories are shown in
Table 1.