In this study, RF is optimized using
GridSearchCV with four parameters. n_estimators
represent the number of trees in the forest, ranging
from 100 to 1000, where more trees can improve
accuracy but meanwhile increase computation time.
The maximum depth (max_depth) of each tree is set
to None (unlimited) or restricted to 10, 20, or 30. The
minimum number of samples (min_samples_split)
required to split an internal node is set to 2, 5, or 10,
influencing the model's complexity.
min_samples_leaf sets the minimum number of
samples required to be at a leaf node, ranging from 10
to 100, which controls overfitting by ensuring each
leaf has enough data.
2.2.2 Decision Tree
DT is a tree-like structure where each internal node
represents a decision based on a feature, each branch
represents the outcome of the decision, and each leaf
node represents the final output or decision. As
mentioned above in RF, the data will be splitted into
subsets until the outputs are pure at the leaf node. The
splitting usually happens from the root into two or
more homogeneous data sets based on the most
significant feature and recursively on each derived
subset. After the tree has been fully grown, if needed,
trees can be pruned by removing branches that have
little importance to avoid overfitting. Moreover, DT
can rank features by importance and the missing
values can be either ignored or by considering
multiple possible outcomes.
As stated, DT is part of RF. Therefore, the
parameters are similar, and for DT in this study,
max_depth from 10 to 70 or None, min_samples_split
between 2 and 20, and min_samples_leaf from 1 to 8.
Besides, the split strategy (splitter) is either "best" for
optimal splits or "random" for random splits.
2.2.3 Logistic Regression
LR is commonly used in binary classification tasks,
which fits the pet adoption prediction. The model
assumes a linear relationship between the
independent variables (the input) and the log-odds of
the dependent variable (the outcome). The logistic
function, also called sigmoid, will transform any real-
valued input to a value between 0 and 1 with the
formula, where z is the linear combination of input
features and their corresponding coefficients.
Usually, as well as in this study, the threshold is set
to 0.5, where a result greater than 0.5 predicts the
positive class, otherwise negative class. Besides
binary classification, LR can be extended to
multiclass classification using techniques like One-
vs-Rest (OvR) or softmax regression.
In this study, the inverse of regularization strength
(c) is set in the range from 0.01 to 100. Some optimize
options (solver) include ‘newton-cg', 'lbfgs' and 'sag',
which impacts convergence speed and model
accuracy.
2.2.4 Artificial Neural Network
ANN is characterized by its human brain’s structure,
usually consisting of multiple interconnected layers
of nodes, or "neurons," that process and transmit
information. The model stimulates how the human
brain processes input information and makes
reactions and decisions. Each neuron receives input,
passes it into the assigned activation function which
allows ANN to model non-linear relationships and
then conveys the output to the next layer. Connections
between neurons have weights, which determine the
strength of it. This weight and the bias added to the
input of neurons will keep being updated when the
error is propagated backward through the network.
This ANN model will be constructed with 3 layers
of Rectified Linear Unit (ReLU) and 1 more layer of
sigmoid as output layer. The first and last layers of
ReLU are set to have the same number of neurons
(20) as the input shape. The number of neurons in the
middle layer of ReLU will be the best option between
40 to 80. Adaptive Moment Estimation (Adam) is
chosen as the optimizer for faster convergence and
acceleration in the relevant direction and dampen
oscillations, with the loss set as 'binary_crossentropy'
and metrics set as 'accuracy'. Epochs will vary in
range from 30 to 90.
3 RESULT AND DISCUSSION
The training and testing accuracies for RF, DT and
LR models are shown in Table 1, and the test result
for ANN is listed in Table 2 and Table 3.
Table 1: Accuracy and AUC in training and testing.
RF DT LR
Accuracy
in Training 0.9495327103 0.9426791277 0.9158878505
Accuracy
in Testing 0.9129353234 0.9179104478 0.8855721393
AUC in
Train 0.9668213326 0.9842749416 0.9343185955
AUC in
Test 0.9207912458 0.9102553311 0.9225028058