Comparative Analysis of Generalization for Machine Learning

Application in Loan Default Binary Classification

Jiayuan Ma

Faculty of Engineering and Computer Science, The University of Sydney, Sydney, Australia

Keywords: Machine Learning, Generalization, Loan Default.

Abstract: Predicting loan defaults is critical for banks because it enables financial institutions to assess the risks

associated with loan approval. While traditional methods of evaluating loan applicants rely on subjective

assessments, Machine Learning (ML) has emerged as a powerful alternative to predicting defaults. This study

conducted a comparative analysis of six ML models on a dataset of 255,347 loan applicants. The goal is to

evaluate the generalization ability of each model when exposed to different data distributions. The results

show significant performance degradation across all models when tested on unseen data, highlighting the

issue of distribution shifts between training and testing sets. Some techniques such as domain adaptation and

distribution alignment are discussed as potential solutions to improve model robustness. These findings

provide valuable insights that could guide financial institutions in selecting more reliable, adaptable, and

robust models for accurately predicting loan defaults, thus improving decision-making processes and reducing

financial risk.

1 INTRODUCTION

In banks, predicting loan default is a critical

component of deciding to accept a loan application,

which is a need of becoming more and more

commonplace in society because although it is a win-

win financial activity for both lenders and borrowers,

defaulting on a loan may lead to severe consequences,

such as decrease of profit in banks, obstacles in the

academic careers of the uninformed student

population (Looney, 2022; Looney, 2019;

Lakshmanarao, 2023; Baesens, 2003).

Traditionally, banks assess loan applicants based

on educational background, occupation, and income.

However, quantifying the importance of these factors

is complex, often relies on the expertise and

subjective judgment of professional bankers, and

requires tons of labour-intensive and time-consuming

(Lakshmanarao, 2023; Wu, 2019). In contrast,

applying Machine Learning (ML) algorithms

enhances the efficiency of solving the binary

classification problem of loan approval, facilitating

the selection of appropriate predictive models (Athey,

2018). For instance, Extensive research in this field

provides evidence that multivariable regression often

https://orcid.org/0009-0009-8252-3713

outperforms traditional methods (Serrano-Cinca,

2016). In addition, while although logistic regression

is designed for distinguishing binary targets by using

probability between 0 and 1, there is a study showing

that it may not be suitable for situations involving

nonlinear data, decision trees have demonstrated

strong results, albeit with some risk of overfitting due

to the model's principle that tries to divide the dataset

repeatedly into nodes base on the level of entropy in

different classes causing better performance in

training set and worse in testing set (Lakshmanarao,

2023; Jin, 2015; Shih, 2014). In addition, random

forests are a powerful and popular ensemble learning

technique that consists of multiple decision trees.

Each decision tree is trained on a different subset of

the data set, providing individual estimates that are

combined together to form a result. Random forest is

more stable and accurate than single decision tree. It

can effectively solve the overfitting problem of

decision tree and enhance the generalization ability of

the model to the unknown data in the future (Baesens,

2003). Moreover, the Naive Bayes algorithm

performs well with small datasets, and K-Nearest

Neighbours, a simple but effective model, tends to

excel when the features in the dataset exhibit high

Ma, J.

Comparative Analysis of Generalization for Machine Learning Application in Loan Default Binary Classiﬁcation.

DOI: 10.5220/0013224800004568

In Proceedings of the 1st International Conference on E-commerce and Artiﬁcial Intelligence (ECAI 2024), pages 319-323

ISBN: 978-989-758-726-9

319

correlation (Zareapoor, 2015). Furthermore, the

advent of neural networks and deep learning has

introduced even more promising outcomes (Wu, 2019;

Chong, 2017).

Many researchers have compared the

performance of various ML models in predicting loan

defaults. Despite these efforts, there remains a gap

regarding the generalization ability of these models

across diverse datasets. Addressing this gap is

essential for developing more reliable predictive

tools, which is crucial for banks to make better-

informed decisions when selecting predictive models.

This study will utilize the Loan Default Prediction

dataset from Coursera, which includes 255,347

samples and 16 features related to borrowers,

providing a comprehensive basis for analysis. The

results of this study could provide insights for banks,

enabling them to adopt more robust and generalizable

models for loan default prediction, ultimately making

to better risk management and decision-making in the

financial industry. Therefore, the primary objective of

this study is to evaluate and compare the

generalization ability of different models from ML in

the context of predicting loan default. Various

techniques of machine learning will be employed to

achieve this objective, and performance indicators

will be used.

2 METHOD

2.1 Dataset Setup

The dataset used in this study is sourced from

Coursera's educational materials and contains

information on bank loan defaults. It includes

255,347 samples, each representing 16 borrower

features, such as age, income, education, marital

status, and employment, along with a record of

whether the borrower defaulted. The dataset

underwent preprocessing before the experiments to

enhance the prediction accuracy of loan defaults. This

included normalizing quantitative features and

encoding categorical variables. After data

preprocessing, K-means clustering was employed to

divide the dataset into two categories, labelled Group

A and Group B. Group A represents the subset of data

available for model training in banks. At the same

time, Group B serves as an unseen sample set.

Therefore, to address the class imbalance, only the

Group A dataset was subjected to Synthetic Minority

Over-sampling Technique (SMOTE) for

oversampling. This approach ensures a more robust

model performance evaluation in predicting loan

defaults.

2.2 Machine Learning Models-based

Prediction

This study employs six fundamental machine

learning models. These models are integrated into the

sklearn library, providing easy access for researchers

to implement and use. Various performance

evaluation metrics are also available to assess the

models' effectiveness. This setup allows for

comprehensive comparisons and evaluations of

model performance in predicting outcomes based on

the selected dataset.

2.2.1 K Nearest Neighbour

The K nearest Neighbour (KNN) algorithm is a

simple and is commonly used for non-parametric

classification (this study) and regression tasks. It

calculates the distance between samples to be

classified and all the samples in the training set select

the nearest number of K neighbours and predicts the

class of the new sample by voting or weighted

average according to the class of these neighbours.

The advantage of KNN is interpretable and adaptable

to multi-class problems. However, it is

computationally inefficient in high dimensional

Spaces and sensitive to noise (Zhang, 2021).

2.2.2 Logistic Regression

Logistic regression (LR) is a widely used

classification algorithm, especially for binary

classification problems. It predicts the class by

mapping a linear combination of the input variables

to an interval from 0 to 1 and setting a threshold. LR

assumes there is a linear relationship between input

variables and output variables. It is easily

interpretable and computationally efficient. However,

LR has limited performance when facing complex

nonlinear problems (Fernandes, 2020).

2.2.3 Support Vector Machine

Support Vector machines (SVM) are also supervised

for classification and regression that perform

classification by finding the optimal decision

boundary (maximizing the margin between two

classes). SVM performs well in dealing with small

samples and high-dimensional data and can solve

nonlinear problems through kernel functions.

However, its computational cost is high on large-

scale datasets (Abdullah, 2021).

ECAI 2024 - International Conference on E-commerce and Artiﬁcial Intelligence

320

2.2.4 Naïve Bayes

Based on Bayes theorem, Naïve Bayes is an algorithm

of probabilistic classification. It assumes that the

features are independent. Although this assumption is

often not true, Naive Bayes still shows good

classification results in practice. The advantage of the

proposed algorithm is that it is computationally

efficient and insensitive to noise, making it suitable

for processing large-scale data sets (Saritas, 2019).

2.2.5 Decision Tree

The decision tree (DT) is another classification and

regression algorithm that generates a DT by

recursively selecting the best features to partition the

data. Its advantage is that the model has good

interpretability and can deal with multi-class

problems and nonlinear data. However, DT is prone

to overfitting and sensitive to data noise (Charbuty,

2021).

2.2.6 Random Forest

Random forest (RF) is an ensemble learning

algorithm that builds multiple DTs and combines

their predictions to improve the accuracy of

classification or regression. It reduces the overfitting

problem of single DT by introducing randomness and

has strong adaptability to high-dimensional data. RF

performs well in classification accuracy and stability

(Speiser, 2019).

3 RESULTS AND DISCUSSION

After analysing 6 model performances across groups

A and B, a notable decline in F1 scores is observed

for all models, indicating a significant reduction in the

ability of models to balance precision and recall when

applied to group B, as shown in Table 1 below.

The reason why the performance of the six

machine learning models deteriorates severely when

moving from Group A (training and testing) to Group

B (testing) can be attributed to several key factors. A

major reason may be that there is a distribution shift

Table 1: Performance Comparison on 6 Models between Group A and B.

MODEL INDICATOR GROUP A GROUP B

KNN

Precision 0.7570 0.1512

Recall 0.9654 0.4504

F1-score 0.8486 0.2264

Logistic

Regression

Precision 0.6827 0.2054

Recall 0.6857 0.6937

F1-score 0.6842 0.3170

SVM

Precision 0.7038 0.1949

Recall 0.6994 0.6711

F1-score 0.7016 0.3020

Naïve Bayes

Precision 0.6806 0.2128

Recall 0.6943 0.6665

F1-score 0.6874 0.3226

Decision Tree

Precision 0.8870 0.1823

Recall 0.9990 0.2331

F1-score 0.9397 0.2046

Random Forest

Precision 0.9790 0.4365

Recall 0.9984 0.1291

F1-score 0.9886 0.1993

Comparative Analysis of Generalization for Machine Learning Application in Loan Default Binary Classiﬁcation

321

between the two groups; that is, the data distribution

in Group B is significantly different from that in

Group A. KNN depends on the distance between data

points, and when the distribution changes, it is

challenging to find suitable neighbouring points in

Group B, causing its performance to slide. Similarly,

logistic regression, as a linear model, is very sensitive

to nonlinear relationships and complex category

boundaries in Group B and cannot maintain high

accuracy. The ability of SVM to find the optimal

hyperplane is also frustrated when the boundary data

points of Group B deviate from Group A. Naive

Bayes assumes feature independence, and

performance suffers when feature relationships in

Group B are more complex than in Group A. Decision

trees, due to their tendency to overfit, may have

learned some specific patterns in Group A that do not

generalize well to Group B, resulting in poor

performance. Finally, while random forests enhance

generalization by integrating multiple decision trees,

they fail to fully meet this challenge in the face of

significant distribution shifts.

To address these issues, an effective technique to

improve model robustness in the presence of

distribution shifts is domain adaptation distribution

alignment. The focus of the method is to align the

feature distribution between the source domain

(group A) and the target domain (Group B). By

minimizing the discrepancy between these

distributions, models can generalize better to unseen

data. Domain adaptation techniques can be

particularly effective when the distribution

differences are significant, as seen in this study.

Using this approach, the model will be better to

capture the underlying structure of both datasets,

improving its ability to handle new, unseen data from

Group B. As can be seen from the sharp drop in recall,

the model struggles to identify the true positive class

in the new group. Solving these problems may require

regularization and data augmentation to improve the

robustness of the model to distribution shifts.

4 CONCLUSIONS

This study compares the generalization performance

of six machine learning models in predicting loan

defaults. The results reveal that model performance,

as measured by the F1 score, significantly declines

when applied to unseen data due to distribution shifts

between training and testing sets. Among the models,

Random Forest showed the highest performance in

the training set but experienced a sharp decline in

unseen data, indicating overfitting. Techniques like

domain adaptation and distribution alignment are

suggested to address these issues. In conclusion,

while machine learning models offer enhanced

predictive capabilities over traditional methods, their

generalization ability remains challenging. Future

research should focus on improving model

robustness, particularly in the face of distribution

shifts, to ensure better risk management in financial

settings.

REFERENCES

Abdullah, D. M. & Abdulazeez, A. M. 2021. Machine

learning applications based on svm classification a

review. Qubahan Academic Journal.

Athey, S. et al. 2018. The impact of machine learning on

economics. The economics of artificial intelligence: An

agenda, pp. 507–547.

Baesens, B., Setiono, R., Mues, C., & Vanthienen, J. 2003.

Using neural network rule extraction and decision

tables for credit-risk evaluation. Manag. Sci., vol. 49,

pp. 312–329.

Charbuty, B. & Abdulazeez, A. M. 2021. Classification

based on decision tree algorithm for machine learning.

Journal of Applied Science and Technology Trends.

Chong, E., Han, C., & Park, F. C. 2017. Deep learning

networks for stock market analysis and prediction:

Methodology, data representations, and case studies.

Expert Systems with Applications, vol. 83, pp. 187–

205.

Fernandes, A. A. T., Filho, D. B. F., da Rocha, E. C., &

Nascimento, W. 2020. Read this paper if you want to

learn logistic regression. Revista de Sociologia e

Política.

Jin, Y. & Zhu, Y. 2015. A data-driven approach to predict

default risk of loan for online peer-to-peer (p2p)

lending. In 2015 Fifth international conference on

communication systems and network technologies, pp.

609–613, IEEE.

Lakshmanarao, A., Gupta, C., Koppireddy, C. S., Ramesh,

U., & Dev, D. 2023. Loan default prediction using

machine learning techniques and deep learning ann

model. 2023 Annual International Conference on

Emerging Research Areas: International Conference on

Intelligent Systems (AICERA/ICIS), pp. 1–5, 2023.

Looney, A. & Yannelis, C. 2022. The consequences of

student loan credit expansions: Evidence from three

decades of default cycles. Journal of Financial

Economics, vol. 143, no. 2, pp. 771–793.

Looney, A. & Yannelis, C. 2019. How useful are default

rates? borrowers with large balances and student loan

repayment. Economics of Education Review, vol. 71,

pp. 135–145.

Saritas, M. M. & Yas¸ar, A. B. 2019. Performance analysis

of ann and naive bayes classification algorithm for data

classification. International Journal of Intelligent

ECAI 2024 - International Conference on E-commerce and Artiﬁcial Intelligence

322

Systems and Applications in Engineering, vol. 7, pp.

88–91.

Shih, J.-Y., Chen, W.-H., & Chang, Y.-J. 2014. Developing

target marketing models for personal loans. In 2014

IEEE International Conference on Industrial

Engineering and Engineering Management, pp. 1347–

1351, IEEE.

Speiser, J. L., Miller, M. I., Tooze, J. A., & Ip, E. H.-S.

2019. A comparison of random forest variable selection

methods for classification prediction modeling. Expert

systems with applications, vol. 134, pp. 93–101.

Wu, M., Huang, Y., & Duan, J. 2019. Investigations on

classification methods for loan application based on

machine learning. In 2019 International Conference on

Machine Learning and Cybernetics (ICMLC), pp. 1–6,

IEEE.

Zareapoor, M., Shamsolmoali, P., et al. 2015. Application

of credit card fraud detection: Based on bagging

ensemble classifier. Procedia computer science, vol. 48,

no. 2015, pp. 679–685.

Zhang, S. 2021. Challenges in knn classification. IEEE

Transactions on Knowledge and Data Engineering, vol.

34, pp. 4663–4675.

Comparative Analysis of Generalization for Machine Learning Application in Loan Default Binary Classiﬁcation

323