between the two groups; that is, the data distribution
in Group B is significantly different from that in
Group A. KNN depends on the distance between data
points, and when the distribution changes, it is
challenging to find suitable neighbouring points in
Group B, causing its performance to slide. Similarly,
logistic regression, as a linear model, is very sensitive
to nonlinear relationships and complex category
boundaries in Group B and cannot maintain high
accuracy. The ability of SVM to find the optimal
hyperplane is also frustrated when the boundary data
points of Group B deviate from Group A. Naive
Bayes assumes feature independence, and
performance suffers when feature relationships in
Group B are more complex than in Group A. Decision
trees, due to their tendency to overfit, may have
learned some specific patterns in Group A that do not
generalize well to Group B, resulting in poor
performance. Finally, while random forests enhance
generalization by integrating multiple decision trees,
they fail to fully meet this challenge in the face of
significant distribution shifts.
To address these issues, an effective technique to
improve model robustness in the presence of
distribution shifts is domain adaptation distribution
alignment. The focus of the method is to align the
feature distribution between the source domain
(group A) and the target domain (Group B). By
minimizing the discrepancy between these
distributions, models can generalize better to unseen
data. Domain adaptation techniques can be
particularly effective when the distribution
differences are significant, as seen in this study.
Using this approach, the model will be better to
capture the underlying structure of both datasets,
improving its ability to handle new, unseen data from
Group B. As can be seen from the sharp drop in recall,
the model struggles to identify the true positive class
in the new group. Solving these problems may require
regularization and data augmentation to improve the
robustness of the model to distribution shifts.
4 CONCLUSIONS
This study compares the generalization performance
of six machine learning models in predicting loan
defaults. The results reveal that model performance,
as measured by the F1 score, significantly declines
when applied to unseen data due to distribution shifts
between training and testing sets. Among the models,
Random Forest showed the highest performance in
the training set but experienced a sharp decline in
unseen data, indicating overfitting. Techniques like
domain adaptation and distribution alignment are
suggested to address these issues. In conclusion,
while machine learning models offer enhanced
predictive capabilities over traditional methods, their
generalization ability remains challenging. Future
research should focus on improving model
robustness, particularly in the face of distribution
shifts, to ensure better risk management in financial
settings.
REFERENCES
Abdullah, D. M. & Abdulazeez, A. M. 2021. Machine
learning applications based on svm classification a
review. Qubahan Academic Journal.
Athey, S. et al. 2018. The impact of machine learning on
economics. The economics of artificial intelligence: An
agenda, pp. 507–547.
Baesens, B., Setiono, R., Mues, C., & Vanthienen, J. 2003.
Using neural network rule extraction and decision
tables for credit-risk evaluation. Manag. Sci., vol. 49,
pp. 312–329.
Charbuty, B. & Abdulazeez, A. M. 2021. Classification
based on decision tree algorithm for machine learning.
Journal of Applied Science and Technology Trends.
Chong, E., Han, C., & Park, F. C. 2017. Deep learning
networks for stock market analysis and prediction:
Methodology, data representations, and case studies.
Expert Systems with Applications, vol. 83, pp. 187–
205.
Fernandes, A. A. T., Filho, D. B. F., da Rocha, E. C., &
Nascimento, W. 2020. Read this paper if you want to
learn logistic regression. Revista de Sociologia e
Política.
Jin, Y. & Zhu, Y. 2015. A data-driven approach to predict
default risk of loan for online peer-to-peer (p2p)
lending. In 2015 Fifth international conference on
communication systems and network technologies, pp.
609–613, IEEE.
Lakshmanarao, A., Gupta, C., Koppireddy, C. S., Ramesh,
U., & Dev, D. 2023. Loan default prediction using
machine learning techniques and deep learning ann
model. 2023 Annual International Conference on
Emerging Research Areas: International Conference on
Intelligent Systems (AICERA/ICIS), pp. 1–5, 2023.
Looney, A. & Yannelis, C. 2022. The consequences of
student loan credit expansions: Evidence from three
decades of default cycles. Journal of Financial
Economics, vol. 143, no. 2, pp. 771–793.
Looney, A. & Yannelis, C. 2019. How useful are default
rates? borrowers with large balances and student loan
repayment. Economics of Education Review, vol. 71,
pp. 135–145.
Saritas, M. M. & Yas¸ar, A. B. 2019. Performance analysis
of ann and naive bayes classification algorithm for data
classification. International Journal of Intelligent