
forests, LightGBM yielded more balanced results
and although random forests produced a marginally
higher recall, this came at the expense of a signifi-
cantly increased number of FPs - a trade-off which
is deemed unfavourable, as high false positives can
lead to increased operational costs and potential cus-
tomer dissatisfaction. Hence, the LightGBM model
with cost-sensitive learning emerges as the preferred
choice due to its enhanced fraud capturing capabil-
ities and balanced performance. Furthermore, cost-
sensitive learning emerged as the superior way to ad-
dress data imbalancing on this dataset. Also, when
comparing feature importance under an imbalanced
dataset with feature importance when data imbalanc-
ing is addressed, one can see that variables related to
the claim type, claim processing and the number of
injuries became more promintent in the latter. In-
deed, injury-related claims have been found to be
more prone to fraud.
It is important to recognise the limitations of the
dataset, particularly with regard to the accuracy and
completeness of the fraud labels. As the data is
sourced from a single motor insurance company, it
is subject to the specific methods and procedures
used by that company to identify and report fraud-
ulent claims. Consequently, the dataset may not be
fully representative of the true incidence of fraudu-
lent claims within the wider motor insurance indus-
try. This could potentially impact the reliability and
generalisability of any results obtained from the data,
particularly if the sample is biased in any way towards
certain types of claims or customers. Nonetheless,
further research can be done to determine which clas-
sification techniques are useful for correctly identify-
ing fraud in a cost-effective manner, and whether cost-
sensitive learning is truly a more superior approach
to addressing data imbalance when compared to syn-
thetic minority oversampling. Furthermore, this re-
search can be further enhanced by possibly incorpo-
rating principal component analysis to reduce the di-
mensionality of the data set. This would allow us to
use information from all the features and improve pre-
dictability, however this could come at the expense of
interpretability of the features.
REFERENCES
Al-Hashedi, K. G. and Magalingam, P. (2021). Financial
fraud detection applying data mining techniques: A
comprehensive review from 418 2009 to 2019. Com-
put. Sci. Rev., 40, . 419:100–402.
Aslam, F., Hunjra, A. I., Ftiti, Z., Louhichi, W., and Shams,
T. (2022). Insurance fraud detection: Evidence from
artificial intelligence and 416 machine learning. Res.
Int. Bus. Financ., 62, . 417:101–744.
Bhattacharyya, S., Jha, S., Tharakunnel, K., and Westland,
J. C. (2011). Data mining for credit card fraud: A
comparative study. Decis. Support, 50(3):602–613.
424 Syst.,, . 425.
Breiman, L. (1984). Classification and regression trees.
Chapman and Hall/CRC: New York, 431:18–58.
Breiman, L. (2001). Classification and regression trees.
Mach. Learn, 45, . 434:5–32.
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer,
W. P. (2002). Smote: synthetic minority over-
sampling technique. J. Artif. Intell., 429 Res., 16, .
430:321–357.
Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree
boosting system. In Proceedings of the 22nd ACM
SIGKDD International 436 Conference on Knowledge
Discovery and Data Mining, San Francisco, USA, 13-
27 August 2016, pages 785–794.
Gomes, C., Jin, Z., and Yang, H. (2021). Insurance fraud
detection with unsupervised deep learning. J. Risk In-
sur., 88:591–624.
Hargreaves, C. A. and Singhania, V. (2015). Analytics for
insurance fraud detection: An empirical study. Am. J.
Mob. Syst. Appl. Serv., 1(410):3. 227–232. 411.
Hashmi, N., Shankaranarayanan, G., and Malone, T. W.
(2018). Is bigger better? a study of the effect of group
size on collective intelligence 407 in online groups.
Decis. Support Syst., 107:88–98.
Hastie, T., Tibshirani, R., and Friedman, J. H. (2009).
The elements of statistical learning: Data mining,
inference, and prediction. In 432, pages 261–288.
Springer-Verlag, New York, 2 edition.
He, H. and Garcia, E. A. (2009). Learning from imbalanced
data. IEEE Trans. Knowl. Data Eng., 21(9):1263–
1284.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Weidong,
M., Ye, Q., and Liu, T. Y. (2017). Lightgbm: A highly
efficient gradient boosting 438 decision tree. In Pro-
ceedings of the 31st Conference of Neural Information
Processing Systems, Long Beach, USA, 4-9 December
439 2017, pages 3149–3157.
Kemp, G. (2010). Fighting public sector fraud in the 21st
century. Comput. Fraud Secur., 11, . 414:16–18.
Ngai, E. W. T., Hu, Y., Wong, Y. H., Chen, Y., and Sun, X.
(2011). The application of data mining techniques in
financial fraud detection: A 426 classification frame-
work and an academic review of literature. Decis.
Support Syst., 50(3):559–569.
Nian, K., Zhang, H., Tayal, A., Coleman, T., and Li, Y.
(2016). Auto insurance fraud detection using unsu-
pervised spectral ranking for anomaly. 412 J. Financ.
Data Sci., 2, . 413:58–75.
Phua, C., Alahakoon, D., and Lee, V. (2004). Minority re-
port in fraud detection: Classification of skewed data.
ACM SIGKDD Explor., 6(420):1. pp. 50–59. 421.
Vapnik, V. (2000). The Nature of Statistical Learning The-
ory. New York, pp, Springer-Verlag, 2 edition. 1–16.
Comparison of Tree-Based Learning Methods for Fraud Detection in Motor Insurance
397