From Payment Services Directive 2 (PSD2) to Credit Scoring: A Case

Study on an Italian Banking Institution

Roberto Saia, Alessandro Giuliani, Livio Pompianu and Salvatore Carta

Department of Mathematics and Computer Science,

University of Cagliari, Via Ospedale 72, 09124 Cagliari, Italy

Keywords:

Business Intelligence, Decision Support System, Machine Learning, Credit Scoring, PSD2.

Abstract:

The Payments Systems Directive 2 (PSD2), recently issued by the European Union, allows the banks to share

their customer data if they authorize the operation. On the one hand, this opportunity offers interesting per-

spectives to the ﬁnancial operators, allowing them to evaluate the customers reliability (Credit Scoring) even

in the absence of the canonical information typically used (e.g., age, current job, total incomes, or previous

loans). On the other hand, the state-of-the-art approaches and strategies still train their Credit Scoring models

using the canonical information. This scenario is further worsened by the scarcity of proper datasets needed

for research purposes and the class imbalance between the reliable and unreliable cases, which biases the

reliability of the classiﬁcation models trained using this information. The proposed work is aimed at experi-

mentally investigating the possibility of deﬁning a Credit Scoring model based on the bank transactions of a

customer, instead of using the canonical information, comparing the performance of the two models (canoni-

cal and transaction-based), and proposing an approach to improve the performance of the transactions-based

model. The obtained results show the feasibility of a Credit Scoring model based only on banking transactions,

and the possibility of improving its performance by introducing simple meta-features.

1 INTRODUCTION

Nowadays, risk management has become a key factor

in ﬁnancial business scenarios. An appropriate credit

risk management is crucial for supporting ﬁnancial in-

stitutions that provide lending services, which may be

affected by substantial economic losses due to loan

defaults. Estimating the probability of default is a

common way to assess the risk borrowers cannot re-

pay their loans. In such a context, to support ﬁnan-

cial businesses in facing the competitive marketplace,

deﬁning a reliable and effective Credit Scoring model

is essential to predict and avoid the aforementioned

issue. Credit Scoring can be deﬁned as the set of

models and statistical methods aimed at automatically

evaluating consumer credit, estimating the likelihood

of default (Thomas et al., 2017; Gup, 2005).

Recently, Credit Scoring systems have rapidly

grown due to the increase of both consumer credit

requests and ﬁnancial operators who ﬂank the tradi-

tional credit circuits (i.e., banks) (Siddiqi, 2017). In-

deed, the massive number of requests does not allow

processing them manually, requiring effective auto-

mated systems to establish the reliability of a given

user, according to a binary (by classifying the user as

reliable or unreliable) or continuous (by assigning a

solvency score to the user) criterion.

In such a context, many people can not access

consumer credit because of the strict conditions re-

quired by ﬁnancial institutions, e.g., in many coun-

tries, banks ask customers to provide formal docu-

mentation proving they have a permanent employ-

ment contract. Clients who do not meet these cri-

teria can not access consumer credit, even if they

have other ﬁnancial incomes, allowing them to repay

a loan. For these reasons, there has also been a growth

of investments and efforts in Credit Scoring research,

involving an ever-increasing number of researchers.

The focus is on deﬁning strategies and algorithms ca-

pable of correctly evaluating new cases (users) basing

on the data previously collected.

A common way to deﬁne Credit Scoring models

is to rely on Machine Learning techniques to quan-

titatively assess the default risk, typically basing on

personal information (e.g., age, job, income, or out-

standing debt debts) obtained from loan applicants

(Lee and Sohn, 2017; Kim and Sohn, 2012). In de-

tails, the aim is to develop systems able to classify a

user as reliable or unreliable by exploiting the avail-

able information, which from now on we deﬁne in-

164

Saia, R., Giuliani, A., Pompianu, L. and Carta, S.

From Payment Services Directive 2 (PSD2) to Credit Scoring: A Case Study on an Italian Banking Institution.

DOI: 10.5220/0010653200003064

In Proceedings of the 13th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2021) - Volume 1: KDIR, pages 164-171

ISBN: 978-989-758-533-3; ISSN: 2184-3228

stances.

Research on Credit Scoring has been heavily con-

ditioned by the scarcity of publicly available datasets,

due to the privacy policies of most ﬁnancial com-

panies. A key-point of state-of-the-art approaches

trained with the canonical datasets is represented by

the high level of class imbalance, as, typically, the

unreliable cases (e.g., users that did not repay, fully

or partially, a loan) occur less frequently than the re-

liable ones (e.g., users that repaid a loan). The data

imbalance, which represents a very common scenario

in several domains, e.g., in Fraud Detection (Carta

et al., 2019b; Saia et al., 2017; Saia and Carta, 2017a)

or in Intrusion Detection (Saia et al., 2018b; Saia

et al., 2019b), is typically addressed by adopting sev-

eral re-sampling strategies (Leevy et al., 2018; Jun-

somboon and Phienthrakul, 2017). It should be ob-

served that such balancing techniques often cannot ef-

fectively face some types of problems, since the gen-

eration of synthetic unreliable instances based on the

existing ones does not face the heterogeneity prob-

lem introduced before. Indeed, if there were similar

instances related to both information classes (reliable

and unreliable), they will continue to exist, even in

more signiﬁcant numbers, despite the data balancing.

It means that very similar feature patterns can char-

acterize both categories of users, reliable and unre-

liable. Consequently, the traditional information set

may need to be enlarged.

To address the aforementioned issue, let us con-

sider an emerging perspective. Currently, the Credit

Scoring environment is strongly affected by the Pay-

ment Services Directive 2 (PSD2), i.e., the new regu-

latory framework, introduced by the European Com-

mission, to regulate payment services and providers

throughout the European Union (EU) and European

Economic Area (EEA). The PSD2 aims to provide a

more integrated European payments market, in which

all players may be either banks or non-banks while

ensuring a more secure and protected platform for

consumers. In particular, the new directive encour-

ages customers to exploit innovative online and mo-

bile payments, such as through Open Banking, as the

current rules better protect customers in online pay-

ments and make cross-border European payment ser-

vices safer. In doing so, the PSD2 regulation al-

lows third parties to obtain free access to client ac-

counts and their payment transactions through bank

APIs. This new regulation can be exploited in real-

world scenarios, as models based also on bank trans-

action information may offer interesting employment

prospects. However, there are still no clear policies on

which technologies should be used and, in particular,

which types of data the banks must share. To this end,

there is the need to investigate the explanatory value

of transaction data to estimate the potential usefulness

in a Credit Scoring scenario.

This paper aims to exploit the transaction infor-

mation to improve the characterization of the user in-

stances for classic Credit Scoring models. Further-

more, basing on a state-of-the-art study, we also ex-

pand the feature space by adding a series of meta-

information. To our knowledge, no peer-review publi-

cations exist on using transaction data to deﬁne Credit

Scoring models. Our scientiﬁc contribution can be re-

capped as follows:

- the analysis of a real-world dataset provided by an

Italian bank institution, built with client accounts

information and their payment transactions data,

compliant to the new PSD2 regulation;

- assessment of the usefulness of transaction data for

Credit Scoring models;

- enrichment of the feature space with the introduc-

tion of a series of meta-features;

- evaluation of the related improvements for the

Credit Scoring models.

This paper has been structured into the follow-

ing sections: Section 2 presents the background and

the related work of domain taken into account in this

paper; Section 3 formalizes the notation adopted in

this paper, providing also information about the ex-

ploitation of the additional meta-features; Section 4

describes the performed experiments in terms of envi-

ronment, datasets, strategy, and adopted metrics, dis-

cussing the experimental results; remarks and future

work where we are headed are given in Section 5,

which also ends the paper.

2 BACKGROUND AND RELATED

WORK

The literature identiﬁes three different risk models

based on the default condition (i.e., the failure to re-

spect the legal obligations/conditions related to a ﬁ-

nancial service, such as a loan): (i) Probability of De-

fault (PD), a model aimed to assess the likelihood of

a default over a certain period; (ii) Exposure At De-

fault (EAD), a model aimed to assess the total value

a ﬁnancial operator is exposed in case of default; (iii)

Loss Given Default (LGD), a model aimed to assess

the amount of money a ﬁnancial operator loses in case

of default.

For the purpose related to this paper, we will con-

sider the PD model, since our objective is a dichoto-

mous classiﬁcation of the new instances into two

classes, reliable or unreliable.

Approaches: The literature offers a considerable

From Payment Services Directive 2 (PSD2) to Credit Scoring: A Case Study on an Italian Banking Institution

165

number of techniques and strategies, such as:

- Statistical: in (Sohn et al., 2016) the authors ex-

ploit Logistic Regression (LR) to design a fuzzy

Credit Scoring model aimed at assessing the de-

fault probability of a loan. In (Khemais et al.,

2016) the authors use the Linear Discriminant Anal-

ysis (LDA) in order to achieve this result. A

recent study (Roy and Shaw, 2021) proposes a

model based on multiple-criteria decision-making

(MCDM) that can be adopted as an internal scoring

model to preliminary screen the loan applications,

and it can be initially applied to reduce costs;

- Transformed Domain: in (Saia and Carta, 2017b)

has been proposed a Credit Scoring approach based

on the Fourier transform, similarly to the work done

in (Saia et al., 2018a), where instead the Wavelet

transform has been exploited;

- Machine Learning (ML): in (Roy and Urolagin,

2019) a Credit Scoring approach based on Decision

Tree (DT) and Support Vector Machine (SVM) al-

gorithms has been deﬁned, whereas in (Zhang et al.,

2018) a Random Forest (RF) algorithm has been

adopted. Another work is based on a survival gra-

dient boosting decision tree (GBDT) approach (Xia

et al., 2020);

- Deep Learning (DL): recent works focus on exploit-

ing DL models also in the ﬁeld of Credit Scoring.

In (Liu et al., 2019) the authors exploit an Artiﬁcial

Neural Network (ANN) to perform the Credit Scor-

ing task, as well as in (Lei et al., 2019), where has

been deﬁned an Imbalanced Generative Adversarial

Fusion Network (IGAFN) based on a feed-forward

neural network (FNN) and a Bidirectional Long

Short-Term Memory (Bi-LSTM) network. An ap-

proach based on Generative Adversarial Networks

(GAN) for data oversampling is proposed in (En-

gelmann and Lessmann, 2021);

- Others: approaches that focus on other factors.

For example, the entropy factor has been taken

into account in (Saia and Carta, 2016a), the linear-

dependence of the involved data has been consid-

ered in (Saia and Carta, 2016c; Saia and Carta,

2016b), a discretized enriched technique has been

considered in (Saia et al., 2019b), whereas some

hybrid approaches that combine different methods

have been proposed in (Tripathi et al., 2018; Zhang

et al., 2019).

Let us remark that, in this preliminary work, we

focus on developing a model aimed at highlighting

the usefulness of transaction data rather than compar-

ing the model with all state-of-the-art systems. In do-

ing so, we compare our model with the most-known

classical ML models.

Open Problems: Although there are numerous state-

of-the-art approaches for credit scoring, they have to

face some well-known problems, such as:

- Data Scarcity: the scarcity of real-world datasets

has affected the research activity in the Credit Scor-

ing domain. Otherwise, the development of this

research area would undoubtedly have been more

consistent and effective, and it is curious to ob-

serve how this kind of problem can be considered

a side effect, given by the security and privacy poli-

cies that regulate many public and private compa-

nies (Sloan and Warner, 2018).

- Data Imbalance: the prediction models used in the

Credit Scoring domain are commonly deﬁned on

the basis of datasets with a high degree of data

imbalance, i.e., data characterized by unbalanced

distributions of the events of interest (unreliable

cases), which are signiﬁcantly fewer than the other

ones (reliable cases). Class imbalance is the most

critical problem to face in developing Credit Scor-

ing solutions, since a model trained by using un-

balanced data underestimates the probability of rare

events, tending to be biased towards the most com-

mon events (King and Zeng, 2001), reducing the

performance dramatically. The literature offers sev-

eral methods to balance the data, mainly accord-

ing to one of the following strategies: introducing

synthetic instances (oversampling); removing ex-

isting instances (undersampling); combining both

the oversampling and undersampling strategies. In

more detail, in the Credit Scoring context, the

oversampling creates synthetic unreliable instances

based on the existing ones, whereas the undersam-

pling removes several reliable existing instances to

balance their number with respect to the unreliable

ones. Both oversampling and undersampling have

their shortcomings. The former can lead to over-

ﬁtting, as duplicating “bad” records may underesti-

mate the likelihood of observations belonging to the

minority class, whereas the latter may discard rele-

vant cases from the majority class, overestimating

the probability of “bad” samples (Weiss, 2004). An

empirical study on the balancing techniques applied

to Credit Scoring models highlighted that, albeit

larger datasets require longer training times, over-

sampling signiﬁcantly increases the accuracy rel-

ative to undersampling (Crone and Finlay, 2012).

For this reason, in this paper, we adopted this tech-

nique to balance the data.

- Cold Start: the cold start problem concerns the

evaluation model deﬁnition when such a process

can not use samples of one of the classes of infor-

mation involved, and it is a problem shared by many

domains. In the Credit Scoring one, this typically

happens when there are no unreliable cases avail-

KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval

166

able, and then the evaluation model training can not

be performed by using only the reliable ones.

Evaluation Metrics: As already pointed out, we fall

in a binary classiﬁcation domain, in which the sys-

tem should predict if a given user will be reliable

or unreliable. In such a context, we are focused on

confusion-matrix based metrics, such as, for instance,

accuracy, sensitivity, speciﬁcity, and Matthews cor-

relation coefﬁcient (MCC), which are largely used in

literature in several domains. All the reported met-

rics are based on the confusion-matrix, i.e., a matrix

of size 2x2 that reports the numbers of True Nega-

tives (TN), False Negatives (FN), True Positives (TP),

and False Positives (FP). In the Credit Scoring lit-

erature, the confusion-matrix-based metrics are usu-

ally combined with other metrics based on the Re-

ceiver Operating Characteristic (ROC) curve (Green

and Swets, 1966). One of the metrics primarily used

is the Area Under the ROC Curve (AUROC): the

ROC curve plots the confusion-matrix-based Sensi-

tivity against the confusion-matrix-based Fallout, re-

spectively on the y-axis and the x-axis, giving us the

separability measure of a binary classiﬁer (i.e., the

capability to discriminate the reliable and unreliable

cases, correctly).

3 PROPOSED APPROACH

Before formalizing the proposed approach, we in-

troduce the adopted formal notation: given a set

of classiﬁed instances I = {i

,...,i

}, composed

by a subset of reliable ones I

= {i

,...,i

}

(then I

⊆ I), and a subset of unreliable ones I

−

,...,i

−

} (then I

−

⊆ I), we denotes as

I =

{

,...,

} another set of unclassiﬁed instances,

each instance being characterized by a series of fea-

tures F = { f

, f

,..., f

}, and a destination class C =

{reliable, unreliable}.

Data Aggregation: In order to train the classiﬁcation

models of the involved algorithms, as the ﬁrst step,

we aggregate the transaction information of each user

in a single data vector (record) on the basis of sev-

eral criteria. In more detail, the adopted Data Aggre-

gation Approach (DAA) aggregates the transaction of

each user in terms of: number of transactions, activ-

ity days, minimum handled amount, maximum han-

dled amount, total handled amount, mean handled

amount, and standard deviation measured in all the

transactions.

Meta-feature Addition: Similarly to other previ-

ous works (Saia et al., 2019b; Carta et al., 2020;

Carta et al., 2019a; Saia et al., 2019a), which ex-

ploit preprocessing techniques in order to improve the

performance of the machine learning algorithms, we

propose a Meta-features Addition Approach (MAA)

aimed to better characterize each instance in the

sets I and

I, improving the credit scoring perfor-

mance. In more detail, we added a series of meta-

features MF = {m f

,m f

,...,m f

}, where m f

minimum, m f

= maximum, m f

= average, and

m f

= standard deviation, all of them calculated in

the set of features F related to each instance, as for-

malized in Equation 1, then such a process involves

both the set I and the set

MF =











m f

= min( f

, f

,. .. , f

)

m f

= max( f

, f

,. .. , f

)

m f

∑

n=1

( f

)

m f

N−1

∑

n=1

( f

−

f )

(1)

Instances Classiﬁcation: Each instance

i ∈

I will be

classiﬁed as reliable or unreliable, in accordance with

the criteria formalized in the Algorithm 1. In order to

Algorithm 1: Instance classiﬁcation.

Require: A=Classiﬁer, I=Set of classiﬁed instances,

i=Instance to evaluate

Ensure: c=Classiﬁcation of the

i instance

1: procedure GETEVALUATION(A, I,

2: I ← aggregateData(I)  DAA process on I set

i ← aggregateData(

i)  DAA process on

i instance

4: I ← addMeta f eatures(I)  MAA process on I set

i ← addMeta f eatures(

i)  MAA process on

i instance

6: model = trainModel(A, I)  Classiﬁcation model training

7: c = getPrediction(model,

i)  Instance classiﬁcation

8: return c

9: end procedure

simplify, with regard to each user, we have not used a

different notation for the multiple instances (i.e., the

related bank transactions) and those aggregated into a

single vector through the DAA process.

4 EXPERIMENTS

All the involved code has been developed in the

Python language, exploiting the scikit-learn

library.

In order to ensure the reproducibility of the exper-

iments carried out, the seed of the pseudo-random

number generator has been ﬁxed to 1. We also per-

formed independent-samples two-tailed Student’s t-

tests, highlighting no statistical difference between

the results (p > 0.05).

Datasets: The experiments have been performed us-

ing real-world datasets provided by a huge Italian

http://scikit-learn.org

From Payment Services Directive 2 (PSD2) to Credit Scoring: A Case Study on an Italian Banking Institution

167

bank. In this regard, it should be noted that the data

provided by the bank allow us to compare the per-

formance of the canonical credit scoring models (i.e.,

those based on the information usually exploited in

literature) with those based only on bank transactions,

as both the datasets used during the experiments refer

to the same set of users. Each user has been labeled as

reliable or unreliable. In details, two types of datasets

have been provided by the bank, one containing trans-

actions made by a set of bank clients and one contain-

ing, for the same clients, the canonical information

used in the literature for the Credit Scoring tasks (e.g.,

age, gender, job, incomes, or loans). Similar to other

works that use real data from private companies, due

to conﬁdentiality reasons, the dataset can not be made

public by us, not even in an anonymous form, except

for its characteristics, which should allow the repro-

ducibility of the experiments on other similar datasets.

The ﬁrst dataset, named Credit Card Transactions

(CCT), contains 3,376, 573 transactions related to the

use of a revolving credit card, carried out by 49,847

bank customers in the period from 01/01/2017 to

31/12/2018. The CCT dataset has been preprocessed

according to the criteria described in Section 3, which

leads toward the features reported in Table 1.

In addition, concerning the same users in the CCT

dataset, we use another dataset, named Bank Cus-

tomers Information (BCI), that contains the Credit

Scoring information usually exploited in the litera-

ture. It refers to the same 49, 847 customers of the

CCT dataset, and its features are reported in Table 2.

Each dataset contains 49,459 reliable cases

(99.23%) and 388 unreliable ones (0.77%). There-

fore, they are characterized by a high level of

data imbalance, similarly to the other datasets typ-

ically available in the Credit Scoring ﬁeld. In

order to evaluate the performance as correctly

as possible, avoiding the inﬂuence of the minor-

ity class in deﬁning the evaluation model, both

datasets have been preprocessed using an oversam-

pling technique, the Adaptive Synthetic Sampling Ap-

proach (ADASYN) (He et al., 2008), keeping its

default parameters (i.e., sampling strategy=’auto’,

n neighbors=5, n jobs=None).

Table 1: CCT Dataset Features.

Feature Description Type

F01 Unique identiﬁer of the user Integer

F02 Number of user transactions Integer

F03 Days of user activity Integer

F04 Minimum amount handled by the user Real

F05 Maximum amount handled by the user Real

F06 Total amount handled by the user Real

F07 Average amount handled by the user Real

F08 Standard deviation of the user transactions Real

Table 2: BCI Dataset Features.

Feature Description Type Feature Description Type

F01 user from Date F08 Employed from Date

F02 Family members Integer F09 Real estates Real

F03 Family income Real F10 Annual income Real

F04 Resident from Date F11 Other incomes Real

F05 House type String F12 Loans amount Real

F06 Job type String F13 Mortgages amount Real

F07 Job sector String F14 Rent amount Real

Metrics: In order to evaluate the performance, we

rely on three different metrics. Two of them, Sensi-

tivity and Speciﬁcity, are both based on the confusion

matrix, indicating, respectively, the true positive rate

and the true negative rate. They estimate the capabil-

ity to classify the reliable and unreliable instances,

correctly. The other one, the Area Under the Re-

ceiver Operating Characteristic curve (AUC), is in-

stead derived from the Receiver Operating Character-

istic (ROC) curve, and provides us information about

the predictive performance of an evaluation model, re-

gardless of the balancing of classes (reliable and un-

reliable) in the dataset (number of examples available

for the two classes).

The Equation 2 formalize the Sensitivity and

Speciﬁcity metrics, where TP indicates the instances

classiﬁed as reliable correctly, TN indicates the in-

stances classiﬁed as unreliable correctly. FN and FP

indicate, respectively, those unreliable wrongly clas-

siﬁed as reliable, and those reliable wrongly classi-

ﬁed as unreliable. These metrics give us the measure

of how many instances have been correctly classiﬁed

by an evaluation model.

Sensitivity =

T N

(T N + FP)

, Speci f icity =

T P

(T P + FN)

(2)

The Area Under the Receiver Operating Char-

acteristic curve (AUC) metric is largely used in the

Credit Scoring literature, since it allows us to assess

the capabilities of an evaluation model, regardless of

the data balancing. Formally, considering the subsets

of reliable (I

) and unreliable (I

−

) instances in the

set I, Equation 3 formalizes all possible comparisons

α of the scores of each instance i, and the AUC value

(in the range [0, 1], where 1 indicates the best perfor-

mance) is given by averaging over them.

α(i

−

) =











1, i f i

> i

−

0.5, i f i

= i

−

0, i f i

< i

−

AUC =

·I

−

∑

−

∑

α(i

−

)

(3)

Strategy: The experiments involve ﬁve of the most

performing algorithms in Credit Scoring literature,

i.e., Gradient Boosting (GB) (Chopra and Bhilare,

KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval

168

2018), AdaBoost (AB) (Freund and Schapire, 1999),

Random Forests (RF) (Malekipirbazari and Aksakalli,

2015), Multilayer Perceptron (MP) (Luo et al., 2017),

and Decision Tree (DT) (Damrongsakmethee and

Neagoe, 2019). Table 3 reports their parameters.

Table 3: Algorithms Parameters.

Algorithm Parameter Value

Gradient Boosting n estimators 100

(GB) learning rate 0.1

max depth 3

AdaBoost n estimators 50

(AB) learning rate 0.1

algorithm SAMME.R

Random Forests n estimators 10

(RF) max depth none

min samples split 2

Multilayer Perceptron alpha 0.0001

(MP) max iter 200

solver adam

Decision Tree min samples split 2

(DT ) max depth none

min samples lea f 1

The performance comparison has been made by

taking into account the average value of the three

considered metrics (i.e., Sensitivity, Speciﬁcity, and

AUC). In addition, in order to reduce the impact of

the data dependency, all the experiments have been

performed according to a k-fold cross-validation cri-

terion, with k=10.

The strategy adopted for the experiments follows

four steps: (i) evaluation of a model trained using the

canonical users information (BCI dataset); (ii) evalu-

ation of a model trained using only the bank trans-

actions (CCT dataset); (iii) comparison of the per-

formance of the two aforementioned evaluation mod-

els, both applied to the same set of users; (iv) as-

sessing the beneﬁts of the proposed preprocessing ap-

proach (meta-features enrichment) in the context of

the model trained only with the bank transactions.

Results: According to our experimental strategy, the

ﬁrst set of experiments has been aimed at evaluat-

ing the state-of-the-art algorithms in the context of an

evaluation model trained using the canonical Credit

Scoring information, then without using the bank

transactions (BCI dataset). The results are reported

in Table 4. In Table 5 are instead reported the per-

Table 4: Canonical Credit Scoring Model Performance.

Algorithm Dataset Sensitivity Speciﬁcity AUC Average

GB BCI 0.9320 0.9295 0.9307 0.9307

AB BCI 0.8933 0.8815 0.8873 0.8874

RF BCI 0.9560 0.9992 0.9776 0.9776

MP BCI 0.4836 0.7899 0.6367 0.6367

DT BCI 0.9613 0.9712 0.9715 0.9715

formance related to the same users of the BCI dataset,

but evaluated on the basis of a model trained using the

CCT dataset. It should be observed how, apart from

some algorithms (i.e., AB and MP), all the other ones

reach good performances, especially DT and GB, in

accordance with many literature works in this domain.

The average values reported in Table 5, related to the

Table 5: Transactions-based Credit Scoring Model Perfor-

mance.

Algorithm Dataset Sensitivity Speciﬁcity AUC Average

GB CCT 0.7920 0.8477 0.8173 0.8190

AB CCT 0.6907 0.7123 0.7008 0.7013

RF CCT 0.9803 0.8253 0.8883 0.8980

MP CCT 0.6311 0.6157 0.6231 0.6233

DT CCT 0.9703 0.9054 0.9356 0.9371

model trained using the credit card transactions, have

been compared in Table 6 to the results obtained af-

ter adding the meta-features (the best performance are

highlighted in bold). Furthermore, to better highlight

the impact of meta-features, the last column in the Ta-

ble reports the increment, expressed in percentage, of

the average values.

Table 6: Transactions-based Credit Scoring Model Perfor-

mance, Before and After the Meta-features Addition.

Algorithm Dataset Average before Average after Increment (%)

GB CCT 0.8190 0.8208 +0.22%

AB CCT 0.7013 0.7056 +0.62%

RF CCT 0.8980 0.9118 +1.54%

MP CCT 0.6233 0.6354 +1.94%

DT CCT 0.9371 0.9397 +0.28%

Discussion: The experimental results lead toward the

following considerations:

- the adoption of the canonical customers informa-

tion for the model training leads toward better per-

formances of those obtained using only the bank

transactions. In any case, to face those scenarios

where the canonical information is not available,

the latter evaluation model offers an interesting op-

portunity, as it allows the ﬁnancial operators to ex-

tend the set of potential customers, with all the re-

lated advantages;

- according to the previous observation, we experi-

mented how the addition of simple meta-features

can improve the Credit Scoring performance of all

the classiﬁcation algorithms taken into account dur-

ing the experiments;

- it is therefore clear that both models can be ex-

ploited in real-world scenarios, and that those based

on bank transactions offer interesting employment

prospects, albeit presenting lower performances

than those obtained by canonical models;

- although the improvements introduced by the addi-

tion of the meta-features appear slight, they can be

considered an interesting result, given their impact

in real-world scenarios, where are usually involved

From Payment Services Directive 2 (PSD2) to Credit Scoring: A Case Study on an Italian Banking Institution

169

a huge number of customers;

- the previous observation is supported by the fact

that in the context of the used dataset (49,847 bank

customers), the improvements related to the GB,

AB, RF, MP, and DT algorithms lead toward, re-

spectively, 60, 169, 837, 613, and 135 further cor-

rectly classiﬁed customers;

- summarizing, the results demonstrate both the fea-

sibility of a model based only on banking transac-

tions, and the possibility of improving its perfor-

mance by introducing simple meta-features.

5 CONCLUSIONS AND FUTURE

WORK

The recent Payments Systems Directive 2 (PSD2) is-

sued by the European Union, which enables the banks

to share the user’s data with their prior consent, makes

it essential to revise the canonical methodologies for

deﬁning Credit Scoring models. Indeed, the state-of-

the-art Credit Scoring models are usually trained us-

ing different information about the customers, mainly

based on personal data such as, for instance, age, gen-

der, current job, total incomes, or previous loans. To

this end, we investigated the feasibility of deﬁning a

Credit Scoring model based on bank transactions of

customers instead of using the canonical information.

Transaction data has also been enhanced through the

introduction of suitable meta-features. The performed

experiments indicate a performance reduction when

the Credit Scoring models have been trained using the

bank transactions only, as reported in Table 4. How-

ever, as shown in Table 5, they indicate that it is pos-

sible to get acceptable performance also by using the

bank transactions, grouped according to simple cri-

teria, and that these performances can be improved

by adding meta-features (Table 6). These results open

up interesting scenarios, since they allow the ﬁnancial

operators to extend their potential customers in many

contexts such as, for instance, in the consumer credit

one, by virtue of the fact that it is possible to evaluate

them even in the absence of the canonical information

used up to now.

As future work, we plan to extend the proposed

approach, testing more sophisticated methodologies

able to further improve the Credit Scoring perfor-

mance, according to the available user’s data.

ACKNOWLEDGEMENTS

This research was partially funded by the “Bando

‘Aiuti per progetti di Ricerca e Sviluppo’– POR FESR

2014-2020 – Asse 1, Azione 1.1.3. Project Intelli-

Credit: AI-powered digital lending platform”.

REFERENCES

Carta, S., Fenu, G., Ferreira, A., Recupero, D. R., and Saia,

R. (2019a). A two-step feature space transforming

method to improve credit scoring performance. In In-

ternational Joint Conference on Knowledge Discov-

ery, Knowledge Engineering, and Knowledge Man-

agement, pages 134–157. Springer.

Carta, S., Fenu, G., Recupero, D. R., and Saia, R. (2019b).

Fraud detection for e-commerce transactions by em-

ploying a prudential multiple consensus model. Jour-

nal of Information Security and Applications, 46:13–

22.

Carta, S., Podda, A. S., Reforgiato Recupero, D. R., and

Saia, R. (2020). A local feature engineering strategy to

improve network anomaly detection. Future Internet,

12(10):177.

Chopra, A. and Bhilare, P. (2018). Application of ensemble

models in credit scoring models. Business Perspec-

tives and Research, 6(2):129–141.

Crone, S. F. and Finlay, S. (2012). Instance sampling in

credit scoring: An empirical study of sample size

and balancing. International Journal of Forecasting,

28(1):224–238.

Damrongsakmethee, T. and Neagoe, V.-E. (2019). Principal

component analysis and relieff cascaded with decision

tree for credit scoring. In Computer Science On-line

Conference, pages 85–95. Springer.

Engelmann, J. and Lessmann, S. (2021). Conditional

wasserstein gan-based oversampling of tabular data

for imbalanced learning. Expert Systems with Appli-

cations, 174.

Freund, Y. and Schapire, R. E. (1999). A short introduction

to boosting. In In Proceedings of the Sixteenth In-

ternational Joint Conference on Artiﬁcial Intelligence,

pages 1401–1406. Morgan Kaufmann.

Green, D. M. and Swets, J. A. (1966). Signal Detection

Theory and Psychophysics. Wiley, New York.

Gup, B. E. (2005). Commercial banking : the management

of risk. J. Wiley.

He, H., Bai, Y., Garcia, E. A., and Li, S. (2008). Adasyn:

Adaptive synthetic sampling approach for imbalanced

learning. In 2008 IEEE international joint conference

on neural networks (IEEE world congress on compu-

tational intelligence), pages 1322–1328. IEEE.

Junsomboon, N. and Phienthrakul, T. (2017). Combining

over-sampling and under-sampling techniques for im-

balance dataset. In Proceedings of the 9th Interna-

tional Conference on Machine Learning and Comput-

ing, pages 243–247.

KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval

170

Khemais, Z., Nesrine, D., Mohamed, M., et al. (2016).

Credit scoring and default risk prediction: A compar-

ative study between discriminant analysis & logistic

regression. International Journal of Economics and

Finance, 8(4):39.

Kim, Y. and Sohn, S. (2012). Stock fraud detection us-

ing peer group analysis. Expert Systems with Applica-

tions, 39(10):8986–8992.

King, G. and Zeng, L. (2001). Logistic regression in rare

events data. Political analysis, 9(2):137–163.

Lee, B. K. and Sohn, S. Y. (2017). A credit scoring model

for smes based on accounting ethics. Sustainability,

9(9).

Leevy, J. L., Khoshgoftaar, T. M., Bauder, R. A., and Seliya,

N. (2018). A survey on addressing high-class imbal-

ance in big data. Journal of Big Data, 5(1):42.

Lei, K., Xie, Y., Zhong, S., Dai, J., Yang, M., and Shen,

Y. (2019). Generative adversarial fusion network for

class imbalance credit scoring. Neural Computing and

Applications, pages 1–12.

Liu, C., Huang, H., and Lu, S. (2019). Research on personal

credit scoring model based on artiﬁcial intelligence. In

International Conference on Application of Intelligent

Systems in Multi-modal Information Analytics, pages

466–473. Springer.

Luo, C., Wu, D., and Wu, D. (2017). A deep learn-

ing approach for credit scoring using credit default

swaps. Engineering Applications of Artiﬁcial Intel-

ligence, 65:465–470.

Malekipirbazari, M. and Aksakalli, V. (2015). Risk assess-

ment in social lending via random forests. Expert Sys-

tems with Applications, 42(10):4621–4631.

Roy, A. G. and Urolagin, S. (2019). Credit risk assess-

ment using decision tree and support vector machine

based data analytics. In Creative Business and So-

cial Innovations for a Sustainable Future, pages 79–

84. Springer.

Roy, P. and Shaw, K. (2021). A credit scoring model for

smes using ahp and topsis. International Journal of

Finance and Economics.

Saia, R. and Carta, S. (2016a). An entropy based algorithm

for credit scoring. In International Conference on Re-

search and Practical Issues of Enterprise Information

Systems, pages 263–276. Springer.

Saia, R. and Carta, S. (2016b). Introducing a vector space

model to perform a proactive credit scoring. In In-

ternational Joint Conference on Knowledge Discov-

ery, Knowledge Engineering, and Knowledge Man-

agement, pages 125–148. Springer.

Saia, R. and Carta, S. (2016c). A linear-dependence-based

approach to design proactive credit scoring models. In

KDIR, pages 111–120.

Saia, R. and Carta, S. (2017a). Evaluating credit card trans-

actions in the frequency domain for a proactive fraud

detection approach. In SECRYPT, pages 335–342.

Saia, R. and Carta, S. (2017b). A fourier spectral pattern

analysis to design credit scoring models. In Proceed-

ings of the 1st International Conference on Internet of

Things and Machine Learning, page 18. ACM.

Saia, R., Carta, S., et al. (2017). A frequency-domain-

based pattern mining for credit card fraud detection.

In IoTBDS, pages 386–391.

Saia, R., Carta, S., and Fenu, G. (2018a). A wavelet-based

data analysis to credit scoring. In Proceedings of the

2nd International Conference on Digital Signal Pro-

cessing, pages 176–180. ACM.

Saia, R., Carta, S., and Recupero, D. R. (2018b). A

probabilistic-driven ensemble approach to perform

event classiﬁcation in intrusion detection system. In

KDIR, pages 139–146.

Saia, R., Carta, S., Recupero, D. R., Fenu, G., and Saia, M.

(2019a). A discretized enriched technique to enhance

machine learning performance in credit scoring. In

KDIR, pages 202–213.

Saia, R., Carta, S., Recupero, D. R., Fenu, G., and Stan-

ciu, M. (2019b). A discretized extended feature space

(defs) model to improve the anomaly detection per-

formance in network intrusion detection systems. In

KDIR, pages 322–329.

Siddiqi, N. (2017). Intelligent credit scoring: Building and

implementing better credit risk scorecards. John Wi-

ley & Sons.

Sloan, R. H. and Warner, R. (2018). When is an algorithm

transparent? predictive analytics, privacy, and public

policy. IEEE Security & Privacy, 16(3):18–25.

Sohn, S. Y., Kim, D. H., and Yoon, J. H. (2016). Technol-

ogy credit scoring model with fuzzy logistic regres-

sion. Applied Soft Computing, 43:150–158.

Thomas, L., Crook, J., and Edelman, D. (2017). Credit

Scoring and Its Applications, Second Edition. Society

for Industrial and Applied Mathematics, Philadelphia,

PA.

Tripathi, D., Edla, D. R., and Cheruku, R. (2018). Hy-

brid credit scoring model using neighborhood rough

set and multi-layer ensemble classiﬁcation. Journal

of Intelligent & Fuzzy Systems, 34(3):1543–1549.

Weiss, G. M. (2004). Mining with rarity: A unifying frame-

work. SIGKDD Explor. Newsl., 6(1):7–19.

Xia, Y., He, L., Li, Y.-G., Fu, Y., and Xu, Y. (2020). A dy-

namic credit scoring model based on survival gradient

boosting decision tree approach. Technological and

Economic Development of Economy, pages 1–24.

Zhang, W., He, H., and Zhang, S. (2019). A novel multi-

stage hybrid model with enhanced multi-population

niche genetic algorithm: An application in credit scor-

ing. Expert Systems with Applications, 121:221–232.

Zhang, X., Yang, Y., and Zhou, Z. (2018). A novel credit

scoring model based on optimized random forest. In

2018 IEEE 8th Annual Computing and Communica-

tion Workshop and Conference (CCWC), pages 60–

65. IEEE.

From Payment Services Directive 2 (PSD2) to Credit Scoring: A Case Study on an Italian Banking Institution

171