
and have more power to measure default were selected 
for modelling part of project.  
2  DATASET DESCRIPTION  
The  dataset  used  in  this  study  consists  of  16000 
samples each represented with a feature vector of 18 
variables and an associated class label. The variable 
names and types along with their ranges are shown 
in Table 1. The dataset belongs to the individual loan 
applications of a financial institution.   
The  variables  can  be  categorized  under  2  main 
categories which are finance-related information and 
personal  information.  In  this  section,  we  briefly 
introduce the input variables under these 2 categories 
and  the  class  variable  to  clarify  the  information 
represented by each variable in our dataset.   
2.1  Finance-related Information  
The variable denoted with ‘housingMaturity’ in Table 
1 represents  for  how many  months  the  customer is 
paying  the  instalments  of  housing  credits.  The 
maturity  value  of  housing  credits  can  take  a  value 
between 6 and 240 months in Turkish finance system. 
Similarly,  vehicle  maturity  shows  the  number  of 
months for the credit instalments of vehicle loan. This 
is also an integer variable and has a range from 0 to 60 
months.  The  number  of  months  for  the  credit 
instalments  of  consumer  loan  is  stored  in  the 
consumer maturity variable, which has a range from 0 
to 120.  The variable referred to as ‘ProductNumber’ 
in  Table  1  represents  the  total  number  of  different 
products taken by the customer before, including the 
current active loan.  This variable is in integer type 
and it has a range from 1 to 113. The ‘workingTime’ 
and  ‘workplace’  variables  show  the  term  of 
employment and  status of the working place of the 
credit customer, respectively.  
While the working time information is represented 
with an integer variable, the workplace is a categorical 
variable  which  takes  3  different  values  as  “Public” 
“Private  or  Corporate”  and  “Other”.  The  other 
variable  related  with  the  working  place  of  the 
customer  is  ‘Ownership’  which  is  a  categorical 
variable  and takes  4  different  values  indicating  the 
owner of the workplace the customer is working for. 
The  possible  values  of  this  variable  are  “personal”, 
“rental”,  “family-owned”  and  “other”.  The 
‘insuranceCode’ variable represents the type of social 
security  of  the  credit  customer.  It  is  a  categorical 
variable which can take 5 different values.  
Loan Type is an indicator for consumer maturity, 
vehicle maturity and housing maturity variables. It is 
a factor variable and it is kept in financial institution’s 
system  in  integer  type.  Variable  has  values  as 
“consumer loan”, “housing loan”, and “vehicle loan” 
and kept as 1, 2, and 3 in the system. The financial 
institution  is  using  this  variable  for  analyzing  the 
relationship between the number of instalments and 
whether the credit will end as default or not default. 
Most of the credits given by the financial institution 
are consumer credits rather than housing and vehicle.  
There  is  a  "due  date"  in  every  kind  of  credit 
settlements as credit card, credit deposit account or 
different loan types. If the payment due date is 1 or 2 
days  delayed,  the  delay  is  referred  as  1  term.  If 
consecutive loan repayments have been made late on 
a two-time payment date, it is a two-term delay.  The 
“DefaultNumber”  variable  refers  to  customers  who 
have experienced the legal default process before. The 
credits whose repayment period is delayed for 3 terms 
go into default process and closed after completion of 
repayment.  
There are 2 important credit scores determined by 
the  Consumer  Reporting  Agency  (CRA)  for  each 
customer. One of these variables, referred to as CRA 
in Table 1 is an integer variable with a range from 0 
to 1612. The CRA calculates this value according to 
their  internal  rating  system  and  provides  to  the 
financial  institution  when  required.  The  value  of  0 
(zero) means that the score cannot be calculated by 
CRA for that customer. The higher the score the more 
credit worthiness customer has. The other important 
credit score included in our dataset is the individual 
indebtedness index (III) which is designed to predict 
the risks  arising from  high  indebtedness. The  main 
difference between CRA  score and III value is that 
while the CRA value aims to determine the risk based 
on the past or current payment problems, III value is 
used  to  identify  people  who  have  not suffered  any 
difficulties but are likely to suffer in the future due to 
excessive borrowing.  
2.2  Personal Information  
In addition to the variables related with the financial 
status of the  customers, the dataset  contains  some 
personal information that might be important in the 
credit worthiness of the customer. These are marital 
status, occupation, education status, and age.  
The  marital  status  variable  specifies  the  marital 
status  of  the  customer  as  of  the  date  of  credit 
application.  This  is  a  categorical  variable  with  5 
different  values.  The  occupation  information  is 
represented  with  8  different  categories  each  one  
   
Variable Importance Analysis in Default Prediction using Machine Learning Techniques
57