Assessment of the Relationship Between Attribute Coding and the

Interpretability of Machine Learning Models: An Analysis in the

Context of Children and Adolescents with Depression

Ludmila B. S. Nascimento

1 a

, Marcelo de S. Balbino

1,2 b

, Maycoln L. M. Teodoro

3 c

and Cristiane N. Nobre

1 d

Institute of Exact Sciences and Informatics, Pontiﬁcal Catholic University of Minas Gerais,

Dom Jos

e Gaspar, Belo Horizonte, Brazil

Department of Computing and Civil Construction, Federal Center for Technological Education of Minas Gerais,

Belo Horizonte, Brazil

Department of Psychology, Federal University of Minas Gerais, Belo Horizonte, Brazil

Keywords:

Children and Adolescents, Depression, Encoding, Interpretability.

Abstract:

Depression is a global public health challenge that affects approximately 300 million people. Artiﬁcial Intel-

ligence and Machine Learning have revolutionized the healthcare sector, allowing the development of models

to diagnose depression. Tabular data, shared in healthcare, requires preprocessing, including encoding cat-

egorical attributes into numeric values, as many Machine Learning algorithms only support numeric data.

This study aims to investigate different coding methods for non-ordinal nominal categorical attributes in a

dataset related to depression in children and adolescents suffering from Major Depressive Disorder (MDD).

The comparison results revealed that the XGBoost algorithm with the Hash Encoding, Customized One Hot,

Frequency, and Dummy coding techniques were more effective for the analyzed data set. However, not all

of these encodings are interpretable. These results provide signiﬁcant insights, highlighting the importance

of choosing appropriate coding methods to improve the accuracy of Machine Learning models and the inter-

pretability of these models in healthcare.

1 INTRODUCTION

Depressive disorder, called depression, represents a

signiﬁcant public health challenge on a global scale

(Herrman et al., 2019). Around 300 million individu-

als worldwide face depression (PAHO, 2022).

The World Health Organization (WHO) states that

depressive disorders, such as mental illness, are dif-

ferent from usual variations in mood or routine feel-

ings. Major Depressive Disorder (MDD) is the most

common type of depression in adolescents. It is char-

acterized by sadness, loss of interest, tiredness, exces-

sive guilt, sleep and appetite disturbances, moderate

or sluggishness, and suicidal thoughts. In this article,

the term depression refers to this form of MDD.

Depression in children and adolescents is a se-

https://orcid.org/0009-0004-9133-9671

https://orcid.org/0000-0003-4154-8518

https://orcid.org/0000-0002-3021-8567

https://orcid.org/0000-0001-8517-9852

vere problem, with estimates of approximately 1.1%

among adolescents aged 10 to 14 and 2.8% among

adolescents aged 15 to 19 (WHO, 2021). In Brazil,

the prevalence of depression in children over 14 years

old varies from 0.2% to 7.5% (Coutinho et al., 2014).

Depression has several symptoms that affect so-

matic and psycho-emotional processes. This can be

complicated during adolescence, as symptoms can be

confused with age-typical behaviors, such as irritabil-

ity, family conﬂicts, school problems, substance use,

and problematic behavior (Bernaras et al., 2019).

Early diagnosis of depression is crucial to pre-

vent the condition from worsening and to improve

the effectiveness of treatment (Bal

azs et al., 2013). In

Brazil, only 19.2% of children and adolescents treated

in the Uniﬁed Health System (SUS) and diagnosed

with depression sought or received any (Coutinho

et al., 2014) treatment.

In an attempt to obtain a more accurate and early

diagnosis as possible, many studies have focused

on ﬁnding additional evidence, whether verbal, non-

482

Nascimento, L., Balbino, M., Teodoro, M. and Nobre, C.

Assessment of the Relationship Between Attribute Coding and the Interpretability of Machine Learning Models: An Analysis in the Context of Children and Adolescents with Depression.

DOI: 10.5220/0012386200003657

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2024) - Volume 2, pages 482-489

ISBN: 978-989-758-688-0; ISSN: 2184-4305

verbal, behavioral, linguistic, gestural, or social, that

can help diagnose or alert to the future development

of the depression in adolescence (Bal

azs et al., 2013).

On the other hand, the application of Artiﬁcial In-

telligence (AI) has achieved signiﬁcant advances in

different areas of health (Yu et al., 2018). In the area

of depression, we have several studies classifying pa-

tients with depression (Chu et al., 2018; Kim et al.,

2019). Therefore, computational methods can pro-

vide valuable help in diagnosing and classifying peo-

ple with depression.

Using Machine Learning (ML) algorithms is pos-

sible to assist in the prediction and decision-making

processes related to depression, using data such as im-

age, text, audio, and tabular data(Yu et al., 2018).

Tabular data, most common in healthcare, in-

cludes measurements and counts of clinical, labora-

tory, and historical values and must be represented

appropriately so that results are consistent, organized,

and can be used to create efﬁcient ML models (Yu

et al., 2018). In this case, the databases are composed

of instances (observations) and attributes (character-

istics), these can be numeric or nominal.

Data quality is essential for the excellent perfor-

mance of ML algorithms, so data preprocessing is

crucial due to common imperfections in raw data such

as missing values outliers, high attribute dimensional-

ity, and data balancing (Kotsiantis et al., 2006). An es-

sential step in this process is encoding, which converts

categorical attributes into numeric values, as most ML

algorithms require this input type.

The distribution of categorical characteristics in

health data plays an essential role in the interpretabil-

ity of results(Vellido, 2020). This is because choos-

ing appropriate coding can directly affect the ability

to understand and extract meaningful insights from

data, and some coding approaches can result in the

loss of important information.

Given the above, this article aims to investigate

different types of coding for nominal categorical at-

tributes in a healthcare database. The aim is to an-

alyze the performance and effectiveness of different

encodings in this data set and investigate whether

there is an encoding that presents more effective re-

sults, in addition to observing whether or not it helps

with the interpretability of the results.

To carry out this analysis, we used a database of

children and adolescents with different symptomatic

presentations of depression and eight different cod-

ings. One Hot (OHE), Dummy, and Customized One

Hot encodings ensure explicit representations of cat-

egories, while Frequency and Count ensure the fre-

quency of attribute options. Ordinal and Rank Hot

maintain the order and relationship between cate-

gories, and Hash reduces dimensionality while pre-

serving part of the information.

The remainder of this text follows the following

structure: Section 2 brings the reference, containing

the techniques usually used to convert nominal at-

tributes. Section 3 reviews some studies related to

coding techniques for categorical attributes and car-

dinality. Section 4 presents the methodology adopted

in this study. Section 5 explores the results achieved

during the experiments. Finally, Section 6 brings ﬁnal

considerations and points to future work.

2 BACKGROUND

2.1 Depression in Children and

Adolescents

MDD is characterized by sadness much of the day,

almost every day. Symptoms include signiﬁcant loss

of interest or pleasure in activities, changes in weight,

sleep problems, agitation or psychomotor retardation,

fatigue, feelings of worthlessness or excessive guilt,

difﬁculty concentrating, thoughts of death, and sui-

cidal ideation. MDD not only affects adults, but also

children and adolescents. Although they may vary de-

pending on age, symptoms are similar to those seen in

adults (Bernaras et al., 2019).

Depression in children leads to adverse effects on

socioemotional skills, resulting in relationship irri-

tability, social isolation, low school attendance, and

self-harm risk. Additionally, it disrupts personality

development, potentially leading to personality disor-

ders (Herrman et al., 2019).

The origin of depression in adolescence is multi-

factorial, involving several processes and risk factors

of a biological, interpersonal, cognitive, emotional

and personality. There is no single necessary and suf-

ﬁcient cause for depression, and therefore, a complete

understanding requires the simultaneous assessment

of multiple elements (Hankin, 2006).

For adequate treatment of depression, an accurate

diagnosis is necessary. Therefore, professionals who

care for children and adolescents and treat depression

should be familiar with the Diagnostic and Statistical

Manual of Mental Disorders V (DSM-V). This man-

ual presents essential criteria for diagnosing depres-

sion. The primary feature for diagnosing MDD is the

presence of 5 of 9 critical symptoms on most days for

two weeks, with at least 1 being depressed mood or

loss of interest/pleasure (Association et al., 2014).

Treating and preventing depression in children

and adolescents is crucial. Many of the treatments

Assessment of the Relationship Between Attribute Coding and the Interpretability of Machine Learning Models: An Analysis in the Context

of Children and Adolescents with Depression

483

for depression in this age group were initially devel-

oped based on treating adults. Essential evidence-

based treatments include pharmacotherapy, cognitive

behavioral therapy, and interpersonal therapy. A com-

bination of therapies and medications may also be

considered (Bal

azs et al., 2013).

2.2 Techniques for Encoding

Categorical Attributes

Often, datasets are organized in a tabular or matrix

structure in which rows and columns represent in-

stances and attributes, respectively.

For the encodings described in the following sub-

sections, consider the following notation: Let a

database be B = (I, A, C), where I is a set of instances,

A is a set of attributes, and C is the set of target classes.

The total number of instances is denoted by m.

Furthermore, another term used is the cardinal-

ity of the categorical attribute to evaluate the perfor-

mance of coding methods.

Cardinality in categorical variables indicates how

many unique values an attribute can have (card(a

) =

N). High cardinalities, with many different values,

are challenging in ML modeling.

2.2.1 Indicator Encoding

It is a generic concept that refers to two standard

methods of encoding categorical attributes, One Hot

Encoding and Dummy Encoding (Pargent et al.,

2022).

• One Hot Encoding: The method creates a new

column for each N level of a categorical attribute,

representing them as binary values (0 or 1) in

each row. The process generates a binary ma-

trix with columns corresponding to the number

of variations of this attribute (N) (cardinality).

In other words, high attribute cardinality results

in high-dimensional feature vectors in (Pargent

et al., 2022) databases.

Dummy Encoding: Dummy is an improved ver-

sion of OHE. However, unlike OHE, it uses N − 1

columns to represent an attribute with cardinality

2.2.2 Rank Hot Encoding

Rank Hot encoding is a variation of OHE in which

all attributes, including the current and previous cat-

egories, are set to ”hot” (1), and the remaining cat-

egories are set to cold (0). This technique is proper

when there is some ordinal or hierarchical relation-

ship between categories, and the order of the cate-

gories plays a relevant role (Buckman et al., 2018).

2.2.3 Frequency Encoding

This coding assigns a numerical value to each level

(N) of the categorical attribute according to its fre-

quency about the total occurrences (m) to that data set

column. Therefore, the most frequent values receive

higher numerical values, and the less frequent values

receive lower values. However, according to (Roy,

2019), it is possible to lose important information if

two distinct categories have the same occurrence rate.

2.2.4 Count Encoding

Count coding consists of counting the number of oc-

currences of a given categorical feature (N) in a spe-

ciﬁc attribute and replacing the names of these fea-

tures with the count performed. As with coding by

frequency, it is possible to lose important information

if two distinct categories have the same value of oc-

currences (Roy, 2019).

2.2.5 Ordinal Encoding

This coding transforms options for a categorical at-

tribute into a column of integers based on knowledge

of the number of existing categories (Pargent et al.,

2022). It is possible to provide a mapping to deﬁne a

speciﬁc order; otherwise, integer values are assigned

randomly. The result is a column of integers ranging

from 1 to N.

2.2.6 Hashing Encoding

The method transforms categorical data into a numer-

ical value using a hash function. It reduces extensive

data sets into smaller structures of ﬁxed size. How-

ever, hash functions are unidirectional, making it im-

possible to recover the original values after the hash

values are generated. Furthermore, the risk of com-

plications when different keys result in the same hash

value can compromise data interpretability in critical

situations (Kuhn and Johnson, 2019).

3 RELATED WORKS

Kovalerchuk and McCoy (2023) discuss the challenge

of generating accurate and interpretable ML models

for datasets with numeric and categorical attributes.

The authors suggest using numerical coding for non-

numeric attributes and present an algorithm called

HEALTHINF 2024 - 17th International Conference on Health Informatics

484

Sequential Rule Generation (SRG). The SRG algo-

rithm is successfully introduced to generate explain-

able rules in categorical data and is evaluated in sev-

eral computational experiments.

In another study, the authors evaluate the inﬂu-

ence of encoding numeric attributes on two highly

unbalanced fraudulent transaction data sets. Six cod-

ing methods were employed, belonging to Agnostic

Methods and Target-Based Methods. The study high-

lights the importance of choosing the appropriate en-

coding technique for imbalanced datasets, as target-

based encodings can have signiﬁcant performance in

the (Breskuvien

e and Dzemyda, 2023) model.

Cerda and Varoquaux (2022) investigated statis-

tical coding approaches for categorical attributes for

automated methods suitable for categories with high

cardinality. Two coding approaches, Min-hash En-

coder, and Gamma-Poisson Factorization, were com-

pared with other approaches on different databases.

The results showed that if interpretability is required,

Gamma-Poisson factorization is the most suitable al-

ternative, while if scalability is essential, the Min-

hash encoder is the most efﬁcient option.

4 MATERIALS AND METHODS

4.1 Database Description

To conduct this analysis, we used a database provided

by the Federal University of Minas Gerais related to

depression in children and adolescents aged between

10 and 16 years old, 158 males and 219 females, to-

taling 377 instances. The database comprises 74 at-

tributes that contain essential information for analyz-

ing depression in children and adolescents between

10 and 16 years old. These attributes cover several

aspects, including demographic, social, and mental

health-related data.

Furthermore, we categorized depression into two

groups based on percentile calculation, where 63 in-

dividuals were classiﬁed as having high depression,

while the rest were classiﬁed as having low depres-

sion. The cutoff point for this classiﬁcation was de-

ﬁned as the 85th percentile.

During pre-processing, two textual attributes and

27 attributes related to the CDI questionnaire totaled

45 input attributes. The description of the attributes is

detailed in Table 1.

4.2 Description of Methods

To evaluate different coding techniques for categor-

ical attributes, we created a pipeline for ML in the

language Python, employing several libraries widely

used in data science projects, such as Scikit-learn

Pandas

and Category Encoders.

. Figure 1 illus-

trates the pipeline used to carry out this study.

Figure 1: Pipeline adopted.

4.2.1 Preprocessing

1. Attribute Removal

Initially, the database contained two attributes

with insufﬁcient information, “What medicine

does the father take?” and “What medicine does

the mother take?”. Due to this limitation in data

quality, we decided to exclude these attributes.

Furthermore, textual attributes that referred to the

medication the individual’s parents took were re-

moved. We made this decision considering the

speciﬁc nature of this data and the complexity of

integrating it into our analysis.

Other attributes excluded from the analysis were

related to the CDI (Children’s Depression Inven-

tory) questionnaire. This exclusion was due to the

high correlation between these attributes and the

class attribute.

2. Categorical Attribute Coding

We use a speciﬁc coding, Ordinal Encoder, to

carry out the coding for ordinal categorical at-

tributes.

To deal with the non-ordinal nominal categorical

attribute “Lives with Who”, we employ eight un-

supervised categorical coders. Furthermore, we

adapted the OHE for coding that would trans-

form each family member option into a new at-

tribute, assigning the value one if the individual

Ofﬁcial documentation can be found at: Scikit-learn,

the version used was 1.2.2

The ofﬁcial documentation can be found at: Pandas,

the version used was 2.0 .2

Ofﬁcial documentation can be found at: Category En-

coders, the version used was 2.6.1.

Assessment of the Relationship Between Attribute Coding and the Interpretability of Machine Learning Models: An Analysis in the Context

of Children and Adolescents with Depression

485

Table 1: Description of the dataset and used attributes.

Description Attribute Type Possible Values

Demographics

Time spent with the mother/father per day during the week?

Discrete

Minimum = 0, Maximum = 24

Time spent with the mother/father per day on the weekend?

Discrete

Minimum = 0, Maximum = 24

Patient in psychological or psychiatric treatment?

Binary Yes, no

Parents live together or separated Binary Together, Separated/divorced

Father’s/mother’s education level? Ordinal Categorical

Did Not Study, Incomplete Primary School,

Complete Primary School, Incomplete High School,

Complete High School,Incomplete College, Complete College,

Don’t Know

Has the mother/father or anyone in their family been in

psychological or psychiatric treatment?

Binary Yes, no

Is the mother/father taking any continuous medication?

Binary Yes, no

Time the mother spends with her child(ren) per day

during the week

Discrete

Minimum = 2, Maximum = 24

Time the mother spends with her child(ren) per day

on the weekend

Discrete

Minimum = 4, Maximum = 24

Time the father spends with his child(ren) per day

during the week

Discrete

Minimum = 0 , Maximum = 24

Time the father spends with his child(ren) per day

on the weekend

Discrete

Minimum = 0 , Maximum = 24

Social Gender Binary Male, female

Age Discrete

Minimum = 10, Maximum = 16

Mother’s age Discrete

Minimum = 26, Maximum = 54

Is the mother/father working? Binary Yes, no

Father’s age Discrete

Minimum = 25, Maximum = 78

Who lives with the patient in their home? Nominal Categorical

[Father], [Mother], [Siblings], [Grandparents], [Father and Mother],

[Father, Mother, and Siblings], [Mother and Siblings],

[Mother, Siblings, and Others], [Father, Sibling, and Others],

[Mother and Others], [Father, Mother, Siblings, and Others],

[Father and Siblings], [Father, Mother, and Others],

[Siblings and Others], [Others], [Father and Stepmother],

[Father, Stepmother, and Others], [Mother and Stepfather],

[Mother, Stepfather, and Others], [Father and Others]

Questionnaire Youth Self-Report (YSR) Discrete

Minimum = 25, Maximum = 100

lives with that member and 0 if he does not. To

clarify, let us take as an example the option I

live with “father, mother and siblings”: each of

them was transformed into three new separate at-

tributes, with an exclusive column for “father”,

another for “mother” and a third for “siblings”.

Table 2 presents the Python coders and libraries

used to implement them.

Table 2: Libraries and methods used to implement the en-

codings.

Encoding Library Methods

Count Encoding Category Encoders CountEncoder

Customized One Hot Pandas -

Dummy Encoding Pandas get dummies

Frequency Encoding Pandas -

Hashing Encoding Category Encoders HashingEncoder

One Hot Encoding Pandas get dummies

Ordinal Encoding Category Encoders OrdinalEncoder

Rank Hot Encoding Category Encoders RankHotEncoder

During the coding process of categorical at-

tributes, we use the default settings of the cod-

ing methods except for the Dummy Encoding cod-

ing. For the Dummy Encoding technique, the

drop ﬁrst pattern was changed, which removes

the ﬁrst level of each column. This change results

in N-1 columns representing an attribute with a N

cardinality.

4.3 Learning Algorithms

This research employed six machine learning algo-

rithms to compare and determine the most effective

organization method among those presented. The al-

gorithms and their respective approaches were: De-

cision Tree - DecisionTreeClassiﬁer, Random Forest

- RandomForestClassiﬁer, Support Vector Machine

(SVM) - svm, Adaboost - AdaBoostClassiﬁer, XG-

Boost - XGBClassiﬁer, and Neural Network - MLP-

Classiﬁer. During testing, all settings were changed to

their default settings, as the focus is on evaluating the

encodings, not the learning algorithms themselves.

We adopted a proportion of 80% for training and

20% for testing to divide the data.

We evaluated the model’s generalization using a

10-fold cross-validation. We use Python 3.10.9 at ev-

ery stage, from pre-processing to hypothesis testing.

Taking into account that the problem analyzed is

classiﬁcation, the evaluation metric F1 Score

4.4 Interpretability

To assist in the interpretability of the model, we

adopted the SHAP (Lundberg and Lee, 2017) method,

through which it was possible to understand how the

model’s predictions are inﬂuenced by different re-

sources (attributes). We use the Dependence Plot as

F1-Score = 2 ×

Precision×Recall

Precision+Recall

HEALTHINF 2024 - 17th International Conference on Health Informatics

486

Table 3: Machine Learning Algorithm Metrics Results for Each Encoding.

Results for F1 Score (Mean/Standard Deviation)

Encoding AdaBoost Decision Tree Neural Network Random Forest SVM XGBoost

Count 0.6778/0.127 0.7106/0.1054 0.4547/0.0005 0.7698/0.1435 0.5725/0.1031 0.7202/0.1323

Customized One Hot 0.6993/0.0736 0.7475/0.1228 0.6752/0.164 0.7722/0.1565 0.6297/0.1056 0.7819/0.1011

Dummy 0.7119/0.0956 0.7390/0.1234 0.7022/0.1488 0.7687/0.1504 0.6297/0.1056 0.7761/0.0988

Frequency 0.6846/0.0746 0.7405/0.1017 0.4547/0.0005 0.7479/0.1471 0.6395/0.1205 0.7781/0.1071

Hash 0.7082/0.0894 0.7570/0.1159 0.6869/0.1438 0.7403/0.1640 0.6297/0.1056 0.7883/0.0935

One Hot 0.7123/0.0735 0.7395/0.1321 0.656/0.1034 0.7375/0.1563 0.6297/0.1056 0.761/0.1017

Ordinal 0.6954/0.0921 0.7605/0.1045 0.4547/0.0005 0.7585/0.1421 0.6395/0.1205 0.764/0.1013

Rank Hot 0.7036/0.1136 0.7366/0.1071 0.6879/0.1158 0.7646/0.1529 0.6297/0.1056 0.765/0.0981

a visual tool to deepen understanding of interpretable

and non-interpretable encodings.

5 RESULTS AND DISCUSSIONS

Table 3 presents the results of tests with six categor-

ical attribute encoding methods and ML algorithms.

The proper choice of encoding is crucial for accurate

outcomes, given the signiﬁcance of nominal attributes

and the necessity of numeric data for ML algorithms.

The results are organized alphabetically based on

the encodings and learning algorithms. Additionally,

we highlight in bold the best encodings for each ML

algorithm. Considering the t-test, all tests were per-

formed with 95% conﬁdence.

The study analyzed different encodings of cate-

gorical attributes in a database related to depression in

children and adolescents provided by the Federal Uni-

versity of Minas Gerais. The algorithms used were

AdaBoost, Decision Tree , Neural Networks, Random

Forest, SVM, and XGBoost.

”The learning algorithm that proved most effec-

tive in achieving the best encoding results was XG-

Boost. The encodings Hash, Customized One Hot,

Frequency, and Dummy yielded the highest F1 scores,

as conﬁrmed through a t-test

with this algorithm.”

We use a method to assist with interpretability, as

it is essential to understand the meaning of each at-

tribute. In this case, how the attribute was encoded

can contribute to or harm the interpretability of the

explanations: both Hash, Frequency, and Count en-

coding have problems in terms of interpretability. For

example, in Figure 2, we present one of the SHAP

graphs, the Dependence Plot, with possible encodings

for the attribute “Lives with Who” from the Depres-

sion base. This graph aims to analyze the impact of a

given attribute on a set of database instances. On the

graph, each point presented represents an instance of

the base impacted by the attribute in the different en-

Hash: p-value: [0-0.0029], Customized One Hot: p-

value: [0-0.0375], Frequency: p-value: [0-0.0447], and

Dummy: p-value: [0]

codings. Such impact, measured by the SHAP value,

can be positive, negative, or null. In this case, the

instances whose SHAP value is negative are those in

which the attribute value hurts high depression. On

the other hand, instances with a positive SHAP value

are those whose attribute value contributes to high de-

pression symptoms.

In Figure 2a, in which the attribute was coded

with Ordinal Encoder, it is possible to notice the cases

in which the attribute “Lives with Who” always con-

tributed to the classiﬁcation as low symptomatology

for depression (cases 1 – Mother and brothers, 2 - Fa-

ther, mother and brothers, 5 - Father and brothers, 8 -

Mother) and the cases in which always favored classi-

ﬁcation as high symptoms (cases 9 - Mother, brothers

and others, 10 - Father, mother and others, 11 - Fa-

ther, stepmother and others, 12 - Mother, stepfather

and others, 13 - Brothers and others, 14 - Father, 15

- Others, 16 - Father and others, 17 - Father, broth-

ers and others). In the remaining cases (3 – Grandfa-

ther and grandmother, 4 – Mother, siblings, and oth-

ers, 6 – Father and mother, 7 – Mother and others),

the attribute sometimes contributes to low symptoms

and, in others, to high symptoms. Primarily, we high-

light that in this encoding, it is possible to interpret

the model results with the behavior of the attribute.

In Figure 2b, observing the entry “col 1” gen-

erated in the Hash encoding, it is clear that when

“col 1” is equal to 0 (zero), we have negative val-

ues of SHAP value indicating favor low symptomatol-

ogy for depression. On the other hand, when “col 1”

is equal to 1 (one), the attribute contributes to high

depression symptoms. However, Hash functions are

unidirectional, making it impossible to revert to the

original values, which leads to the fact that “col 1”

has no meaning in the context of the problem, com-

promising the interpretability of the explanations to

be presented to the user and, possibly, its acceptance.

Finally, we work with the frequency and count en-

codings in Figures 2c and 2d. It is possible to ex-

tract interpretations of the attribute values as long as

different values represent different response options.

In frequency coding, each option is coded based on

how often it appears in the attribute, while in count

Assessment of the Relationship Between Attribute Coding and the Interpretability of Machine Learning Models: An Analysis in the Context

of Children and Adolescents with Depression

487

(a) Attribute encoded with ordinal encoder (b) Hash encoded attribute

Figure 2: Dependence plot with different encodings for the attribute “Lives with Who”

coding, the value represents the number of times the

answer option appears in the attribute. For example,

when we ﬁnd the value 0.0158 in the frequency cod-

ing and the value 6 in the count coding for the at-

tribute “Lives

with Who” it could be any of the re-

sponse options: ‘Grandfather and Grandmother’, ‘Fa-

ther, Mother and Others’, ‘Mother, Stepfather and

Others’ and ‘Siblings and Others’, and it is not pos-

sible to determine which speciﬁc response option is

represented by these values in their respective encod-

ings.

Regarding cardinality, it is essential to note that

Hash, Customized One Hot, and Dummy coding tech-

niques increase the dimensionality of the data set. For

example, Hash encoding turns a single column into

eight columns, Customized One Hot encoding into

nine columns, and Dummy encoding into 19 columns.

This increase in dimensionality can become a prob-

lem due to increased computational cost, overﬁtting,

and analysis complexity.

The Neural Network and SVM learning algo-

rithms did not perform satisfactorily for this database,

considering the majority of encodings used.

Given the above, when it comes to coding, in the

case of this speciﬁc database, the best algorithm to

use is XGBoost. Furthermore, the user can choose

between Hash, Customized One Hot, Frequency, and

Dummy encodings, depending on what they want: a

higher F1 Score or the interpretability of the results.

6 FINAL CONSIDERATIONS

This study aimed to investigate the existence of a

technique for coding non-ordinal nominal attributes

that demonstrates consistent results when applied to

ML algorithms in a healthcare database linked to de-

pression in children and adolescents, as well as ob-

serve whether it is interpretable. Several unsupervised

coding techniques were explored, each with its advan-

tages and limitations.

Analyzing each algorithm’s metrics and applying

the t-test made it possible to evaluate the performance

of the different coding techniques. In the tests carried

out, it was observed that the Hash, Customized One

Hot, Frequency, and Dummy encodings achieved su-

perior performance, although not all of them were in-

terpretable. However, there is room to improve the

pipeline used in order to obtain more satisfactory re-

sults. This includes ﬁne-tuning the parameters of ML

encodings and algorithms. Pay special attention to

hash encoding, exploring different types of Hash en-

coding, and adjusting the number of bits used.

Encoding nominal categorical attributes is an ef-

ﬁcient solution to deal with the qualitative nature of

these attributes. It transforms categories into suitable

numeric representations, effectively using this infor-

mation in ML algorithms. This leads to more accurate

and efﬁcient solutions, advances in data analysis, and

assists in decision-making, especially in the health-

HEALTHINF 2024 - 17th International Conference on Health Informatics

488

care sector.

For future research, it is recommended to investi-

gate broader healthcare datasets to validate different

hit techniques, varying in size and complexity. These

analyses should include databases with high cardi-

nality to compare supervised and unsupervised ap-

proaches, improving programming techniques in ma-

chine learning projects in healthcare.

ACKNOWLEDGEMENTS

The authors thank the National Council for Scien-

tiﬁc and Technological Development of Brazil (CNPq

- Conselho Nacional de Desenvolvimento Cient

ıﬁco

e Tecnol

ogico – Code: 311573/2022-3), the Pon-

tif

ıcia Universidade Cat

olica de Minas Gerais –

PUC-Minas, the Coordination for the Improvement

of Higher Education Personnel - Brazil (CAPES –

Grant PROAP 88887.842889/2023-00 – PUC/MG,

Grant PDPG 88887.708960/2022-00 – PUC/MG -

Inform

atica and Finance Code 001), and the Foun-

dation for Research Support of Minas Gerais State

(FAPEMIG – Code: APQ-03076-18).

REFERENCES

Association, A. P. et al. (2014). DSM-5: Manual

diagn

ostico e estat

ıstico de transtornos mentais.

Artmed Editora.

Bal

azs, J., Mikl

osi, M., Kereszt

eny,

A., Hoven, C. W., Carli,

V., Wasserman, C., Apter, A., Bobes, J., Brunner, R.,

Cosman, D., et al. (2013). Adolescent subthreshold-

depression and anxiety: Psychopathology, functional

impairment and increased suicide risk. Journal of

child psychology and psychiatry, 54(6):670–677.

Bernaras, E., Jaureguizar, J., and Garaigordobil, M. (2019).

Child and adolescent depression: A review of theo-

ries, evaluation instruments, prevention programs, and

treatments. Frontiers in psychology, 10:543.

Breskuvien

e, D. and Dzemyda, G. (2023). Categori-

cal feature encoding techniques for improved classi-

ﬁer performance when dealing with imbalanced data

of fraudulent transactions. International Journal of

Computers Communications & Control, 18(3).

Buckman, J., Roy, A., Raffel, C., and Goodfellow, I. (2018).

Thermometer encoding: One hot way to resist adver-

sarial examples. In International conference on learn-

ing representations.

Cerda, P. and Varoquaux, G. (2022). Encoding high-

cardinality string categorical variables. IEEE Transac-

tions on Knowledge and Data Engineering, 34:1164–

1176.

Chu, S.-H., Lenglet, C., Schreiner, M. W., Klimes-Dougan,

B., Cullen, K., and Parhi, K. K. (2018). Classifying

treated vs. untreated mdd adolescents from anatomical

connectivity using nonlinear svm. In 2018 40th An-

nual International Conference of the IEEE Engineer-

ing in Medicine and Biology Society (EMBC), pages

1–4.

Coutinho, M. P. L., Oliveira, M. X., Pereira, D. R., and

Santana, I. O. (2014). Indicadores psicom

etricos do

invent

ario de depress

ao infantil em amostra infanto-

juvenil. Avaliac¸ao Psicologica: Interamerican Jour-

nal of Psychological Assessment, 13:269–276.

Hankin, B. L. (2006). Adolescent depression: Descrip-

tion, causes, and interventions. Epilepsy & Behavior,

8(1):102–114.

Herrman, H., Kieling, C., McGorry, P., Horton, R., Sargent,

J., and Patel, V. (2019). Reducing the global burden

of depression: a lancet–world psychiatric association

commission. The Lancet, 393(10189):e42–e43.

Kim, D., Kang, P., Kim, J., Kim, C. Y., Lee, J.-H., Suh,

S., and Lee, M.-S. (2019). Machine learning classi-

ﬁcation of ﬁrst-onset drug-naive mdd using structural

mri. IEEE Access, 7:153977–153985.

Kotsiantis, S. B., Kanellopoulos, D., and Pintelas, P. E.

(2006). Data preprocessing for supervised leaning.

International journal of computer science, 1(2):111–

117.

Kovalerchuk, B. and McCoy, E. (2023). Explain-

able machine learning for categorical and mixed

data with lossless visualization. arXiv preprint

arXiv:2305.18437.

Kuhn, M. and Johnson, K. (2019). Feature Engineering

and Selection: A Practical Approach for Predictive

Models. CRC Press.

Lundberg, S. M. and Lee, S.-I. (2017). A uniﬁed approach

to interpreting model predictions. In Advances in neu-

ral information processing systems, pages 4765–4774.

PAHO (2022). Depression. https://www.paho.org/en/topics

/depression.

Pargent, F., Pﬁsterer, F., Thomas, J., and Bischl, B.

(2022). Regularized target encoding outperforms tra-

ditional methods in supervised machine learning with

high cardinality features. Computational Statistics,

37:2671––2692.

Roy, B. (2019). All about categorical variable encoding.

Vellido, A. (2020). The importance of interpretability and

visualization in machine learning for applications in

medicine and health care. Neural computing and ap-

plications, 32(24):18069–18083.

WHO (2021). Depression. https://www.who.int/news-roo

m/fact-sheets/detail/adolescent-mental-health.

Yu, K.-H., Beam, A. L., and Kohane, I. S. (2018). Artiﬁ-

cial intelligence in healthcare. Nature biomedical en-

gineering, 2(10):719–731.

Assessment of the Relationship Between Attribute Coding and the Interpretability of Machine Learning Models: An Analysis in the Context

of Children and Adolescents with Depression

489