Utility of Anonymised Data in Decision Tree Derivation

Jack R. Davies and Jianhua Shao

School of Computer Science & Informatics, Cardiff University, U.K.

Keywords:

Data Anonymisation, Data Utility, Decision Tree.

Abstract:

Privacy Preserving Data Publishing (PPDP) is a practice for anonymising microdata such that it can be publicly

shared. Much work has been carried out on developing methods of data anonymisation, but relatively little

work has been done on examining how useful anonymised data is in supporting data analysis. This paper eval-

uates the utility of k-anonymised data in decision tree derivation and examines how accurate some commonly

used metrics are in estimating this utility. Our results suggest that whilst classiﬁcation accuracy loss is mini-

mal in most common scenarios, using a small selection of simple metrics when calibrating a k-Anonymisation

could help signiﬁcantly improve decision tree classiﬁcation accuracy for anonymised data.

1 INTRODUCTION

With the increase in personal data collection, storage

and use by a growing number of corporations and or-

ganisations, there has been a corresponding rise in the

population’s concern for their privacy. To alleviate

these concerns, governments require that individuals’

privacy is protected in the sharing of sensitive data. To

comply with these requirements, data publishers use

a process known as Privacy Preserving Data Publish-

ing (PPDP) (Fung et al., 2010). The major challenge

with PPDP is to ensure that the privacy of individu-

als is maintained, whilst also retaining the usefulness

of the original data. Na

ıve approaches such as sim-

ply removing explicit identiﬁers (e.g., driving license

number) from the data set, or redacting contextual in-

formation, are not sufﬁcient. A more sophisticated

approach is needed whereby the data satisﬁes given

privacy requirements deﬁned in a privacy model, to

protect against potential attacks.

One such model is k-Anonymity (Samarati and

Sweeney, 1998). k-Anonymity requires each record

in a data set to be indistinguishable from at least k −1

other records over the set of quasi-identiﬁers (QIDs).

QIDs are attributes in the data set that are externally

available and can be used to link a record to a spe-

ciﬁc individual – an attack known as Record Link-

age. For example, Age and Occupation are possible

QIDs that could be used to identify a speciﬁc per-

son in a data set if there is a unique combination of

their values within. A given data set will rarely sat-

isfy k-Anonymity, thus the data will need to be mod-

iﬁed through anonymisation operations. For exam-

ple, if the Occupation values ‘Lawyer’ and ‘Doctor’

do not appear k times in a data set individually, we

can choose to generalise both into ‘Professional’ or

simply ’{Lawyer, Doctor}’. The data set will publish

at least k records with the generalised value instead,

thereby satisfying k-Anonymity.

Whilst k-Anonymity ensures anonymity in the

data, it must retain utility for recipients of the data

too. The degree to which this utility is maintained

is something that has not been comprehensively stud-

ied; this paper attempts to study this in the context

of one particular area of data analysis – classiﬁcation

using decision trees. We adapt the basic ID3 algo-

rithm (Quinlan, 1986) to derive decisions trees from

anonymised data and we then use this adapted algo-

rithm to train and test decision trees in four differ-

ent scenarios, comparing and evaluating the results

from each scenario to measure the data utility in terms

of classiﬁcation accuracy. In addition, we measure

the utility of the anonymised data using some metrics

commonly found in the literature and compare these

measures to classiﬁcation accuracy. This grants in-

sight into the reliability of these metrics in estimating

the utility of anonymised data in decision tree classi-

ﬁcation.

The rest of the paper is organised as follows: In

Section 2, we brieﬂy discuss some related work; Sec-

tion 3 presents essential background information; in

Section 4, we present our experiments and report our

results in Section 5; ﬁnally, in Section 6, we conclude

the paper.

Davies, J. and Shao, J.

Utility of Anonymised Data in Decision Tree Derivation.

DOI: 10.5220/0010778300003120

In Proceedings of the 8th International Conference on Information Systems Security and Privacy (ICISSP 2022), pages 273-280

ISBN: 978-989-758-553-1; ISSN: 2184-4356

 2022 by SCITEPRESS – Science and Technology Publications, Lda. All r ights reserved

273

2 RELATED WORK

Much of the work done on evaluating the utility of

anonymised microdata has been performed in order to

evaluate a novel algorithm or method for achieving a

certain privacy model (LeFevre et al., 2006; Li et al.,

2011; Tang et al., 2010). There has been much less

work performed on comprehensively evaluating the

utility of data anonymised using existing methods.

Ayala-Rivera et al. (Ayala-Rivera et al., 2014)

performed experiments on several k-Anonymity al-

gorithms, including the Mondrian algorithm used

for this paper, to measure the efﬁciency and utility

retention of the different algorithms for practition-

ers. However, the experiments performed only utilise

general-purpose metrics, as such it lacks the depth of

investigation proposed in this paper. Primarily basing

evaluation of utility on measurements of metrics is a

common theme among much of the related work. In

contrast, this paper attempts to link the measurements

of these metrics to concrete experimental results.

Shao & Beckford (Shao and Beckford, 2017) eval-

uated the decision tree classiﬁcation accuracy of data

anonymised using the Mondrian algorithm. They did

so by performing a series of experiments using the

ADULT data set to train and test an ID3 decision tree.

It shows that whilst there is a degradation of classi-

ﬁcation accuracy for anonymised data sets compared

to non-anonymised, the degradation is minimal. This

is a similar study to the one described in this paper.

However, it differs in that this paper considers multi-

ple possible scenarios for training and testing of the

ID3 decision tree. It also differs in the implemen-

tation of the ID3 algorithm, and in the utilisation of

the two modes of Mondrian algorithm. Furthermore,

their work does not examine metrics in relation to the

utility results as this paper proposes.

It is notable that the majority of related work uses

metrics as part of the evaluation in some capacity. It is

clear that these metrics are heavily relied upon in eval-

uation of anonymised data, providing obvious motiva-

tion for the work described in this paper.

3 BACKGROUND AND

METHODS

3.1 k-Anonymity

k-Anonymity is a privacy model ﬁrst introduced by

Samarati & Sweeney (Samarati and Sweeney, 1998).

A data table is said to provide k-Anonymity “if at-

tempts to link explicitly identifying information to its

contents ambiguously map the information to at least

k entities” (Samarati and Sweeney, 1998). This can be

achieved by ensuring that each unique record is iden-

tical to at least k −1 other records over the QIDs. This

set of identical records is referred to as an equivalence

class (EC).

3.2 The Mondrian Algorithm

The algorithm we use to anonymise a raw data set

in this paper is the Mondrian Multidimensional algo-

rithm (LeFevre et al., 2006). The algorithm is imple-

mentable in two modes: Strict or Relaxed. The pseu-

docode for the Strict algorithm can be seen in Algo-

rithm 1.

Algorithm 1: Mondrian - Strict.

1: function Anonymise(D)

2: if D cannot be partitioned then

3: return D

4: else

5: X

← chooseAttribute(D)

6: F ← f requencySet(D, X

)

7: pv ← median(F)

8: lhs ← {t ∈ D|t.X

≤ pv}

9: rhs ← {t ∈ D|t.X

> pv}

10: end if

11: return Anonymise(lhs) ∪ Anonymise(rhs)

12: end function

The Mondrian algorithm is recursive. The data set

D used as input is initially the entire data set requir-

ing anonymisation. The algorithm in both Strict and

Relaxed modes will ﬁrst determine if D can be par-

titioned (line 2) by ensuring D is large enough to be

split into two subsets of minimum size k. X

in this

implementation is simply the QID attribute with the

widest normalised range of values (line 5). The fre-

quency of each unique value in X

is counted and val-

ues ordered in a frequency set F (line 6). For this im-

plementation, the ordering used was alphabetical for

categorical values and ascending for numerical. Par-

titioning of records is then performed based on the

position of the values of X

in F relative to the me-

dian value of F, or the ”pivot value” pv ∈ X

(line 7).

The Strict and Relaxed algorithms differ in how the

partitioning is performed. A Strict partitioning does

not allow intersecting values between the two subsets

of D resulting from a cut (lhs & rhs); whereas, the

Relaxed algorithm allows this by modifying the parti-

tioning (lines 8-9) to that shown below:

lhs ← {t ∈ D|t.X

< pv}

ICISSP 2022 - 8th International Conference on Information Systems Security and Privacy

274

med ← {t ∈ D|t.X

= pv}

rhs ← {t ∈ D|t.X

> pv}

The records in med are then distributed into lhs

and rhs such that neither exceeds the median count of

values in F.

Following this, recursive calls are made to the al-

gorithm with lhs & rhs as input D (line 11), result-

ing in a partition of the initial data set with minimally

sized subsets. All that remains is the generalisation

of the values in each subset for all selected QIDs.

The generalised subsets are our equivalence classes,

the union of which is a k-Anonymisation of the initial

data set.

3.3 ID3 Decision Trees and Induction

The ID3 algorithm (Quinlan, 1986) was selected as

the method of building decision trees in this pa-

per. The implementation is as described in (Mitchell,

1997), with some modiﬁcations.

The ﬁrst modiﬁcation relates to the calculation of

entropy. The entropy of a set of records D with an ar-

ray of different classiﬁcation values in class attribute

φ is given by the following:

Entropy(D) =

∑

i∈φ

−P

log

Where P

is the proportion of D with class i in class

attribute φ.

The entropy can then be used in the following cal-

culation of Information Gain (IG):

IG(D, A) = Entropy(D) −

∑

v∈A

|D|

Entropy(D

)

Where A is the selected attribute and D

is the subset

of records with value v in A.

We must consider how to deal with generalised

values when calculating entropy. P

can be written

φ=i

|D|

where D could be the entire data set or sub-

set, which poses no issue where generalised values

are concerned. However, in the information gain cal-

culation, we need to also ﬁnd the entropy of D

. In

an anonymised data set, for any given record, value

v may be generalised. To deal with this, we use the

same method as in Shao & Beckford (Shao and Beck-

ford, 2017). Each value in a generalisation value-set

is equally likely to be the true value. In a generalisa-

tion containing r values, we can consider each value

in the generalisation to be worth

An example of this is shown below. The table

shows records for a generic attribute and correspond-

ing class, followed by entropy calculations for each

attribute value.

Attribute Class

A X

B X

B Y

{A, B} Y

{A, B} X

Entropy(Attrib

) = −

1.5

2.5

log

1.5

2.5

−

2.5

log

2.5

Entropy(Attrib

) = −

1.5

3.5

log

1.5

3.5

−

3.5

log

3.5

In addition to the calculation of entropy, we also

need to consider how to integrate generalised values

as branches in the decision tree. We cannot simply use

the generalised value itself. Consider the generalised

value {Doctor, Lawyer} used as a branch in a tree.

This implies that we are posing the logical question

“is the Occupation value Doctor or Lawyer?” when

classifying a record. This may seem ﬁne, but if there

exists another branch at the same level of the tree la-

belled with {Doctor, Mechanic}, how would we de-

cide which branch to traverse when the record being

classiﬁed has the value ”Doctor”? Further problems

are encountered if the record being classiﬁed itself has

generalised values.

It would be much simpler to have exclusively spe-

ciﬁc values as branches in the tree. To do this, we map

back the generalisations using two different methods:

• Random: We take a speciﬁc value from the gen-

eralised value-set at random and use this as the

branch value in the tree.

• Statistical: We consider the possibility of re-

leasing anonymisations with a frequency distri-

bution of the QID values from the original data.

This statistical information could then be used

to potentially recreate the original data more ac-

curately, providing better classiﬁcation accuracy.

We weight each value by its frequency in the orig-

inal data. This does not guarantee the correct

value will be mapped back, however, this is to be

expected as otherwise anonymisation would be re-

dundant.

In this paper, we examine both methods to deter-

mine if there is any utility in releasing statistical in-

formation with an anonymisation.

The ﬁnal modiﬁcation we make to the ID3 algo-

rithm regards numerical values. The data set we use

in our experiments includes numerical values as im-

portant determiners. The standard ID3 algorithm does

not deal with numerical values, so we use the method

Utility of Anonymised Data in Decision Tree Derivation

275

suggested in (Mitchell, 1997) to allow for them to be

considered.

4 EXPERIMENTS DESIGN

4.1 Experimental Setup

We want to evaluate how useful an anonymised data

set is in decision tree classiﬁcation. To do this, we

ﬁrst ﬁnd classiﬁcation accuracy results from the four

possible scenarios:

1. Classiﬁcation of Non-Anonymised data using a

decision tree trained on Non-Anonymised data

2. Classiﬁcation of Anonymised data using a deci-

sion tree trained on Anonymised data

3. Classiﬁcation of Non-Anonymised data using a

decision tree trained on Anonymised data

4. Classiﬁcation of Anonymised data using a deci-

sion tree trained on Non-Anonymised data

For brevity, we refer to each scenario in the rest of

this paper by its numbering. We use Scenario 1 as

the baseline for our experiments; it is the scenario in

which maximal information is available to the clas-

siﬁer. Comparisons between these scenarios should

provide insight into how much information is lost

through anonymisation, allowing an evaluation on

utility.

In this paper, we use the ADULT data set (Dua and

Graff, 2017) for all experiments to train and test deci-

sion trees and classify records. It is the de-facto stan-

dard for the evaluation of anonymisation algorithms.

Of the 15 attributes in the data set, the income at-

tribute is the class attribute. We omit the attributes fnl-

wgt (not useful for our purpose) and education-num

(enumeration of education). This leaves us with 12

attributes that can be used to classify a record. We

also remove records with missing values, resulting in

45,222 records for our experiments.

We run a series of experiments using the following

process:

1. k-Anonymise ADULT data set with given QIDs &

k-value.

2. Train and test the data according to one of the four

scenarios explained above.

3. Use 6-fold cross-validation testing to measure

classiﬁcation accuracy.

We repeat this process for the two modes of Mon-

drian algorithm and for the two types of generalisa-

tion mapping. Note that this means that the Scenario

1 experiment will only be performed once as there is

no anonymisation to vary the k-value, and no require-

ment to change the algorithm mode or mapping type.

Furthermore, the Scenario 4 experiments will not be

repeated for the two mapping types as training of the

decision tree is done on Non-Anonymised data.

4.2 Metric Correlations

Once results have been collected from the utility eval-

uation, we try to ﬁnd a correlation between those re-

sults and a selection of relevant metrics. The metrics

chosen for this paper include Discernability Metric,

ILoss and Classiﬁcation Metric.

Discernability Metric (DM) (Slowinski, 1992) is

almost ubiquitous in the related literature. It assigns

a penalty to an anonymisation based on the size of

equivalence classes; a higher penalty suggests that

records are less discernible. The penalty is given by

the following formula:

DM =

∑

E∈T

|E|

where E is an equivalence class and T is the data table

being evaluated.

ILoss (Tao and Xiao, 2008) is a metric that tries

to consider the information loss from the generalisa-

tion of values. Fung et al. (Fung et al., 2010) states:

”ILoss measures the fraction of domain values gener-

alized by [a generalised value] v

.” The ILoss mea-

surement for a speciﬁc generalised value is given by:

ILoss(v

) =

| − 1

where A is the attribute of the value v

, |v

| is the num-

ber of values within the domain of A that are descen-

dants of v

, and |D

| is the total number of values in

the domain of A. The total ILoss for a given table can

then be found by simply summing the measurements

for each generalised value in the table.

Finally, we examined Classiﬁcation Metric (CM)

(Iyengar, 2002). This is a specialised metric designed

speciﬁcally for measuring utility regarding classiﬁca-

tion. The metric is deﬁned:

CM =

∑

r∈T

penalty(r)

where r is a row in the table T , N is the total number

of rows, and the penalty function is deﬁned:

penalty(r) =











1, if r is suppressed

1, if class(r) 6= majority(E(r))

0, all other cases

Here E(r) is the equivalence class record r belongs to.

In this paper, we can ignore the ﬁrst penalty case as

ICISSP 2022 - 8th International Conference on Information Systems Security and Privacy

276

suppression, another method of anonymisation differ-

ent to generalisation, is not used. The case where the

class of record r is not the majority of its EC adds a

penalty if we have an EC that is not homogeneous in

its classiﬁcation.

5 EXPERIMENTAL RESULTS

As mentioned, we get our baseline from the Scenario

1 experiment. We found the classiﬁcation accuracy in

this case to be 81.82%. It is expected that this will be

the highest measured accuracy, as maximal informa-

tion is available to the classiﬁer.

5.1 Classiﬁcation Accuracy

We now discuss the effect of k-anonymisation on clas-

siﬁcation accuracy by analysing the results for other

scenarios.

Classiﬁcation Accuracy Degradation

Here, we view the results of the reduction in classi-

ﬁcation accuracy from the measured baseline in each

scenario involving anonymisation. The results were

averaged over all QID counts, mapping and algorithm

types. For Scenarios 2 & 3 we saw similar results to

those measured in Shao & Beckford (Shao and Beck-

ford, 2017), that is, degradation in classiﬁcation ac-

curacy from the baseline increases with the k-value.

The fall for these two scenarios ranged from 2.15%

to 19.15%, with an average for the more commonly

used values of k (2-10) (Emam and Dankar, 2008) be-

ing 11.72%.

For Scenario 4, we saw the fall in classiﬁcation ac-

curacy stabilise across the board, with the measured

fall ranging from 5.51% to 8.01%. This would sug-

gest that anonymisation in the training phase of deci-

sion tree classiﬁcation has a greater impact on classi-

ﬁcation accuracy than during the testing phase. Also,

echoing the conclusions from Shao & Beckford (Shao

and Beckford, 2017), it would be beneﬁcial to re-

searchers performing decision tree classiﬁcation for

data publishers to keep the k-value in the range 2-10,

regardless of applicable scenario.

Mapping Type Comparison

Figure 1 shows the classiﬁcation accuracy compari-

son between the two types of generalisation mapping

used in this paper. These results are for Scenario 2,

with the result from Scenario 1 shown for compari-

son.

2 5 10 25 50 100

Classification Accuracy %

Random

Statistical

Non-Anonymised

Figure 1: Generalisation Mapping Comparison - Scenario

We can see a slight improvement in classiﬁcation

accuracy for the statistical method of generalisation

mapping compared to the random method. However,

this increase is minimal with an improvement of just

1.4% in the best case. Furthermore, in the commonly

used k-value range (2-10), the improvement is neg-

ligible. The expectation was that the statistical ap-

proach would show an improvement in classiﬁcation

accuracy over the random approach; these results are

in line with that, but the increase is so small that it

would be better for the data publisher not to provide

frequency statistics with their anonymised data so that

it is better protected. We found similar results for Sce-

nario 3.

Algorithm Comparison

Here, we compare the results for the two modes of

the Mondrian algorithm. These results are compiled

from average classiﬁcation accuracy over all QID

counts for the given k-value, disregarding the map-

ping method.

Figure 2 shows the comparison of algorithm

modes for Scenario 3. We see a minor improve-

ment for the strict algorithm over the relaxed. The

relaxed algorithm tends to result in anonymisations

with smaller ECs. Hence, for decision tree classiﬁca-

tion, the size of ECs seem less of a factor than the size

and number of generalisations within.

The results for Scenarios 2 & 4 showed a negli-

gible difference between the two algorithm modes re-

garding classiﬁcation accuracy. It would appear to be

a better choice for the publisher to opt for the strict

algorithm when the known use of the data is decision

tree classiﬁcation.

Utility of Anonymised Data in Decision Tree Derivation

277

2 5 10 25 50 100

Classification Accuracy %

Relaxed

Strict

Non-Anonymised

Figure 2: Algorithm Mode Comparison - Scenario 3.

Scenario Comparison

To complete the classiﬁcation accuracy results, we

compare the average classiﬁcation accuracy of all

given scenarios described in Section 4.1.

1 2 3 4

Scenario

Avg. Classification Accuracy %

Figure 3: Scenario Classiﬁcation Accuracy Comparison.

Figure 3 shows the average classiﬁcation accuracy

by scenario, all else being equal. We can see more

clearly here the degradation in accuracy from classi-

ﬁcation where no anonymisation is used. These re-

sults highlight that degradation is expected when us-

ing anonymised data in decision tree classiﬁcation;

however, it can be limited by restricting the use of

anonymised data to either the training or classiﬁca-

tion phase only. This is particularly so in the case

described in Scenario 4.

5.2 Metric Correlation

This section will analyse the results from measure-

ments of the three well-established metrics tradition-

ally used in the literature to measure the effectiveness

and utility of anonymisations. As stated, the three

metrics measured were Discernability Metric, ILoss,

and Classiﬁcation Metric. We took measurements for

each of these metrics on the same anonymised data

sets as used in the previous experiments, then tried to

establish a correlation between the metric measure-

ments and the corresponding classiﬁcation accuracy

measurements.

Discernability Metric

Firstly, as an aside from the correlation results, Fig-

ure 4 shows a comparison of the Discernability Met-

ric (DM) penalty for the two modes of Mondrian al-

gorithm. It is notable that the strict algorithm showed

a much higher measured DM in all cases. As estab-

lished, DM is a measure of the size of equivalence

classes in an anonymisation. These measurements

provide evidence to show that the relaxed algorithm

results in smaller equivalence classes on average, ev-

erything else being equal. It is worth noting that the

disparity here is in contrast with Figure 2, where the

two modes of Mondrian were found showing only a

small difference in accuracy.

2 5 10 25 50 100

0.5

1.5

2.5

Discernability Penalty

Relaxed

Strict

Figure 4: Discernability Penalty - Algorithm Comparison.

Figure 5 shows a scatter graph illustrating the cor-

relation between the classiﬁcation accuracy and the

DM for each scenario involving anonymisation. We

can see from these results that the correlation between

DM and classiﬁcation accuracy is relatively mediocre

in all cases; ILoss and the Classiﬁcation Metric show

much stronger correlations. Notably, the correlation

is clearly negative, indicating that an increase in DM

suggests a decrease in classiﬁcation accuracy. How-

ever, the correlation is not strong enough for DM to be

considered a reliable estimator for classiﬁcation accu-

racy in decision tree classiﬁcation.

ICISSP 2022 - 8th International Conference on Information Systems Security and Privacy

278

65 70 75 80 85

Classification Accuracy %

Discernability Penalty

Scenario 2

r = -0.81

65 70 75 80 85

Classification Accuracy %

Discernability Penalty

Scenario 3

r = -0.83

65 70 75 80 85

Classification Accuracy %

Discernability Penalty

Scenario 4

r = -0.58

Figure 5: Discernability Penalty / Classiﬁcation Accuracy

Correlation.

ILoss

Similar to the previous subsection, Figure 6 shows the

correlation between the classiﬁcation accuracy and

the ILoss.

In comparison to DM, we can see a much stronger

correlation between this metric and the measured

classiﬁcation accuracy, with Scenarios 2 & 3 showing

an impressive correlation coefﬁcient r of 0.98. This

would suggest that ILoss is a good estimator for clas-

siﬁcation accuracy in decision tree classiﬁcation.

As discussed, ILoss considers individual records

when calculating the penalty, unlike DM which sim-

ply considers the size of the ECs containing said

records. Speciﬁcally, ILoss considers the size of gen-

eralisations in a given attribute compared to the do-

main of that attribute. The strong correlation shown

here would suggest that the number and relative size

of generalisations in an anonymisation has a greater

effect on classiﬁcation accuracy than the size of ECs.

Classiﬁcation Metric

Finally, we consider the Classiﬁcation Metric (CM).

Figure 7 shows the correlation between the classiﬁca-

tion accuracy and the CM penalty.

CM is a special purpose metric, designed to give

insight into the utility of an anonymisation in classi-

ﬁcation tasks. Therefore, we expected the correlation

shown here to be the strongest of the three metrics.

This is indeed the case, with the lowest correlation

65 70 75 80 85

Classification Accuracy %

0.5

1.5

ILoss

Scenario 2

r = -0.98

65 70 75 80 85

Classification Accuracy %

0.5

1.5

ILoss

Scenario 3

r = -0.98

65 70 75 80 85

Classification Accuracy %

0.5

1.5

ILoss

Scenario 4

r = -0.82

Figure 6: ILoss / Classiﬁcation Accuracy Correlation.

coefﬁcient being 0.95 in Scenario 4 - much higher

than with the other metrics tested. As with ILoss, CM

would appear to be a very reliable metric for estimat-

ing classiﬁcation accuracy in decision tree classiﬁca-

tion.

It is notable that for each metric, the correlation

in Scenario 4 is the weakest. This would seem to

be evidence for the suggestion that classiﬁcation ac-

curacy is affected to a greater degree by anonymisa-

tion in the training of the decision tree, compared to

the classiﬁcation of values using it. In Scenario 4,

anonymisation in the training part of the algorithm is

eliminated, and the classiﬁcation results became more

stable whilst the metric measures showed a dispropor-

tionate change.

6 CONCLUSIONS

We can see from these results that you can expect the

anonymisation of data to be detrimental to the utility

regarding decision tree classiﬁcation, but the magni-

tude of that detriment is minimal in most cases. Of

course, these results only relate to the ID3 decision

tree algorithm; more research is necessary to draw

general conclusions on machine learning tasks.

There are clear factors that affect the loss of util-

ity from anonymisation. Primarily, the scenario in

which the data is used and the value of k. We saw

that, generally, Scenario 4 had the best results in this

regard, although Scenario 3 was similar for the more

Utility of Anonymised Data in Decision Tree Derivation

279

65 70 75 80 85

Classification Accuracy %

0.05

0.1

0.15

0.2

CM Penalty

Scenario 2

r = -0.98

65 70 75 80 85

Classification Accuracy %

0.05

0.1

0.15

0.2

CM Penalty

Scenario 3

r = -0.99

65 70 75 80 85

Classification Accuracy %

0.05

0.1

0.15

0.2

CM Penalty

Scenario 4

r = -0.95

Figure 7: Classiﬁcation Metric / Classiﬁcation Accuracy

Correlation.

commonly used values of k. Limiting the amount of

anonymisation where possible is therefore endorsed.

In addition, a value of k between 2 and 10 is generally

recommended based on these results. In most cases,

a value in this range would provide sufﬁcient privacy,

whilst maintaining utility.

There appeared to be no beneﬁt to using frequency

statistics to aid the mapping of generalised values

when building decision trees. Furthermore, there was

very little difference between the two modes of Mon-

drian algorithm – although the Strict mode performed

slightly better in some cases.

Data publishers who want to maximise the util-

ity of their data for classiﬁcation could consider the

use of the ILoss Metric and Classiﬁcation Metric to

provide an estimation of decision tree classiﬁcation.

These metrics are easily calculated, and the evidence

provided in this paper suggests that they are reliable

estimators. By utilising these metrics to calibrate their

anonymisation parameters, data publishers can cer-

tainly provide more useful data to researchers without

compromising privacy.

REFERENCES

Ayala-Rivera, V., McDonagh, P., Cerqueus, T., and Murphy,

L. (2014). A systematic comparison and evaluation of

k-anonymization algorithms for practitioners. Trans-

actions on Data Privacy, 7(3):337–370.

Dua, D. and Graff, C. (2017). UCI machine learning repos-

itory. [http://archive.ics.uci.edu/ml]. Irvine, CA: Uni-

versity of California, School of Information and Com-

puter Science.

Emam, K. and Dankar, F. (2008). Protecting privacy using

k-anonymity. Journal of the American Medical Infor-

matics Association, 15:627–37.

Fung, B. C. M., Wang, K., Chen, R., and Yu, P. S. (2010).

Privacy-preserving data publishing: a survey of recent

developments. ACM Computing Surveys, 42(4):14:1–

14:53.

Iyengar, V. S. (2002). Transforming data to satisfy pri-

vacy constraints. In Proceedings of the Eighth

ACM SIGKDD International Conference on Knowl-

edge Discovery and Data Mining, page 279–288.

LeFevre, K., DeWitt, D. J., and Ramakrishnan, R. (2006).

Mondrian multidimensional k-anonymity. In Proceed-

ings of the 22nd International Conference on Data En-

gineering, pages 25–25.

Li, J., Liu, J., Baig, M. M., and Wong, R. C.-W. (2011).

Information based data anonymization for classiﬁ-

cation utility. Data & Knowledge. Engineering.,

70(12):1030–1045.

Mitchell, T. (1997). Machine Learning, chapter 3. McGraw

Hill.

Quinlan, J. R. (1986). Induction of decision trees. Machine

Learning, 1:81–106.

Samarati, P. and Sweeney, L. (1998). Protecting privacy

when disclosing information: k-anonymity and its

enforcement through generalization and suppression.

Technical report.

Shao, J. and Beckford, J. (2017). Learning decision trees

from anonymized data. In 8th Annual International

Conference on ICT: Big Data, Cloud and Security.

Slowinski, R. (1992). Intelligent decision support: Hand-

book of applications and advances of the rough sets

theory.

Tang, Q., Wu, Y., Liao, S., and Wang, X. (2010). Utility-

based k-anonymization. In The 6th International Con-

ference on Networked Computing and Advanced In-

formation Management, pages 318–323.

Tao, Y. and Xiao, X. (2008). Personalized Privacy Preser-

vation, pages 461–485.

ICISSP 2022 - 8th International Conference on Information Systems Security and Privacy

280