On the Evaluation of the Privacy Breach in Disassociated Set-valued

Datasets

Sara Barakat

, Bechara Al Bouna

, Mohamed Nassar

and Christophe Guyeux

TICKET Lab., Antonine University, Hadat-Baabda, Lebanon

Department of Computer Science, MUBS, Hamra, Lebanon

University of Bourgogne Franche-Comt

e, Belfort, France

Keywords:

Disassociation, Privacy Breach, Data Privacy, Set-valued, Privacy Preserving.

Abstract:

Data anonymization is gaining much attention these days as it provides the fundamental requirements to safely

outsource datasets containing identifying information. While some techniques add noise to protect privacy

others use generalization to hide the link between sensitive and non-sensitive information or separate the

dataset into clusters to gain more utility. In the latter, often referred to as bucketization, data values are kept

intact, only the link is hidden to maximize the utility. In this paper, we showcase the limits of disassociation,

a bucketization technique that divides a set-valued dataset into k

-anonymous clusters. We demonstrate that

a privacy breach might occur if the disassociated dataset is subject to a cover problem. We ﬁnally evaluate the

privacy breach using the quantitative privacy breach detection algorithm on real disassociated datasets.

1 INTRODUCTION

Privacy preservation is an important requirement that

must be considered carefully before the release of

datasets containing valuable information. Anonymiz-

ing a dataset by simply removing the personally iden-

tifying information (PII) is insufﬁcient to prevent

a privacy breach (Samarati, 2001; Sweeney, 2002).

An attacker

may still succeed in associating his/her

background knowledge with one or multiple records

via non-sensitive information in the dataset to even-

tually reveal individuals’ sensitive information. The

same problem arises when releasing set-valued data

(e.g., shopping and search logs) consisting of a set

of records where each record links an individual to

his/her set of distinct items. The AOL search data

leak in 2006 (Barbaro and Zeller, 2006) is an explicit

example that shows the implications of inappropri-

ate anonymization on data privacy. The query logs

of 650k individuals were released after omitting all

explicit identiﬁers. They were later withdrawn due to

multiple reportings of privacy breaches.

To better illustrate the problem, let us consider a

dataset of individuals and their corresponding sets of

searched data items. Let us assume that an attacker

A person who is intentionally trying to link individuals

to their sensitive information

knows that Alice, an individual whose items ﬁgure in

publicly available dataset, has used two search items:

{SideE f f ects and Vomiting}. The attacker’s back-

ground knowledge consists of the set of these two

items. If it happens that one and only one record ex-

ists in the released dataset where both Vomiting and

SideE f f ects belong to the same itemset, such as the

case of record r

in the example shown in Figure 1,

the attacker can link the individual Alice to r

. Here,

the attacker, who has partial knowledge of some of

the data items searched by the individual, is able to

(a) Original set-valued dataset

(b) Disassociated dataset with record chunks C

and C

Figure 1: Example scenario.

318

Barakat, S., Bouna, B., Nassar, M. and Guyeux, C.

On the Evaluation of the Privacy Breach in Disassociated Set-valued Datasets.

DOI: 10.5220/0005969403180326

In Proceedings of the 13th International Joint Conference on e-Business and Telecommunications (ICETE 2016) - Volume 4: SECRYPT, pages 318-326

ISBN: 978-989-758-196-0

determine the complete itemset and links it to the cor-

responding individual.

There has been signiﬁcant work in anonymization

(Samarati, 2001; Sweeney, 2002; Machanavajjhala

et al., 2006; Xiao and Tao, 2006; Dwork et al., 2006;

Terrovitis et al., 2012) to cope with this problem.

Categorization, on the one hand, separates the data

into sensitive and non-sensitive categories of values,

which unfortunately can be hard to adopt in real world

applications due to the elastic meaning of the sensi-

tivity (Terrovitis et al., 2012). Generalization, on the

other hand, transforms values into a general form and

creates groups to eliminate possible linking attacks.

Despite its efﬁciency in privacy preservation, general-

ization has a major drawback related to the loss in data

utility (Xiao and Tao, 2006), making it less efﬁcient

for aggregate analysis. Anonymization by disassoci-

ation (Terrovitis et al., 2008; Loukides et al., 2014a;

Loukides et al., 2015) is one recent method that pre-

serves the original items keeping them intact, with-

out applying generalization or substitution. Instead,

it uses some form of bucketization and separates the

records into clusters and chunks to hide their associ-

ations. More speciﬁcally, disassociation transforms

the original data into k

-anonymous (Terrovitis et al.,

2008) records by dividing the dataset 1) horizontally,

creating clusters of similar records and 2) vertically,

creating k

-anonymous chunks of similar sub-records

(e.g., record chunks C

and C

in Figure 1(b)) and an

item chunk containing items that appear less than k

times.

Disassociation lies under a general category of

techniques (Li et al., 2012; Xiao and Tao, 2006; Ciri-

ani et al., 2010) that preserve privacy by dividing the

dataset to maximize the utility. It provides strong the-

oretical guarantees but unfortunately remains vulner-

able when the dataset is subject to what we call a

cover problem, which is deﬁned in Section 5.

In this article, we evaluate the privacy provided by

the disassociation of a set-valued dataset and demon-

strate that it might be breached whenever this dataset

is subject to a cover problem. We show that attack-

ers with different levels of partial background knowl-

edge weak, moderate, or strong are able to compro-

mise the privacy of a disassociated dataset. We also

show that, as many other bucketization techniques

(Xiao and Tao, 2006; Li et al., 2012) in their attempt

to protect against attribute disclosure, preserving pri-

vacy tends to be overly optimistic (Wong et al., 2011;

Kifer, 2009; al Bouna et al., 2015a; al Bouna et al.,

2015b).

The contributions of this research work can be

summarized as follows:

• We present a formal model of the set-valued data

and the disassociation technique.

• We deﬁne the cover problem and investigate its

repercussion on the disassociation of the dataset.

• We propose the so-called quantitative privacy

breach detection algorithm, and study its efﬁ-

ciency. The proposed algorithm shows to what

extent the privacy of a disassociated dataset can

be breached.

The remainder of this paper is organized as follows.

In the next section, we present prior work in data

anonymization and discuss some of the techniques

used to breach well known privacy constraints. In

Section 3, we give some insights on privacy preserv-

ing using the disassociation technique. Section 4

deﬁnes the attacker model including the three types

of attackers: strong, moderate, and weak. In Sec-

tion 5, we present the cover problem and the result-

ing privacy breach in a disassociated dataset. We ﬁ-

nally present a set of experiments to evaluate the de-

anonymizing algorithm in Section 6.

2 RELATED WORK

Anonymization Techniques have been a topic of in-

tense research in the last decade. Several have been

proposed as a way to protect privacy by guaranteeing

that the probability of linking an individual to a spe-

ciﬁc record will not be greater than a certain thresh-

old (e.g., 1/k or 1/l). Some rely on generalization

(Samarati, 2001; Sweeney, 2002; Machanavajjhala

et al., 2006); replacing some values with more gen-

eral values based on a given taxonomy or an inter-

val for numerical values, while others on bucketiza-

tion (Xiao and Tao, 2006; Ciriani et al., 2010; Biskup

et al., 2011); separating what is sensitive from non-

sensitive. Disassociation lies under this category of

bucketization techniques, preserves utility by keeping

the values intact without any alteration, but it does not

divide the attributes into sensitive and non-sensitive.

Dwork deﬁned the notion of differential privacy

(Dwork et al., 2006) to handle private data publishing

efﬁciently. It gained much popularity among com-

puter scientists providing strong assumptions on the

way that data should be released. It is based on a

strong mathematical foundation and guarantees that

an attacker will not be able to learn any information

about an individual that he/she already knows if the

individual is removed from the dataset. In order to

achieve differential privacy, approaches tend to re-

lease privatized data by introducing noise to query re-

sults. Here, despite the efﬁciency provided by differ-

On the Evaluation of the Privacy Breach in Disassociated Set-valued Datasets

319

ential privacy, we opt for disassociation since it pub-

lishes truthful datasets.

Yeye et al. (He and Naughton, 2009) extended

k-anonymity to set-valued context, in which there is

no distinction between quasi-identifying, sensitive or

non-sensitive values. In their approach, the authors

generate a k-anonymous dataset via a top-down par-

titioning privacy technique based on local generaliza-

tion. In a similar approach, the authors in (Fard and

Wang, 2010) adopted the k-anonymity privacy con-

straint for a transactional dataset. They proposed a

clustering based technique to group similar transac-

tions into clusters, in order to reduce generalizations

and suppression thus reducing information loss. A

common problem with these methods is that a large

part of the initial items are usually missing from the

anonymized data due to generalization or suppres-

sion. In (Jia et al., 2014), the authors propose a pri-

vacy notion ρ-uncertainty to ensure that there are no

sensitive association rules that can be inferred with

conﬁdence higher than ρ. Truthful association rules

however can still be derived, but this requires, in a

similar manner to bucketization techniques, distinc-

tion between the sensitive and non-sensitive values.

Preserving privacy by disassociation is a promis-

ing technique in which the values are kept intact with-

out any alteration: neither generalization nor suppres-

sion. It thus falls under the general category of bucke-

tization techniques but again, without the need to dis-

tinguish between sensitive and non-sensitive values.

Attacks on Anonymization focus on the identiﬁ-

cation of sensitive information, information that is

meant to be private, by either exploiting defects in

the privacy constraints, a bug in the mechanism that

implements it or even by assuming knowledge of the

underlying anonymization algorithm such as the min-

imality attack (Cormode et al., 2010; Wong et al.,

2007).

We discuss here some of the interesting attacks that

have been reported in the data anonymization liter-

ature. Homogeneity attack (Samarati, 2001) is con-

sidered successful when there is a lack of diversity

among the values of the sensitive attribute. Back-

ground knowledge (Machanavajjhala et al., 2006)

attack compromises anonymization, when the at-

tacker’s background knowledge includes information

about the sensitive values of a speciﬁc individual

or partial knowledge of the distribution of sensitive

and non-sensitive values. Unsorted matching attack

(Sweeney, 2001) is related to the order in which the

tuples appear in the released table that can leak sen-

sitive information if it is preserved in the anonymized

dataset.

The attacks mentioned previously present some

Table 1: Notations

T a table containing individuals related records

r a record of T ; set of items associated with a speciﬁc indi-

vidual of a population

]

a set of all m-combinations of items in the record r

I an itemset of D that might or not be grouped together in a

record of T

a set of records in T grouped as a cluster

s(I,T ) the number of records in T that contain I

∗

a table anonymized using the disassociation technique

C a chunk in a disassociated dataset; is a set of sub-records of

T in T

∗

an item chunk in a disassociated; dataset a quasi-identiﬁer

group

a record chunk; is a set of sub-records of R in T

∗

B the background knowledge of an attacker

insights on the privacy problems that might be en-

countered when anonymizing sensitive information in

an outsourcing scenario. To the best of our knowl-

edge, none has explored the privacy breach in a disas-

sociated dataset. In this paper, we highlight and eval-

uate the ability of an attacker to link his/her partial

background knowledge represented by the m items

that he/she is allowed to have, according to the pri-

vacy constraint k

-anonymity, to less than k records.

The works described in (Wong et al., 2011; Res-

sel, 1985) are good examples of attacks that fol-

low the same convention by identifying ﬂaws in the

anonymization techniques allowing attackers to use

their background knowledge to breach privacy or ex-

tract foreground knowledge. The authors assume that

the correlation between the values in an anonymized

dataset could be used to breach the privacy of the

anonymization techniques that use l-diversity in their

underlying privacy mechanism.

3 PRESERVING PRIVACY BY

DISASSOCIATION

3.1 Formalization

Let D = {x

,...,x

} be a set of items (e.g., supermar-

ket products, query logs, or search keywords). Any

subset I ⊆ D is an itemset (e.g., items searched to-

gether). Let T = (r

,...,r

) be a dataset of records,

where r

⊆ D for 1 ≤ i ≤ n is a record in T and r

associated with a speciﬁc individual of a population.

We use [r

]

to denote the set of all m-combinations of

the record r

. Equivalently, [r

]

is the set of m items

subsets of r

. Note that D is no more than the set of

items in T : D =

i=1

We use R

to deﬁne a subset of records in T . s(T )

and s(R

) denote the numbers of records in T and R

respectively and thus, s(I, T ) denotes the number of

records in T that contain I.

Table 1 deﬁnes the basic concepts and notations

used in the paper.

SECRYPT 2016 - International Conference on Security and Cryptography

320

3.2 Disassociation

Disassociation works under the assumption that the

items should neither be altered, suppressed, nor gen-

eralized, but at the same time the dataset should serve

the k

-anonymity privacy constraint (Terrovitis et al.,

2008). k

-anonymity guarantees that an attacker who

knows up to m items, grouped together in an itemset

I ⊆ D, cannot associate them with less than k records

from T , where k ≥ 2 is a user-deﬁned constant. For-

mally, k

-anonymity is deﬁned as follows:

Deﬁnition 1 (k

-anonymity). Given a dataset of

records T whose items belong to a given set of items

D, we say that T is k

-anonymous if ∀I ⊆ D and

|I| ≤ m, the number of records that contain I in T is

greater than or equal to k, s(I,T ) ≥ k.

In a practical scenario, k

-anonymity cannot

be achieved without a severe loss in data utility.

The disassociation technique (Terrovitis et al., 2012;

Loukides et al., 2014b), for its part, ensures privacy

through bucketization to provide better utility when it

comes for frequent itemsets discovery and aggregate

analysis.

Disassociation separates the dataset into clusters

of k

-anonymous record chunks and an item chunk.

The key idea is to sort records based on the most

frequent items and then group them horizontally into

smaller disjoint clusters {R

,...,R

}. It partitions, in

a next step, the clusters vertically into k

-anonymous

record chunks {C

,...,C

} and an item chunk C

hide infrequent combinations. The record chunks are

created subsequently as long as there are items that

can be grouped together in a way to satisfy the k

anonymity privacy constraint. The remaining items,

the ones that occur less that k times, are moved to the

item chunk C

Formally, given a dataset T , applying disassoci-

ation on T will produce an anonymized dataset T

∗

composed of n clusters where each is divided into a

set of record chunks and an item chunk,

∗

= ({R

,...,R

},...,{R

,...,R

})

such that ∀R

∈ T

∗

, R

is k

-anonymous, where,

• R

represents the itemsets of the i

cluster that

are contained in the record chunk C

• R

is the item chunk of the i

cluster containing

items that occur less than k times.

The example in Figure 1(b) shows that the disas-

sociated dataset contains only one cluster with two

The shared chunk as deﬁned in the original paper (Ter-

rovitis et al., 2012) is omitted here for simplicity.

-anonymous record chunks (k=3 and m=2) and an

item chunk, thus, T

∗

=({R

}).

In a disassociated dataset, privacy is preserved by

assuming that the record chunks are k

-anonymous

and that the records that are not k

-anonymous in the

original dataset, are k

-anonymous in at least one of

the datasets resulting from the inverse transformation

of the disassociated dataset. This privacy guarantee is

formally expressed as follows:

Deﬁnition 2 (Disassociation Guarantee). Let G be

the inverse transformation of T

∗

. G takes an

anonymized dataset G(T

∗

) and outputs the set of

all possible datasets {T

}. ∀I ⊆ D, |I| ≤ m, ∃T

∈

G(T

∗

) where s(I,T

) ≥ k.

The Disassociation Guarantee ensures that for any

individual with a complete record r, and for an at-

tacker who knows up to m items of r, at least one of

the datasets reconstructed by the inverse transforma-

tion has r hidden by repeating its existence k times.

That is, given m items of r, if r exists less than

k times in the original dataset, the attacker will not

be able to link back the record to the individual with

a probability greater than 1/k in the disassociated

dataset. This record r, as all other records, exists in

at least one of the inverse transformations k times.

4 ATTACKER MODEL

In accordance with the work in (Terrovitis et al., 2012;

Loukides et al., 2014b; Terrovitis et al., 2008), we

consider that the attacker intends to link individuals to

their anonymized records (e.g., complete set of search

logs or purchased supermarket products). This can

be done, in a trivial anonymization, by tracing unique

combinations of items. For example, an attacker who

knows that Alice has searched for side effects and

vomiting is represented by a background knowledge

set of items {Side Effects, Vomiting}. Since one and

only one such record exists in the dataset of Figure

1(a), the attacker can easily link this record back to

Alice. In addition, we assume that the attacker has

knowledge of the disassociation technique.

We also assume that the attacker may have back-

ground knowledge consisting of at most m items;

knowing these m items, he/she will not be able to link

them to less than k records in the dataset (i.e., adopt-

ing k

-anonymity, the same privacy guarantee as in

(Terrovitis et al., 2012)). Some models add to it neg-

ative background knowledge (e.g., the attacker knows

that an individual has not posed a speciﬁc query, that

an item is not purchased by a given individual, etc.).

Again, as in (Terrovitis et al., 2012) and many other

On the Evaluation of the Privacy Breach in Disassociated Set-valued Datasets

321

(a) Strong attackers (b) Moderate attackers (c) Weak attackers

Figure 2: Associations between the background knowledge of attackers based on their category and the records in T .

bucketization techniques (Xiao and Tao, 2006) (Li

et al., 2012), we do not assume this kind of nega-

tion knowledge. Here, we classify attackers into three

categories: strong, moderate, and weak, depending

on their level of background knowledge and conse-

quently their ability to associate individuals with their

records in T .

Strong Attackers have a background knowledge

B = {I

,...,I

} such that ∀I

∈ B, I

is of size

m, |I

| = m and well picked

from the set of m-

combinations [r

]

of a record r

, I

∈ [r

]

. For

these attackers, as shown in Figure 2(a), there ex-

ists a bijective function that links itemsets in B to

records in T .

Moderate Attackers have a background knowledge

B = {I

,...,I

}, z < n such that ∀I

∈ B, I

is of size

m, |I

| = m and chosen at random from a strict sub-

set I

−

of the m-combinations [r

]

of a record r

∈ I

−

⊂ [r

]

. For these attackers, as shown

in Figure 2(b), there exists a one-to-one injective

function that links itemsets in B to records in T .

Weak Attackers have a background knowledge B =

,...,I

}, z < n such that ∀I

∈ B, I

is of size m,

| = m and chosen at random from a set of asso-

ciation rules and frequent itemsets I, I

∈ I, that

are not necessarily in D. The goal of weak at-

tackers is limited to reconstruct the disassociated

dataset to its original form but they are unable to

link records to individuals (see Figure 2(c)).

5 PRIVACY EVALUATION IN A

DISASSOCIATED DATASET

In this section, we deﬁne the cover problem and we

demonstrate its repercussion on data privacy. More

speciﬁcally, we evaluate the privacy breach that might

occur in a disassociated dataset due to the cover prob-

lem.

More details on how these itemsets are picked will be

provided in the experiments section

5.1 Cover Problem

A cover problem is deﬁned by the ability to associate

one-to-one or one-to-many items in two subsequent

record chunks in the disassociated data. The target

item from the ﬁrst chunk has equal or higher support

and we call it the covered item. Formally the cover

problem is deﬁned as follows:

Deﬁnition 3 (Cover Problem). Given a set of items

in R

j−1

( j ≥ 2) that have a support greater than or

equal to the support of an item x

i, j

∈ R

, I

i, j−1

{y : y ∈ R

j−1

and s(y, R

j−1

) ≥ s(x

i, j

)}, we say

that a cover problem exists if ∃y

i, j−1

∈ I

i, j−1

such that

s(y

i, j−1

j−1

)

= s(I

i, j−1

j−1

) = min

∀y∈I

i, j−1

s(y,R

j−1

To give an example of the cover problem, consider

Figure 3(a). The set of items in the previous record

chunk R

having support greater than or equal to e

is I

1,1

= {a,b,c}. The support of the itemset I

1,1

, which is s(I

1,1

), is equal to 2. In turn, it

is equal to the minimum support of the values in I

1,1

which, in our example, is s(c,R

). Therefore we say

that the item c is covered by the items a and b.

Another example can be extracted from Figure

1(b), the item Cancer in I

is covered by the remain-

der of the items in I

. In fact, all of the occurrences of

the item Cancer are associated with the items Treat-

ment and Oncologist and therefore, by associating the

item Side Effects with Cancer, Side Effects will even-

tually be linked to both Treatment and Oncologist.

5.2 Privacy Breach

Recall from Subsection 3.2 and Deﬁnition 2, that

the disassociation technique hides infrequent records;

records that occur less than k times in the original

dataset for a given m items, by 1) dividing them into

k-anonymous sub-records in record chunks and 2) en-

suring that all the records reconstructed by the inverse

transformation are k

-anonymous in at least one of

the resulting datasets.

SECRYPT 2016 - International Conference on Security and Cryptography

322

(a) Cover problem

example in T

∗

(b) Datasets reconstructed by the inverse transfor-

mation of T

∗

and c are associated

Figure 3: Cover problem and privacy breach examples.

Intuitively, a privacy breach occurs if an attacker

is able to link m items, which he/she already knows

about an individual, to less than k records in all

the datasets reconstructed by the inverse transforma-

tion. More subtle is when these records contain the

same set of items in all the reconstructed datasets

thus, linking more than m items to the individual or

worse leading to a complete de-anonymization; link-

ing, with certainty, the complete set of items to the

individual. (Figure 3(c) is an example of such a de-

anonymization. The attacker already knows that an

individual has searched for items e and c and thus can

relate him/her them to the same items a and b in all

the reconstructed datasets).

We will show in the following that this privacy

breach might occur whenever the dataset is subject to

a cover problem. Formally speaking:

Lemma 1. Let T

∗

be a disassociated dataset subject

to a cover problem and involving an item x

i, j

∈ R

and a covered item y

i, j−1

∈ R

j−1

. We say that a pri-

vacy breach might occur if {x

i, j

, y

i, j−1

} ⊆ I where

I ∈ B, the attacker’s background knowledge.

We show in what follows that an attacker is able

to breach the privacy provided by the disassociation

technique, if his/her background knowledge contains

the items involved in the association that led to the

cover problem.

Proof. The dataset T

∗

is subject to a cover problem

and therefore, there exists an item x

i, j

∈ R

( j ≥ 2)

that can be associated with a covered item y

i, j−1

∈

i, j−1

where I

i, j−1

= {y : y ∈ R

j−1

and s(y,R

j−1

) ≥

s(x

i, j

)} is the set of items in R

j−1

having sup-

port greater than or equal to the support of x

i, j

Let us assume that the items x

i, j

and y

i, j−1

are as-

sociated together in k records in at least one of the

datasets reconstructed by the inverse transformation

of T

∗

. Since y

i, j−1

is a covered item, x

i, j

will also be

associated k times with all the items covering y

i, j−1

in I

i, j−1

. While this is correct from a privacy per-

spective, it cannot be considered for disassociation.

In fact, these items, x

i, j

, y

i, j−1

and the covering items

in I

i, j−1

are considered k-anonymous and therefore

should have been allot to the record chunk R

j−1

ac-

cording to disassociation

Now, if the attacker’s background knowledge B

consists of itemsets of size m and there exists an item-

set I that contains both x

i, j

and y

i, j−1

, {x

i, j

, y

i, j−1

}

⊆ I, a privacy breach will occur. In fact, these m items

will never be associated together in k records in any

of the datasets reconstructed by the inverse transfor-

mation of T

∗

Figure 3(b) describes six possible datasets, T

...,T

, reconstructed by the inverse transformation of

the disassociated dataset shown in Figure 3(a). In this

example, only ﬁve datasets T

, ..., T

are valid ones.

is omitted due to the cover problem and the knowl-

edge of the disassociation algorithm. Given that the

item e is associated with the covered item c is k = 2

times, consequently e is associated with a and b in k

records and thus, according to the disassociation tech-

nique, the item e should have remained in the ﬁrst

record chunk R

If the attacker’s background knowledge contains

the items c and b, he/she will be able to link every

record containing c and e to less than k records in the

record chunk, thus breaching the privacy of disasso-

ciation (see Figure 3(c)).

This privacy breach is also detected in the ex-

ample of Figure 1(b). The item Cancer in R

is one of the least frequent items in the set of

items having support greater than or equal to the

Vertical partitioning, creation of record chunks, pre-

serves k-anonymous itemsets in the same record chunks.

On the Evaluation of the Privacy Breach in Disassociated Set-valued Datasets

323

item Side Effects in the record chunk R

{Oncologist, Cancer, Treatment}. Cancer is covered

by the remainder of the items in I

. If Side Effects

is associated with Cancer in k records, Side Effects

will be associated with Treatment and Oncologist in

k records or alternatively saying that the itemset {

Side Effects, Cancer, Treatment and Oncologist } is

k-anonymous. As a consequence, according to the

disassociation technique, these items should have be-

longed to the same record chunk which does not cor-

respond to the disassociated dataset of Figure 1(b).

5.3 Quantitative Privacy Breach

Detection Algorithm

We present, in the following, the quantitative pri-

vacy breach detection algorithm to evaluate the pri-

vacy breach in a disassociated dataset based on the

cover problems. The algorithm takes a disassociated

dataset T

∗

, and the attacker’s background knowledge

B that includes itemsets of size m and evaluates the

privacy breach represented by the number of vulnera-

ble records in the disassociated dataset T

∗

. It iterates,

in Step 3, over each cluster R

in T

∗

in an ascending

order to evaluate, for each item x

i, j

in record chunk

, the number of privacy breaches in all the pre-

vious record chunks. To do so, the algorithm 1) re-

trieves in Step 10 the itemset I

i,l

of items having sup-

port greater than or equal to the support of x

i, j

and

2) veriﬁes in Step 11 whether x

i, j

and any item y in

i,l

is subject to a cover problem. A privacy breach

is noted if both items x

i, j

and y are included in any of

the itemsets I of the attackers’ background knowledge

B. The algorithm returns the total number of vulner-

able records that is computed based on the sum of the

maximum number of privacy breaches determined per

item.

Although the proposed algorithm covers a broader

aspect of the privacy breach by estimating the number

of records that are vulnerable due to the cover prob-

lem, it can be extended to provide better insights on

how many records are de-anonymized; linked to the

individual with certainty. In fact, as discussed in Sub-

section 5.2, a privacy breach is found not only when

completely de-anonymizing a record but also when an

attacker is able to link more than m items to the indi-

vidual. This is noted in the algorithm every time the

items involved in the association that led to the cover

problem are part of the attacker’s knowledge.

Algorithm 1: Quantitative Privacy Breach Detection algo-

rithm.

Require: T

∗

Ensure: total

1: max ← 0;

2: total ← 0;

3: for each R

∈ T

∗

4: get j, the number of record chunks in R

;

5: for each R

∈ R

6: breach ← 0;

7: l ← j − 1;

8: for each item x

i, j

∈ R

9: while (l ≥ 0) do

10: get I

i,l

;

/**I

i,l

= {y : y ∈ R

and s(y, R

) ≥ s(x

i, j

)}*/

11: if s(I

i,l

) = min

∀y∈I

i,l

s(y, R

) and

i,l

i, j

} ⊆ I then

/**I ∈ B*/

12: breach + +;

13: end if

14: l − −;

15: end while

16: end for

17: if breach > max then

18: max ← breach;

19: end if

20: end for

21: total ← total + max;

22: end for

23: return total;

6 EXPERIMENTS

We conducted a series of experiments on real datasets

described in Table 2, the click-stream data, BMS-

WebView-1 and BMS-WebView-2, and the AOL

search query logs. We implemented all methods in

JAVA and performed the experiments on an Intel Core

i7-4702MQ CPU at 2.2GHz with 8GB of RAM. The

goal of these experiments is to:

• Evaluate the privacy breach represented by the

number of vulnerable records using the quanti-

tative privacy breach detection algorithm on the

aforementioned datasets with various attackers:

strong, moderate, and weak,

• Evaluate the performance of the quantitative pri-

vacy breach detection algorithm.

Table 2: Datasets properties.

Dataset # of distinct individuals # of distinct items

AOL 65517 1216655

BMS-WebView-1 59602 497

BMS-WebView-2 77512 3340

The tests were performed using three types of at-

tackers, strong, moderate and weak, where each of

SECRYPT 2016 - International Conference on Security and Cryptography

324

(a) Strong attacker privacy breach evalu-

ation

(b) Moderate attacker privacy breach

evaluation

tion

Figure 4: Privacy breach and performance evaluations.

which has a background knowledge simulated as fol-

lows:

Strong Attacker: the background knowledge of this

attacker consists of itemsets of size 2 (m = 2)

picked from the 2-combinations of each and every

record in the dataset. Since it is a strong attacker,

we assume that these itemsets contain the items

that are involved in the association that will lead

to a cover problem (i.e., well picked). Therefore, a

privacy breach occurs whenever a cover problem

is found.

Moderate Attacker: the background knowledge of

this attacker consists of itemsets of size 2 (m = 2)

chosen at random from the 2-combinations of a

strict subset of D (records in T ).

Weak Attacker: to simulate the background knowl-

edge of this attacker, we have chosen at random

10 items from the dataset (for each test) and an-

other 10 items from wordnet (Miller, 1995) and

generated their 2-combinations. This gave us a

background knowledge B of size 190.

Privacy Breach Evaluation. In this test, we eval-

uate the privacy breach caused by the cover problem

using the quantitative privacy breach detection algo-

rithm. Three test cases were studied. In each of the

test cases, the algorithm evaluates the aforementioned

datasets by varying the maximum cluster size with

ﬁxed k = 3, m = 2 according to the attacker’s back-

ground knowledge. The results are shown in Figure

4. It is not surprising that for strong attackers the

privacy breach is most explicit. In fact, strong at-

tackers’ background knowledge, consisting of all the

itemsets of size m = 2, are linked to each of the in-

dividuals in the dataset. Therefore, whenever a cover

problem is found, a privacy breach is noted. In ad-

dition, we can notice that for the strong attacker as

exhibited in the three datasets, the privacy breach in-

creases linearly with the cluster size. Larger clusters

include more cover problems and thus more privacy

Figure 5: Quantitative privacy breach detection algorithm

performance evaluation.

breaches. This is not reﬂected by moderate and weak

attackers. These attackers have background knowl-

edge generated from a strict subset of randomly cho-

sen items from the dataset and other items from exter-

nal sources. Moreover and unlike the strong attack-

ers’ background knowledge, moderate and weak at-

tackers’ background knowledge are linked to a subset

of records in the datasets. This fact justiﬁes the low

number of violations found in the dataset for these at-

tackers.

Performance Evaluation. Here, we evaluate the

performance of the quantitative privacy breach detec-

tion algorithm on the BMS-WebView-1 dataset for

the three types of attackers; strong, moderate and

weak. The results shown in Figure 5 demonstrate that

the algorithm performs in a generally stable fashion

when increasing the cluster size. The best perfor-

mance is obtained when dealing with strong attackers.

The algorithm treats every cover problem as a privacy

breach and therefore, unlike for moderate and weak

attackers, there is no need to search for the items that

are involved in the association that led to a cover prob-

lem in the background knowledge of the attackers.

On the Evaluation of the Privacy Breach in Disassociated Set-valued Datasets

325

7 CONCLUSION

In this article, we have proposed a formal model of

the set-valued data and the disassociation technique.

The cover problem has been deﬁned, while its effects

on the disassociation of the dataset has been regarded.

Such investigations have led to the quantitative pri-

vacy breach detection algorithm, whose efﬁciency has

been studied. By this way, we have shown in what ex-

tent a disassociated dataset can be vulnerable. In the

near future, we intend to develop partial suppression

in disassociated dataset and evaluate the gain in utility

that the disassociation provides.

REFERENCES

al Bouna, B., Clifton, C., and Malluhi, Q. M. (2015a).

Anonymizing transactional datasets. Journal of Com-

puter Security, 23(1):89–106.

al Bouna, B., Clifton, C., and Malluhi, Q. M. (2015b). Efﬁ-

cient sanitization of unsafe data correlations. In Pro-

ceedings of the Workshops of the EDBT/ICDT 2015

Joint Conference (EDBT/ICDT), Brussels, Belgium,

March 27th, 2015., pages 278–285.

Barbaro, M. and Zeller, T. (2006). A face is exposed for aol

searcher no. 4417749.

Biskup, J., PreuB, M., and Wiese, L. (2011). On the

inference-proofness of database fragmentation satis-

fying conﬁdentiality constraints. In Proceedings of the

14th Information Security Conference, Xian, China.

Ciriani, V., Vimercati, S. D. C. D., Foresti, S., Jajodia,

S., Paraboschi, S., and Samarati, P. (2010). Combin-

ing fragmentation and encryption to protect privacy in

data storage. ACM Trans. Inf. Syst. Secur., 13:22:1–

22:33.

Cormode, G., Li, N., Li, T., and Srivastava, D. (2010). Mini-

mizing minimality and maximizing utility: Analyzing

method-based attacks on anonymized data. In Pro-

ceedings of the VLDB Endowment, volume 3, pages

1045–1056.

Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006).

Calibrating noise to sensitivity in private data analysis.

In Proceedings of the Third Conference on Theory of

Cryptography, TCC’06, pages 265–284, Berlin, Hei-

delberg. Springer-Verlag.

Fard, A. M. and Wang, K. (2010). An effective cluster-

ing approach to web query log anonymization. In

Security and Cryptography (SECRYPT), Proceedings

of the 2010 International Conference on, pages 1–11.

IEEE.

He, Y. and Naughton, J. F. (2009). Anonymization of set-

valued data via top-down, local generalization. Proc.

VLDB Endow., 2(1):934–945.

Jia, X., Pan, C., Xu, X., Zhu, K., and Lo, E. (2014). -

uncertainty anonymization by partial suppression. In

Bhowmick, S., Dyreson, C., Jensen, C., Lee, M.,

Muliantara, A., and Thalheim, B., editors, Database

Systems for Advanced Applications, volume 8422 of

Lecture Notes in Computer Science, pages 188–202.

Springer International Publishing.

Kifer, D. (2009). Attacks on privacy and deﬁnetti’s theorem.

In SIGMOD Conference, pages 127–138.

Li, T., Li, N., Zhang, J., and Molloy, I. (2012). Slicing: A

new approach for privacy preserving data publishing.

IEEE Trans. Knowl. Data Eng., 24(3):561–574.

Loukides, G., Liagouris, J., Gkoulalas-Divanis, A., and

Terrovitis, M. (2014a). Disassociation for electronic

health record privacy. Journal of Biomedical Infor-

matics, 50:46–61.

Loukides, G., Liagouris, J., Gkoulalas-Divanis, A., and

Terrovitis, M. (2014b). Disassociation for electronic

health record privacy. Journal of Biomedical Infor-

matics, 50(0):46 – 61. Special Issue on Informatics

Methods in Medical Privacy.

Loukides, G., Liagouris, J., Gkoulalas-Divanis, A., and

Terrovitis, M. (2015). Utility-constrained electronic

health record data publishing through generalization

and disassociation. In Gkoulalas-Divanis, A. and

Loukides, G., editors, Medical Data Privacy Hand-

book, pages 149–177. Springer International Publish-

ing.

Machanavajjhala, A., Gehrke, J., Kifer, D., and Venkitasub-

ramaniam, M. (2006). l-diversity: Privacy beyond k-

anonymity. In Proceedings of the 22nd IEEE Interna-

tional Conference on Data Engineering (ICDE 2006),

Atlanta Georgia.

Miller, G. A. (1995). Wordnet: A lexical database for en-

glish. Commun. ACM, 38(11):39–41.

Ressel, P. (1985). De Finetti-type theorems: an analytical

approach. Ann. Probab., 13(3):898–922.

Samarati, P. (2001). Protecting respondents’ identities in

microdata release. IEEE Trans. Knowl. Data Eng.,

13(6):1010–1027.

Sweeney, L. (2001). Computational disclosure control - a

primer on data privacy protection. Technical report,

Massachusetts Institute of Technology.

Sweeney, L. (2002). k-anonymity: a model for protecting

privacy. International Journal on Uncertainty, Fuzzi-

ness and Knowledge-based Systems, 10(5):557–570.

Terrovitis, M., Mamoulis, N., and Kalnis, P. (2008).

Privacy-preserving anonymization of set-valued data.

PVLDB, 1(1):115–125.

Terrovitis, M., Mamoulis, N., Liagouris, J., and Skiadopou-

los, S. (2012). Privacy preservation by disassociation.

Proc. VLDB Endow., 5(10):944–955.

Wong, R. C.-W., Fu, A. W.-C., Wang, K., and Pei, J. (2007).

Minimality attack in privacy preserving data publish-

ing. In VLDB, pages 543–554.

Wong, R. C.-W., Fu, A. W.-C., Wang, K., Yu, P. S., and Pei,

J. (2011). Can the utility of anonymized data be used

for privacy breaches? ACM Trans. Knowl. Discov.

Data, 5(3):16:1–16:24.

Xiao, X. and Tao, Y. (2006). Anatomy: Simple and effective

privacy preservation. In Proceedings of 32nd Interna-

tional Conference on Very Large Data Bases (VLDB

2006), Seoul, Korea.

SECRYPT 2016 - International Conference on Security and Cryptography

326