Randomized Addition of Sensitive Attributes for l-diversity

Yuichi Sei and Akihiko Ohsuga

Graduate School of Information Systems, The University of Electro-Communications, Tokyo, Japan

Keywords:

Privacy, Data mining, Anonymization.

Abstract:

When a data holder wants to share databases that contain personal attributes, individual privacy needs to be

considered. Existing anonymization techniques, such as l-diversity, remove identiﬁers and generalize quasi-

identiﬁers (QIDs) from the database to ensure that adversaries cannot specify each individual’s sensitive at-

tributes. Usually, the database is anonymized based on one-size-ﬁts-all measures. Therefore, it is possible

that several QIDs that a data user focuses on are all generalized, and the anonymized database has no value

for the user. Moreover, if a database does not satisfy the eligibility requirement, we cannot anonymize it by

existing methods. In this paper, we propose a new technique for l-diversity, which keeps QIDs unchanged and

randomizes sensitive attributes of each individual so that data users can analyze it based on QIDs they focus

on and does not require the eligibility requirement. Through mathematical analysis and simulations, we will

prove that our proposed method for l-diversity can result in a better tradeoff between privacy and utility of the

anonymized database.

1 INTRODUCTION

In recent years, numerous organizations have begun

to provide services related to Cloud computing that

collect a great deal of personal information. This per-

sonal information can be shared with other organiza-

tions so that they can subsequently create new ser-

vices. However, the data holder should not publish

any information that identiﬁes an individual’s sensi-

tive attributes.

A lot of research studies regarding anonymized

databases of personal information have been proposed

(e.g., (Samarati, 2001; LeFevre et al., 2008; Wu et al.,

2013)). Most of existing methods consider that the

data holder has a database of the form explicit iden-

tiﬁers, quasi-identiﬁers (QIDs), sensitive attributes,

where explicitly identiﬁers are attributes that explicit

identify individuals (e.g., name and phone number),

QIDs are attributesthat could be potentially combined

with public directories to identify individuals (e.g.,

zip code, age), and sensitive attributes are personal

attributes of private nature (e.g., disease) (Fung et al.,

2010).

The l-diversity technique (Machanavajjhala et al.,

2007), which enhances k-anonymity (LeFevre et al.,

2006), is one of the major anonymization techniques.

This technique removes explicit identiﬁers and gen-

eralizes QIDs to ensure that the adversaries cannot

specify each individual’s sensitive values with con-

ﬁdence greater than 1/l. For example, Table 1 shows

the original patient database that a hospital wants to

publish. Consider that Name is the explicit identiﬁer,

Sex, Age, Address, and Job are the QIDs, and Dis-

ease is the sensitive attribute. Suppose that an adver-

sary knows that Becky is included in the database and

knows Becky’s QIDs. From the database, the adver-

sary can identify Becky’s sensitive value as Sty with

100% conﬁdence, even if we remove all the Names

from Table 1.

Table 2 shows one result of the l-diversity tech-

nique where l is set to 2. Even if the adversary

knows that Becky is included in Table 2 and knows

Becky’s QIDs, he cannot know whether Fever or Sty

is Becky’s disease. That is, he cannot identify Becky’s

sensitive value with greater than 50% (=1/2) conﬁ-

dence. Table 3 is another result of the l-diversity tech-

nique. If Tables 2 and 3 are both published, the adver-

sary can identify Becky’s disease as Sty with 100%

conﬁdence, so the data holder usually anonymizes Ta-

ble 1 once and publishes the anonymized table (i.e.,

either Table 2 or Table 3 only) to the all data users.

Therefore, the data holder should anonymize the

database based on one-size-ﬁts-all measures. Sup-

pose that two data users, A and B, receive Table 2, and

suppose that data user A wants to know the difference

between male and female, while data user B wants to

350

Sei Y. and Ohsuga A..

Randomized Addition of Sensitive Attributes for l-diversity.

DOI: 10.5220/0005058203500360

In Proceedings of the 11th International Conference on Security and Cryptography (SECRYPT-2014), pages 350-360

ISBN: 978-989-758-045-1

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

Table 1: Patient table.

Name

Sex Age Address Job Disease

Alex M 41 13000 Artist Fever

Becky

F 41 17025 Artist Sty

Carl

M 50 13021 Writer Cancer

Diana

F 51 14053 Nurse HIV

Edward

M 68 15000 Writer Chill

Flora

F 69 16022 Nurse HIV

Greg

M 72 13001 Artist Cut

Hanna

F 77 17001 Artist Cancer

Table 2: 2-diversity by existing method.

Sex

Age Address Job Disease

* 41 13*-17* Artist Fever

41 13*-17* Artist Sty

50-51 13*-14* * Cancer

50-51 13*-14* * HIV

68-69 15*-16* * Chill

68-69 15*-16* * HIV

72-77 13*-17* Artist Cut

72-77 13*-17* Artist Cancer

Table 3: 2-diversity by existing method (2).

Sex

Age Address Job Disease

M 41-72 13* Artist Fever

41-77 17* Artist Sty

50-68 13*-15* Writer Cancer

51-69 14*-16* Nurse HIV

50-68 13*-15* Writer Chill

51-69 14*-16* Nurse HIV

41-72 13* Artist Cut

41-77 17* Artist Cancer

Table 4: 2-diversity by proposed method.

Sex

Age Address Job Disease

M 41 13000 Artist {Fever, Flu}

41 17025 Artist {Fever, Sty}

50 13021 Writer {Cold, Cancer}

51 14053 Nurse {HIV, Pus}

68 15000 Writer {Chill, Cut}

69 16022 Nurse {Cold, HIV}

72 13001 Artist {Cut, Fever}

77 17001 Artist {Cancer, Flu}

know the differences among age. Although data user

B may know the differences among age from Table

2, data user A cannot know any differences between

sexes because the values of sex are all generalized.

On the other hand, suppose that data users A and B

receive Table 3 rather than Table 2. In this case, data

user A can analyze Table 3, but data user B cannot an-

alyze it effectively because ages are too generalized.

In this paper, we deal with this speciﬁc problem.

Our proposed method keeps QIDs unchanged and

adds l − 1 dummy values into sensitive attributes in

order to satisfy l-diversity. The resulting anonymized

database is shown in Table 4. Because all QIDs

are not changed, all data users can analyze the

anonymized database in theirpreferred ways. We pro-

pose a protocol that estimates the original data dis-

tribution of sensitive values according to each data

user’s preferred QIDs from the anonymized database,

as well as an anonymization protocol for the data

holder. If the number of records in the table is large,

data users can analyze it with a high degree of accu-

racy by the proposed method.

It is worth noting that we need not protect QIDs

when we use k-anonymity or l-diversity as pri-

vacy models. Although QIDs of databases may be

anonymized as a result of anonymization, most of ma-

jor privacy models such as k-anonymity, l-diversity

and t-closeness offer no guarantee of anonymizing

QIDs because their aims are not to protect QIDs but

to protect sensitive attributes.

Moreover, one major limitation of existing tech-

niques for l-diversity is the eligibility requirement,

which requires that “at most |T|/l records of T carry

the same sensitive value” where |T| represents the

number of records of an original database T (Xiao

et al., 2010). Our proposed method does not require

the eligibility requirement, therefore our method can

be applied to any databases when l is less than the

number of distinct sensitive values.

The rest of this paper is organized as follows. Sec-

tion 2 presents the models of applications and attacks.

Section 3 deﬁnes privacy and utility as used in this

paper. Section 4 discusses the related methods. Sec-

tion 5 presents the design of our algorithm. The re-

sults of a mathematical analysis and our simulations

are presented in Section 6, and Section 7 concludes

the paper.

2 ASSUMPTIONS

2.1 Application Model

A data holder has a database that contains personal in-

formation. Such information consists of explicit iden-

tiﬁers, QIDs, and sensitive attributes as described in

Section 1. The database can contain non-sensitive at-

tributes that are not explicit identiﬁers, QIDs, or sen-

sitive attributes. We can publish these non-sensitive

attributes without them being anonymized if they are

important for data analysis.

We assume that the data holder wants to publish

the database for collaborating with data users. Be-

RandomizedAdditionofSensitiveAttributesforl-diversity

351

Table 5: l-diversity does not protect any values of QIDs

when the original database satisﬁes l-diversity (here, l=2).

Sex

Age Address Job Disease

M 31 13000 Artist Fever

31 13000 Artist Sty

52 13021 Writer Cancer

52 13021 Writer HIV

cause of privacy concerns, the data holder wants to

anonymize the database in order to ensure that any

adversaries cannot identify each individual’s sensitive

attribute.

In this paper,we assume that the data user wants to

know what kinds of people tend to have certain sensi-

tive values. As a result, we might ﬁnd that older men

are more likely to get lung cancer than young women

if the anonymized table contains age and sex as QIDs

(or non-sensitive attributes) and disease as a sensitive

attribute.

We assume that we must protect sensitive at-

tributes, but we do not need to protect QIDs. This

assumption is widely accepted in research studies re-

lated to k-anonymity, l-diversity, and so on. If there

are l records that have the same values of QIDs,

these records are not anonymized in any methods

for l-diversity. For example, if a data holder want

to anonymize Table 5, the values of QIDs are not

anonymized at all when l=2.

If we want to protect some of the QIDs, then we

can treat them as sensitive attributes.

2.2 Attack Model

We assume that the data holder is an honest entity, but

the data users may be malicious entities. Moreover,

we assume that the data users may collude with each

other to identify each individual’s sensitive attributes.

3 NOTATIONS AND

MEASUREMENTS

3.1 Notations

Let T and T

∗

denote the original and anonymized

database respectively. Let N denote the number of

records of T or T

∗

and let r

and r

∗

denote the i-th

record of T and T

∗

, respectively.

S represents the domain of possible sensitive at-

tributes that can appear in the database and s

means

the i-th value of S (i = 0, .. .,|S| −1). Let Q be the set

of QIDs that a data user wants to analyze and let Q(i)

be the i-th QID of Q (i = 0,...,|Q| − 1).

Let Q(i) denote the domain of possible values that

can appear in Q(i) and let q(i)

denote the j-th value

of Q(i).

For example, S is {Fever, Sty, ...} in Table 1. Q

is {Sex}, Q(0) is Sex, Q(0) is {M, F} and q(0)

M and q(0)

is F if a data user wants to know the

difference between the sexes (male and female). In

this case, the data user might ﬁnd that people who are

men have the highest risk of getting cancer.

For another example, if the data user wants to an-

alyze sensitive attributes based on the combination of

sexes and categories of age ([0-9], [10-19], ..., [90-

99]), Q is {Sex, Age}, Q(0) is {M, F} and Q(1) is

{[0-9], .. ., [90-99]}. In this case, the data user might

ﬁnd that men ages between 30-39 have the highest

risk of getting cancer.

The value of a sensitive attribute of a record r

is represented by E(r). For example, E(r

) returns

Fever in Table 1. In a similar way, E(r

∗

) returns

{Fever, Flu} in Table 4.

Let C denote all the combinations of the elements

of Q(0),...,Q(|Q| − 1):

C = Q(0) × Q(1) × ...× Q(|Q| − 1). (1)

Let c

denote the i-th element of C (i = 0,...,|C| −

1). For example, if the data user wants to analyze

sensitive attributes based on sexes and categories of

age ([0-9], .. ., [90-99]), c

is (M, [0-9]), c

is (M,

[10-19]), . . ., c

|C|−1

is (F, [90-99]).

3.2 Measurement of Privacy

Many existing papers use l-diversity as a privacymea-

surement (Machanavajjhala et al., 2007; Xiao et al.,

2010; Cheong, 2012). Althoughthere are several vari-

ations of l-diversity, we use a simple interpretation.

Deﬁnition 1 (QID group): We denote a set of

records that has same values of all QIDs as a QID

group.

For example, the ﬁrst and second records in Table

2 are a QID group because their values of QIDs are

the same (*, 41, 13*-17*, Artist).

Deﬁnition 2 (l-diversity): The anonymized ta-

ble T

∗

satisﬁes l-diversity if the relative frequency of

each of the sensitive values does not exceed 1/l for

each QID group of T

∗

This deﬁnition of l-diversity is also widely used,

such as in the cases of (Kenig and Tassa, 2011; Xiao

and Tao, 2006; Nergiz et al., 2013).

For example, all QID groups of Tables 2 and 3

have 2 records and 2 distinct sensitive values. There-

fore, these tables satisfy 2-diversity. Then, see Ta-

ble 4. Each QID group has only one record, but each

record has 2 distinct sensitive values. We consider

Table 4 also satisﬁes 2-diversity in this paper.

SECRYPT2014-InternationalConferenceonSecurityandCryptography

352

3.3 Measurement of Utility

A data user estimates the distribution of sensitive at-

tributes for each c

(i = 0,...,|C| − 1) as described in

Section 2. Let V

i, j

denote the number of records that

are categorized to c

according to their QIDs and have

a sensitive attribute s

in the original table. Also, let

i, j

denote the estimated number of values of V

i, j

for

the data user.

Let N

denote the number of records that are cat-

egorized to c

according to their QIDs. We can use

the Mean Squared Errors (MSE) between

i, j

and

i, j

to quantify the utility for category c

(i) =

|S|

|S|−1

∑

j=0

(

i, j

−

i, j

)

(2)

This utility measurement is widely used in many

studies (Xie et al., 2011; Huang and Du, 2008; Groat

et al., 2012).

4 RELATED WORK

Many algorithms for k-anonymity have been pro-

posed. The basic deﬁnition of k-anonymity is as fol-

lows:

If one record in the table has some value QID,

at least k − 1 other records also have the value QID

(Fung et al., 2010).

Due to the fact that ﬁnding an optimal k-

anonymity is NP-hard (Meyerson and Williams,

2004), many existing algorithms for k-anonymity

search for better anonymization through heuristic ap-

proaches.

Dataﬂy (Sweeney, 2002) is a bottom-up greedy

approach and it generalizes QIDs until all combi-

nations of QIDs appear at least k times. Mondrian

(LeFevre et al., 2006), which is an efﬁcient top-down

greedy approach, is widely used in many other studies

as a base method.

Although k-anonymity can protect individual

identities, there are times when it cannot protect indi-

viduals’ sensitive attributes. For example, the fourth

and sixth rows in Table 3 are an equivalence class be-

cause their QIDs are all the same (F, 51-69, 14*-16*,

Nurse). Suppose that an adversary knows Diana’s

QIDs and she is included in Table 3. The adversary

cannot identify which row of the fourth and sixth rows

is Diana’s, but the adversary can identify Diana’s sen-

sitive attribute is HIV because the sensitive attributes

of both the rows are HIV.

l-diversity (Machanavajjhala et al., 2007) ensures

that the probability of identifying an individual’s sen-

sitive attribute is less than or equal to 1/l. There

are many research studies related to l-diversity (e.g.,

(Kabir et al., 2010; Cheong, 2012)).

Although many algorithms for k-anonymity or l-

diversity other than the research studies previously

mentioned havebeen proposed, most of all algorithms

keep sensitive attributes unchanged and generalize

QIDs. Changing QIDs leads to the problems men-

tioned in Section 1.

If all data users analyze the anonymized table in

the same way or if we believethat all data users do not

collude with each other for identifying an individual’s

sensitive attributes, then we can use workload-aware

anonymization techniques (LeFevre et al., 2008).

However, in the situation described in Section 2, we

need other anonymization techniques that do not rely

on those assumptions.

Xiao et al. (Xiao and Tao, 2006) and Sun et al.

(Sun et al., 2009) proposed an algorithm based the

Anatomy technique. Their algorithms publish two ta-

bles; one is a non-sensitive table that contains QIDs

without change and the other is a sensitive table.

By joining of the two tables, data users can analyze

them. Because the resulting table have many dummy

records (e.g., several times of the number of original

records), such anonymized data may become useless

for data analysis.

TP (Xiao et al., 2010) is the algorithm with

a non-trivial bound on information loss. The au-

thors compared TP with other techniques for l-

diversity (single-dimensional generalization method

TDS (Fung and Yu, 2005) and multi-dimensionalgen-

eralization method Hilbert (Ghinita, 2007)) and they

proved that TP outperforms these single- or multi-

dimensional methods.

We compare Anatomy and TP with our proposed

method in Section 6.

To the best of our knowledge, all existing tech-

niques for l-diversity requires the eligibility require-

ment as described in Section 1. Therefore, if

more than |T|/l records of original database T have

the same sensitive value, we cannot anonymize the

database.

The privacy measure of t-closeness (Li et al.,

2007) is stronger than l-diversity. It requires that the

distribution of sensitive values in each set of records

that have the same QID values should be close to the

distribution of the whole database.

Differential privacy (Dwork, 2006; Domingo-

Ferrer, 2013) makes user data anonymous by adding

noise to a dataset so that an attacker cannot deter-

mine whether or not a particular point of user data

is included. Although differential privacy represents

one of the strongest privacy models (Nikolov et al.,

2013), the data holder cannot publish the anonymized

RandomizedAdditionofSensitiveAttributesforl-diversity

353

database. The data holder should manage the original

database and respond to each data user’s query every

time. Therefore, the data holder’s cost is high, data

users cannot analyze the anonymized database freely,

and it is vulnerable to malicious intrusion (Clifton and

Anandan, 2013). In this paper, we assume that we

need anonymization techniques other than differential

privacy because of these limitations.

5 PROPOSAL

Our proposed method consists of two steps; random-

ization for the data holder and reconstruction for the

data user.

For the data holder, we generate and add l − 1 val-

ues randomly for a sensitive attribute of each record

in the original database.

For the data user, the user ﬁrst determines which

QIDs should be analyzed in terms of the relationships

between the QIDs and the sensitive attributes. Then,

the number of records is estimated in each combina-

tion of the QIDs and the sensitive attribute. For exam-

ple, suppose that the domain of the sensitive attribute

contains HIV, Fever, and Cancer. Also, suppose that

the data user wants to analyze the difference between

male and female. In this case, we should estimate

each number of records in which a sensitive attribute

is HIV, Fever, and Cancer, and where Sex is Male and

Female, respectively.

5.1 Anonymization Protocol

We generate and add distinct l − 1 values randomly

for a sensitive attribute of each record in the original

database for the data holder. Algorithm 1 shows the

anonymization protocol.

The function rand(S) returns an element of S ran-

domly. The data holder executes Algorithm 1 for each

record.

THEOREM 5.1. Algorithm 1 can always generate

l-diversity Table T

∗

from an original Table T if |S| is

larger than or equal to l.

Proof. After conducting Algorithm 1 for Table T,

the algorithm generates an anonymized Table T

∗

which each record has l distinct sensitive values. Sup-

pose that a QID group has δ (δ = 1,...,N) records of

∗

. The possible maximum number of occurrences of

each sensitive value within the δ records is δ because

each sensitive value does not appear more than once

in a record. On the other hand, the total number of

sensitive values in the QID group is δ × l if we use

Algorithm 1: Anonymization protocol for record r.

Input: Domain of a sensitive attribute S, Privacy

level l

Output: Set of anonymized sensitive values for

record r

1: Creates empty set R

2: R ⇐ R∪ {E(r)}

/*Generates and adds dummy sensitive at-

tributes*/

3: while |R| < l do

4: R ⇐ R ∪ {rand(S)}

5: end while

6: return R

Algorithm 1 because each record has l sensitive val-

ues, thus, the possible maximum relative frequency

of each of the sensitive values in the QID group is

1/l.

5.2 Estimation Protocol

The data user who received the anonymized table es-

timates the distribution of sensitive attributes for each

(i = 0,... , |C| − 1). We use V

i, j

to represent the ac-

tual number of records, which are categorized to c

and have a sensitive attribute s

. Let

i, j

denote the

estimated V

i, j

First, the data user counts how many records that

are categorized c

occur in the anonymized database,

that is,

N−1

∑

h=0

G(r

∗

), where G(r

∗

,c) =

(

1 (r

∗′

s QIDs are categorized to c)

0 (otherwise)

(3)

In a similar way, the data user counts how many

records that are categorized to c

and have a set of

sensitive attributes that contain s

in the anonymized

database:

i, j

N−1

∑

h=0

H(r

∗

),where H(r

∗

,c, s) =

(

1 (r

∗′

s QIDs are categorized to c and s ∈ E(r

∗

))

0 (otherwise)

(4)

If a record r has s

as its original sensitive at-

tribute, then s

is included in E(r

∗

) with probability

1 and s

s.t., j 6= j

is included in E(r

∗

) with proba-

bility

l − 1

|S| − 1

. (5)

SECRYPT2014-InternationalConferenceonSecurityandCryptography

354

Let the

i, j

be the maximum likelihood estimate

of W

i, j

. Then, we have

i, j

= N

−

∑

6= j

(1− P

)

i, j

. (6)

The data user constructs a linear system of equa-

tions in |S| variables of

i, j

( j = 0,... , |S| − 1) from

Equation 6 for each c

. Here, we set the value of W

i, j

calculated from T

∗

by Equation 4 to

i, j

in Equa-

tion 6. By solving the linear system of equations,

the data user gets

i, j

for each i = 0,...,|C| − 1 and

j = 0,...,|S| − 1. Each value of

i, j

is an unbiased

maximum likelihood estimate of V

i, j

We show this estimation protocol in Algorithm 2.

A data user executes this algorithm for each c

(i =

0,...,|C| − 1) based on his own C.

Algorithm 2: Server protocol for c

Input: Anonymized database T

∗

, Domain of a sensi-

tive attribute S

Output: Distribution of sensitive attributes in c

1: Creates Array W,

/*Calculates N

2: N

⇐ result calculated by Eq. 3

/*Calculates each W

i, j

3: for j = 0, ...,|S| − 1 do

4: W

i, j

⇐ result calculated by Eq. 4

5: end for

/*Calculates P

6: P

⇐ result calculated by Eq. 5

/*Creates and solves the linear system of equa-

tions*/

7: Creates Eq. 6 for all j = 0,... , |S| − 1

8: for j = 0, ...,|S| − 1 do

i, j

⇐ each result calculated by the linear sys-

tem of equations

10: end for

11: return

i, j

( j = 0,...,|S| − 1)

5.3 Expectation of MSE

The data user should understand the expectation of the

MSE between the estimated distribution and the origi-

nal distribution of sensitive attributes in each category

(i = 0, . . . ,|C| − 1) for valuable data analysis. We

propose the calculation method even if the data user

does not have the original database.

From Equation 2, the expected value of MSE for

a category c

, i.e., E[σ

], is calculated by

E[σ

] =

|S|

|S|−1

∑

j=0

(

i, j

−

i, j

)

(7)

Because

i, j

is the unbiased estimator of V

i, j

, we

have E[

i, j

] = V

i, j

. Therefore, E[(V

i, j

−

i, j

)

] repre-

sents the variance of

i, j

. LetVar(

i, j

) be the variance

i, j

. The expected value of σ

is calculated by

E[σ

] =

|S|

|S|−1

∑

j=0

Var(

i, j

). (8)

The simultaneous equations created by Equation

6 for all j are the same as the following equation by

deformation.

M ·

where M =







0 1− P

··· 1− P

1− P

0 ··· 1− P

1−P

··· 0







V = (

i,0

i,1

,...,

i,|S|−1

W = (N

−W

i,0

−W

i,1

,...,N

−W

i,|S|−1

(9)

Therefore, we have

= M

−1

. (10)

In general, if X and Y are random variables and a

and b are constant, then we have

Var(aX + bY) = a

Var(X) + b

Var(Y) + 2abCov(X,Y ),

(11)

where Cov(X,Y) represents the covariance be-

tween X and Y. Let M

−1

, j

denote the j

-th row and

-th column element of the inverse matrix of M.

Because we have

i, j

∑

−1

, j

− W

i, j

) from

Equation 10, we get from Equation 11,

Var(

i, j

) =

∑

−1

, j

)

Var(N

−W

i, j

)

∑

, j

( j

6= j

)

−1

, j

−1

, j

Cov(N

−W

i, j

−W

i, j

(12)

Let P

denote the probability that an arbitrary E(r

∗

)

contains s

, let P

, j

denote the probability that it con-

tains s

and s

, let P

denote the probability that

it contains s

but does not contain s

, and let P

, j

denote the probability that it contains s

but does not

RandomizedAdditionofSensitiveAttributesforl-diversity

355

contain s

. We get

Var(N

−W

i, j

) = E[W

i, j

] − E[W

i, j

]

∑

h=0

(1− P

)

−h

· h

− (NP

)

= N

(1− P

Cov(N

−W

i, j

−W

i, j

) =

E[W

i, j

] − E[W

i, j

]E[W

i, j

]

∑

h=0

−h

∑

t=0

−h−t

∑

u=0

, j

· (1− P

, j

− P

, j

)

−h−t−u

−h

−h−t

· (h+ t)(h+ u)

− N

= N

, j

+ (P

, j

+ P

)(P

, j

+ P

, j

)(N

− 1))

− N

(13)

If we assume that each P

is the same and it is

independent of each other, we get

= P

|S|

, j

|S|

l − 1

|S| − 1

= P

, j

|S|

· (1−

l − 1

|S| − 1

)

(14)

We get from Equations 8, 12, 13 and 14

E[σ

] =

|S|

(1−

|S|

)

∑

−1

, j

)

−

l(|S|− l)

(|S| − 1)|S|

∑

, j

6= j

−1

, j

−1

, j

(15)

The inverse matrix of M is represented by

−1

|S|−l







2−|S| 1 . . .

1 2−|S| ...







. (16)

Therefore, we get

∑

−1

, j

)



|S|−l





(2−|S|)

+ |S| −1



(17)

and

∑

, j

6= j

−1

, j

−1

, j



|S|−l



· {2(2−|S|)(|S| − 1) + (|S| − 1)(|S| − 2)}

(18)

As a result, from equations 15, 17, and 18, we get

the expected MSE between the original and estimated

distributions of sensitive attributes in c

;

E[σ

l(|S| − 1)

(|S| − l)|S|

. (19)

5.4 Analysis

5.4.1 Prerequisite for Anonymization

Our proposed method does not require the eligibility

requirement because our method does not generalize

any QIDs nor insert dummy records. Let |S| denote

the number of kinds of sensitive values. If l ≤ |S|, we

can anonymize any databases for l-diversity because

we can add l− 1 sensitive values to each user’s sensi-

tive attribute. Note that setting l > |S| is meaningless

in terms of the deﬁnition of l-diversity.

5.4.2 Cost Analysis

The time complexity of the anonymization protocol at

the data holder is O(l). This protocol is computation-

ally simple as we know from Algorithm 1.

Solving the linear system of equations with |S| di-

mensions is the most expensive step in the estima-

tion protocol. We can solve a linear system of equa-

tions by solving a |S| × |S| matrix equation. The time

complexity of solving a |S| × |S| matrix equation is

O(g|S|

), where g is a parameter of the number of re-

cursive iterations if we use the Gauss-Seidel method.

Since we solved the linear system of equations for

each category c, the resulting time complexity is rep-

resented by O(g|S|

|C|).

6 EVALUATION

6.1 Mathematical Analysis and

Simulations of Synthetic Data

Figure 1 shows the results of mathematical analysis.

We assume that we have N records categorized to c

that are one of the combinations of elements of QIDs

that a data user wants to analyze. In this subsection,

we calculate MSE by Equation 2 between the distri-

bution of original sensitive attributes and those of es-

timated sensitive attributes in the category c.

In the ﬁrst evaluation, we set the number of

records N to 1,000 and l to 5. Figure 1(a) represents

the results of the MSE. We know from Fig. 1(a) that

the MSE tends to be decreasing with an increase of

|S|.

SECRYPT2014-InternationalConferenceonSecurityandCryptography

356

0.0002

0.0004

0.0006

0.0008

0.001

10 20 30 40 50 60 70 80 90 100

MSE

|S|

(a) MSE vs. |S|

0.002

0.004

0.006

0.008

0.01

2 4 6 8 10 12 14 16 18

MSE

|S|=20 |S|=100

(b) MSE vs. l

0.000001

0.00001

0.0001

0.001

0.01

0.1

1 10 100 1000 10000100000

MSE

Figure 1: Results of mathematical analysis.

Then, we calculate the MSE by changing the pri-

vacy level l. The results are shown in Fig. 1(b). We

changed l from 2 to 18 based on the settings of many

existing studies (Yao et al., 2012; Nergiz et al., 2013;

Hu et al., 2010). In this evaluation, we set the num-

ber of records N to 1,000 and the number of possible

sensitive attributes |S| to 20 and 100. From the ﬁg-

ure, we know that the MSE of our proposed method

increases by increasing l. If we set l to near |S|,

the MSE increases exponentially, although the MSE

keeps a small value if l is much less than |S|. Never-

theless, we show that our proposed method can result

in a small MSE even if we set l to very near |S| (i.e.,

|S| − 1) by simulations of a real data set in the next

subsection.

Next, we conducted an experiment to analyze the

effects of changing the number of records N. Figure

1(c) shows the results of this. In this evaluation, we

set l to 5 and |S| to 20. We know that the MSE reduced

relative to N.

Finally, we conducted simulations by setting pa-

rameters (|S|, l, and N) to the same values in the math-

ematical analysis described above. The underlying

distributions of sensitive attributes were set to Uni-

form distribution, Gaussian distribution, and Poisson

distribution. In all simulations, the results are almost

the same as that of the mathematical analysis.

6.2 Real Data Set Evaluation Results

We evaluated the MSE by using real data sets, which

are OCC, SAL and Adult. OCC and SAL are obtained

from (Minnesota Population Center, ). Many existing

studies such as (Xiao et al., 2010) and (Xiao and Tao,

2006) use this data sets. OCC has an attribute Occu-

pation as a sensitive attribute and has attributes Age,

Gender, Marital Status, Race, Birth Place, Education,

Work Class as QIDs. SAL has an attribute Income

as a sensitive attribute and has the same QIDs as in

OCC. The number of distinct values of each attribute

is shown in Table 6.

We create 7 sets of database, OCC-1, OCC-2, ...,

OCC-7 from OCC. Each database OCC-d has the ﬁrst

d QIDs in Table 6 and a sensitive attribute Occupa-

tion. For example, OCC-3 has Age, Gender, Mari-

tal Status as QIDs and Occupation as a sensitive at-

tribute. Similarly, we also create 7 sets of database

SAL-d (d = 1,...,7) from SAL.

The Adult data set consists of 15 attributes (e.g.,

Age, Sex, Race, Relationship shown in Table 7) and

has 45,222 records after the records with unknown

values are eliminated. Adult data set (UCI Machine

Learning Repository, ) used in many studies on pri-

vacy (Machanavajjhala et al., 2007; Sun et al., 2009).

Wet create 15 sets of database, Adult(1), Adult(2), . ..,

Adult(15)from Adult. Each database Adult(d)has the

d’th attribute in Table 7 as a sensitive attribute and has

other attributes as QIDs. For example, Adult-2 has

Work Class as a sensitive attribute and has Age, Final

Weight, Education, ... as QIDs.

Following (Xiao and Tao, 2006), we consider

a query involves qd random QIDs A

,...,A

, and

the sensitive attribute, where qd represents the query

dimensionality parameter. For example, when the

database is SAL-4 and qd=3, {A

, A

} is a ran-

dom 3 sized subset of {Age, Gender, Marital Status,

Race}. We assume that data users want to know the

number of persons whose attribute A

(i = 1,...,qd) is

the speciﬁc value. Therefore, we create random val-

ues for each A

as the speciﬁc values in each query.

Let b

denote the size of the random values for at-

tribute A

. Following (Xiao and Tao, 2006), the value

of b

is calculated by the expected query selectivity s:

= ⌈|A

| · s

1/(qd+1)

⌉ (20)

where |A

| represents the number of distinct values of

in the original database.

We compared our method with Anatomy and TP,

which were described in Section 4. All experiments

were conducted on an Intel Xeon CPU E5-2687W v2

@ 3.40GHz 3.40GHz personal computer with 12 GB

of RAM.

RandomizedAdditionofSensitiveAttributesforl-diversity

357

0.00E+00

5.00E-05

1.00E-04

1 2

MSE

Proposal TP Anatomy

(a) OCC-3

0.00E+00

2.00E-05

4.00E-05

1 2

MSE

Proposal TP Anatomy

(b) SAL-3

0.00E+00

2.00E-05

4.00E-05

6.00E-05

8.00E-05

1 2 3 4

MSE

Proposal TP Anatomy

0.00E+00

1.00E-05

2.00E-05

3.00E-05

1 2 3 4

MSE

Proposal TP Anatomy

(d) SAL-5

0.00E+00

1.00E-04

2.00E-04

3.00E-04

4.00E-04

1 2 3 4 5 6

MSE

Proposal TP Anatomy

(e) OCC-7

0.00E+00

5.00E-05

1.00E-04

1.50E-04

2.00E-04

1 2 3 4 5 6

MSE

Proposal TP Anatomy

(f) SAL-7

Figure 2: MSE vs. query dimensionality qd.

0.00E+00

1.00E-05

2.00E-05

3.00E-05

4.00E-05

5 10 15

MSE

Proposal TP Anatomy

(a) OCC-5

0.00E+00

2.00E-05

4.00E-05

6.00E-05

5 10 15

MSE

Proposal TP Anatomy

(b) SAL-5

Figure 3: MSE vs. l.

We set l to 10, s to 0.07, and qd to 3 as a default

value. In each simulation, we generated 1,000random

queries based on qd and Equation 20, and calculate

MSEs based on Equation 2.

In the ﬁrst simulation with OCC and SAL, we con-

ducted an analysis to determine how the number of

QIDs and qd affect the MSE. We changed the number

of QIDs d from 3 to 7 and changed qd from 1 to d −1.

Figure 2 shows the results. Figures 2(a), 2(c) and 2(e)

are the results of OCC, and Figs. 2(b), 2(d) and 2(f)

are the results of SAL. When the number of QIDs d

is 3, the MSEs of our proposed method and TP are

almost the same. On the other hand, in the results of

other settings, the MSEs of our proposed method is

much fewer than other methods.

In the next simulation, we changed l from 5 to 15.

Figure 3 shows the results. The value of l is large,

the MSEs are also large. However, we know from the

Figs. 3(a) and 3(b) that our proposed method has the

smallest MSE among the three methods.

Then, we changed s from 0.4 to 1.0 and calculated

the MSEs in each setting. The results are shown in

Fig 4. The MSEs decreases as s grows higher. This is

0.00E+00

2.00E-05

4.00E-05

6.00E-05

8.00E-05

1.00E-04

0.04 0.09

MSE

Proposal TP Anatomy

(a) OCC-5

0.00E+00

1.00E-05

2.00E-05

3.00E-05

4.00E-05

5.00E-05

0.04 0.09

MSE

Proposal TP Anatomy

(b) SAL-5

Figure 4: MSE vs. selectivity s.

0.2

0.4

0.6

0.8

Rao of successful

anonymizaon

(a) Ratio of successful

anonymization

0.002

0.004

0.006

0.008

0.01

Average of MSE

(b) Average of MSE

Figure 5: Result overview of Adult dataset.

because the number of target records of the databases

becomes large when s is large.

In the simulations with Adult(1), Adult(2), ...,

Adult(15), we changed l from 2 to 10 in each sim-

ulation. TP and Anatomy could not anonymize

the database in several simulations because of the

eligibility requirement. The ratio of successful

anonymization is shown in Fig. 5(a). Although

our proposed method (and all existing methods for l-

diversity) cannot anonymize if the number of distinct

sensitive attributes is less than l, our proposed method

does not require the eligibility requirement.

Figure 5(b) shows the average MSEs of all simu-

lations of Adult(1), ..., Adult(15). The ﬁgure helped

us to determine that our method could realize fewer

MSE.

In the following evaluation, we used an attribute

of “relationship” as the sensitive attribute. S is

{Wife, Hasband, Own-child, Unmarried, Not-in-

family, Other-relative}.

We used an attribute of Sex as the QID, that the

data user wants to analyze. That is, Q is {Sex} and

Q(0) is {Male, Female}. Figure 6(a) shows the orig-

inal distributions of S in Male and Female, respec-

tively.

Figures 6(b), 6(c), and 6(d) present the estimated

results of Proposal, TP, and Anatomy, respectively.

We know from the ﬁgures that our proposed method

can estimate the true distribution of S for Male and

Female with the highest precision.

Finally, we measured the execution time of

anonymization and estimation for the Adult dataset.

In the proposed method, it took approximately 4 sec-

SECRYPT2014-InternationalConferenceonSecurityandCryptography

358

Table 6: Number of distinct values of OCC and SAL.

Age

Gender Marital Status Race Birth Place Education Work Class Occupation Income

80 2 6 9 124 24 9 50 50

Table 7: Number of distinct values of Adult data set.

Age

Work Class Final Weight Education Education-num Marital Status Occupation Relationship

74 7 26741 16 16 7 14 6

Race Gender Capital Gain Capital Loss Hours Per Week Country Salary Class

5 2 121 97 96 41 2

0.2

0.4

0.6

0.8

Ra o

Male Female

(a) Original database

0.2

0.4

0.6

0.8

Ra o

Male Female

(b) Estimated from the

anonymized database of

Proposal

0.2

0.4

0.6

0.8

Ra o

Male Female

anonymized database of TP

0.2

0.4

0.6

0.8

Ra o

Male Female

(d) Estimated from the

anonymized database of

Anatomy

Figure 6: Relationship vs. sex in l = 2.

onds. From this result, we know that our proposed

method is a very efﬁcient technique in terms of exe-

cution time.

7 CONCLUSION

An anonymization technique of l-diversity can

achieve a privacy-preserving model where a data

holder can share their data with other data users. Ex-

isting techniques generalize QIDs from the database

to ensure that adversaries cannot specify each individ-

ual’s sensitive attribute. However, it is possible that

several QIDs that a data user focuses on are all gener-

alized and the anonymized database has no value for

the data user.

In this paper, we propose a new technique for l-

diversity, which keeps QIDs unchanged and random-

izes sensitive attributes of each individual so that data

users can analyze it based on QIDs that they focus on.

We also proposed an estimation protocol of the orig-

inal distribution of sensitive attributes for data users

according to their preferred QIDs, and, we proposed

the calculation method for the expected MSE between

the original and estimated distribution of sensitive

attributes. Moreover, our method does not require

the eligibility requirement, therefore, our method can

be applied to any databases when l is less than the

number of distinct sensitive values. By mathemati-

cal analysis and simulations, we prove that our pro-

posed method can result in a better tradeoff between

privacy and utility of anonymized databases than ex-

isting studies.

Future work will include the evaluation of other

relevant data sets. We also plan to extend our ap-

proach to other privacy measures such as t-closeness.

ACKNOWLEDGEMENTS

This research was subsidized by JSPS 24300005,

26330081, 26870201. The authors would also like

to express their deepest appreciation to the entire staff

of Professor Honiden’s laboratory at the University of

Tokyo and Professor Fukazawa’s lab at Waseda Uni-

versity, both of whom provided helpful comments and

suggestions.

REFERENCES

Cheong, C. H. (2012). Non-Centralized Distinct L-

Diversity. International Journal of Database Man-

agement Systems, 4(2):1–21.

Clifton, C. and Anandan, B. (2013). Challenges and Op-

portunities for Security with Differential Privacy. In

Information Systems Security, pages 1–13. Springer.

Domingo-Ferrer, J. (2013). On the Connection between t-

Closeness and Differential Privacy for Data Releases.

RandomizedAdditionofSensitiveAttributesforl-diversity

359

In Proc. International Conference on Security and

Cryptography (SECRYPT), pages 478–481.

Dwork, C. (2006). Differential Privacy. In Automata, Lan-

guages and Programming, volume 4052 of Lecture

Notes in Computer Science, pages 1–12. Springer.

Fung, B. and Yu, P. (2005). Top-Down Specialization for

Information and Privacy Preservation. In Proc. IEEE

ICDE, pages 205–216.

Fung, B. C. M., Wang, K., Chen, R., and Yu, P. S.

(2010). Privacy-preserving data publishing: A sur-

vey of recent developments. ACM Computing Sur-

veys, 42(4):1–53.

Ghinita, G. (2007). Fast Data Anonymization with Low

Information Loss. In Proc. VLDB, pages 758–769.

Groat, M. M., Edwards, B., Horey, J., He, W., and Forrest,

S. (2012). Enhancing privacy in participatory sens-

ing applications with multidimensional data. In Proc.

IEEE PerCom, pages 144–152.

Hu, H., Xu, J., On, S. T., Du, J., and Ng, J. K.-Y. (2010).

Privacy-aware location data publishing. ACM Trans.

Database Systems, 35(3):1–42.

Huang, Z. and Du, W. (2008). OptRR: Optimizing Random-

ized Response Schemes for Privacy-Preserving Data

Mining. In Proc. IEEE ICDE, pages 705–714.

Kabir, M., Wang, H., Bertino, E., and Chi, Y. (2010). Sys-

tematic clustering method for l-diversity model. In

ADC, volume 103, pages 93–102.

Kenig, B. and Tassa, T. (2011). A practical approximation

algorithm for optimal k-anonymity. Data Mining and

Knowledge Discovery, 25(1):134–168.

LeFevre, K., DeWitt, D., and Ramakrishnan, R. (2006).

Mondrian Multidimensional K-Anonymity. In Proc.

IEEE ICDE, pages 25–25.

LeFevre, K., DeWitt, D. J., and Ramakrishnan, R. (2008).

Workload-aware anonymization techniques for large-

scale datasets. ACM Trans. Database Systems,

33(3):1–47.

Li, N., Li, T., and Venkatasubramanian, S. (2007).

t-closeness: Privacy beyond k-anonymity and l-

diversity. In Proc. IEEE ICDE, pages 106–115.

Machanavajjhala, A., Kifer, D., Gehrke, J., and Venkita-

subramaniam, M. (2007). l-diversity: Privacy beyond

k-anonymity. ACM TKDD, 1(1):3–es.

Meyerson, A. and Williams, R. (2004). On the complexity

of optimal K-anonymity. In Proc. ACM PODS, pages

223–228.

Minnesota Population Center. Ipums,

https://www.ipums.org/.

Nergiz, A. E., Clifton, C., and Malluhi, Q. M. (2013). Up-

dating outsourced anatomized private databases. In

Proc. EDBT, page 179. ACM.

Nikolov, A., Talwar, K., and Zhang, L. (2013). The geome-

try of differential privacy: the sparse and approximate

cases. In Proc. ACM STOC, pages 351–360.

Samarati, P. (2001). Protecting respondents identities in mi-

crodata release. IEEE Trans. Knowledge and Data En-

gineering, 13(6):1010–1027.

Sun, X., Wang, H., Li, J., and Ross, D. (2009). Achieving P-

Sensitive K-Anonymity via Anatomy. In Proc. IEEE

International Conference on e-Business Engineering

(ICEBE), pages 199–205.

Sweeney, L. (2002). Achieving k-anonymity privacy pro-

tection using generalization and suppression. In-

ternational Journal of Uncertainty, Fuzziness and

Knowledge-Based Systems, 10(05):571–588.

UCI Machine Learning Repository. Adult Data Set,

http://archive.ics.uci.edu/ml/datasets/Adult.

Wu, S., Wang, X., Wang, S., Zhang, Z., and Tung,

A. K. (2013). K-Anonymity for Crowdsourcing

Database. IEEE Trans. Knowledge and Data Engi-

neering, PP(99).

Xiao, X. and Tao, Y. (2006). Anatomy: simple and effective

privacy preservation. In Proc. VLDB, pages 139–150.

Xiao, X., Yi, K., and Tao, Y. (2010). The hardness and

approximation algorithms for l-diversity. In Proc.

EDBT, pages 135–146.

Xie, H., Kulik, L., and Tanin, E. (2011). Privacy-aware col-

lection of aggregate spatial data. Data & Knowledge

Engineering, 70(6):576–595.

Yao, L., Wu, G., Wang, J., Xia, F., Lin, C., and Wang, G.

(2012). A Clustering K-Anonymity Scheme for Loca-

tion Privacy Preservation. IEICE Trans. Information

and Systems, E95-D(1):134–142.

SECRYPT2014-InternationalConferenceonSecurityandCryptography

360