Managing Distributed Queries under Personalized Anonymity

Constraints

Axel Michel

1,2

, Benjamin Nguyen

1,2

and Philippe Pucheral

SDS Team at LIFO, INSA-CVL, Boulevard Lahitolle, Bourges, France

Petrus team, Inria Saclay & UVSQ, Versailles, France

Keywords:

Data Privacy and Security, Big Data, Distributed Query Processing, Secure Hardware.

Abstract:

The beneﬁt of performing Big data computations over individual’s microdata is manifold, in the medical,

energy or transportation ﬁelds to cite only a few, and this interest is growing with the emergence of smart

disclosure initiatives around the world. However, these computations often expose microdata to privacy leak-

ages, explaining the reluctance of individuals to participate in studies despite the privacy guarantees promised

by statistical institutes. This paper proposes a novel approach to push personalized privacy guarantees in the

processing of database queries so that individuals can disclose different amounts of information (i.e. data at dif-

ferent levels of accuracy) depending on their own perception of the risk. Moreover, we propose a decentralized

computing infrastructure based on secure hardware enforcing these personalized privacy guarantees all along

the query execution process. A performance analysis conducted on a real platform shows the effectiveness of

the approach.

1 INTRODUCTION

In many scientiﬁc ﬁelds, ranging from medicine to so-

ciology, computing statistics on (often personal) pri-

vate and sensitive information is central to the disci-

pline’s methodology. With the advent of the Web, and

the massive databases that compose it, statistics and

machine learning have become “data science”: their

goal is to turn large volumes of information linked

to a speciﬁc individual, called microdata, into knowl-

edge. Big Data computation over microdata is of ob-

vious use to the community: medical data is used to

improve the knowledge of diseases, and ﬁnd cures:

energy consumption is monitored in smart grids to op-

timize energy production and resources management.

In these applications, real knowledge emerges from

the analysis of aggregated microdata, not from the mi-

crodata itself

Smart disclosure initiatives, pushed by legislators

(e.g. EU General Data Protection Regulation (Eu-

ropean Union, 2016)) and industry-led consortiums

(e.g. blue button and green button in the US

, Midata

We thus do not consider applications such as targeted ad-

vertising, who seek to characterize the users at individual

level.

https://www.healthit.gov/patients-families/your-health-da

ta; https://www.healthit.gov/patients-families/

https://www.gov.uk/government/news/the-midata-vision-

in the UK, MesInfos

in France), hold the promise of

a deluge of microdata of great interest for analysts. In-

deed, smart disclosure enables individuals to retrieve

their personal data from companies or administrations

that collected them. Current regulations carefully re-

strict the uses of this data to protect individual’s pri-

vacy. However, once the data is anonymized, its pro-

cessing is far less restricted. This is good news, since

in most cases, these operations (i.e. global database

queries) can provide results of tunable quality when

run on anonymized data.

Unfortunately, the way microdata is anonymized

and processed today is far from being satisfactory.

Let us consider how a national statistical study is

managed, e.g. computing the average salary per ge-

ographic region. Such a study is usually divided into

3 phases: (1) the statistical institute (assumed to be

a trusted third party) broadcasts a query to collect

raw microdata along with anonymity guarantees (i.e.,

a privacy parameter like k in the k−anonymity or ε

in the differential privacy sanitization models) to all

users ; (2) each user consenting to participate trans-

mits her microdata to the institute ; (3) the institute

computes the aggregate query, while respecting the

of-consumer-empowermenthttps://www.gov.uk/governme

nt/news/

http://mesinfos.ﬁng.org/http://mesinfos.ﬁng.org/

Michel, A., Nguyen, B. and Pucheral, P.

Managing Distributed Queries under Personalized Anonymity Constraints.

DOI: 10.5220/0006477001070117

In Proceedings of the 6th International Conference on Data Science, Technology and Applications (DATA 2017), pages 107-117

ISBN: 978-989-758-255-4

107

announced anonymity constraint.

This approach has two important drawbacks:

1. The anonymity guarantee is deﬁned by the querier

(i.e. the statistical institute), and applies uniformly

to all participants. If the querier decides to pro-

vide little privacy protection (e.g. a small k in the

k−anonymity model), it is likely that many users

will not want to participate in the query. On the

contrary, if the querier decides to provide a high

level of privacy protection, many users will be

willing to participate, but the quality of the results

will drop. Indeed, higher privacy protection is al-

ways obtained to the detriment of the quality and

then utility of the sanitized data.

2. The querier is assumed to be trusted. Although

this could be a realistic assumption in the case

of a national statistics institute, this means it is

impossible to outsource the computation of the

query. Moreover, microdata centralization exacer-

bates the risk of privacy leakage due to piracy (Ya-

hoo and Apple recent hack attacks are emblematic

of the weakness of cyber defenses

), scrutiniza-

tion and opaque business practices. This erodes

individuals trust in central servers, thereby reduc-

ing the proportion of citizen consenting to partic-

ipate in such studies, some of them unfortunately

of great societal interest.

The objective of this paper is to tackle these

two issues by reestablishing user’s empowerment,

a principle called by all recent legislations protect-

ing the management of personal data (European

Union, 2016). Roughly speaking, user’s empower-

ment means that the individual must keep the con-

trol of her data and of its disclosure in any situation.

More precisely, this paper makes the following con-

tributions:

• proposing a query paradigm incorporating person-

alized privacy guarantees, so that each user can

trade her participation in the query for a privacy

protection matching her personal perception of

the risk,

• providing a secure decentralized computing

framework guaranteeing that the individual keeps

her data in her hands and that the query issuer

never gets cleartext raw microdata and sees only

a sanitized aggregated query result matching all

personalized privacy guarantees,

• conducting a performance evaluation on a real

Yahoo ’state’ hackers stole data from 500 million

users - BBC News. www.bbc.co.uk/news/world-us-

canada-37447016www.bbc.co.uk/news/world-us-canada-

37447016

dataset demonstrating the effectiveness and scal-

ability of the approach.

The rest of the paper is organized as follows. Sec-

tion 2 presents related works and background ma-

terials allowing the precisely state the problem ad-

dressed. Section 3 details the core of our contribu-

tion. Section 4 shows that the overhead incurred by

our algorithm, compared to a traditional query pro-

cessing technique, remains largely tractable. Finally,

Section 5 concludes.

2 STATE OF THE ART AND

PROBLEM STATEMENT

2.1 Related Works on

Privacy-preserving Data Publishing

Anonymization has been a hot topic in data publi-

cation since the 1960’s for all statistical institutions

wanting to publish aggregate data. The objective of

most of these data publishing techniques is to provide

security against an attacker who is going to mount

deanonymization attacks, which will link some sen-

sitive information (such as their salary or medical di-

agnosis) to a speciﬁc individual. Particular attention

was drawn to the problem by Sweeney, the introduc-

tion of the k-anonymity model (Sweeney, 2002) that

we consider in this paper. k-anonymity is a partition

based approach to anonymization, meaning that the

original dataset, composed of individual’s microdata,

is partitionned, through generalization or suppression

of values, into groups who have similar values which

will then be used for grouping.

The partition-based approach splits the attributes

of the dataset in two categories: a quasi-identiﬁer and

some sensitive data. A quasi-identiﬁer (denoted QID)

is a set of attributes for which some records may ex-

hibit a combination of unique values in the dataset,

and consequently be identifying for the correspond-

ing individuals (e.g., ZipCode, BirthDate, Gender).

The sensitive part (denoted SD) encompasses the at-

tribute(s) whose association with individuals must be

made ambiguous (e.g., Disease).

Partition-based approaches essentially apply a

controlled degradation to the association between in-

dividuals (represented in the dataset by their quasi-

identiﬁer(s)) and their sensitive attribute(s). The ini-

tial dataset is deterministically partitioned into groups

of records (classes), where quasi-identiﬁer and sen-

sitive values satisfy a chosen partition-based privacy

model. The original k-Anonymity model (Sweeney,

DATA 2017 - 6th International Conference on Data Science, Technology and Applications

108

Power

provider

Super-

market

Car insurer

Fixed Trusted Cell

Heat

sensor

Power

meter

F E

Employer

Hospital

School

SSI

Trusted Data Server

MobileTrusted Cell

Figure 1: Trusted Cells reference architecture.

2002) requires each class to contain at least k in-

distinguishable records, thus each sensitive data will

be associated with at least k records. Many other

models have been introduced since 2006, such as

`-Diversity (Machanavajjhala et al., 2006) or t-

Closeness (Li et al., 2010). Each model further con-

strain the distribution of sensitive data within each

class, tackling different adversarial assumptions. For

example, the `-Diversity principle requires that the set

of sensitive data associated to each equivalence class

be linked to ` different sensitive values. t-closeness

requires each class to have a similar distribution of

sensitive values. To illustrate this, table 1 shows

a 3-anonymous and 2-diverse version of a dataset.

This means, for each tuple, at least two others have

the same quasi-identiﬁer (i.e. 3-anonymity) and for

each group of tuples with the same quasi-identiﬁer,

there are at least two distinct sensitive values (i.e. 2-

diversity). It is important to note that the higher the k

and `, the better the privacy protection, but the lower

the precision (or quality) of the query.

Table 1: 3−anonymous and 2−diverse table.

Quasi-identiﬁer Sensitive

ZIP Age Condition

112** > 25 Cancer

112** > 25 Heart Disease

1125* * Heart Disease

1125* * Viral Infection

1125* * Cancer

A different concept is differential privacy, intro-

duced by Dwork in (Dwork, 2006). Differential pri-

vacy is more adapted to interactive query answering.

It’s advantage is to provide formal guarantees regard-

less of the knowledge of the adversary. However, dif-

ferential privacy limits the type of computation which

can be made on the data. Moreover, ﬁxing the privacy

parameter ε is a cumbersome and not intuitive task,

out of reach of lambda individuals.

Another approach is to make an agreement be-

tween the user and the querier. Te concept of sticky

policies presented by Trablesi, Neven, Ragget et

al. (Trabelsi et al., 2011) consists to make a policy

about authorization (i.e. what the querier can do) and

obligation(i.e. what he querier must do) which will

stick to the user data.

2.2 Reference Computing Architecture

Concurrently with smart disclosure initiatives, the

Personal Information Management System (PIMS)

paradigm has been conceptualized (Abiteboul et al.,

2015), and emerges in the commercial sphere (e.g.

Cozy Cloud, OwnCloud, SeaFile). PIMS holds the

promise of a Privacy-by-Design storage and comput-

ing platform where each individual can gather her

complete digital environment in one place and share

it with applications and other users under her con-

trol. The Trusted Cells architecture presented in (An-

ciaux et al., 2013), and pictured in Figure 1, precisely

answers the PIMS requirements by preventing data

leaks during computations on personal data. Hence,

we consider Trusted Cells as a reference computing

architecture in this paper.

Trusted Cells is a decentralized architecture by

nature managing computations on microdata through

the collaboration of two parties. The ﬁrst party is

a (potentially large) set of personal Trusted Data

Servers (TDSs) allowing each individual to manage

her data with tangible elements of trust. Indeed, TDSs

Managing Distributed Queries under Personalized Anonymity Constraints

109

incorporate tamper resistant hardware (e.g. smartcard,

secure chip, secure USB token) securing the data and

code against attackers and users’ misusages. Despite

the diversity of existing tamper-resistant devices, a

TDS can be abstracted by (1) a Trusted Execution En-

vironment and (2) a (potentially untrusted but cryp-

tographically protected) mass storage area where the

personal data resides. The important assumption is

that the TDS code is executed by the secure device

hosting it and then cannot be tampered, even by the

TDS holder herself.

By construction, secure hardware exhibit limited

storage and computing resources and TDSs inherit

these restrictions. Moreover, they are not necessarily

always connected since their owners can disconnect

them at will. A second party, called hereafter Sup-

porting Server Infrastructure (SSI), is thus required

to manage the communications between TDSs, run

the distributed query protocol and store the interme-

diate results produced by this protocol. Because SSI

is implemented on regular server(s), e.g. in the Cloud,

it exhibits the same low level of trustworthiness.

The resulting computing architecture is said asym-

metric in the sense that it is composed of a very large

number of low power, weakly connected but highly

secure TDSs and of a powerful, highly available but

untrusted SSI.

2.3 Reference Query Processing

Protocol

By avoiding delegating the storage of personal data

to untrusted cloud providers, Trusted Cells is key to

achieve user empowerment. Each individual keeps

her data in her hands and can control its disclo-

sure. However, the decentralized nature of the Trusted

Cells architecture must not hinder global computa-

tions and queries, impeding the development of ser-

vices of great interest for the community. SQL/AA

(SQL Asymmetric Architecture) is a protocol to exe-

cute standard SQL queries on the Trusted Cells archi-

tecture (To et al., 2014; To et al., 2016). It has been

precisely designed to tackle this issue, that is execut-

ing global queries on a set of TDSs without recentral-

izing microdata and without leaking any information.

The protocol, illustrated by the Figure 2, works as

follows. Once an SQL query is issued by a querier

(e.g. a statistic institute), it is computed in three

phases: ﬁrst the collection phase where the querier

broadcasts the query to all TDSs, TDSs decide to par-

ticipate or not in the computation (they send dummy

tuples in that case to hide their denial of participa-

tion), evaluate the WHERE clause and each TDS returns

its own encrypted data to the SSI. Second, the aggre-

gation phase, where SSI forms partitions of encrypted

tuples, sends them back to TDSs and each TDS par-

ticipating to this phase decrypts the input partition,

removes dummy tuples and computes the aggregation

function (e.g. AVG, COUNT). Finally the ﬁltering phase,

where TDSs produce the ﬁnal result by ﬁltering out

the HAVING clause and send the result to the querier.

Note that the TDSs participating to each phase can be

different. Indeed, TDSs contributing to the collection

phase act as data producers while TDSs participating

to the aggregation and ﬁltering phases act as trusted

computing nodes. The tamper resistance of TDSs is

the key in this protocol since a given TDS belonging

to individual i

is likely to decrypt and aggregate tu-

ples issued by TDSs of other individuals i

, . .., i

Finally, note that the aggregation phase is recursive

and runs until all tuples belonging to a same group

have been actually aggregated. We refer the interested

reader to (To et al., 2014; To et al., 2016) for a more

detailed presentation of the SQL/AA protocol.

2.4 Problem Statement

In order to protect the privacy of users, queries must

respect a certain degree of anonymity. Our primary

objective is to push personalized privacy guarantees

in the processing of regular statistical queries so that

individuals can disclose different amount of informa-

tion (i.e., data at different level of accuracy) depend-

ing on their own perception of the risk. To the best of

our knowledge, no existing work has addressed this

issue. For the sake of simplicity, we consider SQL as

the reference language to express statistical/aggregate

queries because of its widespread usage. Similarly,

we consider personalized privacy guarantees derived

from the k-anonymity and `-diversity models because

(1) they are the most used in practice, (2) they are rec-

ommended by the European Union (European Union,

2014) and (3) they can be easily understood by in-

dividuals

. The next step in our research agenda is

to extend our approach to other query languages and

privacy guarantees but this ambitious goal exceeds the

scope and expectation of this paper

Hence, the problem addressed in this paper is to

propose a (SQL) query paradigm incorporating per-

sonalized (k-anonymity and `-diversity) privacy guar-

antees and enforcing these individual guarantees all

along the query processing without any possible leak-

age.

The EU Article 29 Working Group mention these char-

acteristics as strong incentives to make these models ef-

fectively used in practice or tested by several european

countries (e.g., the Netherlands and French statistical in-

stitutes).

DATA 2017 - 6th International Conference on Data Science, Technology and Applications

110

SSI TDSs

connect

send Q

decrypt Q

check AC

compute WHERE

encrypt tuple t

`.append(t)

send t

= `.partition(i)

send P

decrypt P

remove dummies

compute aggregations

encrypt partition P

agg

.append(P

agg

)

send P

agg

= `

.partition(j)

send P

decrypt P

remove dummies

compute HAVING

ﬁnalize result res

result.append(res

)

send res

Collection phaseAggregation phaseFiltering phase

Figure 2: SQL/AA Protocol.

3 PERSONALIZED ANONYMITY

GUARANTEES IN SQL

3.1 Modeling Anonymisation using SQL

We make the assumption that each individual owns

a local database hosted in her personal TDS and that

these local databases conform to a common schema

which can be easily queried in SQL. For exam-

ple, power meter data (resp., GPS traces, healthcare

records, etc) can be stored in one (or several) table(s)

whose schema is deﬁned by the national distribution

company (resp., an insurance company consortium,

the Ministry of Health, etc). Based on this assump-

tion, the querier (i.e., the statistical institute) can issue

regular SQL queries as shown by the Figure 3.

SELECT <Aggregate function(s)>

FROM <Table(s)>

WHERE <condition(s)>

GROUP BY <grouping attribute(s)>

HAVING <grouping condition(s)>

Figure 3: Regular SQL query form.

For the sake of simplicity, we do not consider

joins between data stored in different TDSs but inter-

nal joins which can be executed locally by each TDS

are supported. We refer to (To et al., 2014; To et al.,

2016) for a deeper discussion on this aspect which is

not central to our work in this paper.

Anonymity Guarantees: are deﬁned by the

querier, and correspond to the k and ` values that will

be achieved by the end of the process, for each group

produced. They correspond to the commitment of

the querier towards any query participant. Different

k and ` values can be associated to different granular-

ity of grouping. In the example pictured in Figure 4,

the querier commits to provide k ≥ 5 and ` ≥ 3 at a

(City,Street) grouping granularity and k ≥ 10 and

` ≥ 3 at a (City) grouping granularity.

Anonymity Constraints: are deﬁned by the

users, and correspond to the values they are willing

to accept in order to participate in the query. Back to

the example of Figure 4, Alice’s privacy policy stip-

ulates a minimal anonymization of k ≥ 5 and ` ≥ 3

when attribute Salary is queried.

According to the anonymity guarantees and con-

straints, the query computing protocol is as follows.

The querier broadcasts to all potential participants the

query to be computed along with metadata encod-

ing the associated anonymity guarantees. The TDS

of each participant compares this guarantees with the

individual’s anonymity constraints. This principle

shares some similarities with P3P

with the match-

ing between anonymity guarantees and constraints se-

curely performed by the TDS. If the guarantees ex-

ceed the individual’s constraints, the TDS participates

to the query by providing real data at the ﬁnest group-

ing granularity. Otherwise, if the TDS ﬁnds a group-

ing granularity with anonymity guarantees matching

her constraints, it will participate, but by providing a

degraded version of the data, to that coarser level of

granularity (looking at Figure 4, answering the group

https://www.w3.org/P3P/https://www.w3.org/P3P/

Managing Distributed Queries under Personalized Anonymity Constraints

111

Querier

SELECT city, street

AVG(salary)

GROUP BY city, street

.metadata

group by city, street:

anonymity: 5

diversity: 3

group by city:

anonymity: 10

diversity: 3

TDS Alice

policies:

aggregation on salary :

anonymity: 5

diversity: 2

TDS Bob

policies:

aggregation on salary :

anonymity: 6

diversity: 3

TDS Charlie

policies:

aggregation on salary :

anonymity: 5

diversity: 4

SSI

queries list:

, . . .

connect

(salary

Alice

city

Alice

street

Alice

)

D_FLAG=0

(salary

Bob

city

Bob

)

D_FLAG=0

(dummy_salary,

dummy_city,

dummy_street)

D_FLAG=1

Figure 4: Example of collection phase with anonymity constraints.

by city, street clause is not acceptable for Bob,

but answering just with city is). Finally, if no match

can be found, the TDS produces fake data (called

dummy tuples in the protocol) to hide its denial of par-

ticipation. Fake data is required to avoid the querier

from inferring information about the individual’s pri-

vacy policy or about her membership to the WHERE

clause of the query.

Figure 4 illustrates this behavior. By comparing

the querier anonymity guarantees with their respec-

tive constraints, the TDSs of Alice, Bob and Charlie

respectively participate with ﬁne granularity values

(Alice), with coarse granularity values (Bob), with

dummy tuples (Charlie).

The working group ODRL

is looking at some

issues similar to the expression of privacy policies.

However, this paper is not discussing about how users

can express their privacy policy in a standard way.

3.2 The

SQL/AA Protocol

We now describe our new protocol, that we call

SQL/AA to show that it takes into account many

different k values of the i different individuals.

SQL/AA is an extension of the SQL/AA proto-

col (To et al., 2014; To et al., 2016) where the enforce-

ment of the anonymity guarantees have been pushed

in the collection, aggregation and ﬁltering phases.

Collection Phase: After TDSs download the

query, they compare the anonymity guarantees an-

nounced by the querier with their own anonymity

https://www.w3.org/community/odrl/https://www.w3.org/

community/odrl/

constraints. As discussed above (see Section 3.1)

TDSs send real data at the ﬁnest grouping granularity

compliant with their anonymity constraints or send a

dummy tuple if no anonymity constraint can be satis-

ﬁed.

Aggregation Phase: To ensure that the

anonymization guarantees can be veriﬁed at

the ﬁltering phase, clauses COUNT(*) and

COUNT(DISTINCT A) are computed in addition

to the aggregation asked by the querier. COUNT(*)

will be used to check that the k−anonymity guarantee

is met while COUNT(DISTINCT A) will be used

to check the `−diversity guarantee on attribute (or

group of attributes) A on which the aggregate function

applies (e.g. salary in our example)

. If tuples with

varying grouping granularity enter this phase, they

are aggregated separately, i.e. one group per grouping

granularity.

Filtering Phase: Besides HAVING predicates

which can be formulated by the querier, the HAVING

clause is used to check the anonymity guaran-

tees. Typically, k−anonymity sums up to check

COUNT(*)≥ k while `−diversity is checked by

COUNT(DISTINCT A)≥ `. If these guarantees are not

met for some tuples, they are not immediately dis-

carded. Instead, the protocol tries to merge them with

a group of coarser granularity encompassing them.

Let us consider the example of Table 2(a). The tuple

(Bourges, Bv.Lahitolle, 1600) is merged with

the tuple (Bourges, ******, 1400) to form the tu-

Since this clause is an holistic function, we can compute

it while the aggregation phase by adding naively each dis-

tinct value under a list or using a cardinality estimation

algorithm such as HyperLogLog (Flajolet et al., 2007).

DATA 2017 - 6th International Conference on Data Science, Technology and Applications

112

Table 2: Filtering phase.

(a) Example of a post aggregation phase result

city street AVG(salary) COUNT(*)

COUNT

(DISTINCT salary)

Le Chesnay Dom. Voluceau 1500 6 4

Le Chesnay ****** 1700 9 6

Bourges Bv. Lahitolle 1600 3 3

Bourges ****** 1400 11 7

(b) Privacy guarantees of

the query

Attributes k `

city, street 5 3

city 10 3

city street AVG(salary)

Le Chesnay Dom. Voluceau 1500

Bourges ****** 1442.86

ple (Bourges, ******, 1442.86). Merges stop

when all guarantees are met. If, despite merges, the

guarantees cannot be met, the corresponding tuples

are removed form the result. Hence, the querier will

receive every piece of data which satisﬁes the guar-

anties, and only these ones, as shown on Table 2(c).

How to Generalize: To reach the same k and `

values on the groups, the grouping attributes can be

generalized in different orders, impacting the qual-

ity of the result for the querier. For instance, if the

GroupBy clause involves two attributes Address and

Age, would it be better to generalize the tuples on

Address (e.g. replacing <City,Street> by <City>)

or on Age (replacing exact values by intervals) ? It

would be valuable for the querier to give “hints” to

the TDSs on how to generalize the data, by indi-

cating which attributes to generalize, and what pri-

vacy guarantees will be enforced after each gener-

alization. In the following example, we consider

the UCI Adult dataset (Lichman, 2013), we deﬁne

a GroupBy query GB on attributes Age, Workclass,

Education, Marital_status, Occupation, Race,

Gender, Native_Country and we compute the av-

erage fnlwgt operation OP=AVG(fnlwgt). MD repre-

sents the metadata attached to the query. Each meta-

data indicates which k and ` can be guaranteed after a

given generalization operation. Depending on the at-

tribute type, generalizing an attribute may correspond

to climbing up in a generalization hierarchy (for cat-

egorical attributes such as Workclass or Race) or re-

placing a value by an interval of greater width (for

numeric values such as Age or Education). The

del operation means that the attribute is simply re-

moved. The ordering of the metadata in MD translates

the querier requirements.

GB= Age, Workclass, Education, Marital_status,

Ocupation, Race, Gender, Native_Country;

OP= AVG(fnlwgt);

MD= : k=5 l=3,

age->20: k=6 l=3,

workclass->up: k=8 l=4,

education->5: k=9 l=4,

marital_status->up: k=9 l=4,

occupation->up: k=10 l=4,

race->up: k=11 l=5,

gender->del: k=14 l=6,

native country->del:k=15 l=7,

age->40: k=17 l=8;

Figure 5:

SQL/AA query example.

4 EXPERIMENTAL EVALUATION

We have implemented the

SQL/AA protocol on

equivalent open-source hardware used in (To et al.,

2014; To et al., 2016). The goal of this experimental

section is to show that there is very little overhead

when taking into account personalized anonymity

constraints compared to the performance measured by

the original implementation of To et al.. Our imple-

mentation is tested using the classical adult dataset of

UCI-ML (Lichman, 2013)

4.1 Implementation of

SQL/AA

Protocol

The implementation of

SQL/AA builds upon the

secure aggregation protocol of SQL/AA (To et al.,

2014; To et al., 2016), recalled in Section 2.3. The

secure aggregation is based on non-deterministic en-

cryption as AES-CBC to encrypt tuples. Each TDS

encrypts its data with the same secret key but with a

different initialization vector such that two tuples with

the same value have two different encrypted values.

The protection ensures the SSI cannot infer personal

informations by the distribution of same encrypted

Managing Distributed Queries under Personalized Anonymity Constraints

113

values. To make sure that aggregations are entirely

computed, the SSI uses a divide and conquer parti-

tioning to make the TDS compute partial aggregations

on partitions of data. Then the SSI merges partial ag-

gregations to compute the last aggregation. The ag-

gregation phase is illustrated by the Figure 6.

Final Aggregation

Agg(P

1,2

)

Agg(P

)

Agg(P

)

Agg(P

3,4

)

Agg(P

)

Agg(P

)

Figure 6: Aggregation phase with four partitions.

SQL/AA, the aggregation phase has been kept

unchanged from the SQL/AA system since our con-

tribution is on the collection phase and the ﬁltering

phase. The algorithm implementing the collection

phase of

SQL/AA is given below (see Algorithm 1).

As in most works related to data anonymization, we

make the simplifying assumption that each individual

is represented by a single tuple. Hence, the algorithm

always return a single tuple. This tuple is a dummy if

the privacy constraints or the WHERE clause cannot be

satisﬁed.

Algorithm 1: Collection Phase.

procedure COLLECTION_PHASE(Query Q)

t ← getTuple(Q)

p ← getConstraints(Q)

g ← getGuarantees(Q)

i ← 0

if veri f yW here(t, Q) then

while g

< p do

if not canGeneralize(t) then

t ← makeDummy(Q)

else

t ← nextGeneralization(t)

i ← i + 1

end if

end while

else

t ← makeDummy(Q)

end if

return Encrypt(t)

end procedure

The ﬁltering phase algorithm is given by Algo-

rithm 2.

First, the algorithm sorts tuples of the aggre-

gation phase by generalization level, making mul-

tiple sets of tuples of same generalization level.

The function verifyHaving checks if COUNT(*) and

COUNT(Disctinct) match the anonymization guar-

Algorithm 2: Filtering Phase.

procedure FILTERING_PHASE(Query Q, Tuples-

Set T)

sortByGeneralizationLevel(T )

g ← getGuarantees(Q)

for i f rom 0 to MaxGeneralizationLevel(Q) do

for t ∈ T

t ← decrypt(t)

if veri f yHaving(t,Q) then

result.addTuple(t)

else if canGeneralize(t) then

t ← nextGeneralization(t)

i+1

.addTuple(t)

end if

end for

return result

end procedure

antees expected at this generalization level. If so,

the tuple is added to the result. Otherwise, it is fur-

ther generalized and merged with the set of higher

generalization level. At the end, every tuple which

cannot reach the adequate privacy constraints, despite

achieving maximum generalization is not included in

the result.

4.2 Experiments Platform

SD card

Bluetooth

Fingerprint

reader

Smartcard

(data manag

)

(secrets)

(data)

MCU

USB

Figure 7: TDS characteristics.

The performance of the

SQL/AA protocol pre-

sented above has been measured on the tamper re-

sistant open-source hardware platform shown in Fig-

ure 7. The TDS hardware platform consists of a

32-bit ARM Cortex-M4 microcontroller with a max-

imum frequency of 168MHz, 1MB of internal NOR

ﬂash memory and 196kb of RAM, itself connected

to a µSD card containing all the personal data in

an encrypted form and to a secure element (open

smartcard) containing cryptographic secrets and al-

gorithms. The TDS can communicate either through

DATA 2017 - 6th International Conference on Data Science, Technology and Applications

114

USB (our case in this study) or Bluetooth. Finally,

the TDS embeds a relational DBMS engine, named

PlugDB

, running in the microcontroller. PlugDB

is capable to execute SQL statement over the local

personal data stored in the TDS. In our context, it is

mainly used to implement the WHERE clause during

the Collection phase of

SQL/AA.

Our performance tests use the Adult dataset

from the UCI machine learning repository (Lich-

man, 2013). This dataset is an extraction from the

American census bureau database. We modiﬁed the

dataset the same way of (Iyengar, 2002; Bayardo

and Agrawal, 2005). We kept eight attributes to per-

form the GoupBy clause, namely age, workclass,

education, marital status, occupation, race,

native country and gender. Since our work is

based on GROUP BY queries, we also kept the

fnlwgt (i.e. ﬁnal weight) attribute to perform an AVG

on it. The ﬁnal weight is a computed attribute giv-

ing similar value for people with similar demographic

characteristics. We also removed each tuple with a

missing value. At the end we kept 30162 tuples. At-

tributes age and education are treated as numeric

values and others as categorical values. Since TDS

have limited resources, categorical value are repre-

sented by a bit vector. For instance, the categorical

attribute workclass is represented by a 8 bits value

and its generalization process is performed by taking

the upper node on the generalization tree given in Fig-

ure 8. The native country attribute is the largest and

requires 49 bits to be represented.

Private

0x80

Self-emp-not-inc

0x40

Self-emp-inc

0x20

Federal-gov

0x10

Locale-gov

0x08

State-gov

0x04

Without-pay

0x02

Never-worked

0x01

Gov-emp

0x1c

Self-emp

0x60

without-pay

0x03

Emp

0xfc

0xff

Figure 8: Generalization tree of workclass attribute.

4.3 Performance measurements

SQL/AA being an extension of SQL/AA protocol,

this section focuses on the evaluation of the overhead

incurred by the introduction of anonymisation guar-

antees in the query protocol. Then, it sums up to a di-

rect comparison between

SQL/AA and the genuine

SQL/AA. To make the performance evaluation more

complete, we ﬁrst recall from (To et al., 2016) the

comparison between SQL/AA itself and other state of

the art methods to securely compute aggregate SQL

queries. This comparison is pictured in Figure 9.

The Paillier curve shows the performance to compute

https://project.inria.fr/plugdb/en/https://project.inria.fr/

plugdb/en/

aggregation in a secure centralized server using ho-

momorphic encryption, presented in (Ge and Zdonik,

2007). The DES curve uses also a centralized server

and a DES encryption scheme (data are decrypted at

computation time). Finally, SC curves correspond to

the SQL/AA computation with various numbers of

groups G (i.e. deﬁned by GroupBy clause). This ﬁg-

ure shows the strength of the massively parallel cal-

culation of TDSs when G is small and its limits when

G is really too big. We compare next the overhead

introduced by our contribution to SQL/AA.

TQ (second)

Nt (millions)

SC (G=1)

SC (G=100)

SC (G=1000)

SC (G=10000)

Plaintext

Paillier

DES

Figure 9: Performance measurements of SQL/AA and state

of the art.

Categorical vs. numeric values. We ran a query

with one hundred generalization levels, using ﬁrst cat-

egorical, then numerical values. Execution time was

exactly the same, demonstrating that the cost of gen-

eralizing categorical or numerical values is indiffer-

ent.

Collection and Filtering Phases. Figures 10 and 11

show the overhead introduced by our approach, re-

spectively on the collection and ﬁltering phases. The

time corresponds to processing every possible gener-

alization of the query presented in Figure 5, which

generates the maximal overhead (i.e., worst case) for

our approach. The SQL/AA bar corresponds to the ex-

ecution cost of the SQL/AA protocol inside the TDS,

the data transfer bar corresponds to the total time

spent sending the query and retrieving the tuple (ap-

proximately 200 bytes at an experimentally measured

data transfer rate of 7.9Mbits/sec), the TDS platform

bar corresponds to the internal cost of communicating

between the infrastructure and the TDS (data trans-

fer excluded), and the Privacy bar corresponds to the

overhead introduced by the kiSQL/AA approach. All

times are indicated in milliseconds. Values are aver-

aged over the whole dataset (i.e. 30K tuples).

Collection Phase Analysis. The overhead of the

collection phase resides in deciding how to generalize

the tuple in order to comply with the local privacy re-

quirements, and the global query privacy constraints.

Figure 10 shows that our protocol introduces a 0.25ms

overhead for a total average execution time of 3.5ms,

thus under 10% which we consider is a very reason-

Managing Distributed Queries under Personalized Anonymity Constraints

115

0.5

1.5

2.5

3.5

Time (ms)

data transfert

SQL/AA

Privacy

TDS platform

Figure 10: Collection Phase Execution Time Breakdown.

able cost.

Filtering Phases Analysis. Figure 11 shows break-

down of the ﬁltering phase execution time. The ﬁlter-

ing phase takes places once the TDSs have computed

all the aggregations and generalizations. The limited

resources of the TDSs are bypassed by the SQL/AA

system with the help of the (distributed) aggregation

phase. Since every group is represented by one tu-

ple, the TDS which computes the ﬁltering phase re-

ceives a reduced amount of tuples (called G). To et

al. have shown that the SQL/AA protocol converges

if it is possible for a given TDS to compute G groups

during the aggregation phase. As this is the number

of tuples that will be processed during the ﬁltering

phase, we know that if G is under the threshold to

allow its computation via the distributed agregation

phase, then it will be possible to compute the ﬁlter-

ing phase with our improved protocol. Once again,

measurements show that the overhead introduced by

SQL/AA is of only 4% compared to the overall cost

of this phase : the overhead introduced is of 0.42ms

compared to a total cost of 9.8ms.

Time (ms)

data transfert

SQL/AA

Privacy

TDS platform

Figure 11: Filtering Phase Execution Time Breakdown.

5 CONCLUSION

In this paper, we presented a novel approach to de-

ﬁne and enforce personalized anonymity constraints

on SQL GROUP BY queries. To the best of our

knowledge, this is the ﬁrst approach targeting this is-

sue. To this end, we extended the SQL/AA protocol

and implemented our solution on secure hardware to-

kens (TDS). Our experiments show that our approach

is clearly useable, with an overhead of a few percent

on a total execution time compared with the genuine

SQL/AA protocol.

Our current work involves investigating the qual-

ity of the anonymization produced, in presence of

different anonymization constraints for each individ-

ual. We ﬁrmly believe that introducing personalized

anonymity constraints in database queries and provid-

ing a secure decentralized query processing frame-

work to execute them gives substance to the user’s

empowerment principle called today by all legisla-

tions regulating the use of personal data.

REFERENCES

Abiteboul, S., André, B., and Kaplan, D. (2015). Managing

your digital life. Commun. ACM, 58(5):32–35.

Anciaux, N., Bonnet, P., Bouganim, L., Nguyen, B., Popa,

I. S., and Pucheral, P. (2013). Trusted cells: A sea

change for personal data services. In CIDR 2013,

Sixth Biennial Conference on Innovative Data Systems

Research, Asilomar, CA, USA, January 6-9, 2013, On-

line Proceedings.

Bayardo, R. J. and Agrawal, R. (2005). Data privacy

through optimal k-anonymization. In Proceedings

of the 21st International Conference on Data Engi-

neering, ICDE ’05, pages 217–228, Washington, DC,

USA. IEEE Computer Society.

Dwork, C. (2006). Differential privacy. In Proceeding

of the 39th International Colloquium on Automata,

Languages and Programming, volume 4052 of Lec-

ture Notes in Computer Science, pages 1–12. Springer

Berlin / Heidelberg.

European Union (2014). ARTICLE 29 Data Protection

Working Party: Opinion 05/2014 on Anonymisation

Techniques.

European Union (2016). Regulation (EU) 2016/679 of the

European Parliament and of the Council of 27 April

2016 on the protection of natural persons with re-

gard to the processing of personal data and on the

free movement of such data, and repealing Directive

95/46/EC (General Data Protection Regulation). Ofﬁ-

cial Journal of the European Union, L119/59.

Flajolet, P., Fusy, r., Gandouet, O., and Meunier, F. (2007).

Hyperloglog: The analysis of a near-optimal cardinal-

ity estimation algorithm. In Proceedings of the 2007

International conference on Analysis of Algorithms

(AOFA’07).

Ge, T. and Zdonik, S. (2007). Answering aggregation

queries in a secure system model. In Proceedings

of the 33rd International Conference on Very Large

Data Bases, VLDB ’07, pages 519–530. VLDB En-

dowment.

DATA 2017 - 6th International Conference on Data Science, Technology and Applications

116

Iyengar, V. S. (2002). Transforming data to satisfy pri-

vacy constraints. In Proceedings of the Eighth

ACM SIGKDD International Conference on Knowl-

edge Discovery and Data Mining, KDD ’02, pages

279–288, New York, NY, USA. ACM.

Li, N., Li, T., and Venkatasubramanian, S. (2010). Close-

ness: A new privacy measure for data publishing.

IEEE Trans. Knowl. Data Eng., 22(7):943–956.

Lichman, M. (2013). UCI machine learning repository.

Machanavajjhala, A., Gehrke, J., Kifer, D., and Venkita-

subramaniam, M. (2006). l-diversity: Privacy beyond

k-anonymity. In Proceedings of the 22nd Interna-

tional Conference on Data Engineering, ICDE 2006,

3-8 April 2006, Atlanta, GA, USA, page 24.

Sweeney, L. (2002). k-anonymity: A model for protecting

privacy. International Journal of Uncertainty, Fuzzi-

ness and Knowledge-Based Systems, 10(5):557–570.

To, Q., Nguyen, B., and Pucheral, P. (2014). SQL/AA: ex-

ecuting SQL on an asymmetric architecture. PVLDB,

7(13):1625–1628.

To, Q.-C., Nguyen, B., and Pucheral, P. (2016). Private and

scalable execution of sql aggregates on a secure de-

centralized architecture. ACM Trans. Database Syst.,

41(3):16:1–16:43.

Trabelsi, S., Neven, G., Raggett, D., Ardagna, C., and et al.

(2011). Report on design and implementation. Tech-

nical report, PrimeLife Deliverable.

Managing Distributed Queries under Personalized Anonymity Constraints

117