Anonylitics: From a Small Data to a Big Data Anonymization System for

Analytical Projects

Alexandra Pomares-Quimbaya

1,3

, Alejandro Sierra-M

unera

1,3

, Jaime Mendoza-Mendoza

1,3

Juli

an Malaver-Moreno

1,3

, Hern

an Carvajal

2,3

and Victor Moncayo

2,3

Pontiﬁcia Universidad Javeriana, Bogot

a, Colombia

Pontiﬁcia Universidad Javeriana, Cali, Colombia

Center of Excellence and Appropriation in Big Data and Data Analytics (CAOBA), Bogot

a, Colombia

{hernan.carvajal, vhmoncayo}@javerianacali.edu.co

Keywords: Anonymization, Analytics, Data Mining, Data Science, Big Data, K-anonymity, Data Privacy, Information

Disclosure.

Abstract:

When a company requires analytical capabilities using data that might include sensitive information, it is im-

portant to use a solution that protects those sensitive portions, while maintaining its usefulness. An analysis of

existing anonymization approaches found out that some of them only permit to disclose aggregated informa-

tion about large groups or require to know in advance the type of analysis to be performed, which is not viable

in Big Data projects; others have low scalability which is not feasible with large data sets. Another group of

works are only presented theoretically, without any evidence on evaluations or tests in real environments. To

ﬁll this gap this paper presents Anonylitics, an implementation of the k-anonymity principle for small and Big

Data settings that is intended for contexts where it is necessary to disclose small or large data sets for applying

supervised or non-supervised techniques. Anonylitics improves available implementations of k-anonymity us-

ing a hybrid approach during the creation of the anonymized blocks, maintaining the data types of the original

attributes, and guaranteeing scalability when used with large data sets. Considering the diverse infrastruc-

ture and data volumes managed by current companies, Anonylitics was implemented in two versions, the ﬁrst

one uses a centralized approach, for companies that have small data sets, or large data sets, but good vertical

infrastructure capabilities, and a Big Data version, for companies with large data sets and horizontal infrastruc-

ture capabilities. Evaluation on different data sets with diverse protection requirements demonstrates that our

solution maintains the utility of the data, guarantees its privacy and has a good time-complexity performance.

1 INTRODUCTION

Conﬁdentiality comprises a set of rules and mecha-

nisms that ensure that information is not made avail-

able or disclosed to unauthorized individuals, entities,

or processes. Although released information for ana-

lytical purposes should be as detailed as possible to

produce accurate models, data utility is in conﬂict

with the conﬁdentiality guarantee (Lee, 2015). As for

utility, it typically refers to two features of the pro-

cessed data: the ease of use of such data for data min-

ing and other analysis purposes; and the correctness

of conclusions drawn from the processed data by such

analysis (Clifton and Tassa, 2013). To protect the con-

ﬁdentiality and the privacy of the entities (e.g. clients,

patients, citizens) to which information refers, data

holders often remove or encrypt explicit identiﬁers

such as ids, names, and phone numbers. However,

this de-identiﬁcation process, although necessary, is

not enough for guaranteeing entities anonymity; the

main risk is to disclose pseudo-identiﬁers (e.g. gen-

der, age, number of children), which individually are

not risky, but when they are combined and linked with

available data sources can be easily used by an aggres-

sor to re-identify an entity (Ciriani et al., 2007).

Many principles and algorithms have been pro-

posed for protecting data released from improper

disclosure of conﬁdential data. Principles like k-

anonymity (Samarati and Sweeney, 1998), (Sweeney,

2002) and ε-differential privacy (Dwork et al., 2006)

have been widely used to produce anonymized data

from original data sets. According to the k-anonymity

principle, a data set can be published if every record is

indistinguishable from at least other k-1 records (k≥

Pomares-Quimbaya, A., Sierra-Múnera, A., Mendoza-Mendoza, J., Malaver-Moreno, J., Carvajal, H. and Moncayo, V.

Anonylitics: From a Small Data to a Big Data Anonymization System for Analytical Projects.

DOI: 10.5220/0007685200610071

In Proceedings of the 21st International Conference on Enterprise Information Systems (ICEIS 2019), pages 61-71

ISBN: 978-989-758-372-8

2), within the data set with respect to the combination

of pseudo-identiﬁers (or Quasi-Identiﬁers). It means

that for every combination of pseudo-identiﬁers there

must be at least k records in the disclosed data

sets. This guarantees that the entities cannot be re-

identiﬁed by linking an external data source with one

speciﬁc record. Alternatively, the ε differential pri-

vacy principle states that privacy can be preserved by

calibrating the standard deviation of the noise accord-

ing to the sensitivity of a function f, where the func-

tion f describes the query in which the user is inter-

ested on, for instance, given a database D a user may

be interested in the total number of 1’s in the database

(Dwork et al., 2006),(Dwork and Roth, 2014) . The

goal of this principle is to enable the user to learn

properties of the population as a whole while protect-

ing their privacy (Dwork and Naor, 2010).

In the context of Big Data projects, an analysis

of existing implementations on both privacy princi-

ples was made, ﬁnding out some drawbacks on both

principles. On the one hand, the differential privacy

is intended to disclose aggregated information, which

is not always the case on Big Data projects, where

the complete data set must be available. In addi-

tion, its core requires noise addition, which seems

to run into considerable resistance among specialists

(Clifton and Tassa, 2013). Furthermore, in practice it

requires an interactive implementation that demands

prior knowledge of the analysis to be performed by

the user in order to calibrate the level of noise to the

global sensitivity of the query and to the targeted dif-

ferential privacy parameter (Clifton and Tassa, 2013).

On the other hand, we found two main drawbacks on

the k-anonymity (and its variants) implementations;

the ﬁrst one is their low scalability for a Big Data set,

and the second one is the quantity of information loss,

when some of the classical generalization approaches

are applied on the data set. Although both have prob-

lems when they are applied on Big Data problems,

we decided to choose the k-anonymity as the princi-

ple to guarantee data privacy because it works well in

non-interactive environments and its implementations

can be enhanced to achieve scalability and to reduce

information loss using non-perturbative methods.

This paper presents Anonylitics, an implementa-

tion of the k-anonymity principle for Big Data set-

tings. Anonylitics is intended for contexts where it is

necessary to publish small or large data sets for ap-

plying supervised or non-supervised analytical tech-

niques that require preserving the utility of the data.

Anonylitics improved classical implementations of k-

anonymity as Mondrian and Dataﬂy using a hybrid

approach during the creation of the blocks required

to be of size at least k, and the replacement using a

class representant. Throughout the design and imple-

mentation of Anonylitics to achieve k-anonymity in

Big Data settings we have different aspects to deal

with for transforming the version of Anonylitics that

anonymizes small data sets (i.e. small enough to be

processed in a single machine) into one version that

can anonymize Big Data sets. These lessons learned

are an important contribution of this paper.

The paper is structured as follows. Section 2

presents a summary of the main ﬁndings on the cur-

rent anonymization approaches in Big Data. Then,

Section 3 explains the strategy of Anonylitics to pro-

duce data sets compliant to the k-anonymity principle

including its transformation from a centralized to a

Big Data version. Section 4 includes the evaluation of

our proposal demonstrating its good results on utility

and privacy preservation. This sections also analyzes

the performance results derived from different test se-

tups. The ﬁnal part of the paper presents in Section 5

the lessons learned, and in Section 4.4 a discussion of

conclusions and future work.

2 RELATED WORK

When the conﬁdential information is related to in-

dividuals, the right of privacy emerges. Privacy is

the right that every individual must control or inﬂu-

ence what information related to them can be col-

lected and stored, by whom, and what information

can be disclosed (Luisa Pfeiffer, 2008). This infor-

mation is represented by attributes that can be classi-

ﬁed into: (i) Identiﬁers: ﬁelds that uniquely identify

a person (name, ids). (ii) Pseudo-identiﬁers: ﬁelds

that do not directly identify, but combined with other

pseudo-identiﬁers can identify an individual (age, sex,

race). (iii) Sensitive data: ﬁelds with a certain value,

on which conclusions are drawn in a subsequent anal-

ysis (medical condition, salary) (Lambert, 1993).

Preserving privacy in Big Data projects is one of

the major concerns because it requires data to go out-

side the boundaries of the companies and involves

pseudo-identiﬁers and sensitive data that need to be

analyzed. With the aim of identifying the solutions to

ensure privacy in Big Data contexts we analyzed re-

lated works on data anonymization in these environ-

ments.

The systematic review of the literature by B. Pur-

cell (Purcell, 2013) presents four anonymization ap-

proaches for large volumes of data: conceptual, ar-

chitectures, tools and algorithms. In the conceptual

approach, anonymization is analyzed from a theoret-

ical perspective where the objective is to evaluate the

balance between information loss and data utility af-

ICEIS 2019 - 21st International Conference on Enterprise Information Systems

ter anonymization. The architecture refers to the log-

ical orchestration that allows generating anonymiza-

tion systems, identifying two types of architectures:

general purpose and speciﬁc applications. The tools

refer to the practical application of principles, op-

erations, algorithms and anonymization methodolo-

gies available to the general public, whether in open

source or proprietary. It includes a commercial tool

for anonymization purposes developed by a Japanese

research group (Morisawa and Matsune, 2016). Fi-

nally, the algorithms are related to methodologies

based on heuristics, which seek to ﬁnd a balance be-

tween computational complexity and information pri-

vacy.

Regarding the classical algorithms developed for

anonymization as Dataﬂy (Sweeney, 1998), Incog-

nito (LeFevre et al., 2005) and Mondrian (LeFevre

et al., 2006), whose objective is to fulﬁll the princi-

ple of k-anonymity, we found that Mondrian is the

most viable to be enhanced for Big Data problems.

The Mondrian algorithm allows anonymizing tabular

information through recursive partitioning of the data

set applying the principle of k-anonymity. From the

columns or dimensions that make up the data set, the

user must deﬁne those that are considered as pseudo-

identiﬁers. Recursively, the algorithm calculates its

normalized range for each of the selected dimensions,

which means that partitioning can only be done on

numerical type dimensions. Once the ranges are cal-

culated, the algorithm selects the dimension with the

highest rank to partition. Partitioning is done by cal-

culating the distribution median of the selected di-

mension, computed using the frequencies, and then

creating two partitions, the ﬁrst with values greater

than or equal to the median and the second with val-

ues smaller than this same value. On these two new

partitions, the algorithm is executed again until the

partition is less than twice the size of k. When this

occurs, it means that the algorithm will not be able

to further partition the data, since the principle of k-

anonymity would not be met. Once the partitions are

created, the generalization process is carried out, in

which for every partition, a representant of the class

or data to be replaced in each pseudo-identiﬁer is se-

lected.

Concerning the scalable implementation of these

classical algorithms, we have found different propos-

als. However, they are more focused on large data

sets that can be processed in a single machine (e.g.

thousands of records). In the centralized approach,

LeFevre, DeWitt and Ramakrishnan propose two (2)

algorithms Top-Down Specialization (TDS) Greedy

Algorithms based on LeFreve’s Mondrian algorithm

(LeFevre and DeWitt, 2007): Rothko-T or Rothko-

Tree and Rothko-S or Rothko-Sampling. Rothko-T

is a variation of the Mondrian algorithm that gives

special consideration when a data set is larger than

the size of the memory. Rothko-S is a variation

in the sampling of the Rothko-T algorithm applied

to data sets aimed at obtaining greater performance.

In the distributed approach, Zhang, Yang, Liu and

Chen present two works (Zhang et al., 2014b), (Zhang

et al., 2014a): Map Reduce Top-Down Specializa-

tion (MRTDS), and Map Reduce Bottom- Up Gen-

eralization (MRBUG). MRTDS anonymizes data par-

titions to generate a level of intermediate anonymiza-

tion, where it ﬁrst initializes the data utility metrics,

and then, with an iterative algorithm, selects the best

specialization according to the level of anonymiza-

tion and does so, updating the anonymization level.

MRBUG initializes a data utility metric and makes

multiple generalizations in an iteration round, to im-

prove the degree of parallelization and efﬁciency, un-

til the anonymized data set complies with the prin-

ciple of k-anonymity. In addition, Zhang et al.

presents the Single-Objective Proximity-Aware Clus-

tering algorithm (SPAC) (Zhang et al., 2015). It is

a technique adapted to a distributed system based

on ﬁles such as HDFS; and, therefore, can be ap-

plied to a large volume of data. On the other hand,

Sowmya and Nagaratna introduce the Parallelizing

k-anonymity Algorithm (Pk-a) (Sowmya and Na-

garatna, 2016), where the implementation of the prin-

ciple of k-anonymity is redeﬁned using distributed

programming, MapReduce and the Hadoop ecosys-

tem. Then, Zhang et al. propose Locality Sensitive

Hashing based on local-recoding (LSH) (Zhang et al.,

2016), which is a scalable local recoding approach

using distance metrics for categorical data (Jaccard

distance) and numerical data (Euclidean distance) for

measuring the similarity between the registers. Zhang

et al. propose MRMondrian (Map-Reduce jobs on

Mondrian) (Zhang et al., 2016), where the objective

is to partition the data iteratively, into smaller sub-

sets, until they all ﬁt into the memory at each process-

ing node. The work presented in (El Ouazzani and

El Bakkali, 2018) proposes a k-anonymity approach

without prior value of the threshold k. However, its

complexity is not calculated. Eyupoglu et al. propose

an algorithm based on chaos and perturbation tech-

niques (Eyupoglu et al., 2018) to preserve the privacy

and utility in large data independent of the type of data

set, and can be applied to both numerical and cate-

gorical attributes, but so far has not been used in real

applications. Finally, Sopaoglu and Abul developed

an ascending and scalable anonymization algorithm

for Apache Spark (Sopaoglu and Abul, 2017). Ba-

sically they reimplemented the TDS (top-down spe-

Anonylitics: From a Small Data to a Big Data Anonymization System for Analytical Projects

cialization) algorithm for the Apache Spark platform,

being an experimental evaluation that suggests that it

would be highly scalable for TDS.

From the review it was evidenced that some of the

developments in Big Data do not delve into the details

on the algorithm and/or code, this is understandable

when dealing with proprietary software, however, it

does not allow to measure the conﬁdence in the re-

sults(Morisawa and Matsune, 2016). Besides, most

of the described algorithms were tested with a data set

that does not correspond to large volumes of informa-

tion, which does not comply with the Volume char-

acteristic of Big Data projects (Zhang et al., 2015)

(Sowmya and Nagaratna, 2016). Finally, some of the

algorithms mentioned above are presented theoreti-

cally, and there is no evidence on evaluations or tests

that they can be used in real environments (Sopaoglu

and Abul, 2017).

3 ANONYLITICS

Taking into account the need of anonymizing infor-

mation while preserving its utility for analytical pur-

poses, an algorithm capable of anonymizing numer-

ical and categorical data maintaining the principle

of k-anonymity in Small and Big Data environments

was developed. This algorithm, called Anonylitics,

is based on the Mondrian algorithm (see Section 2),

but incorporates a series of improvements aimed to

preserve the usefulness of the anonymized informa-

tion and to guarantee its scalability. Two approaches

were carried out, the ﬁrst one for processing low vol-

ume data sets, and the second one providing scalabil-

ity qualities for Big Data problems.

3.1 Small Data Version: Centralized

Implementation

The centralized version of Anonylitics guarantees the

anonymization using the k-anonymity principle, at the

same time that it keeps the usefulness of data for ana-

lytical purposes. An important aspect of our proposal

is that it does not change the data type of the original

pseudo-identiﬁer, as most of the existing approaches

do, and aims to preserve the distribution of numerical

values. The main phases of Anonylitics are presented

in Figure 1. As it can be seen, our proposal includes

a series of improvements that allow to guarantee the

privacy and utility of the input data. The main im-

provements are:

1. Partitioning. The partition is made calculating the

median of the cut dimension whenever it is pos-

sible to partition it (step A). In addition, when it

Figure 1: Anonylitics Centralized Implementation.

has checked all the dimensions and can not ﬁnd

a partition that can be divided by the median, it

starts over to review from the ﬁrst dimension with

greater entropy, in order to create the partition

using the k-ﬁrst elements that comply with the

anonymity k (step D). This hybrid partitioning ap-

proach led us to reduce the information loss, be-

cause it reduces the size of the blocks, at the same

time that it guarantees the K-anonymity principle.

2. Selection of the cutting dimension. Our algorithm

uses the entropy (step B) to ﬁnd the dimension on

which the data set is recursively partitioned. It

means that, by each recursion, entropy is calcu-

lated for each pseudo identiﬁer and the partition

is made by the one with highest entropy value.

When the partition can not be made in the se-

lected dimension, the algorithm selects the sec-

ond pseudo-identiﬁer with highest entropy and try

to partition the data using that dimension. This is

done until all dimensions are evaluated, but, if par-

titioning can not be achieved, it means that the k-

ﬁrst principle is fulﬁlled. Using entropy allows us

to partition not only using numerical dimensions,

but also those of categorical type.

3. Categorical data hierarchy. Our proposal allows

to partition taking into account numerical and cat-

egorical data. Unlike related works, which con-

catenate the different values in the partition and

establish them as class representants, a hierarchy

is associated with each categorical attribute. This

ICEIS 2019 - 21st International Conference on Enterprise Information Systems

hierarchy allows having a tree structure, where the

leaves are the original categorical values and the

nodes values group together different levels of the

categorical data as shown in Figure 2. When car-

rying out the generalization process (Step F), a

class representant is not calculated, but the cat-

egorical value is replaced by the one that pre-

cedes it in the hierarchy until the principle of k-

anonymity is fulﬁlled. This aspect ensures the

utility of the anonymized categorical values.

Figure 2: Cities Hierarchies Implementation Example.

3.2 From a Centralized to a Distributed

Implementation: Challenges

As stated by Katsogridakis et al., Spark, and in gen-

eral Map Reduce based systems, are not suited for re-

cursive algorithms, mainly due to the distributed na-

ture of the code and the data (Katsogridakis et al.,

2017). Data resides in RDDs (Resilient Distributed

Data set) (Zaharia et al., 2012) or dataframes (RDDs

with named columns) distributed across the cluster

while the code and the metadata reside on the drive

node. Therefore, recursive algorithms need to either

be rewritten as iterative algorithms or the job sched-

uler needs to be changed as proposed in (Katsogri-

dakis et al., 2017). In our case we aim to write an al-

gorithm that can be executed on a standard Big Data

environment and thus we opt for switching to an iter-

ative algorithm.

This decision brought additional challenges be-

cause iterative algorithms could lead to space com-

plexities. In our case, after each iteration, the par-

titions deﬁned by the previous step are not repre-

sented in the original data set because dataframes and

RDDs in Spark are immutable. The consequence is

having twice the data after each iteration and forc-

ing the algorithm to constantly create checkpoints or

caching intermediate resources. In the case of Spark

jobs, transformations are not executed until an ac-

tion is called and therefore this lazy execution, if

not managed correctly, could result in big RDDs or

Dataframes created when an action is executed.

As we convert the algorithm to an iterative algo-

rithm, after each iteration data has to be split in dif-

ferent logical partitions. We could either create one

new dataframe containing all the data and using an at-

tribute value to store the logical partition, or splitting

the dataframe into multiple dataframes, each one con-

taining a partition. The second approach has the in-

convenience of building the new dataframes with the

help of ﬁlters over the original dataframe, but having

to iterate over the data multiple times, one for each

new partition.

Another challenge related to the algorithm in-

volves sorting inside a partition or calculating row

numbers within a partition. For instance, ﬁnding

the ﬁrst k elements of a partition requires ordering

within the partition all the rows given an ordering at-

tribute. In Spark, with window functions, such ac-

tion requires that the whole partition is loaded in one

worker’s memory and locally sorting the rows, but in

the case of a large data set partitions could also be

very large, particularly in the ﬁrst iterations, where

partitions could potentially be half the entire data set

and therefore the whole partition could not ﬁt in one

worker’s memory. One example of such action is

computing the median of a partition. This requires

sorting the partition rows and ﬁnding the middle point

of the data set. In the last phase of the algorithm a

class representant is chosen. One of the original ap-

proaches in the centralized algorithm is choosing the

representant given a truncated distribution, sampling

from the partition given the global distribution. Such

process, which is easily implemented in a centralized

python process, has issues in a distributed Spark envi-

ronment. This could be achieved by computing global

distributions, collecting all the data from a partition

and then choosing inside the driver node the represen-

tant for each class. This has the drawback of process-

ing the complete data set in one node and scalability

is compromised.

Anonylitics: From a Small Data to a Big Data Anonymization System for Analytical Projects

3.3 Big Data Version: Distributed

Implementation

All the challenges presented in Section 3.2 were

solved adapting the centralized anonymization ver-

sion to distributed environments managing large data

sets stored in a Big Data environment. In particular,

the environment on which this solution was developed

was a Hadoop cluster with a distributed HDFS storage

and an Apache Spark processing component in which

the algorithm was written. This technology restricts

us in the use of recursion in the process of anonymiza-

tion, therefore, we develop a series of changes with

respect to the centralized version that are illustrated

in Figure 3.

Figure 3: Anonylitics Distributed Implementation.

1. Anonymization strategy. Since the use of recur-

sion over Spark is not recommended or efﬁcient,

it was necessary to implement a proposal that per-

forms the anonymization process iteratively (F).

The algorithm uses the Spark’s data structures

RDDs and dataframes on which all operations are

performed.

2. Identiﬁcation of partitions. In each iteration of the

process a new dataframe is created with a new col-

umn, which indicates the partition to which each

record belongs (A). It is always necessary to cre-

ate new dataframes since the data stored in the

RDDs is immutable.

3. Entropy calculation. The entropy is calculated by

each pseudo-identiﬁer in each partition creating

auxiliary data frames, which allow grouping and

aggregation operations on the data (B).

4. Partitioning. The partition process is done by

selecting the highest entropy (B, C) pseudo-

identiﬁer and dividing it in 2 partitions using the

mid-point (D).

5. Calculating the value of the median is a com-

putationally expensive task that Spark must per-

form. In contrast to the centralized version of the

algorithm, the data set is not divided into parti-

tions, but in each iteration the records are marked

with an identiﬁer corresponding to the partition to

which it belongs (E).

The architecture of Anonylitics is based on a client

server model as shown in Figure 4. It consists of three

main components that allow Anonylitics to be used

both, in its centralized version and in its distributed

version.

Figure 4: Anonylitics Architecture.

The web client contains the front end of Anonyli-

tics and where the users parameterize and execute

the anonymization processes. The Anonylitics server

is where the anonymization processes are orches-

trated. In the centralized version, the anonymization

is executed in this same server. It also stores the

anonymized data in the centralized version, and the

metadata of the anonymization projects. Finally, the

Hadoop Cluster allows us to execute the distributed

anonymization component and provides a distributed

ﬁle system and the Spark processing component.

ICEIS 2019 - 21st International Conference on Enterprise Information Systems

4 ANONYLITICS EVALUATION

In order to compare the big data and the small data im-

plementations, we ran two sets of tests. One set eval-

uates the quality of the anonymization process and a

second set was intended to evaluate the performance

by measuring the running time under different conﬁg-

urations.

This section is divided into three parts: the ex-

perimental setup description, the evaluation of the

anonymization process, and the performance evalu-

ation.

4.1 Experimental Setup

The centralized version presented in section 3.1

was implemented in Python 3.5 with a MySQL 5.7

database. For performing Big Data experiments, a

Hadoop Distributed File System (HDFS) was used.

The code was implemented in Python 3.5, using

Apache Spark 2.2.

All the experiments were conducted on a virtual-

ized cluster that consists of 9 machines. Each ma-

chine had four 2.30GHz Intel Xeon processors, 32GB

of memory, 1TB IDE disks, and a gigabit Ethernet

link.

Tests were run on multiple sized synthetic data

sets, all of them composed by one categorical and ﬁve

numerical columns. The categorical attribute con-

tains a random sample with cities. And the numerical

quasi-identiﬁers were generated following a normal

distribution.

4.1.1 Anonymization Evaluation Setup

Multiple tests were performed varying the k value and

using the same data set and cluster conﬁguration. The

values of k ranged from 24 to 29 and the table with

data contained 1 million records.

4.1.2 Performance Evaluation Setup

In order to measure the performance of the Big Data

approach, three different setup variations were tested:

changing the k-anonymity constant, changing the

dataset size and changing the cluster conﬁguration:

1. K-anonymity: the K-anonymity parameter was

changed in powers of 2 from 2 to 512 and

the dataset sized was ﬁxed to a 1.000.000 rows

dataset.

2. Data Size: the K-anonymity was maintained con-

stant in the worst performance case K=2 (higher

number of iterations) and the data set size was in-

creased from 10.000 to 10.000.000 registers.

Table 1: Performance evaluation parameters setup.

Experiments Parameter

settings

Dataset size

1 K-

anonymity

K = 2, 4, 8,

16, 32, 64,

128, 256 and

512

2 Data Size K = 2 10

, 50

, 10

3 Cluster Con-

ﬁguration

K = 2

Executors = 8

Cores per

Executor=1-4

3. Cluster Conﬁguration: there are mainly three pa-

rameters related to the conﬁguration of the clus-

ter, the number of YARN containers or executors

to allocate for this application, the number of pro-

cessor cores to allocate on each executor, and the

maximum heap or memory size to allocate to each

executor. In this experiment, we used one execu-

tor for each container node (8 in total), and the

number of cores were modiﬁed for each Spark ex-

ecutor increasing it from 1 to 4. The 1.000.000

observations data set and a value of K equal to 2

were maintained constant.

Table 1 shows the conﬁguration of the cluster used

to measure the scalability of the algorithm.

4.2 Anonymization Evaluation

The goal of this section is to evaluate whether the Big

Data approach affects the quality of the anonymiza-

tion process. To do so, the Information Loss metric

(IL from now on), introduced at (Byun et al., 2007)

was used to evaluate the quality of the anonymized

table. This metric estimates the data utility by quanti-

fying the amount of data values that have been gener-

alized.

Let E be the set of clusters produced by an

anonymization algorithm, and for each record x in

each equivalent class (or cluster) e, let x

, ..., x

de-

note the values of the attributes of x. Then the amount

of IL introduced by e, is deﬁned as:

IL(e) = |e|

∑

i=1

max{x

: x ∈ e} − min{x

: x ∈ e}

|{x

: x ∈ e}|

(1)

and the total information loss of the anonymized

table as:

Total − IL(E) =

∑

e∈E

IL(e) (2)

Anonylitics: From a Small Data to a Big Data Anonymization System for Analytical Projects

Figure 5 shows the information loss obtained us-

ing different k values. As expected, it can be observed

how the information loss score tends to increase, since

the attributes were generalized further, which reduced

the utility of the data set. This result conﬁrms that the

Big Data approach does not affect the quality of the

anonymization process.

Figure 5: Information loss evaluation.

4.3 Performance Evaluation

Figures 6, 7 and 8 present the time performance of

Anonylitics based on k-anonymity, dataset size and

cluster conﬁguration, respectively. As expected, a vis-

ible relationship between run time and K-anonymity

was found. Higher values of K inﬂuence directly over

the total number of iterations. We got an execution

time reduction of 46%, from 69 min (k=2, 20 itera-

tions) to 37 min (k=512, 12 iterations).

In the second experiment (Figure 7), we present

the time complexity performance in terms of dataset

size. As the number of rows in the data increases, so

does the time, but the results show that ﬁnal behav-

ior is better than quadratic O(n

), exponential O(2

)

and linear O(n) functions. When data size grows 10

times from 10.000 to 100.000, the run time grows

2,01 times; but, when the number of observations in-

creases from 10.000 to 10.000.000 (1000 times big-

ger) the run time is only 14,7 times higher. This

shows that the Anonylitics algorithm has a good time-

complexity performance, providing an acceptable be-

havior with big datasets.

Finally, our third experiment shows the scalability

capabilities of distributed computation in Anonylitics

(Figure 8). While the core number in each executor is

increased to its maximum value (4 cores in each node

for our assembly), the ﬁnal time is reduced in 59%,

from 86 to 35 min. We can improve the performance

results not only by increasing the number of cores by

executor, but, adding more nodes into the cluster and

optimizing the execution conﬁguration.

Figure 6: K-Anonymity performance evaluation.

Figure 7: Dataset size performance evaluation.

Figure 8: Cluster conﬁguration performance evaluation.

4.4 Comparison with Related Works

A qualitative comparison was made with the Top-

Down Specialization algorithm (TDS) (Fung et al.,

2005) and its Big Data implementations: Two Phase

TDS (Zhang et al., 2014b) for Hadoop (Map-Reduce),

and TDS for Apache Spark (Sopaoglu and Abul,

2017).

The main difference between the two algorithms is

the partitioning process, TDS approaches generalizes

the best attribute within each iteration and over the

ICEIS 2019 - 21st International Conference on Enterprise Information Systems

entire column until satisfying k-anonymity. On the

other hand, our implementation ﬁrst builds iteratively

a group of partitions composed by at least k elements,

and later, it implements what we deﬁned as local and

global generalization process. Local means, that we

could generalize each partition individually and for

that reason, we could have observations at different

hierarchy levels.

TDS proposals and our algorithm support both

types of input data: numerical and categorical,

however, TDS transforms the continuous pseudo-

identiﬁers into discrete values by grouping them into

numerical ranges. Changing its output data type could

impact the later use of analytical and data mining

techniques, especially when we want to evaluate the

information loss and the utility of the anonymized

data. Further transformations will be necessary if we

want to compare initial and ﬁnal data distributions,

or measuring the information change with techniques

like Kullback-Leibler divergence.

On the contrary, our algorithm is capable of main-

taining the data type, keeping all the variables to-

gether in the partitioning process, but carrying out

a different generalization approach depending on

the variable type. The categorical variables are

anonymized based on a hierarchical approach, thus a

domain hierarchy or tree is needed. Instead, numer-

ical values in each partition are changed by a class

representant, like the average, the median or a random

element within the group.

Both anonymization techniques use information

theory or entropy in order to select the best pseudo-

identiﬁer, nevertheless TDS is much more computa-

tionally intensive because it calculates the informa-

tion loss or the entropy difference before and after the

variable is generalized. On the other hand, our ap-

proach only requires the initial entropy among all the

pseudo-identiﬁers.

5 LESSONS LEARNED: FROM

SMALL TO BIG DATA

ANONYMIZATION APPROACH

Converting an algorithm that was designed to run on

a single node into a scalable algorithm introduced

some challenges to our anonymization system. This

demonstrates that not all centralized approaches are

easily implemented in a distributed environment. Dis-

tributed implementations of originally centralized al-

gorithms may involve an overhead that has to be man-

aged when deciding which implementation to use.

Such decision depends on the data volume and the

vertical or horizontal scalability capabilities of the in-

frastructure. Companies must proﬁle and understand

their data and determine whether a single machine

could process all the data, considering its growth over

time. If not, horizontal scalability is needed, and a

distributed algorithm is the right solution.

Additionally, regarding anonymization ap-

proaches, we proposed a way of taking into account

both, numerical and categorical variables when

deciding how to partition a data set. This is valuable

because real world data sets have many types of vari-

ables, which must be considered in an anonymization

process, to produce useful, but at the same time

conﬁdential datasets. The nature of the data is

important in our system because the utility of the

anonymized data is essential, therefore we preserve

the data types and domains in all the process.

From the point of view of the infrastructure, Big

Data software like Hadoop and Spark are complex

platforms that could be parametrized in many differ-

ent ways depending on the characteristics of the data

to be processed and the physical infrastructure avail-

able. Building software with such technologies does

not only involve the process of writing and testing the

code but also understanding their architecture and ad-

justing it according to the algorithms and data.

Finally, it is important to remark that when an or-

ganization requires analytic capabilities that take ad-

vantage of vast amounts of data, and such data might

include sensitive information, it is vital to understand

all the variables contained in the data set and protect

the information before making it available to external

analysts.

Likewise, data analysis processes require useful

information, representing real patterns and real pop-

ulations in order to provide good insights to be used

by decision makers. Using an anonymization system

that protects while maintaining the data utility is es-

sential, especially when it comes to information about

clients, patients, citizens, etc. or information about

other companies, which need to be analyzed exter-

nally or even internally by people who do not have

the right to see the original data.

6 CONCLUSIONS AND FUTURE

WORK

We proposed an implementation of the k-anonymity

principle called Anonylitics with two versions, a cen-

tralized and a distributed version, designed for small

and Big Data scenarios. Our algorithm not only works

for numerical data but also is designed to consider

categorical data as pseudo-identiﬁers. Related works

Anonylitics: From a Small Data to a Big Data Anonymization System for Analytical Projects

described different anonymization approaches that fo-

cus on speciﬁc algorithms or speciﬁc platforms. Un-

fortunately, some of them have no implementation or

have not been tested with large datasets.

Converting the originally centralized algorithm to

a distributed algorithm using the Apache Spark plat-

form, brought to light some challenges related to the

recursive nature of the algorithm and the speciﬁc data

transformation capabilities of Spark. Although an al-

ternative is to make changes on the platform like Kat-

sogridakis et al. did in (Katsogridakis et al., 2017),

this is not possible in most of the companies. In-

stead, we proposed, implemented and tested solutions

to these challenges, analyzing the ﬁnal performance

and data utility. Our implementation allows compa-

nies with huge data-sets to anonymize them in order

to perform analysis tasks with the anonymized data-

set protecting conﬁdentiality.

As future work it is important to test our algorithm

in a more controlled environment with dedicated ma-

chines instead of virtual machines, to avoid sharing

resources between virtual machines that may impact

the overall execution performance. Those tests should

also vary the cluster size and the size of the ﬁles used,

in order to have better understanding of the scalabil-

ity and the overall performance of the approach pro-

posed. Additionally, it will provide more concrete re-

sults that will serve us to further compare in depth the

different approaches.

Further future work includes integrating the algo-

rithm in streaming data for velocity and also includ-

ing unstructured data anonymization, which are also

important aspects of Big Data strategies.

ACKNOWLEDGEMENTS

This research was carried out by the Center of Ex-

cellence and Appropriation in Big Data and Data

Analytics (CAOBA). It was funded partially by the

Ministry of Information Technologies and Telecom-

munications of the Republic of Colombia (MinTIC)

through the Colombian Administrative Department

of Science, Technology and Innovation (COLCIEN-

CIAS) contract No. FP44842- anex46-2015.

REFERENCES

Byun, J.-W., Kamra, A., Bertino, E., and Li, N. (2007). Ef-

ﬁcient k-anonymization using clustering techniques.

In International Conference on Database Systems for

Advanced Applications, pages 188–200. Springer.

Ciriani, V., di Vimercati, S. D. C., Foresti, S., and Samarati,

P. (2007). Microdata protection. In Yu, T. and Jajodia,

S., editors, Secure Data Management in Decentralized

Systems, volume 33 of Advances in Information Secu-

rity, pages 291–321. Springer.

Clifton, C. and Tassa, T. (2013). On syntactic anonymity

and differential privacy. Trans. Data Privacy,

6(2):161–183.

Dwork, C., McSherry, F., Nissim, K., and Smith, A. D.

(2006). Calibrating noise to sensitivity in private data

analysis. In Theory of Cryptography, Third Theory of

Cryptography Conference, TCC, pages 265–284.

Dwork, C. and Naor, M. (2010). On the difﬁculties of dis-

closure prevention in statistical databases or the case

for differential privacy. Journal of Privacy and Conﬁ-

dentiality, 2:93–107.

Dwork, C. and Roth, A. (2014). The algorithmic foun-

dations of differential privacy. Found. Trends Theor.

Comput. Sci., 9(3–4):211–407.

El Ouazzani, Z. and El Bakkali, H. (2018). A new tech-

nique ensuring privacy in big data: K-anonymity with-

out prior value of the threshold k. Procedia Computer

Science, 127:52–59.

Eyupoglu, C., Aydin, M. A., Zaim, A. H., and Sertbas, A.

(2018). An efﬁcient big data anonymization algorithm

based on chaos and perturbation techniques. Entropy,

20(5):373.

Fung, B. C., Wang, K., and Yu, P. S. (2005). Top-down spe-

cialization for information and privacy preservation.

In Data Engineering, 2005. ICDE 2005. Proceed-

ings. 21st International Conference on, pages 205–

216. IEEE.

Katsogridakis, P., Papagiannaki, S., and Pratikakis, P.

(2017). Execution of recursive queries in apache

spark. In European Conference on Parallel Process-

ing, pages 289–302. Springer.

Lambert, D. (1993). Measures of disclosure risk and

harm. JOURNAL OF OFFICIAL STATISTICS-

STOCKHOLM-, 9:313–313.

Lee, C. (2015). Security in telecommunications and in-

formationtechnology. Technical report, ITU-T –

Telecommunication Standardization Bureau (TSB).

LeFevre, K. and DeWitt, D. (2007). Scalable anonymiza-

tion algorithms for large data sets. Age, 40:40.

LeFevre, K., DeWitt, D. J., and Ramakrishnan, R. (2005).

Incognito: Efﬁcient full-domain k-anonymity. In Pro-

ceedings of the 2005 ACM SIGMOD international

conference on Management of data, pages 49–60.

ACM.

LeFevre, K., DeWitt, D. J., and Ramakrishnan, R. (2006).

Mondrian multidimensional k-anonymity. In Data

Engineering, 2006. ICDE’06. Proceedings of the 22nd

International Conference on, pages 25–25. IEEE.

Luisa Pfeiffer, M. (2008). The right to privacy. protect-

ing the sensitive data. REVISTA COLOMBIANA DE

BIOETICA, 3(1):11–36.

Morisawa, Y. and Matsune, S. (2016). Nestgate—realizing

personal data protection with k-anonymization tech-

nology. FUJITSU Sci. Tech. J, 52(3):37–42.

ICEIS 2019 - 21st International Conference on Enterprise Information Systems

Purcell, B. (2013). The emergence of” big data” technology

and analytics. Journal of technology research, 4:1.

Samarati, P. and Sweeney, L. (1998). Protecting privacy

when disclosing information: k-anonymity and its

enforcement through generalization and suppression.

Technical report.

Sopaoglu, U. and Abul, O. (2017). A top-down k-

anonymization implementation for apache spark. In

Big Data (Big Data), 2017 IEEE International Con-

ference on, pages 4513–4521. IEEE.

Sowmya, Y. and Nagaratna, M. (2016). Parallelizing k-

anonymity algorithm for privacy preserving knowl-

edge discovery from big data. International Journal

of Applied Engineering Research, 11(2):1314–1321.

Sweeney, L. (1998). Dataﬂy: A system for providing

anonymity in medical data. In Database Security XI,

pages 356–381. Springer.

Sweeney, L. (2002). k-anonymity: A model for protecting

privacy. International Journal of Uncertainty, Fuzzi-

ness and Knowledge-Based Systems, 10(5):557–570.

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J.,

McCauley, M., Franklin, M. J., Shenker, S., and Sto-

ica, I. (2012). Resilient distributed datasets: A fault-

tolerant abstraction for in-memory cluster computing.

In Proceedings of the 9th USENIX conference on Net-

worked Systems Design and Implementation, pages 2–

2. USENIX Association.

Zhang, X., Dou, W., Pei, J., Nepal, S., Yang, C., Liu, C.,

and Chen, J. (2015). Proximity-aware local-recoding

anonymization with mapreduce for scalable big data

privacy preservation in cloud. IEEE transactions on

computers, 64(8):2293–2307.

Zhang, X., Leckie, C., Dou, W., Chen, J., Kotagiri, R., and

Salcic, Z. (2016). Scalable local-recoding anonymiza-

tion using locality sensitive hashing for big data pri-

vacy preservation. In Proceedings of the 25th ACM In-

ternational on Conference on Information and Knowl-

edge Management, pages 1793–1802. ACM.

Zhang, X., Liu, C., Nepal, S., Yang, C., Dou, W., and

Chen, J. (2014a). A hybrid approach for scalable sub-

tree anonymization over big data using mapreduce on

cloud. Journal of Computer and System Sciences,

80(5):1008–1020.

Zhang, X., Yang, L. T., Liu, C., and Chen, J. (2014b). A

scalable two-phase top-down specialization approach

for data anonymization using mapreduce on cloud.

IEEE Transactions on Parallel and Distributed Sys-

tems, 25(2):363–373.

Anonylitics: From a Small Data to a Big Data Anonymization System for Analytical Projects