Authors:
Anne V. D. M. Kayem
;
Nikolai J. Podlesny
and
Christoph Meinel
Affiliation:
Hasso-Plattner-Institute, University of Potsdam, Prof.-Dr.-Helmert Str. 2-3, 14482 Potsdam, Germany
Keyword(s):
Privacy, Privacy Enhancing Technologies, Pseudonymisation, Data Transformation, Anonymisation, Compartmentation.
Abstract:
Data privacy legislation and the growing number of security violation incidents in the media, have played a key role in consumer awareness of data protection. Furthermore, the digital trail left by activities such as online purchases, websites browsed, and/or clicked advertisements yield behavioural information that is useful for various data analytics operations. Analysing such information in a privacy-preserving way is useful both in satisfying service level agreements and complying with privacy regulations. Pseudonymisation and anonymisation have been widely advocated as a means of generating privacy-preserving datasets. However, each approach poses drawbacks in terms of composing privacy-preserving datasets from multiple distributed data sources. The issue is made worse when the owners of the datasets co-exist in an untrusted environment. This paper presents a novel method of generating privacy-preserving datasets composed of distributed data in an untrusted scenario. We achieve
this by combining cryptographically secure pseudonymisation with data obfuscation and sanitisation. The pseudonymisation and compartmentation are outsourced to a central but fully oblivious entity that can blindly compose datasets based on distributed sources. Controlled non-transitive join operations are used to ensure that the published datasets do not violate the contributing parties’ privacy properties. As a further step, the service provider will employ obfuscation and sanitisation to identify and break functional dependencies between attribute values that hold the risk of inferential disclosures. Our empirical model shows that the overhead due to cryptographic pseudonymisation is negligible and can be deployed in large datasets in a scalable manner. Furthermore, we are able to minimise information loss, even in large datasets, without impacting privacy negatively.
(More)