Differentially Private Graph Publishing and Randomized Response for

Collaborative Filtering

Juli

an Salas

1,2 a

and Vicenc¸ Torra

3,4 b

Internet Interdisciplinary Institute (IN3), Universitat Oberta de Catalunya, Barcelona, Spain

Center for Cybersecurity Research of Catalonia, Spain

Hamilton Institute, Maynooth University, Ireland

University of Sk

ovde, Sweden

Keywords:

Noise-graph Addition, Randomized Response, Edge Differential Privacy, Collaborative Filtering.

Abstract:

Several methods for providing edge and node-differential privacy for graphs have been devised. However,

most of them publish graph statistics, not the edge-set of the randomized graph. We present a method for graph

randomization that provides randomized response and allows for publishing differentially private graphs. We

show that this method can be applied to sanitize data to train collaborative ﬁltering algorithms for recom-

mender systems. Our results afford plausible deniability to users in relation to their interests, with a controlled

probability predeﬁned by the user or the data controller. We show in an experiment with Facebook Likes data

and psychodemographic proﬁles, that the accuracy of the proﬁling algorithms is preserved even when they are

trained with differentially private data. Finally, we deﬁne privacy metrics to compare our method for different

parameters of ε with a k-anonymization method on the MovieLens dataset for movie recommendations.

1 INTRODUCTION

Collaborative ﬁltering algorithms are based on users’

interests to provide recommendations to others. It is

necessary that the users reveal their interests to be able

to build accurate systems, but they may prefer to ex-

clude those that are embarrassing or that reveal their

private preferences. The privacy of users and person-

alization of the algorithms must be balanced.

Statistical techniques have been used for long

time to preserve the privacy of respondents to sur-

veys on social sciences, while preserving the utility of

the responses. Randomized response was devised in

(Warner, 1965) to collect statistical information about

embarrassing or illegal behavior, it provides privacy

by offering plausible deniability for users in relation

to their answers.

It was proved recently that randomized response

is differentially private in (Dwork and Roth, 2014).

Optimal mechanisms for differential privacy by ran-

domized response were calculated in (Wang et al.,

2016; Holohan et al., 2017). Randomized response

was used for protecting privacy for recommendations

https://orcid.org/0000-0003-1787-0654

https://orcid.org/0000-0002-0368-8037

in (Polat and Du, 2006), however, it was not related to

differential privacy in that setting.

In this paper we present a method for providing

differential privacy in graph publishing with an ap-

plication to collaborative ﬁltering. It is based on the

noise-graph addition technique from (Torra and Salas,

2019) to provide randomized response. By using ran-

domized response, privacy may be enhanced from the

moment of data collection until publication.

This paper is organized as follows. Section 2

presents all the theoretical results for differentially

private graph publication. In Section 3 we present re-

lated work on privacy for collaborative ﬁltering and

present metrics for measuring the privacy provided

when adding noise-graphs. In Section 4 we present

two experiments. The ﬁrst one shows that precise user

proﬁles may still be obtained from data with differen-

tial privacy guarantees. The second one, compares the

privacy provided, between different parameter values

and different methods, when collecting data for mak-

ing recommendations. In the last section we present

our conclusions and future work.

Salas, J. and Torra, V.

Differentially Private Graph Publishing and Randomized Response for Collaborative Filtering.

DOI: 10.5220/0009833804150422

In Proceedings of the 17th International Joint Conference on e-Business and Telecommunications (ICETE 2020) - SECRYPT, pages 415-422

ISBN: 978-989-758-446-6

415

2 DIFFERENTIALLY PRIVATE

GRAPH PUBLICATION

In this section we present the noise-graph mechanism

for differentially private graph publication. We pro-

vide the deﬁnition of the noise-graph addition tech-

nique and show that it may be applied to obtain ran-

domized response and differential privacy. Finally,

we adapt it to weighted graphs for an application to

user-ratings data for recommender systems.

2.1 Background and Basic Deﬁnitions

When the original graph and the noise-graph have

the same sets of nodes we can simplify the deﬁni-

tion of noise-graph addition from (Torra and Salas,

2019). For this, we use the symmetric difference

4 E

:= (E

\ E

) ∪ (E

\ E

) = {e|e ∈ E

∧ e /∈

} ∪ {e|e /∈ E

∧ e ∈ E

Deﬁnition 1. Let G

(V,E

) and G

(V,E

) be two

graphs with the same nodes V . The addition of G

and G

is the graph G = (V,E) where E is deﬁned as

E = {e|e ∈ E

4 E

We will denote G as

G = G

⊕ G

Deﬁnition 2 (Noise-graph Addition). Let G be a

graph, let G a family of random graphs. Then, the

noise-graph addition

G following the distribution G

is obtained by drawing a graph g from G and adding

it to G, that is:

G = G ⊕ g.

The most general random graph models in the lit-

erature are the Erd

os-R

enyi and the Gilbert model. It

was proved that they are asymptotically equivalent in

(Aiello et al., 2001).

The Gilbert model is denoted by G(n, p), there are

n nodes and each edge is chosen with probability p.

In contrast, the Erd

os-R

enyi model that is denoted by

G(n,e), represents a uniform probability of all graphs

with n nodes and e edges. We use the Gilbert model

G(n, p) throughout this paper.

For bipartite graphs, in which there are two sets

of independent nodes U and V , such that |U| = n and

|V | = m, the Gilbert model is denoted as G (n, m, p).

Each of the n × m possible edges in G(n,m, p) has

probability p. We denote the set of bipartite random

graphs with n + m nodes as G

n,m

. Another possible

way of randomizing graphs is through the application

of a randomized response mechanism for a binary at-

tribute (the existence or absence of an edge).

Deﬁnition 3 (Design Matrix for Randomized Re-

sponse). A randomised response mechanism for a bi-

nary attribute as deﬁned in (Warner, 1965) is uniquely

determined by its design matrix:

P =





(1)

where the entry p

= P(X

= k|x

= j), and X

is the

random output for original variable x

Therefore, p

denotes the probability that the

original value is 0 and the randomized value is 0; p

denotes the probability that the original value is 1 and

the published value 0; and so on.

For any graph G, if we take g

∈ G(n, p

) ∩ G,

∈ G(n, p

) \ G, p

= 1 − p

and p

= 1 − p

then the randomization mechanism A = G⊕g

⊕g

equivalent to the design matrix from eq. (1). In (Wang

et al., 2016) it was proved that if max{

} ≤ e

then the randomized response mechanism following

the design matrix P from Equation (1) is differentially

private.

Deﬁnition 4 (Differential Privacy for Graphs). It is

said that a randomized function A is ε-differentially

private if for all neighboring graphs G and G

(i.e., differing on at most one element) and all S ⊆

Range(A), it holds that:

Pr[A(G) ∈ S] ≤ exp(ε) × Pr[A(G

) ∈ S].

Differential privacy may be deﬁned considering

that neighboring graphs differ either in one node

and all its incident edges, or that they differ in

only one edge, these are the deﬁnitons of node or

edge-differential privacy. We ﬁrst consider edge-

differential privacy (Hay et al., 2009) as it may be

directly related to randomized response. For node-

differential privacy we will use the deﬁnition of node

adjacency from (Blocki et al., 2013).

Deﬁnition 5 (Node Adjacency). We say that graphs

G and G

are neighbors if there exists a node v

such

that G − v

= G

− v

. Where G − v denotes the result

of removing every edge in E(G) that touches v.

In other words, the graph G may be obtained from

by replacing one node and all its neighbors.

2.2 Noise-graph Mechanism

Using deﬁnitions from Section 2.1 we are now ready

to deﬁne a differentially private mechanism for graph

publishing.

Deﬁnition 6 (Noise-graph Mechanism). We deﬁne

n,p

to be the randomization mechanism that for a

graph G with n nodes and probability

< p < 1 out-

puts A

n,p

(G) = E(G ⊕ g) with g ∈ G(n, p).

SECRYPT 2020 - 17th International Conference on Security and Cryptography

416

Note that the noise-graph mechanism may be de-

ﬁned for bipartite graphs as follows.

Deﬁnition 7 (Bipartite Noise-Graph Mechanism).

For a bipartite graph G = U,V , such that |U| = n,

|V | = m. We deﬁne A

n,m,p

to be the randomization

mechanism that for a given probability

< p < 1 out-

puts A

n,m,p

(G) = E(G ⊕ g) with g ∈ G(n,m, p).

We are going to show that the noise-graph mech-

anism is edge-differentially private, but all the proofs

hold also for the bipartite noise-graph mechanism.

Theorem 1. The noise-graph mechanism A

n,p

ln(

1−p

)-edge-differentially private.

Proof. By the deﬁnition of G(n, p) all the possi-

ble edges in g have probability p. Hence, by Def-

inition 1 if the edge uv ∈ E(G), it will remain in

E(G ⊕ g) with probability 1 − p. Similarly, if uv 6∈

E(G), uv 6∈ E(G ⊕ g) with probability 1 − p. That is,

Pr[uv ∈ E(G ⊕ g)|uv ∈ E(G)] = 1 − p and Pr[uv 6∈

E(G ⊕ g|uv 6∈ E(G)] = 1 − p.

If the edge uv 6∈ E(G), it will be in E(G ⊕ g)

with probability p and analogously if uv ∈ E(G),

with probability p it will not be in E(G ⊕ g). That

is, Pr[uv ∈ E(G ⊕ g)|uv 6∈ E(G)] = p and Pr[uv 6∈

E(G ⊕ g)|uv ∈ E(G)] = p.

Now, assume that G and G

differ in the uv

edge, and recall that A

n,p

(G) = G ⊕ g. In ei-

ther case,

Pr[uv∈E(A

n,p

(G))|uv∈E(G)]

Pr[uv∈E(A

n,p

))|uv6∈E(G

)]

1−p

≤ exp(ε)

Pr[uv6∈E(A

n,p

(G))|uv6∈E(G)]

Pr[uv6∈E(A

n,p

))|uv∈E(G

)]

1−p

≤ exp(ε), which

implies that A

n,p

is ε-edge differentially private, for

ε = ln(

1−p

The mechanism A

n,p

also provides node-

differential privacy for graphs with n nodes. In the

following theorem we denote by G the graph but also

its edge set. We denote the complement of the size of

an edge set X as X =

n(n−1)

− X and the complement

G 4 S as the edges that do not belong to the edge set

G 4 S, which are also

n(n−1)

− |G ∪ S|.

Theorem 2. The randomized mechanism A

n,p

n × ln(

1−p

)-node-differentially private.

Proof. For any node u ∈ U and any subset S of all

the possible edges with n nodes. we denote by X =

|G 4 S|. It holds that Pr[A

n,p

(G) = S] = p

(1 − p)

since all the neighbors of u that are not in S, are in the

noise-graph g with probability p, similarly for those

neighbors that are in S but not in G. While the neigh-

bors that are in G ∩ S or in

G ∪ S, i.e., in G 4 S, with

probability 1 − p do not belong to the noise-graph g.

Equivalently for a neighboring graph G

, we

denote by X

= |G

4 S|. Then Pr[A

n,p

) ∈

S] = p

(1 − p)

. Therefore, we can calculate

Pr[A

n,p

(G)∈S]

Pr[A

n,p

)∈S]

(1−p)

= p

X−X

(1 − p)

X−X

≤ e

Considering that X − X

= X

− X, we obtain that

the mechanism A

n,p

will be differentially private if

ln((

1−p

)

X−X

) = (X −X

)ln(

1−p

) < ε. We can bound

X − X

by n since G and G

are neighboring graphs,

hence A

n,p

is n × ln(

1−p

)-differentially private.

2.3 Noise-graph Addition for Weighted

Graphs

In this section we generalize noise-graph addition to

weighted graphs, this will have applications for data

in which the relations are weighted, such as the users’

ratings in recommender systems.

We consider a weighted graph to be a graph G

with node set V (G), edge set E(G) and a function

ω : E(G) → [0, 1], that to each edge e ∈ V (G) assigns

a weight ω(e) ∈ [0,1]. That is, we are considering

graphs in which the edges have weights between 0

and 1. In some cases this weight can represent the

probability that u and v are connected, or the strength

of their relation. For the case of relations weighted by

ordinal numbers (e.g., t ∈ {0,1, ..., m}) the weights

may be transformed to numbers in [0,1] easily by di-

viding them by the maximum number, i.e.,

in the

example.

From a mathematical point of view, ⊕ can be un-

derstood as an exclusive or of the existence of the

edges. So, 0 ⊕ 0 = 0, 1 ⊕ 0 = 1, 0 ⊕ 1 = 1, 1 ⊕ 1 = 0,

where 1 and 0 means existence or not of the edge.

Understanding ⊕ from this perspective, we can

exploit fuzzy set theory to deﬁne ⊕. Fuzzy set the-

ory provides operations that generalize conjunction

(functions called t-norms and denoted T ), disjunction

(functions called t-conorms and denoted S) and com-

plement (through functions called negations and de-

noted N). These operations are deﬁned on the inter-

val [0,1] or [0,1]

instead of being deﬁned on {0,1}

or {0,1}

as required by classical set theory and clas-

sical logic. Then, using standard logic properties, we

know that:

x ⊕ y = (x ∨ y)∧ ¬(x ∧ y).

Using this logical equivalence, we can deﬁne x ⊕ y

when x ∈ [0,1] and y ∈ [0, 1] in terms of t-norms T , t-

conorms S and negations N. That is,

x ⊕ y = T (S(x,y),N(T (x,y))).

The functions t-norms, t-conorms, and negations

are established according to a set of axioms that gen-

eralize the properties of conjunction (and intersec-

tion), disjunction (and union), and complement. Be-

cause of that there are families of functions for each

Differentially Private Graph Publishing and Randomized Response for Collaborative Filtering

417

of them. For the sake of simplicity, we will use

T (x,y) = min(x, y), S(x, y) = max(x, y) and N(x) =

1 − x. For details and alternatives see e.g., (Klir and

Yuan, 1995). Using these functions, we deﬁne ⊕ as

follows:

x ⊕ y = min(max(x,y),1 − min(x,y)).

Deﬁnition 8 (Weighted Noise-Graph Addition). Let

(V,E

,ω

) and G

(V,E

,ω

) be two graphs with

the same nodes V . We deﬁne the addition of G

and

as the graph G = G

⊕ G

= (V,E,ω) with ω(E)

as follows:

ω(e) = ω

(e), for e ∈ E

\ E

ω(e) = ω

(e), for e ∈ E

\ E

ω(e) = min(max(ω

(e),ω

(e)),1 − min(ω

(e),ω

(e)),

for e ∈ E

∪ E

3 PRIVACY FOR

RECOMMENDER SYSTEMS

In this section we apply the noise-graph mechanism

to bipartite graphs, all the proofs from Section 2.2

are equivalent, by assuming that G is bipartite and

changing A

n,p

(G) for A

n,m,p

(G) deﬁning E(G ⊕ g)

with g ∈ G(n, m, p).

We can represent the user-item graph generated

from the data for recommendations as a bipartite

graph of U users and V items, with an edge e = uv if

the user u ∈ U has liked the item v ∈ V . For numerical

ratings, we may represent them by a weighted graph

in which the weight of an edge ω(e) represents the

rating normalized to [0, 1]. A recommendation may

be formulated as a link prediction problem in such

graphs, or by representing the graphs with their adja-

cency matrix with U as rows and V as columns and us-

ing matrix factorization models (Koren et al., 2009).

Similar methods have been successfully used for pre-

dicting private traits such as sexual, political or reli-

gious preferences from Likes in (Kosinski et al., 2013;

Kosinski et al., 2016).

3.1 Related Work

There are diverse ways to protect users’ privacy, such

as protecting them from precise inference of private

attributes, protecting them from reidentiﬁcation as in

(Salas, 2019) with k-anonymity, or to afford them

with plausible deniability as in randomized response

mechanisms.

Randomized response was used in (Wang et al.,

2016) for differentially private data-collection, it was

compared to the Laplace mechanism and provided an

empirical evaluation including graph statistics. It was

also used in (Polat and Du, 2006) to protect privacy

for recommendations with binary ratings, however,

they did not consider that differential privacy may be

obtained.

Randomized response and differential privacy

were used in (Liu et al., 2017) to provide privacy dur-

ing the data collection and data publication stages.

They added random uniform noise to the user-ratings

and provided differential privacy for the item-item co-

variance matrix, which was deﬁned in (Mironov and

McSherry, 2009) after the Netﬂix prize contest de-

anonymization (Narayanan and Shmatikov, 2008).

In (Salas, 2019) the method for differential pri-

vacy from (Mironov and McSherry, 2009) was com-

pared with a method for k-anonymization for recom-

mendations. The concept of k-anonymity was orig-

inally deﬁned for tables (Samarati, 2001), (Sweeney,

2002) and the attributes were classiﬁed on the disjoint

classes of Identiﬁers (IDs), Quasi-Identiﬁers (QIs),

Sensitive Attributes (SAs) and Non-sensitive.

Some variants and extensions have been provided

for k-anonymity such as ` diversity (Machanavajjhala

et al., 2007) and t-closeness approach (Li et al., 2007),

as well as for ε-differential privacy (Desfontaines

and Pej

o, 2019), including (ε, δ)-differential privacy

(Dwork et al., 2006). Some of their differences and

interactions are discussed in (Salas and Domingo-

Ferrer, 2018).

3.2 Privacy Metrics for Adding

Noise-graphs

We devise metrics to measure the privacy provided

by a sanitization method with different parameters,

that will be also useful to compare among different

methods, such as ε-differential privacy, k-anonymity

or any of their variants. First, we provide the deﬁ-

nitions of k-anonymity from (Samarati, 2001) as de-

ﬁned in (Torra, 2017) and Risk and Imprecision mea-

sures from (Salas, 2019).

Deﬁnition 9. A dataset is k-anonymous if each record

is indistinguishable from at least other k − 1 records

within the dataset, when considering the values of its

QIs. That is, if we denote the set of QI values of

a record j as Q

, then for each record j there are

at least other k − 1 records { j

,.. ., j

k−1

} such that

= Q

for all i ∈ {1,.. ., k − 1}.

Deﬁnition 10. For each user u the Sensitive Attribute

Risk (SA

) of a sanitization is deﬁned as the propor-

tion of her observed records R

that are part of her

SECRYPT 2020 - 17th International Conference on Security and Cryptography

418

true records r

, that is:

(u) =

(2)

Note that Deﬁnition 10, considers that the pub-

lished data for each user u (observed records R

)

contains the true data (r

). This equals the Jaccard

similarity J(R

) =

∩r

∪r

when r

⊂ R

However, if we consider the Jaccard distance in-

stead, we obtain d

) = 1 − J(R

) =

∪r

In our graph representation user’s u true records

equal N

(u) and R

equals N

(u), where

G = A

n,p

(G) = G ⊕ g. Hence, 1 − SA

(u) =

) = d

(u),N

(u)) =

(u)4N

(u)|

(u)

(u)|

(u)

(u)|

(u)

(u)|

(u)

(u)|

|(N

(u)

(u))

(u)|

(u)

(u)|

This distance may equal 1 if N

(u)

(u) =

in that case it will be 1 for sets that are quite differ-

ent, hence it will not measure how many changes have

been done to the original set. Therefore, we propose

to use instead d(N

(u),N

(u)) =

(u)|

(u)|+|N

(u)|

which

will consider the number of modiﬁed items in pro-

portion to the number of real items liked by user u.

Consequently, we deﬁne the Sensitive Attribute Risk

for any graph as 1 − d(N

(u),N

(u)) in equation (3)

which is consistent with Deﬁnition 10.

Deﬁnition 11. Let G be a graph,

G a protected ver-

sion of graph G with the same number of nodes as G.

Let g =

G the edge difference between

G and G

(i.e., the noise added to G). We deﬁne the Sensitive

Attribute Risk for a node u ∈ V (G) as:

(u) =

(u)|

(u)| + |N

(u)|

(3)

In the case of weighted graphs Deﬁnition 11, is

still relevant since it measures the risk by the num-

ber of movies. However, the weight can be con-

sidered by replacing |N

(u)| by its weighted ver-

sion which we will denote as ω(N

(u)) and deﬁne as

ω(N

(u)) =

∑

v∈N

(u)

ω(uv). Therefore, we obtain the

following deﬁnition, which generalizes previous one,

considering that for an unweighted graph the weight

function ω(uv) = 1 for all uv ∈ E(G) can be used.

Deﬁnition 12. Let G be a weighted graph, ω its

weight function,

G a protected version of graph G

with the same number of nodes as G. Let g =

the edge difference between

G and G (i.e., the noise

added to G). We deﬁne the Weighted Sensitive At-

tribute Risk for a node u ∈ V (G) as:

(u) =

ω(N

(u))

ω(N

(u)) + ω(N

(u))

(4)

Deﬁnition 13. Let G be a weighted graph, ω its

weight function,

G a protected version of graph G.

The Average Sensitive Attribute Imprecision (SA

) is

deﬁned as follows:

E(G ∪

∑

e∈E(G∪

|ω

(e) − ω

(e)| (5)

Note that SA

is a well known measure for the in-

formation loss, that is the mean average error. How-

ever, when all the attributes are at the same time QIs

and SAs it measures both the disclosure risk and the

information loss.

After deﬁning these metrics, we perform the util-

ity/privacy evaluation in the following section.

4 EXPERIMENTS

In this section we apply our algorithms to data that

was used to predict users’ psychodemographics, as

well as for training collaborative ﬁltering algorithms.

4.1 Facebook Likes Dataset for

Psychodemographic Inference

In this section we will protect Facebook Likes data

from (Kosinski et al., 2016). The dataset con-

sists of Facebook users (110, 728), Facebook Likes

(1,580,284) and their user-likes pairs (10,612,326).

This data was used to predict the pyschodemograph-

ics of the same users, measured by a 100-item long

International Personality Item Pool questionnare mea-

suring the ﬁve-factor model of personality (Goldberg

et al., 2006), which are their gender, age, political

views, and their scores on openness, conscientious-

ness, extroversion, agreeableness, and neuroticism.

Since this dataset is sparse there are a large num-

ber of users and items that appear few times, such

data points are of little signiﬁcance for building a

model to perform inference. We follow the same ap-

proach of trimming the User-Like Matrix to obtain

the same dataset to perform inference as in (Kosinski

et al., 2016). They chose the thresholds of minimum

50 likes per user and 150 users per like, to obtain a

dataset with n = 19,724 users, m = 8,523 likes (POIs)

and q = 3,817,840 user-like pairs.

With this data we will generate a bipartite graph

G of users U, likes L with |U| = n, |L| = m and

|E(G)| = q. Hence, for adding random noise to G we

will sample bipartite random graphs g ∈ G(n, m, p).

We recall that A

n,m,p

(G) is ε-edge differentially

private for ε = ln(

1−p

), as we pointed out at the beg-

gining of Section 3. In Table 1 we present the corre-

sponding values and the number of user-likes in the

Differentially Private Graph Publishing and Randomized Response for Collaborative Filtering

419

randomized graph A

n,m,p

(G) = G ⊕ g of our experi-

ments. Note that for each p we obtain noise-graphs g

with p × n ×m = p × 168,107,652 edges on average.

Table 1: Values of p, ε and number of user-likes in the ran-

domized graph.

p ε |E(g)| |E(G ⊕ g)|

0.005 5.29 840,162 4,619,770

0.05 2.94 8,408,449 11,844,981

0.1 2.19 16,824,538 19,878,770

0.2 1.38 33,657,261 35,949,261

0.3 0.84 50,482,636 52,007,684

0.4 0.40 67,302,556 68,070,070

For inference, we performed Singular Value De-

composition (SVD), cf. (Leskovec et al., 2014).

Then, use logistic regression for predicting variables

values (such as gender or political views) from the

user SVD scores in the training subset. We use 10-

fold cross-validation and AUC and Pearson correla-

tion coefﬁcient to measure the accuracy of the predic-

tions following from (Kosinski et al., 2016).

After performing inference on the data repre-

sented by G, we compare with the results for each

of the differentially private graphs A

n,m,p

(G) = G

to measure the accuracy lost when providing privacy

guarantees to the data subjects, see Figure 1. Recall

that the variables are gender, age, political views, and

users’ scores on openness, conscientiousness, extro-

version, agreeableness, and neuroticism.

0.00

0.25

0.50

0.75

5.35.35.35.35.35.35.35.33.03.03.03.03.03.03.03.02.22.22.22.22.22.22.22.21.41.41.41.41.41.41.41.40.80.80.80.80.80.80.80.80.40.40.40.40.40.40.40.4

ε values

Prediction Accuracy

Variable

age

agr

con

ext

gender

neu

ope

political

Figure 1: Accuracy for predictions depending on the ε.

We see in Figure 1, that for probabilities p ≤ 0.1

or equivalently ε ≥ 2.2, the data utility is barely re-

duced. The predictions on demographics remain quite

precise, as we show in Table 2 were the imprecision

is measured (in %) as the proportion of the error be-

tween the obtained precision on the sanitized data

with ε-differential privacy with respect to the origi-

nal precision obtained without sanitizing the dataset.

The values for the accuracy on the data without san-

itization are gender = 0.936, age = 0.616, political =

0.884, openness = 0.450, conscientiousness = 0.264,

extroversion = 0.309, agreeableness = 0.213 and neu-

roticism = 0.304.

Table 2: Imprecision of predictions depending on ε (mea-

sured in %).

att/ε 5.29 2.94 2.19 1.38 0.84 0.40

gen 0.5 4 7 17 29.2 41.5

age 0.4 6.4 8.5 25.8 58.4 74.7

pol 0.5 3.7 9.5 17.7 35.9 42.3

ope 0.3 9.3 15.7 28.3 57.3 82.2

con 0.9 3.2 8.7 17.4 55.8 86.5

ext 0.8 3.9 10.7 33 74.5 100.5

agr 0.9 5.7 19.1 46 76.8 84.2

neu 0.7 2.5 9.4 33.4 72.9 82.2

4.2 MovieLens Dataset for

Collaborative Filtering

In this section we protect the Movielens-100K dataset

(Harper and Konstan, 2015) that is used commonly

as a benchmark dataset for collaborative ﬁltering.

The Movielens-100K dataset contains 100,000 rat-

ings (between 1 and 5) with timestamps that 943 users

gave to 1,682 movies.

With this data we generate a weighted bipartite

graph G of users U and movies V , with the weight

function ω : E(G) → [0,1] such that ω(uv) is the rat-

ing that user u assigned to movie v. For adding noise

to such graph we apply the techniques from Section

2.3, we generate noise-graphs as in Theorems 1 and

2. Additionally we deﬁne the weight function for

the noise-graph g to be ω

that assigns edge weights

from the empirical distribution of the edge weights

assigned by ω to G, the rationale behind this is that

assigning a constant weight ω

= c for all edges in

g may facilitate identifying the noise-edges that have

been added to G. The distribution of the weights of

the edges and the weighted sum function, may play a

key role to prove that the weighted noise-graph addi-

tion is differentially private, we leave such proof for

future work.

We perform a comparison with an algorithm

providing k-anonymity, i.e., HAKR algorithm from

(Salas, 2019). We measure the privacy provided by

both algorithms, to show that even when using two a-

priori different models for privacy protection, we may

still be able to compare their privacy guarantees.

In Table 3 we show the values of p and the num-

ber of ratings (edges) in the noise-graph and the ran-

domized graph. Note that since we consider only the

ratings in the train set the edges of the graph G to be

protected.

SECRYPT 2020 - 17th International Conference on Security and Cryptography

420

Table 3: Values of p and number of ratings in the random-

ized graph.

p |E(g)| |E(G ⊕ g)|

0.005 7,728 87,354

0.05 77,907 153,864

0.1 155,663 227,698

0.2 312,148 376,192

0.3 466,739 522,564

0.4 621,566 669,706

In Table 4 we consider the noise added to G to ob-

tain the sanitized graph

G as in 11, where g =

Note that in this case g is not a random graph, and the

graph that is k-anonymized is the graph correspond-

ing to the train set of 80K ratings.

Table 4: Values of k and number of ratings added in the

sanitized graph.

k |E(g)| |E(

G)|

2 74,444 154,444

3 132,236 212,236

4 178,469 258,469

5 214,966 294,966

6 249,806 329,806

In Figures 2 and 3 we plot the cumulative dis-

tributions of the values of the Sensitive Attribute

Risk for all the users. In Figure 2 we see that

the mean values for SA

in the randomized data are

0.84,0.42,0.28, 0.18,0.13 and 0.10, respectively, for

p = 0.005,0.05,0.1, 0.2,0.3, 0.4. While in Figure 3

the mean values for SA

in the k-anonymous data

are 0.50, 0.35,0.28,0.24,0.21, respectively, for k =

2,3,4, 5,6. The privacy provided not only depends on

the method, but also on the parameter. For example,

for p = 0.1 and k = 4 the average SA

is the same.

Table 5: Information loss/disclosure risk measured as aver-

age SA

p SA

0.005 0.3025

0.05 1.732

0.1 2.351

0.2 2.855

0.3 3.080

0.4 3.197

k SA

2 1.713

3 2.209

4 2.443

5 2.571

6 2.675

Finally, in Table 5 we compare the Informa-

tion loss (as mean SA

) between a priori different

sanitization methods such as adding noise and k-

anonymization. The positive aspect of randomization

is that p is a continuous value, while k is discrete, so

randomization can be better tuned for more speciﬁc

privacy and utility values.

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00

Sensitive Attribute Risk

Cumulative Distribution

0.1

0.2

0.3

0.4

Figure 2: Sensitive attribute risk cumulative distribution

grouped by p.

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00

Sensitive Attribute Risk

Cumulative Distribution

Figure 3: Sensitive attribute risk cumulative distribution

grouped by k.

5 CONCLUSIONS

In this paper, we presented a method for differen-

tially private graph-publishing based on noise-graph

addition. Then, we showed that it may be applied

to obtain randomized response and differential pri-

vacy for collaborative ﬁltering. Finally, we provided a

measure for privacy of a sanitization method, that al-

lows to compare between algorithms with different a-

priori guarantees, such as ε-differential privacy and k-

anonymity. We tested our algorithms in public Face-

book Likes to prove that the accuracy of proﬁling al-

gorithms is well preserved even when they are trained

with differentially private data. The experiment on

MovieLens dataset shows an application for sanitiz-

ing weighted graphs for recommendations. We can

conclude from our results that is possible to provide

strong privacy guarantees to users, while still obtain-

ing accurate recommendations and predictions. We

leave as future work to show that our method may

provide ε-differential privacy for weighted graphs.

Differentially Private Graph Publishing and Randomized Response for Collaborative Filtering

421

ACKNOWLEDGEMENTS

This work was partially supported by the Swedish Re-

search Council (Vetenskapsr

adet) project DRIAT (VR

2016-03346), the Spanish Government under grants

RTI2018-095094-B-C22 ”CONSENT”, and the UOC

postdoctoral fellowship program.

REFERENCES

Aiello, W., Chung, F., and Lu, L. (2001). A random graph

model for power law graphs. Experimental Mathemat-

ics, 10(1):53–66.

Blocki, J., Blum, A., Datta, A., and Sheffet, O. (2013). Dif-

ferentially private data analysis of social networks via

restricted sensitivity. In Proceedings of the 4th Con-

ference on Innovations in Theoretical Computer Sci-

ence, ITCS ’13, pages 87–96.

Desfontaines, D. and Pej

o, B. (2019). Sok: Differential

privacies.

Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., and

Naor, M. (2006). Our data, ourselves: Privacy via

distributed noise generation. In Vaudenay, S., editor,

Advances in Cryptology - EUROCRYPT 2006, pages

486–503.

Dwork, C. and Roth, A. (2014). The algorithmic foun-

dations of differential privacy. Found. Trends Theor.

Comput. Sci., 9(3—4):211–407.

Goldberg, L. R., Johnson, J. A., Eber, H. W., Hogan,

R., Ashton, M. C., Cloninger, C. R., and Gough,

H. G. (2006). The international personality item pool

and the future of public-domain personality measures.

Journal of Research in Personality, 40(1):84 – 96.

Proceedings of the 2005 Meeting of the Association

of Research in Personality.

Harper, F. M. and Konstan, J. A. (2015). The movielens

datasets: History and context. ACM Trans. Interact.

Intell. Syst., 5(4):19:1–19:19.

Hay, M., Li, C., Miklau, G., and Jensen, D. (2009). Ac-

curate estimation of the degree distribution of private

networks. In 2009 Ninth IEEE International Confer-

ence on Data Mining, pages 169–178.

Holohan, N., Leith, D. J., and Mason, O. (2017). Optimal

differentially private mechanisms for randomised re-

sponse. IEEE Transactions on Information Forensics

and Security, 12(11):2726–2735.

Klir, G. J. and Yuan, B. (1995). Fuzzy Sets and Fuzzy Logic:

Theory and Applications. Prentice-Hall, Inc., USA.

Koren, Y., Bell, R., and Volinsky, C. (2009). Matrix factor-

ization techniques for recommender systems. Com-

puter, 42(8):30–37.

Kosinski, M., Stillwell, D., and Graepel, T. (2013). Pri-

vate traits and attributes are predictable from digital

records of human behavior. Proceedings of the Na-

tional Academy of Sciences, 110(15):5802–5805.

Kosinski, M., Wang, Y., Lakkaraju, H., and Leskovec,

J. (2016). Mining big data to extract patterns and

predict real-life outcomes. Psychological Methods,

21(4):493–506.

Leskovec, J., Rajaraman, A., and Ullman, J. D. (2014). Min-

ing of Massive Datasets. Cambridge University Press,

2 edition.

Li, N., Li, T., and Venkatasubramanian, S. (2007).

t-closeness: Privacy beyond k-anonymity and l-

diversity. In 2007 IEEE 23rd International Confer-

ence on Data Engineering, pages 106–115.

Liu, X., Liu, A., Zhang, X., Li, Z., Liu, G., Zhao, L., and

Zhou, X. (2017). When differential privacy meets ran-

domized perturbation: A hybrid approach for privacy-

preserving recommender system. In Database Sys-

tems for Advanced Applications, pages 576–591.

Machanavajjhala, A., Kifer, D., Gehrke, J., and Venkitasub-

ramaniam, M. (2007). L-diversity: Privacy beyond

k-anonymity. ACM Trans. Knowl. Discov. Data, 1(1).

Mironov, I. and McSherry, F. (2009). Differentially pri-

vate recommender systems: Building privacy into the

netﬂix prize contenders. In Proceedings of the 15th

ACM SIGKDD International Conference on Knowl-

edge Discovery and Data Mining (KDD), pages 627–

636.

Narayanan, A. and Shmatikov, V. (2008). Robust de-

anonymization of large sparse datasets. In 2008 IEEE

Symposium on Security and Privacy (sp 2008), pages

111–125.

Polat, H. and Du, W. (2006). Achieving private recom-

mendations using randomized response techniques. In

Advances in Knowledge Discovery and Data Mining,

pages 637–646.

Salas, J. (2019). Sanitizing and measuring privacy of large

sparse datasets for recommender systems. Journal of

Ambient Intelligence and Humanized Computing.

Salas, J. and Domingo-Ferrer, J. (2018). Some basics

on privacy techniques, anonymization and their big

data challenges. Mathematics in Computer Science,

12(3):263–274.

Samarati, P. (2001). Protecting respondents identities in mi-

crodata release. IEEE Transactions on Knowledge and

Data Engineering, 13(6):1010–1027.

Sweeney, L. (2002). k-anonymity: A model for protecting

privacy. International Journal of Uncertainty, Fuzzi-

ness and Knowledge-Based Systems, 10(05):557–570.

Torra, V. (2017). Data privacy: Foundations, new develop-

ments and the big data challenge. Springer.

Torra, V. and Salas, J. (2019). Graph perturbation as

noise graph addition: A new perspective for graph

anonymization. In Data Privacy Management, Cryp-

tocurrencies and Blockchain Technology, pages 121–

137.

Wang, Y., Wu, X., and Hu, D. (2016). Using randomized

response for differential privacy preserving data col-

lection. In EDBT/ICDT2016WS.

Warner, S. L. (1965). Randomized response: A survey tech-

nique for eliminating evasive answer bias. Journal of

the American Statistical Association, 60(309):63–69.

SECRYPT 2020 - 17th International Conference on Security and Cryptography

422