Preserving Privacy in High-Dimensional Data Publishing

Narges Alipourjeddi and Ali Miri

Department of Computer Science, Toronto Metropolitan University, Toronto, Canada

Keywords:

High-Dimensional Data, Privacy Preservation, Persistent Homology, Differential Privacy, Data Publishing,

Topological Data Analysis.

Abstract:

As the era of big data unfolds, high-dimensional datasets with complex structures have become increasingly

prevalent in various ﬁelds, including healthcare, ﬁnance, and social sciences. Extracting valuable insights

from such data is essential for scientiﬁc discovery and decision-making. However, the publication of these

datasets is full of privacy concerns, as they often contain sensitive and personally identiﬁable information. In

this paper, we introduce a novel approach that addresses the delicate balance between data privacy and the

exploration of high-dimensional data’s underlying structure. We leverage the power of persistent homology,

a topological data analysis method, to unveil hidden patterns and captures the persistent topological features

of the data, allowing us to study its shape and structure across different scales. Adding noise into the low di-

mensional embedding and provide private persistence diagram with differential privacy, offers a rigorous and

well-established framework to ensure that individuals’ privacy in the dataset is protected. We synthetically

generate high-dimensional data with a focus on differential privacy-preserved persistence diagrams, ensuring

privacy in our publication of the synthesized dataset. We conduct extensive experiments on three real-world

datasets and the experimental results demonstrate that our mechanism can signiﬁcantly improve the data struc-

ture of the published data while satisfying differential privacy.

1 INTRODUCTION

In our data-driven era, high-dimensional datasets have

become ubiquitous, permeating ﬁelds as diverse as

healthcare, ﬁnance, and social sciences. The informa-

tion encapsulated within these data sets holds the key

to crucial scientiﬁc discoveries, informed decision-

making, and innovation. However, sharing this data

is not without its challenges, and among the most sig-

niﬁcant is the need to navigate the delicate balance

between data publishing and data privacy.

The advent of big data has brought forth a press-

ing concern: how can we unlock the valuable insights

hidden within high-dimensional datasets, while safe-

guarding the sensitive and personally identiﬁable in-

formation they contain? This question is central to

our research as we delve into the intersection of data

privacy and data publishing. Our approach, built upon

the powerful foundations of persistent homology and

differential privacy, seeks to address this fundamental

question.

Privacy-Preserving Data Publishing (PPDP) has

gained signiﬁcant attentions in recent years as a

promising approach for information sharing while

preserving data privacy. There exists standard meth-

ods such as k-anonymity (Mahanan et al., 2021), l-

diversity (Binjubeir et al., 2019) and t-closeness (Bin-

jubeir et al., 2019) that data collectors (sometimes

also referred to as curators) can apply to protect and

anonymize datasets. However, these methods can still

leak information when analysis involves additional

datasets or auxiliary information from other sources.

One also needs to able to formally measure informa-

tion leakage and privacy protection. A commonly

used methodology to provide a framework for pre-

serving and measuring privacy is Differential Privacy

(DP) (Dwork et al., 2014). DP can be used to pro-

vide privacy guarantees using an information theoret-

ical approach. The main idea in this approach is that

what can be learned from the published data is (ap-

proximately) the same, whether or not any particular

individual was included in the input database. This

model is mathematical foundation with a formal deﬁ-

nition and rigorous proof while making the assump-

tion that an attacker has the maximum background

knowledge.

Nonetheless, ensuring differential privacy in the

publication of high-dimensional data continues to be

a powerful challenge, primarily due to the “Curse

Alipourjeddi, N. and Miri, A.

Preserving Privacy in High-Dimensional Data Publishing.

DOI: 10.5220/0012455600003648

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 10th International Conference on Information Systems Security and Privacy (ICISSP 2024), pages 845-852

ISBN: 978-989-758-683-5; ISSN: 2184-4356

845

of High-Dimensionality”. This phenomenon signi-

ﬁes that as the dimensionality of the data grows, the

complexity and computational cost of handling and

analysing multidimensional data experience exponen-

tial growth.

One promising way to address high dimensional-

ity is to disassemble the dataset into a group of lower

dimensional datasets. One of the traditional approach

for disassembling the dataset into a group of lower

dimension dataset presented by Zhang et al. (Zhang

et al., 2017). They used a Bayesian network to deal

with high dimensionality. They assumed some corre-

lations between attributes exist and, if these correla-

tions can be modelled, the model can be used to gen-

erate a set of marginal datasets to simulate the dis-

tribution of the original dataset. The disadvantage of

this solution is that it consumes too much of the pri-

vacy budget during network construction and, hence,

makes the approximation of the distribution inaccu-

rate.

In this work, we employ a non-linear dimen-

sionality reduction method grounded in the mani-

fold hypothesis, which posits that real-world data sets

may reside on a non-linear, low-dimensional man-

ifold embedded within a high-dimensional ambient

vector space. In many real-world datasets, the char-

acteristics of this underlying manifold are initially un-

known. The process of manifold learning is employed

to endeavour the extraction of this hidden manifold

by mapping the data into a lower-dimensional space.

One of the current tools in this era is Topological Data

Analysis (TDA), utilized for the analysis of both geo-

metric and topological information within datasets.

TDA represents an innovative ﬁeld of data anal-

ysis that was developed to capture the underlying

topological structures within data. Over the past few

decades, TDA has undergone extensive research and

exploration. This approach has proven invaluable

in handling complex, high-dimensional datasets that

challenge the capabilities of traditional data analysis

methods.

Persistent homology is a powerful tool for dimen-

sionality reduction from the ﬁeld of TDA. In high-

dimensional data analysis, the manifold hypothesis

suggests that many datasets naturally lie on or near

lower-dimensional manifolds. These manifolds repre-

sent the underlying structure of the data, even though

the data is observed in a higher-dimensional space.

Persistent homology detect topological features that

represent the various components of the data, includ-

ing the lower-dimensional manifolds. These features

can include connected components (0-dimensional

manifolds), loops (1-dimensional manifolds), voids

(2-dimensional manifolds) and so on.

In this paper, we present a novel approach that ob-

tain topological features for our datasets and captures

how long these topological features persist privately.

This makes it possible to generate and publish high

dimensional data privately. Speciﬁcally, we make the

following contributions:

1) We use persistent homology technique to analyse

theoretical meaning behind our datasets and cre-

ating persistence diagram.

2) We implement differential privacy measures on

the persistence diagram to make private features.

3) We generate synthetic dataset based on the private

persistent diagram.

We commence with a preliminaries section, lay-

ing the groundwork with essential background infor-

mation and the introduction of the notations we will

use (Section 2). In Section 3, we delve into an exami-

nation of the related work in the ﬁeld. Our framework

is presented comprehensively in Section 4, while Sec-

tion 5 showcases its practical capabilities. The paper

concludes with a summary and insights in Section 6.

2 PRELIMINARIES

In this section we review some of the standard con-

cepts from topology, algebraic topology and differen-

tial privacy. We want to use these methods to synthe-

size private high-dimensional datasets.

2.1 Differential Privacy Fundamentals

The protection of individuals’ privacy in the con-

text of data publishing and analysis has become a

paramount concern with the increasing availability of

large and sensitive datasets. Differential privacy of-

fers a rigorous and effective approach to address this

concern by ensuring that individual privacy is main-

tained while allowing for meaningful data publishing.

This section introduces the core concepts and termi-

nology related to differential privacy. Formally, dif-

ferential privacy is deﬁned as follows:

Deﬁnition 2.1 (ε-differential Privacy). A randomized

mechanism M gives ε-differential privacy for every set

of outputs Ω, and for any neighbouring datasets of D

and D

′

, if M satisﬁes

Pr[M(D) ∈ Ω] ≤ exp(ε) · Pr[M(D

′

) ∈ Ω]

In other words, the probability of obtaining a spe-

ciﬁc outcome from the mechanism M is only slightly

inﬂuenced by the inclusion or exclusion of any indi-

vidual’s data.

ICISSP 2024 - 10th International Conference on Information Systems Security and Privacy

846

Two fundamental components of differential pri-

vacy are the sensitivity of a function and the privacy

parameter ε. The sensitivity of a function f quanti-

ﬁes how much the function’s output can change when

a single data point is added or removed from the

dataset. The parameter ε refers to the privacy budget,

which controls the level of privacy guarantee achieved

by mechanism M. A smaller ε represents a stronger

privacy level. For a strong privacy guarantee, we need

the privacy budget to be small with an ideal in the

range of zero and one.

To achieve differential privacy, various privacy

mechanisms introduce controlled randomness into

data analysis. Common mechanisms include the

Laplace mechanism and the exponential mechanism.

The Laplace mechanism (Dwork et al., 2016)

means perturbing the output of a function with

Laplace-distributed noise to achieve differential pri-

vacy. Lap(b) to represent the noise sampled from a

Laplace distribution with a scaling of b.

Deﬁnition 2.2. For a function f : D → R over

a dataset D, the mechanism Min provides the ε-

differential privacy

M(D) = f (D) +Lap(

∆ f

)

In Deﬁnition 2.2, the parameter ∆ f refers to the

global sensitivity, which determine how much pertur-

bation is required for a particular query in a mecha-

nism. This property is deﬁned as the largest differ-

ence between the outputs of query q for any pair of

neighbouring datasets which means that

∆

= max||q(D) − q(D

′

)||

where ||.||

is the L

norm.

The Exponential Mechanism (McSherry and Tal-

war, 2007) is employed when you need to select an

item from a set or make a decision based on data, and

you want to ensure that the process is differentially

private. This means that the probability of selecting

one item over another should be adjusted to protect

privacy while preserving the utility of the selection.

Deﬁnition 2.3. Let q(D,φ) be a score function of

dataset D that measures the quality of output φ, ∆ f

represents the sensitivity of f. The exponential mech-

anism M satisﬁes ε-differential privacy if

M(D) = (returnφ ∝ exp(

εq(D,Φ)

2∆ f

)

2.2 Persistent Homology

As per the manifold distribution hypothesis (Goodfel-

low et al., 2016), real-world high-dimensional data is

often situated on a lower-dimensional manifold hid-

den within the broader high-dimensional space. This

underlying manifold is believed to possess a highly

intricate non-linear structure, making its explicit deﬁ-

nition challenging. Nonetheless, it is possible to scru-

tinize and analyze this manifold by considering its

topological properties.

Topological Data Analysis (TDA) serves as a

framework that integrates techniques from algebraic

topology and statistical learning, providing a quanti-

tative foundation for understanding these topological

properties. Among the array of tools hailing from al-

gebraic topology used in TDA, persistent homology

stands out as a pivotal method. To compute the per-

sistent homology of a space, it is necessary to ini-

tially express the space as a simplicial complex. Fig-

ure 1 shows example of simplices and one simpli-

cial complex. A simplicial complex is essentially a

collection of simplicial homology. Simplicial homol-

ogy employs matrix reduction algorithms to assign K

a family of groups, the homology groups. The d

homology group H

(K) of K contains d-dimensional

topological features, such as connected components

(d = 0), cycles/tunnels (d = 1), and voids (d = 2).

Homology groups are typically summarised by their

ranks, thereby obtaining a simple invariant “signa-

ture” of a manifold. For example, a circle in R

has

one feature with d = 1 (a cycle), and one feature with

d = 0 (a connected component).

0-simplex

1-simplex

2-simplex

3-simplex

Figure 1: Example of one simplicial complex with differ-

ent simplices. Two 0-simplex are vertex points, 1-simplex

is a pair of vertex points which bound a line segment, a

2-simplex is a collection of vertex points which live on a

triangle and a 3-dimensional simplex is a tetrahedron.

In practice scenarios, the underlying manifold M

is often unknown and we are working with a point

cloud X := x

,...,x

⊂ R

and a metric distance X ×

X → R such as the Euclidean distance. Persistent

homology adopts simplicial homology to this con-

text. Instead of attempting to approximate M trough

a single simplicial complex, which can be unstable

due to the discrete nature of X, persistent homology

monitors changes in homology groups across various

scales of the metric. A distance function on the un-

derlying space corresponds to a ﬁltration of the sim-

plicial complex. One common method of doing this

is using the Vietoris-Rips construction. A Vietoris-

Rips complex of parameter d is the simplicial com-

Preserving Privacy in High-Dimensional Data Publishing

847

plex with ﬁnite set of points that has diameter at most

d. The Vietoris–Rips complex of X at scale d con-

tains all simplices of X whose elements x

,... sat-

isfy dist(x

) ≤ d for all i, j.

We consider all distances d, then each homology

appears at a particular value of d and disappear at an-

other value of d. We represent the persistence of this

hole as a pair, for example (d

) and visualize this

pair as a bar from d

to d

. A collection of bars is

a barcode. We can represent the persistent homology

with a barcode or persistence diagram. A barcode rep-

resents each persistent generator with a horizontal line

beginning at the ﬁrst ﬁltration level where it appears,

and ending at the ﬁltration level where it disappears.

(a)d

(b)d

(c)d

Figure 2: The three step ﬁltration of Vietoris-Rips complex

on the set of 10 points with increasing radius 0 < d

< d

Figure 2 shows the The Vietoris–Rips complex

of a point cloud X at different scales d

and

. As the distance threshold increases, the connec-

tivity changes. The creation and destruction of d-

dimensional topological features is recorded in the d

persistence diagram which is showed in the Figure 3.

Figure 3: The persistent diagram corresponding to the ﬁl-

tration in the ﬁgure on top. Blue points represent persistent

homology groups of dimension 0, and the orange ones of

dimension 1.

A persistence diagram P = (b

) is essentially

a multiset of birth-death pairs b

and d

, which sat-

isfy b

≤ d

. There are numerous ways to “vectoriz”

a persistence diagram into an element in some vec-

tor space. One of the most popular ways is to rep-

resent each birth-death pair (b,d) by the Dirac mea-

sure δ

(b,d)

at (b,d), and represent the whole diagram

P by the point measure

∑

i=1

)

which is a mea-

sure on the set T := {(x; y) : 0 ≤ x ≤ y ≤ ∞} (Owada,

2022). By realizing a persistence diagram as a mea-

sure, it is possible to deﬁne the distance between two

persistence diagrams by means of a distance between

measures. One of the most popular choices is using

the L

∞

Wasserstein distance of the measures, which

is called the bottleneck distance. Speciﬁcally, let P,

′

be two persistence diagrams. Then the bottleneck

distance between P and P

′

is deﬁned as

∞

(P,P

′

) := inf

η:P→P

sup

t∈P

||t − η(t)||

∞

where η ranges over bijections between P and P

′

A small perturbation in the input ﬁltration leads to

a small perturbation of its persistence diagram in the

bottleneck distance. It means that for our work a key

property of the bottleneck distance is stability prop-

erty (Chazal et al., 2016). In this paper, our objec-

tive is to generate a differentially private persistence

diagram for our dataset and subsequently generalize

synthetic data based on its insights.

3 RELATED WORK

The ﬁeld of publishing high-dimensional data has

garnered signiﬁcant attention from researchers seek-

ing effective methods to balance the disclosure of

information with the imperative to preserve privacy.

Researchers have investigated the application of dif-

ferential privacy mechanisms for publishing high-

dimensional data. Dimensionality reduction is a piv-

otal step in managing high-dimensional datasets. A

powerful approach of dimensionality reduction is the

Bayesian network model proposed in (Zhang et al.,

2017), in which Zhang developed a differentially pri-

vate scheme PrivBayes for publishing high dimen-

sional data. PrivBayes ﬁrst constructs a Bayesian

network to approximate the distribution of the orig-

inal dataset. It adds noise into each marginal of the

Bayesian network to guarantee differential privacy. It

constructs an approximate distribution of the original

dataset, and samples the tuples from the approximate

distribution to construct a synthetic dataset. DP2-Pub

algorithm (Jiang et al., 2023) is another method which

is based on the Bayesian network and propose an in-

variant post randomization method (PRAM) to apply

it to each attribute cluster. Another approach involves

analyzing attribute correlations and utilizing a depen-

dency graph to generate synthetic data that aligns with

the joint distribution. (Chen et al., 2015). These solu-

tions have a drawback as it signiﬁcantly consume too

ICISSP 2024 - 10th International Conference on Information Systems Security and Privacy

848

much of the privacy budget during network construc-

tion.

Computational topology and persistent homology

(PH) have started gaining traction in several areas of

data analysis. In (Alipourjeddi and Miri, 2023), PH is

employed to assess synthetic datasets, enhancing ac-

curacy. The integration of PH into graph analysis, as

indicated in (Alipourjeddi and Miri, 2022) contributes

to more precise synthetic datasets. Further, the fu-

sion of PH with neural networks (Moor et al., 2020)

facilitates dimensionality reduction. Additionally, in-

corporating differential privacy with PH enables dif-

ferentially private Topological Data Analysis (Kang

et al., 2023). These studies collectively concentrate

on the manifold hypothesis and preserving topologi-

cal structures of the input space.

In light of the above analysis, we present the

novel method to generate synthetic data with differ-

entially private persistence diagram. To the best of

our knowledge, our work is the ﬁrst attempt of pub-

lishing high dimensional dataset privately with topo-

logical approach.

4 METHODOLOGY

The well-known Manifold Hypothesis (Cao et al.,

2020) states that in high dimensional data such as

census data are concentrated on a low dimensional

manifold in a Euclidean space embedded in the high-

dimensional background space. Based on this hypoth-

esis, we focus the following problem in this paper:

We have a high-dimensional dataset with r attributes,

and our strategy involves publishing and releasing the

dataset to the public while satisfying differential pri-

vacy. We consider persistent homology to preserves

the homology structure of our dataset accurately.

First, we propose how to add differential privacy

into the persistence diagram of our dataset. In this

step we need to consider the method which is sensi-

tive to outlier. Due to the differential privacy princi-

ple, the speciﬁc data of any one individual should not

have a signiﬁcant effect on the outcome of the anal-

ysis to achieve privacy protection sensitivity (Avella-

Medina, 2021). We examine the sensitivity of the bot-

tleneck distance of persistence diagrams, which is the

most widely used presentation of persistent homol-

ogy. Because the magnitude of outlier-robustness af-

fects the rate of sensitivity of the bottleneck distance,

We use L

-DTM in order to achieve a minimal sensi-

tivity (Kang et al., 2023). We apply the exponential

mechanism which utility function is deﬁned in terms

of the bottleneck distance of L

-DTM persistence di-

agrams in order to produce differentially privatized

persistence diagrams.

Second, we generate the synthetic dataset from the

private persistence diagram. In this step, we choose

randomly an initial hole from our persistence diagram

or persistence barcode and sampling the attributes.

We terminate this process when all attributes have

been sampled.

4.1 Differentially Private Persistence

Diagram Construction

In the realm of differential privacy algorithms, it is

commonplace to quantify the extent to which the

value of a statistic changes when altering a single

point within a given dataset. This maximal potential

change in the statistic is commonly referred to as the

sensitivity of the statistic. It is necessary that the sen-

sitivity goes to 0 as the size of the data grows.

In our work, we use a persistence diagram con-

structed from a dataset D as a statistic that provides

an estimation of the homological structure underlying

the data. To measure distances between persistence

diagrams, we employ the bottleneck distance, deﬁn-

ing a metric on the space of these diagrams. Conse-

quently, when applying a differential privacy mecha-

nism to persistence diagrams, our initial step involves

estimating the sensitivity of persistence diagrams in

terms of the bottleneck distance. Speciﬁcally, we

need to analyse the maximum potential magnitude

of the bottleneck distance, where the pair (D, D

′

) de-

notes an adjacent pair of datasets. The sensitivity of

the persistence diagrams of VietorisRips complexes

cannot converge to 0 even if the size of data grows

to inﬁnity (Kang et al., 2023). Weighted Vietoris-

Rips ﬁltration can be useful to highlight topological

features against outliers and noise. In this regard,

Chazal propose using the notion called distance to a

measure (DTM), to get outlier-robust persistence dia-

grams(Chazal et al., 2017; Anai et al., 2020).

Deﬁnition 4.1. Given a probability measure P , for

0 < m < 1, the distance-to-measure (DTM) at resolu-

tion m is deﬁned by

δ(x) = δ

P,m

(x) =

−1

(u))

where G

(t) = P(||X − x|| ≤ t).

The deﬁnition is L

type of DTM where the sen-

sitivity is bounded by O(n

−1/2

). We focus on L

type

DTM for getting fastest decrease rate for sensitivity

which is bounded by O(n

−1

) (Kang et al., 2023).

To generate differential private persistence dia-

gram, employ exponential mechanism with utility

Preserving Privacy in High-Dimensional Data Publishing

849

function with the bottleneck distance

,...,P

) =

∑

q=0

where

(P) = −d

(P,P

(D))

More speciﬁcally, we use negative bottleneck dis-

tance between private and non-private persistence di-

agrams as a utility function.

4.2 Synthetic Data by Differential

Privacy Persistence Diagram

Our approach focuses on leveraging persistent dia-

grams to generate synthetic data that preserves the

essential topological features of the original high-

dimensional dataset. Analysing the differentially pri-

vate persistent diagrams provide valuable insights

into the homological characteristics inherent in the

data privately.

In the ﬁrst step, we translate and understand the

persistent points corresponding to connected compo-

nents. A clear trend emerges, showcasing the birth

and death of connected components across different

scales. Such persistence indicates the robustness of

speciﬁc structural elements in the original dataset. Fo-

cusing on loops and voids, we identify regions of sus-

tained persistence, signifying the presence of consis-

tent topological patterns. Peaks and valleys in the

diagrams provide valuable insights into the lifetimes

of these features, aiding in the understanding of their

relevance and stability. All analysing help us to for-

mulate synthesis rules for generating synthetic data.

For our datasets, we prioritize components with long

persistence (we deﬁne a threshold for determining the

persistence) and simulate the birth and death events of

topological features. We need to ensure that the dis-

tribution of synthetic points are aligned with the topo-

logical structure. These results allowing us to recreate

the topological patterns in a low-dimensional space.

Secondly, we transform the synthetic points from the

low dimensional space to match the dimensionality

of the original dataset. To generate synthetic data, we

apply the topological autoencoders method (TopoAE)

(Moor et al., 2020). this paper evaluates the topologi-

cal loss in term of distance matrix for each persistence

diagram A

[π

]. Hence, L

= L

X→Z

+ L

Z→X

(Moor

et al., 2020) where,

X→Z

||A

[π

]] − A

[π

]||

and

Z→X

||A

[π

]] − A

[π

]||

The key idea for both terms is to align and preserve

topologically relevant distances from both spaces.

5 EXPERIMENTAL EVALUATION

In this part, we carry out extensive experiments

to demonstrate the performance of our mechanism

and compare it with two benchmark approaches,

PrivBayes (Zhang et al., 2017) and DP2-Pub (Jiang

et al., 2023). Note that our comparative study fo-

cuses on PrivBayes and DP2-Pub because these meth-

ods share a common approach of decomposing high-

dimensional data into a set of low-dimensional repre-

sentations. The evaluation is based on the three real

high-dimension datasets: the Adult dataset (Asuncion

and Newman, 2007), the Poker-Hand dataset (Asun-

cion and Newman, 2007) and the Cleveland dataset

(Asuncion and Newman, 2007).

Adult dataset contains personal information such

as gender, salary, and education level of 45222

records extracted from the 1994 US Census, where

each record has 15 attributes. Each record of Poker-

Hand dataset is an example of a hand consisting of

ﬁve playing cards drawn from a standard deck of 52.

Each card is described using two attributes (suit and

rank), for a total of 11 predictive attributes. There is

one Class attribute that describes the“Poker Hand”.

Cleveland Heart Disease dataset presents the heart

disease in the patient and contains 14 attributes.

Initially, we assess the persistent homology of our

datasets. Table 2 illustrates the count of homology

in various dimensions of these datasets, presented on

persistent diagrams. Notably, our analysis focuses on

a subset of both the Adult and Poker-Hand datasets.

Table 1: Persintent barcodes in different dimensions.

Datasets H

Adult 1000 27 0

Poker-Hand 800 863 564

Cleveland 303 199 34

To obtain the private persistence diagrams for each

dataset, we set the resolution of the L

-DTM as m =

0.05, the privacy budget ε = 1 and we consider 1-

dimension of topological feature in ﬁgures. Figure

4, Figure 5 and Figure 6 show the results of com-

paring the L

-DTM persistence diagram correspond-

ing to Adult, Poker-Hand and Cleveland datasets and

their differentially private diagrams respectively.

After getting the differential private persistence

diagram, we generate the synthetic data. We use

Monte carlo (MC) method to align the distributions.

The threshold for persistence of component H

are

ICISSP 2024 - 10th International Conference on Information Systems Security and Privacy

850

(a) Original PD. (b) Private PD.

Figure 4: Persistent diagrams(PDs) of Original and private

Adult dataset.

(a) Original PD. (b) Private PD.

Figure 5: Persistent diagrams(PDs) of Original and private

Poker-Hand dataset.

(a) Original PD. (b) Private PD.

Figure 6: Persistent diagrams(PDs) of Original and private

Cleveland dataset.

Figure 7: Multiple SVM classiﬁers on Adult dataset.

vary for each datasets. We set the threshold for Adult,

Adult

= 50000, for Poker-Hand T

= 5, and for

Cleveland T

Celeland

= 25. Also, we use our differen-

tially private persistence diagram in the TopoAE al-

gorithm and generate the synthetic dataset.

For the second task, we evaluate the performance

of PrivBayes, DP2-Pub1, our work DPTopo, and

Non-Private (no DP is considered) for SVM classi-

ﬁcation. Figure 7 , Figure 8 and Figure 9 show the

Figure 8: Multiple SVM classiﬁers on Cleveland dataset.

Figure 9: Multiple SVM classiﬁers on Poker-Hand dataset.

misclassiﬁcation rate of three datasets at different lev-

els of privacy protection or privacy budgets. The mis-

classiﬁcation rate of the original dataset (denoted as

Non Private) stands for the best performance we can

achieve. It means that the lowest misclassiﬁcation

rate is better result. We observe that our result out-

performs compared to others. Moreover, A lower pri-

vacy budget typically means stronger privacy protec-

tion but may lead to a higher misclassiﬁcation rate.

Another notable observation is that the misclassiﬁca-

tion rate of SVM decreases with the increase in the

privacy budget. This ﬁnding is in line with the theo-

retical expectation that as the privacy budget expands,

privacy protection weakens, leading to an increase in

the availability of data and a consequent reduction in

misclassiﬁcation rates.

6 CONCLUSIONS

In this paper, we presented a novel approach to

generating private synthetic data leveraging insights

from persistent homology. Our methodology success-

fully replicated the essential topological features ob-

served in the high-dimensional original dataset. By

applying the weighted Vietoris-Rips complex algo-

rithm, we computed persistent homology and ex-

tracted meaningful diagrams. We produced differ-

ential private persistence diagrams by applying ex-

ponential mechanism. We used a negative bottle-

neck distance between private and non-private per-

sistence diagram as a utility function. we used L

DTM to achieve minimal sensitivity. For generating

synthetic data based on differentially private persis-

tence diagram, we kept similar birth and death events

for persistent points with the same distribution. We

Preserving Privacy in High-Dimensional Data Publishing

851

transformed the low-dimensional space with synthetic

points to high-dimensional space by topological au-

toencoders method. Our research highlights the ef-

ﬁcacy of persistent homology-inspired synthesis in

producing differential private synthetic data with sig-

niﬁcant topological structures. As the ﬁeld of Topo-

logical Data Analysis (TDA) progresses, exploring al-

ternative metrics for computing the persistence dia-

gram, such as the persistence landscape, becomes cru-

cial. Adopting an alternative privacy framework like

zero-concentrated Differential Privacy has also shown

to yield lower errors in the privacy mechanism.

ACKNOWLEDGEMENTS

This work was supported in parts by funds from the

Natural Sciences and Engineering Research Coun-

cil of Canada (NSERC) Discovery and the Canada

First Research Excellence Fund (CFREF) Bridging

Divides programs.

REFERENCES

Alipourjeddi, N. and Miri, A. (2022). Publishing private

high-dimensional datasets: A topological approach.

In the 2022 International Wireless Communications

and Mobile Computing (IWCMC), pages 1142–1147.

IEEE.

Alipourjeddi, N. and Miri, A. (2023). Evaluating gener-

ative adversarial networks: A topological approach.

In the 2023 International Conference on Computing,

Networking and Communications (ICNC), pages 202–

206. IEEE.

Anai, H., Chazal, F., Glisse, M., Ike, Y., Inakoshi, H., Tinar-

rage, R., and Umeda, Y. (2020). Dtm-based ﬁltrations.

In Topological Data Analysis: The Abel Symposium

2018, pages 33–66. Springer.

Asuncion, A. and Newman, D. (2007). UCI machine learn-

ing repository.

Avella-Medina, M. (2021). Privacy-preserving parametric

inference: a case for robust statistics. Journal of the

American Statistical Association, 116(534):969–983.

Binjubeir, M., Ahmed, A. A., Ismail, M. A. B., Sadiq, A. S.,

and Khan, M. K. (2019). Comprehensive survey on

big data privacy protection. IEEE Access, 8:20067–

20079.

Cao, W., Yan, Z., He, Z., and He, Z. (2020). A comprehen-

sive survey on geometric deep learning. IEEE Access,

8:35929–35949.

Chazal, F., De Silva, V., Glisse, M., and Oudot, S. (2016).

The structure and stability of persistence modules.

Springer.

Chazal, F., Fasy, B., Lecci, F., Michel, B., Rinaldo, A., Ri-

naldo, A., and Wasserman, L. (2017). Robust topo-

logical inference: Distance to a measure and kernel

distance. The Journal of Machine Learning Research,

18(1):5845–5884.

Chen, R., Xiao, Q., Zhang, Y., and Xu, J. (2015). Dif-

ferentially private high-dimensional data publication

via sampling-based inference. In the proceedings of

the 21th ACM SIGKDD international conference on

knowledge discovery and data mining, pages 129–

138. ACM.

Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2016).

Calibrating noise to sensitivity in private data analysis.

Journal of Privacy and Conﬁdentiality, 7(3):17–51.

Dwork, C., Roth, A., et al. (2014). The algorithmic founda-

tions of differential privacy. Foundations and Trends®

in Theoretical Computer Science, 9(3–4):211–407.

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep

learning. MIT press.

Jiang, H., Yu, H., Cheng, X., Pei, J., Pless, R., and Yu,

J. (2023). DP2-Pub: Differentially private high-

dimensional data publication with invariant post ran-

domization. IEEE Transactions on Knowledge and

Data Engineering.

Kang, T., Kim, S., Sohn, J., and Awan, J. (2023). Differen-

tially private topological data analysis. arXiv preprint

arXiv:2305.03609.

Mahanan, W., Chaovalitwongse, W. A., and Natwichai, J.

(2021). Data privacy preservation algorithm with k-

anonymity. World Wide Web, 24:1551–1561.

McSherry, F. and Talwar, K. (2007). Mechanism design

via differential privacy. In the proceeding of 48th An-

nual IEEE Symposium on Foundations of Computer

Science (FOCS’07), pages 94–103. IEEE.

Moor, M., Horn, M., Rieck, B., and Borgwardt, K. (2020).

Topological autoencoders. In the proceeding of In-

ternational conference on machine learning, pages

7045–7054. PMLR.

Owada, T. (2022). Convergence of persistence diagram in

the sparse regime. The Annals of Applied Probability,

32(6):4706–4736.

Zhang, J., Cormode, G., Procopiuc, C. M., Srivastava, D.,

and Xiao, X. (2017). PrivBayes: Private data re-

lease via bayesian networks. ACM Transactions on

Database Systems (TODS), 42(4):1–41.

ICISSP 2024 - 10th International Conference on Information Systems Security and Privacy

852