Privacy-preserving Data Retrieval using Anonymous Query

Authentication in Data Cloud Services

Mohanad Dawoud and D. Turgay Altilar

Computer Engineering Department, Istanbul Technical University, Istanbul, Turkey

Keywords:

Privacy, Data Retrieval, Anonymous Authentication, Cloud Computing, Homomorphic Encryption.

Abstract:

Recently, cloud computing became an essential part of most IT strategies. However, security and privacy

issues are still the two main concerns that limit the widespread use of cloud services since the data is stored in

unknown locations and retrieval of data (or part of it) may involve disclosure of sensitive data to unauthorized

parties. Many techniques have been proposed to handle this problem, which is known as Privacy-Preserving

Data Retrieval (PPDR). These techniques attempt to minimize the sensitive data that needs to be revealed.

However, revealing any data to an unauthorized party breaks the security and privacy concepts and also may

decrease the efﬁciency of the data retrieval. In this paper, different requirements are deﬁned to satisfy a

high level of security and privacy in a PPDR system. Moreover, a technique that uses anonymous query

authentication and multi-server settings is proposed. The technique provides an efﬁcient ranking-based data

retrieval by using weighted Term Frequency-Inverse Document Frequency (TF-IDF) vectors. It also satisﬁes

all of the deﬁned security requirements that were completely unsatisﬁed by the techniques reported in the

literature.

1 INTRODUCTION

Information Technology (IT) systems are used

steadily in most technology ﬁelds. Therefore, the

size of data that need to be stored, processed, and

transferred through different public, private, or hy-

brid network systems is increasing rapidly. Recently,

cloud computing services have been constituting the

best solution to deal with this explosion in the data

size. There are many cloud systems that offer differ-

ent services with high potentials, however, security

and privacy still being the main concerns in such sys-

tems. The nature of the cloud requires the transfer

of the data to unknown locations to store or process

them. Storing the data in the cloud systems can be

secured by the traditional symmetric or asymmetric

encryption algorithms. However, any data mining or

retrieval processes need the data, or part of it, to be re-

vealed to the cloud system, which may contradict with

conventional security and privacy concepts. Accord-

ingly, many techniques have been proposed to enable

the cloud to apply searching processes on the data

without revealing them, or, revealing as little as secu-

rity and privacy rules allow. Figure 1 shows the basic

model of such a system. The system consists mainly

of three parts: data owner, cloud server (cloud), and

client (user). Data owner has a large number of doc-

uments that need to be indexed, searched, and par-

tially retrieved by the user, but he does not have the

processing and storage capabilities to do that. Cloud

has the processing and storage capabilities needed to

serve the system, but it is assumed to be “honest-but-

curious”. “Honest-but-curious”means that the cloud

follows the designated protocol honestly, but curious

to infer useful information by analysing the data ﬂow

during running the protocol. User needs to retrieve

documents related to a queried document (or key-

words). Data owner creates indexes for the documents

and store the indexes as well as the documents in the

cloud in an encrypted form. He also creates trapdoors

and send them to the user. User utilizes these trap-

doors with the queried document to create a query.

He sends the query to the cloud which in turn replies

by the related documents.

Suppose that f eatures(γ) and index(γ) are the fea-

tures and index of a document γ, respectively. Also,

query(θ, τ) is the query generated from the document

θ and the trapdoor τ. To keep a high level of security

and privacy of the data as well as the queries, follow-

ing requirements need to be satisﬁed by any proposed

protocol:

1. No Index Pattern: For any two documents α and

β where f eatures(α) = f eatures(β), index(α) 6=

Dawoud, M. and Altilar, D.

Privacy-preserving Data Retrieval using Anonymous Query Authentication in Data Cloud Services.

In Proceedings of the 6th International Conference on Cloud Computing and Services Science (CLOSER 2016) - Volume 2, pages 171-180

ISBN: 978-989-758-182-3

171

Data

Owner

Cloud Server

(Cloud)

Client (User)

Trapdoors

Encrypted Index

& Documents

Query

Related Documents

Data outsourcing

Query communications

Figure 1: The simplest model of privacy preserving data retrieval system.

index(β). Otherwise, the cloud can relate these

documents to each other.

2. No Query Pattern: For any two documents

δ and θ where f eatures(δ) = f eatures(θ),

query(δ, τ) 6= query(θ, τ). Otherwise, an unautho-

rized party can relate the users who are sending

similar queries.

3. No Documents Pattern: For any unauthorized

party, it is infeasible to know the retrieved docu-

ments or the rank of the documents for any query

query(δ, τ). Otherwise, this unauthorized party

can relate the retrieved documents to each other,

or, relate the query query(δ, τ) to the retrieved

documents and dissociate it to others.

4. No Index Frequency: For any document α, there

is no frequency pattern in the index index(α) that

can be used to infer any information about the data

in that document.

5. No Query Frequency: For any document δ and

trapdoor τ, there is no frequency pattern in the

query query(δ, τ) that can be used to infer any in-

formation about the data in that document.

6. No Replay Attack: For any valid query, it cannot

be used later by any unauthorized party for any

purpose.

7. Query Privacy: Neither the cloud nor any unau-

thorized party is allowed to know or to be able

to infer anything about the contents of the user’s

queries. Moreover, only authorized users can

make queries.

8. Index Privacy: For any unauthorized party, it is

infeasible to know or to be able to infer anything

about the contents of the index. Additionally, in

case cloud can add fake indexes (even random

ones), it cannot acquire any useful information.

9. Documents Privacy: For any unauthorized party,

it is infeasible to know or to be able to infer any-

thing about the contents of the encrypted docu-

ments.

These 9 security requirements along with with the

high efﬁciency of data retrieval are called the 9+1

requirements in the rest of this paper.

One of the ﬁrst techniques to handle the privacy

preserving search on encrypted data was proposed

by Song et al. (Song et al., 2000). Later, vari-

ous techniques were proposed such as (Boneh et al.,

2004; Liu et al., 2009; Li et al., 2011; ChinnaSamy

and Sujatha, 2012; Kuzu et al., 2012; Tseng et al.,

2012). These techniques are examples of keyword-

based search techniques. However, this kind of search

misses a lot of similarity details and decreases the ef-

ﬁciency of the data retrieval. The techniques in (Li

et al., 2010; Chuah and Hu, 2011; Wang et al.,

2012b) provided the capability of fuzzy keyword

search. These techniques possess the same weak-

nesses with the keyword-based search techniques.

However, they are distinguished by their capability of

overcoming a number of spelling mistakes found in

the queries. Other techniques such as (Wang et al.,

2010; Orencik and Savas, 2014; Wang et al., 2012a;

Chen et al., 2014; Sun et al., 2013; Cao et al.,

2011) provided results ranking. However, in order

to provide results ranking, these techniques compro-

mise some key data to unauthorized parties in the

system. In a previous work (Dawoud and Altilar,

2014) we used homomorphic encryption of normal-

ized Term Frequency-Inverse Document Frequency

(TF-IDF) values (Rajaraman and Ullman, 2011) to

provide a multi-keyword ranked search. The tech-

nique was an improvement to Gopal and Singh (Gopal

and Singh, 2012) technique to hide any frequency pat-

CLOSER 2016 - 6th International Conference on Cloud Computing and Services Science

172



E E D



Figure 2: Homomorphic property of homomorphic encryp-

tion.

tern in the index as well as keeping the efﬁciency of

data retrieval high. Both techniques utilize the cosine

similarity of TF-IDF vectors. The previously pro-

posed technique satisﬁes the 1st, 4-5th, and 7-9th re-

quirements along with efﬁciency requirement where

it fails to satisfy the 2nd, 3rd, and 6th requirements.

Homomorphic encryption is a form of encryption

that allows computations to be applied on the en-

crypted data. The result is the ciphertext of the value

resulting from applying the same operations (or other)

on the unencrypted values. Figure 2 shows the homo-

morphic property of homomorphic encryption. Sup-

pose that x and y are two plaintexts, and, a and b are

the ciphertexts resulting from encrypting x and y, re-

spectively, using the same homomorphic key K

. z

is resulting from applying the  operation on x and

y, while c is resulting from applying operation on a

and b.  and  severally can be addition, subtraction,

multiplication or division operations. The homomor-

phic property ensures that decrypting c using the key

results to z. There are two categories of homomor-

phic encryption algorithms: partially and fully ho-

momorphic algorithms. Partially homomorphic algo-

rithms support only multiplication or addition, while

fully homomorphic algorithms support both multipli-

cation and addition (Xiang et al., 2012).

The rest of this paper is organized as follows: Sec-

tion 2 discusses the problem statement and the contri-

bution of this paper. Section 3 explains the proposed

technique. Section 4 analyse the efﬁciency of the

proposed technique according to the security require-

ments discussed in this Section. Section 5 concludes

this paper and discusses some of the future works.

2 PROBLEM STATEMENT

To the best of our knowledge, there is no technique

that satisﬁes all the 9+1 requirements efﬁciently in the

current state of the art. Failing to satisfy any one of

the 9 security requirements poses a threat on the pri-

vacy of the data, which is non-negotiable in most of

the current applications. On the other hand, high efﬁ-

ciency of data retrieval is needed to achieve the main

aim of the data retrieval system. Satisfying some of

the security requirements on cost of the retrieval efﬁ-

ciency decreases the reliability of the system.

In a previous work (Dawoud and Altilar, 2014),

we proposed a two-rounds technique that uses the

model shown in Figure 1. It uses the cosine similar-

ity between the TF-IDF vectors to ﬁnd a ranked sim-

ilarity vector of the documents according to a query.

This retrieval system was shown more efﬁcient com-

pared to binary keyword-based search systems (Da-

woud and Altilar, 2014; Salton and Buckley, 1988a).

Although the technique satisﬁes the requirements 1,

4-5, and 7-9 efﬁciently, it is still unable to hide the

query and document patterns. It is also vulnerable to

replay attacks as shown below:

1. Query Pattern: The normalization and encryp-

tion of the values of a Term Frequency (TF) vector

are used to hide any frequency pattern in that vec-

tor. However, any two queries with equal TF vec-

tors generate the same normalized values but with

different distributions. This is because the nor-

malized values are distributed randomly in place

of the original ones. Therefore, this pattern can be

detected by checking if any two or more queries

have the same values even in different distribu-

tions.

2. Documents Pattern: The documents are re-

quested in a clear format from the cloud in the

second round. Therefore, the cloud can relate

the documents requested by a user in the second

round to the query sent by the same user in the ﬁrst

round. It can also relate the requested documents

to each other.

3. Replay Attacks: Any valid query can be reused

by any party. Although unauthorized parties can-

not compromise the contents of the query, sim-

ilarity vector, and retrieved documents, they are

still able to use these valid queries in Denial-of-

Service (DoS) attacks.

Therefore, ﬁnding a technique that satisﬁes all the

9+1 requirements is our contribution in this paper.

The technique beneﬁts the achievements of the tech-

nique proposed in (Dawoud and Altilar, 2014). It is

extended to overcome its deﬁciencies to reach a com-

plete ranked multi-keyword secure data retrieval sys-

tem over cloud system.

3 THE TECHNIQUE

Beside the data owner, cloud (which is called search-

ing server in the proposed technique), and user, which

are similar to the ones shown in Figure 1, the pro-

posed technique model includes an authentication

Privacy-preserving Data Retrieval using Anonymous Query Authentication in Data Cloud Services

173

server, ranking server, private server, and L document

servers. Searching server, authentication server, rank-

ing server, private server, and document servers are

assumed to be “honest-but-curious”and do not collab-

orate with each other, which is consistent with previ-

ous works. The same assumptions regarding the data

owner, cloud, and user reported in Section 1 for the

model in Figure 1 are used here. The required au-

thorizations and communication security between the

system parties are assumed to be appropriately done.

Although hiding the communications paths maybe es-

sential in such systems, it is considered out of the

scope of this paper.

Figure 3 shows the model of the proposed tech-

nique. It can be divided into six processes: data

outsourcing, query generation, query authentication,

similarity vector calculation, similarity vector rank-

ing, and documents retrieval. The rest of this Section

discusses the implementation of these processes. The

security of the proposed technique will be discussed

in Section 4.

3.1 Data Outsourcing

The data owner generates the TF-IDF table of the set

of documents D (which will be noted as T FIDF in the

rest of this paper). The searchable indexes S is gen-

erated by normalizing T FIDF values and encrypting

them by a homomorphic encryption algorithm and a

key K

as described in (Dawoud and Altilar, 2014).

Moreover, the documents (D) and their IDs (ID) are

encrypted separately by a symmetric or an asymmet-

ric encryption algorithm and a key K

to generate

E[D] and E[ID], respectively. Thereafter, S is sent

to the searching server, K

and K

are sent to the user,

is sent to the ranking server, E[ID] is sent to the

private server, while E[D] and E[ID] are sent to the

document servers as shown in Figure 3

3.2 Query Generation

The user calculates the TF vector of the query docu-

ment (QT F). The values of QT F are normalized as

described in (Dawoud and Altilar, 2014) to generate

Γ(QT F). An extra step before encrypting these nor-

malized values is to multiply each normalized value

by a number ρ, which is a random number greater

than zero generated for each single query, to generate

(QT F). Therefore, if Γ(QT F) = [ f

, f

, . . . , f

then Γ

(QT F) = [( f

× ρ), ( f

× ρ), . . . , ( f

× ρ)].

Multiplication by ρ is used to hide any query pattern

as will be shown in Section 4. Finally, the values of

(QT F) are encrypted by the same homomorphic

encryption algorithm used by the data owner and the

key K

to generate the query vector (Q).

The user sends the query which consists of Q, the

number of documents to be retrieved (r), and the au-

thentication key (k

) (to be discussed in Sub-Section

3.3) to the searching server as shown in Step (1) in

Figure 3.

3.3 Query Authentication

Utilizing query anonymous authentication in the pro-

posed technique may provide many properties such

as prevention of replay attack, allowing only autho-

rized users to create a valid query, services pricing

and billing, privileges granting, etc. However, only

the added security properties are discussed in this pa-

per, where the others are considered out of its scope.

In a previous stage of this work, an anonymous au-

thentication technique for limited resources units has

been proposed. The system consists of three parties:

authentication servers, readers (or foreign servers),

and users. The authentication servers have the data

needed for authentication. The readers are respon-

sible of transferring data between the users and the

servers. The users are limited resources units. Each

user has a key K

which is composed of unique M

sub-keys grouped into I groups. Suppose that a com-

bination of sub-keys is a set of sub-keys composed

of exactly one sub-key from each group of a key K

then, the summation of any combination in the whole

system has to be unique. An m different combina-

tions of sub-keys are selected randomly without du-

plication by the user in each authentication process.

The summations of these m combinations of sub-keys

are calculated and sent to the server as an authentica-

tion key. The server uses a key data to reverse these

summations and ﬁnd the exact combinations used to

generate them. The server searches in the database

for the user which has all the sets of sub-keys used in

the combinations. If such a user is found, the server

identify him, otherwise, the request is considered as

fake request and ignored.

Suppose that W is the number of users, and G =

, g

, . . . , g

} is the set of the sizes of the groups. To

ﬁnd the sub-keys that satisﬁes the above assumptions,

the server do the following steps:

1. Find the maximum value of the interval of the sub-

keys [Min, Max] as follows:

Max =



∏

i=1

(W ∗g

) + 1



−1 (1)

2. Use a recursive algorithm to divide each inter-

val into non-overlapped sub-intervals. For ex-

ample, suppose that M = 9, W = 3, I = 3, and

G = {3, 3, 3}, then Max = (10∗10 ∗10)−1 = 999

CLOSER 2016 - 6th International Conference on Cloud Computing and Services Science

174

Data Owner

Authentication

Server

Searching

Server

Ranking

Server

Client (User)

Private

Server

Data Server L

Data Server 2

Data Server 1

E[D], E[ID]

E[ID]

, K

(1) Q, r, k

(2) k

(3) Check(k

)

(4a) [col

, col

], r

(4b) [col

, col

]

(5) Rank(r)

(6a) E[id

]

(7a) E[d

]

(6b) E[id

]

(7b) E[d

]

(6c) E[id

]

(7c) E[d

]

(8) E[d

], E[d

], . . . , E[d

]

Data outsourcing

Query communications

Figure 3: The architecture of the proposed technique.

and the complete interval [Min, 999] is divided

into non-overlapped sub-intervals as follows:

= {100, 200, 300, 400, 500, 600, 700, 800, 900}

= {10, 20, 30, 40, 50, 60, 70, 80, 90}

= {1, 2, 3, 4, 5, 6, 7, 8, 9}

Note that Min value is 111 and the summation

of any combination composed of one value from

each interval gives a unique value.

3. Encrypt the values of the sub-intervals individu-

ally using the homomorphic encryption and a key

to restrict the reversibility of the summations

to the parties which has the homomorphic key K

4. Assign sub-keys to the users randomly with-

out duplication from each interval to the related

group.

The sub-keys are used to generate different authenti-

cation keys, k

, in each query which is called the au-

thentication key from now on. Therefore, only the au-

thorized users can generate valid authentication keys

which are identiﬁable and acceptable by the authen-

tication server. Moreover, the keys are changing in

each authentication process to prevent any unautho-

rized party (including the readers) from identifying or

tracking a user. Although the technique was originally

designed to authenticate users with limited resources,

it is more than suitable to be applied in the proposed

technique based on the following properties:

1. The technique was shown secure against key

forgery and key exposure attacks.

2. The technique was shown secure against replay

attacks.

3. Generation of an authentication key in the user is

very simple (addition of integers).

4. Identiﬁcation of a user in the server is done by a

simple search in the user list without revealing the

identity of the user to the reader. In other tech-

niques, the identity of the user is revealed to the

reader, or, the server makes a brute-force search

to identify a user in each authentication process.

5. Only the authentication server is able to identify

the user.

6. No need for key synchronization between the user

and the server.

7. After key deployment, the user does not need any

data to start an authentication process.

8. The technique is suitable for users with different

capabilities starting from simple sensors up to su-

percomputers.

In the proposed technique, the user and the au-

thentication server play the same roles, while the

searching server plays the reader role. The authenti-

cation server generates K

and sends it to the user. In

each query, the user generates k

(which is different

for each query) and sends it to the searching server

as part of the query. The searching server forwards

only k

to the authentication server. The authentica-

tion server checks the validity of k

. If k

is valid, the

Privacy-preserving Data Retrieval using Anonymous Query Authentication in Data Cloud Services

175

authentication server sends Accept Msg to the search-

ing server, otherwise, it sends Re ject Msg as shown

in Steps (2) and (3), respectively, in Figure 3. The

searching server checks whether the message coming

from the authentication server is Accept Msg to pro-

ceed, otherwise, the query is ignored.

3.4 Similarity Vector Calculation

In order to ﬁnd the similarity vector between the

query and the documents, the searching server uses

the Cosine similarity measure. Cosine similarity mea-

sure calculates the cosine value between two vec-

tors (Salton and Buckley, 1988b). Suppose that

is the Cosine similarity between the query Q =

, q

, . . . , q

] and the index of the nth document

= [s

n,1

, s

n,2

, . . . , s

n,M

, ], then csn can be given as in

Equation 2.

∑

m=1

×s

n,m

)

∑

m=1

)

∑

m=1

n,m

)

(2)

The Cosine similarity vector between Q and S (CS)

would be the set of all similarity vectors as shown in

Equation 3.

CS = [cs

|1 ≤ n ≤ N] (3)

However, the multiplication of normalized QT F val-

ues by ρ in query generation (shown in Sub-Section

3.2) has no effect on the ﬁnal similarity value. Equa-

tion 2 can be reconstructed as in Equation 4 to include

this multiplication.

∑

m=1

(ρ ×q

×s

n,m

)

∑

m=1

(ρ ×q

)

∑

m=1

n,m

)

(4)

One can easily simplify Equation 4 to Equation 2

through a single line of derivation as shown in Equa-

tion 5.



ρ ×

∑

m=1

×s

n,m

)



∑

m=1

)

∑

m=1

n,m

)

(5)

The searching server creates a table of N ×3 ele-

ments where N is the number of documents. The ﬁrst

column (col

) consists of the numbers between 1 and

N distributed randomly in the rows. The second col-

umn (col

) consists of the encrypted IDs (E[ID]). The

third column (col

) consists of the cosine similarity

values (CS) in the same order of E[ID]. The search-

ing server orders the table [col

, col

] according

to col

. Finally, it sends the table [col

, col

] and r to

the ranking server, while the table [col

, col

] is sent

to the private server as shown in Steps (4a) and (4b)

in Figure 3.

3.5 Similarity Vector Ranking

The ranking server uses K

to decrypt the values of

col

received from the searching server in col

. The

table [col

, col

] is ordered in descending order ac-

cording to col

. The highest r (where r is the num-

ber of the requested documents in the query) rows of

the ordered [col

, col

] table are stored in a new table

called Rank(r). The ranking servers sends Rank(r ) to

the private server as shown in Step (5) in Figure 3.

The private server matches the values of col

column

of table Rank(r) to the values of col

column of the

[col

, col

] table received from the searching server

and retrieves the encrypted IDs of the documents from

the column col

. The matched encrypted IDs (ε) are

the encrypted IDs of the documents selected to be re-

trieved for the query Q.

3.6 Documents Retrieval

Considering that there is L data servers, assume

that the private server received υ sets of documents

(ε

, ε

, . . . , ε

) to be retrieved for υ different queries

, Q

, . . . , Q

), respectively. The private server se-

lects randomly κ ×

∑

i=1

documents from E[ID].

These random documents together with the υ sets of

documents are inserted randomly in a queue. For

each document in the queue, the private server selects

a data server randomly to retrieve that document as

shown in Steps (6a, 6b, and 6c) in Figure 3. Once a

requested document is retrieved from a data server, it

is forwarded to the user who sent the query related to

that document to decrypt it using K

as shown in Steps

(7a, 7b, and 7c) and (8) in Figure 3, otherwise, it is ig-

nored. The value of κ can be changed according to υ

as will be discussed in Section 4.

4 ANALYSIS OF THE

TECHNIQUE

This Section discusses the achievement of the 9 secu-

rity requirements, listed in Section 1, by the proposed

technique. Note that the proposed technique relies on

a previously proposed technique (Dawoud and Alti-

CLOSER 2016 - 6th International Conference on Cloud Computing and Services Science

176

lar, 2014). The tenth requirement (high efﬁciency of

data retrieval) was discussed in Section 2.

1. No Index Pattern: Whatever the values of

T FIDF are, normalization described in (Dawoud

and Altilar, 2014) guarantees that all the val-

ues of Γ(T FIDF), and therefore all the val-

ues of S, are unique. Uniqueness of S val-

ues means that for any two documents α and

β where f eatures(α) = f eatures(β), index(α) 6=

index(β), which achieves the ﬁrst requirement.

2. No Query Pattern: For each set of similar val-

ues in QTF, normalization generates a new set

of unique values and distributes them randomly

in place of the original ones. Therefore, the sim-

ilarity values of different generated queries of a

single document are slightly different. As these

values are encrypted, then, even small differences

in these similarity values will not be detectable

by the searching server. To see that, suppose that

= h

, h

, . . . , h

is the histogram of the QT F

values of a document d

. This means that a value

in QT F appears h

times. Therefore, the num-

ber of different queries that can be generated from

the document d

is MQ, which is calculated as

shown in Equation 6.

MQ =

∏

i=1

! for h

> 1 (6)

Table 1 shows the average MQ values for three

different datasets available in the literature (Ham-

mouda, 2013; Lang, 1995; Volkan, 2012). It

can be seen that the probability of generating

two similar queries for the same document is less

than

20000

, which is negligible. However, these

queries can be recognized since they have the

same values but in different distributions. For this

reason, multiplication by ρ, mentioned in Sub-

Section 3.2, is used. Multiplying the values of

Γ(QT F) by ρ hides the distribution of these val-

ues without effecting the ﬁnal similarity results as

shown in Sub-Section 3.4. Therefore, random dis-

tribution of the normalized values together with

hiding this distribution achieve the second re-

quirement. Moreover, using anonymous authenti-

cation makes the searching server unable to iden-

tify the user, which is another advantage of using

it among the other authentication techniques.

3. No Documents Pattern: In (Dawoud and Alti-

lar, 2014), the similarity vector is sent to the user

to decrypt it and selects the documents with the

highest similarity values. This may cause an extra

overhead to the user; it also reveals the IDs of the

documents related to the query. In the proposed

technique, the ranking server is used to decrypt

and order the values, which means that it has the

key K

. To prevent the ranking server from gener-

ating queries, query authentication is used as ex-

plained in Sub-Section 3.3. As the ranking server

does not have the key K

, it is unable to generate

queries that are acceptable by the authentication

server.

The searching server calculates the similarity val-

ues of the encrypted indexes without revealing

their actual values. However, sending the en-

crypted similarity vector together with the related

document IDs to the ranking server makes it able

to reveal the similarity between the query and the

exact documents. So, the searching server sends

a temporary random IDs (col

), instead of the

original IDs, to the ranking server. In this way,

the ranking server is simply decrypting and or-

dering random numbers before sending them to

the private server. On the other hand, the private

server uses the information received from both the

searching server and the ranking server to ﬁnd the

related documents.

Referring to Sub-Section 3.6, to retrieve υ sets of

documents (ε

, ε

, . . . , ε

) coming from υ differ-

ent queries, the private server selects randomly

κ ×

∑

i=1

documents from E[ID]. These ran-

dom set of documents together with the υ sets of

documents are inserted randomly in a queue. For

each document in the queue, the private server se-

lects a data server randomly from L data servers

to retrieve that document. Assume that the queue

is static, which means that there is no online in-

sertion of documents into the queue. If max is the

maximum of

, . . . ,

, then, for a data server

l, the probability of being two requested docu-

ments related to the same query is ≤ P

, where:

υ(max

−max)



(κ + 1)

∑

i=1





(κ + 1)

∑

i=1



−1

(7)

Increasing L decreases P

, however, ﬁnding large

number of data servers which are “honest-but-

curious”and do not collaborate with each other is

not easy. Therefore, for a speciﬁc values of L and

υ, increasing κ decreases P

. The private server

can dynamically change κ to keep a negligible

value of P

4. No Index Frequency: Normalization of T FIDF

values removes any frequency pattern (Dawoud

and Altilar, 2014).

5. No Query Frequency: Normalization of QT F

Privacy-preserving Data Retrieval using Anonymous Query Authentication in Data Cloud Services

177

Table 1: Average MQ values of different datasets.

Dataset Number of Documents Number of Unique Keywords Average MQ

webdata (Hammouda, 2013) 314 15756 2033 ×10

59270

mini newsgroups (Lang, 1995) 400 16360 1115 ×10

61691

classic (Volkan, 2012) 800 6291 3143 ×10

21158

Table 2: Comparison between different privacy-preserving data retrieval techniques.

Requirments

(Boneh et al., 2004)

(Liu et al., 2009)

(Li et al., 2011)

(ChinnaSamy and Sujatha, 2012)

(Kuzu et al., 2012)

(Tseng et al., 2012)

(Li et al., 2010)

(Wang et al., 2012b)

(Chuah and Hu, 2011)

(Wang et al., 2010)

(Cao et al., 2011)

(Wang et al., 2012a)

(Sun et al., 2013)

(Orencik and Savas, 2014)

(Chen et al., 2014)

(Dawoud and Altilar, 2014)

The proposed Technique

1- No Index Pattern · · · ·

√

· · · ·

√

· · · · ·

√ √

2- No Query Pattern · · · · · · · · ·

√

· · ·

√

· ·

√

3- No Documents Pattern · · · · · · · · · · · · · · · ·

√

4- No Index Frequency · · · ·

√

· · · ·

√ √

· ·

√ √ √ √

5- No Query Frequency · · · ·

√ √

· · ·

√ √

· ·

√ √ √ √

6- No Replay Attack · · · · · · · · · · · · ·

√

· ·

√

7- Query Privacy · ·

√

√ √ √ √ √ √ √ √

√ √ √ √

8- Index Privacy

√ √ √

√ √ √ √ √

√

· ·

√ √ √ √

9- Documents Privacy

√ √ √ √ √ √ √ √ √ √ √ √

√ √ √ √

10- Ranked · · · ·

√

· · · ·

√ √ √ √ √ √ √ √

√

= Achieved · = Not achieved.

values removes any frequency pattern (Dawoud

and Altilar, 2014).

6. No Replay Attack: Each single authentication

key k

is used only once until the authentication

keys are updated. Therefore, if a valid query is

resent, it will be rejected by the authentication

server and ignored by the searching server.

7. Query Privacy: The query values are encrypted

using K

. Although the ranking server has the key

, it is unable to create a query because it does

not have an authentication key. Moreover, it is un-

able to reveal the query since it does not have the

encrypted query. On the other side, the searching

server has the encrypted query but does not have

the key K

to decrypt it, which achieves the sev-

enth requirement.

8. Index Privacy: The index values are encrypted

using K

. Although the searching server has the

encrypted index, it is unable to disclose its values

since it does not have the key K

. Although the

ranking server has the key K

, it also unable to

disclose the index contents since it receives only

the similarity vector in a random order. Moreover,

if the cloud added fake indexes, it will be unable

to get any important information since the simi-

larity vector is encrypted. Therefore, the eighth

requirement is achieved.

9. Documents Privacy: The documents are en-

crypted using K

which is known only to the data

owner and users. Therefore, unauthorized parties

are unable to disclose the contents of the docu-

ments. which achieves the ninth property.

Table 2 compares the proposed technique to the

techniques reported in the literature according to the

9 security requirements as well as the ranking prop-

erty. Among the discussed techniques, it can be seen

that the only technique that achieves the 9 security

requirements is the proposed technique. Moreover,

it provides a multi-keyword ranked search based on

the similarity of the normalized TF-IDF values which

gives better retrieval efﬁciency compared to the other

searching techniques (Dawoud and Altilar, 2014).

CLOSER 2016 - 6th International Conference on Cloud Computing and Services Science

178

5 CONCLUSION

In this paper, 9 security requirements are deﬁned to

create a highly secure data retrieval system that uti-

lizes cloud computing systems. These requirements

are: no index pattern, no query pattern, no documents

pattern, no index frequency, no query frequency, no

replay attack, query privacy, index privacy, and doc-

uments privacy. None of the techniques that have

been reported in the literature are able to satisfy all

of these 9 requirements. Moreover, some of the exist-

ing approaches use data mining techniques that may

decrease the efﬁciency of the data retrieval, such as

binary features, reduction of keywords, reduction of

features vector, classes normalization, etc. The pro-

posed technique is shown as being able to satisfy all of

the 9 security requirements along with the efﬁciency

requirement. It utilizes a multi-server setting to sep-

arate the leaked information. However, none of the

servers are able to infer any information from the data

that pass through it. The technique uses anonymous

authentication of the queries to prevent any unautho-

rized party from generating a query as well as pre-

venting the replay attacks. It also uses the Cosine

similarity measure to calculate the similarity between

the TF vector of the query and the TF-IDF vectors of

the documents to rank them according to their simi-

larity to the query. This similarity measure is shown

as being effective and applicable in the proposed tech-

nique. Table 2 illustrates the position of the proposed

technique with regard to relevant researches available

in the literature. The technique can also be adapted to

support fuzzy-keywords retrieval property, which is a

future research topic for our group.

REFERENCES

Boneh, D., Di Crescenzo, G., Ostrovsky, R., and Per-

siano, G. (2004). Public key encryption with keyword

search. In Cachin, C. and Camenisch, J., editors, Ad-

vances in Cryptology - EUROCRYPT 2004, volume

3027 of Lecture Notes in Computer Science, pages

506–522. Springer Berlin Heidelberg.

Cao, N., Wang, C., Li, M., Ren, K., and Lou, W. (2011).

Privacy-preserving multi-keyword ranked search over

encrypted cloud data. In INFOCOM, 2011 Proceed-

ings IEEE, pages 829–837.

Chen, L., Sun, X., Xia, Z., and Liu, Q. (2014). An efﬁ-

cient and privacy-preserving semantic multi-keyword

ranked search over encrypted cloud data. Inter-

national Journal of Security and Its Applications,

8(2):323–332.

ChinnaSamy, R. and Sujatha, S. (2012). An efﬁcient seman-

tic secure keyword based search scheme in cloud stor-

age services. In Recent Trends In Information Tech-

nology (ICRTIT), 2012 International Conference on,

pages 488–491.

Chuah, M. and Hu, W. (2011). Privacy-aware bedtree

based solution for fuzzy multi-keyword search over

encrypted data. In Distributed Computing Systems

Workshops (ICDCSW), 2011 31st International Con-

ference on, pages 273–281.

Dawoud, M. and Altilar, D. (2014). Privacy-preserving

search in data clouds using normalized homomorphic

encryption. In Euro-Par 2014: Parallel Processing

Workshops, volume 8806 of Lecture Notes in Com-

puter Science, pages 62–72. Springer International

Publishing.

Gopal, G. and Singh, M. (2012). Secure similarity based

document retrieval system in cloud. In Data Science

Engineering (ICDSE), 2012 International Conference

on, pages 154–159.

Hammouda, k. (2013). Web mining data - uw-can-dataset.

http://pami.uwaterloo.ca/ hammouda/webdata.

Kuzu, M., Islam, M. S., and Kantarcioglu, M. (2012). Ef-

ﬁcient similarity search over encrypted data. In Pro-

ceedings of the 2012 IEEE 28th International Confer-

ence on Data Engineering, ICDE ’12, pages 1156–

1167, Washington, DC, USA. IEEE Computer Soci-

ety.

Lang, K. (1995). Newsweeder: Learning to ﬁlter netnews.

In Proceedings of the Twelfth International Confer-

ence on Machine Learning, pages 331–339.

Li, J., Wang, Q., Wang, C., Cao, N., Ren, K., and Lou, W.

(2010). Fuzzy keyword search over encrypted data in

cloud computing. In INFOCOM, 2010 Proceedings

IEEE, pages 1–5.

Li, M., Yu, S., Cao, N., and Lou, W. (2011). Autho-

rized private keyword search over encrypted data in

cloud computing. In Distributed Computing Sys-

tems (ICDCS), 2011 31st International Conference

on, pages 383–392.

Liu, Q., Wang, G., and Wu, J. (2009). An efﬁcient pri-

vacy preserving keyword search scheme in cloud com-

puting. In Computational Science and Engineering,

2009. CSE ’09. International Conference on, vol-

ume 2, pages 715–720.

Orencik, C. and Savas, E. (2014). An efﬁcient privacy-

preserving multi-keyword search over encrypted

cloud data with ranking. Distributed and Parallel

Databases, 32(1):119–160.

Rajaraman, A. and Ullman, J. D. (2011). Data mining. In

Mining of Massive Datasets, pages 1–17. Cambridge

University Press. Cambridge Books Online.

Salton, G. and Buckley, C. (1988a). Term-weighting ap-

proaches in automatic text retrieval. In Information

Processing and Management, pages 513–523.

Salton, G. and Buckley, C. (1988b). Term-weighting ap-

proaches in automatic text retrieval. Inf. Process.

Manage., 24(5):513–523.

Song, D. X., Wagner, D., and Perrig, A. (2000). Practical

techniques for searches on encrypted data. In Security

and Privacy, 2000. S P 2000. Proceedings. 2000 IEEE

Symposium on, pages 44–55.

Privacy-preserving Data Retrieval using Anonymous Query Authentication in Data Cloud Services

179

Sun, W., Wang, B., Cao, N., Li, M., Lou, W., Hou, Y. T., and

Li, H. (2013). Privacy-preserving multi-keyword text

search in the cloud supporting similarity-based rank-

ing. In Proceedings of the 8th ACM SIGSAC Sympo-

sium on Information, Computer and Communications

Security, ASIA CCS ’13, pages 71–82, New York,

NY, USA. ACM.

Tseng, F.-K., Liu, Y.-H., and Chen, R.-J. (2012). Toward

authenticated and complete query results from cloud

storages. In Trust, Security and Privacy in Computing

and Communications (TrustCom), 2012 IEEE 11th In-

ternational Conference on, pages 1204–1209.

Volkan, T. (2012). Data mining re-

search - classic3 and classic4 datasets.

http://www.dataminingresearch.com/index.php/2010/

09/classic3-classic4-datasets.

Wang, C., Cao, N., Li, J., Ren, K., and Lou, W. (2010).

Secure ranked keyword search over encrypted cloud

data. In Distributed Computing Systems (ICDCS),

2010 IEEE 30th International Conference on, pages

253–262.

Wang, C., Cao, N., Ren, K., and Lou, W. (2012a). Enabling

secure and efﬁcient ranked keyword search over out-

sourced cloud data. IEEE Transactions on Parallel

and Distributed Systems, 23(8):1467–1479.

Wang, C., Ren, K., Yu, S., and Urs, K. (2012b). Achieving

usable and privacy-assured similarity search over out-

sourced cloud data. In INFOCOM, 2012 Proceedings

IEEE, pages 451–459.

Xiang, G., Yu, B., and Zhu, P. (2012). A algorithm of

fully homomorphic encryption. In Fuzzy Systems and

Knowledge Discovery (FSKD), 2012 9th International

Conference on, pages 2030–2033.

CLOSER 2016 - 6th International Conference on Cloud Computing and Services Science

180