Eﬃcien

t Tree Search in Encrypted Data

R. Brinkman, L. Feng, J. Doumen, P.H. Hartel, and W. Jonker

University of Twente, Enschede, the Netherlands

Abstract. Sometimes there is a need to store sensitive data on an un-

trusted database server. Song, Wagner and Perrig have introduced a way

to search for the existence of a word in an encrypted textual document.

The search speed is linear in the size of the document. It does not scale

well for a large database. We have developed a tree search algorithm

based on the linear search algorithm that is suitable for XML databases.

It is more eﬃcient since it exploits the structure of XML. We have built

prototype implementations for both the linear and the tree search case.

Exp eriments show a major improvement in search time.

1 Introduction

Nowadays the need grows to store data securely on an untrusted system. Think,

for instance, of a remote database server administered by somebody else. If you

want your data to be secret, you have to encrypt it. The problem then arises

how to query the database. The most obvious solution is to download the whole

database locally and then perform the query. This of course is terribly ineﬃcient.

Song, Wagner and Perrig [1] have introduced a protocol to search for a word in

an encrypted text. We will summarise this protocol in section 2.

In this paper we propose a new protocol that is more suitable for handling

large amounts of semi-structured XML data. This new protocol exploits the

XML tree structure. XPath queries can be answered fast and secure.

We have built prototype implementations for both the linear and the tree

search protocol (section 3). We use these prototypes to ﬁnd optimal settings

for the parameters used within the protocols and to show the increase in search

speed by using the tree structure. We did some experiments (section 4) for which

the results can be found in section 5.

2 Search Strategy

Before we describe our tree search strategy (section 2.2) we will give a short

summary of the original linear search strategy of Song, Wagner and Perrig [1].

2.1 Linear Search Strategy for Full Text Documents

Song, Wagner and Perrig [1] describe a protocol to store sensitive data on an

untrusted server. A client (Alice) can store data on the untrusted server (Bob)

Brinkman R., Feng L., Doumen J., H. Hartel P. and Jonker W. (2004).

Efﬁcient Tree Search in Encrypted Data.

In Proceedings of the 2nd International Workshop on Security in Information Systems, pages 126-135

DOI: 10.5220/0002664201260135

 SciTePress

and search in it, without revealing the plain text of either the stored data, the

query or the query result. The protocol consists of three parts: storage, search

and retrieval.

Storage Before Alice can store information on Bob she has to do some calcu-

lations. First of all she has to fragment the whole plain text W into several

ﬁxed sized words W

. Each W

has a ﬁxed length n. She also generates en-

cryption keys k

and k

and a sequence of random numbers S

using a pseudo

random generator. Then she has or calculates the following for each block

plain text block

encryption key

= E

) = hL

, R

i encrypted text block

key for f

= f

) key for F

random number i

= hS

, F

)i tuple used by search

= X

⊕ T

value to be stored (⊕ stands for xor)

where E is an encryption function and f and F are keyed hash functions:

E : key × {0, 1}

→ {0, 1}

f : key × {0, 1}

n−m

→ key

F : key × {0, 1}

n−m

→ { 0, 1}

The encrypted word X

has the same block length as W

(i.e. n). L

has

length n − m and R

has length m. The parameters n and m may be chosen

freely (n > 0, 0 < m ≤

). Section 5.1 gives guidelines for eﬃcient values of

n and m. The value C

can be sent to Bob for storage. Alice may now forget

the values W

, X

, L

, R

, k

, T

and C

, but should still remember k

, k

and S

Search After the encrypted data is stored by Bob in the previous phase Alice

can query Bob. Alice provides Bob with an encrypted version of a plain text

word W

and asks him if and where W

occurs in the original document.

Note that Alice does not have to know the position j. If W

was a block in

the original data then hj, C

i is returned. Alice has or calculates:

encryption key

key for f

plain text block to search for

= E

) = hL

, R

i encrypted block

= f

) key for F

Then Alice sends the value of X

and k

to Bob. Having X

and k

Bob is

able to compute for each C

= C

⊕ X

= hS

, S

IF S

= F

) THEN RETURN hp, C

127

If p = j then S

= F

), otherwise S

is garbage. Note that all locations

with a correct T

value are returned. However there is a small chance that T

satisﬁes T = hS

, F

)i but where S

6= S

. Therefore, Alice should check

each answer whether the correct random value is used or not.

Retrieval Alice can also ask Bob for the cipher text C

at any p osition p. Alice,

knowing k

, k

and the seed for S, can recalculate W

p desired location

= hC

p,l

, C

p,r

i stored block

random value

p,l

= C

p,l

⊕ S

left part of encrypted block

= f

p,l

) key for F

= hS

, F

)i check tuple

= C

⊕ T

encrypted block

= D

) plain text block

where D is the decryption function D : key × {0, 1}

→ {0, 1}

such that

)) = W

This is all Alice needs. She can store, ﬁnd and read the text while Bob cannot

read anything of the plain text. The only information Bob gets from Alice is C

in the store phase and X

and k

in the search phase. Since C

and X

are

both encrypted with a key only known to Alice and k

is only used to hash one

particular random value, Bob does not learn anything of the plain text. The only

information Bob learns from a search query is the location where an encrypted

word is stored.

2.2 Tree Search Strategy for XML Documents

So far, we considered only text ﬁles. Using structured XML data can improve

eﬃciency.

Torsten Grust [2, 3] introduces a way to store XML data in a relational

database such that search queries can be handled eﬃciently. An XML document

is translated into a relational table with a predeﬁned structure. Each record

consists of the name of the tag or attribute and its corresponding value. The

information about the tree structure of the original XML document is captured

in the pre, post and parent ﬁelds. All ﬁelds can be computed in a single pass over

the XML document. The pre and post ﬁelds are sequence numbers that count

the open tags respectively the close tags. The parent value is the pre value of

the parent element (see ﬁgure 1(a)).

The XPath axes like descendant, ascendant, child, etc can be expressed as

simple expressions over the pre, post and parent ﬁelds. For instance:

– v is a child of v

⇐⇒ v.parent = v

.pre

– v is a descendant of v

⇐⇒ v

.pre < v.pre ∧ v

.post > v.post

– v is following v

⇐⇒ v

.pre < v.pre ∧ v

.post < v.post

128

pre

p ost

parent

<a>

<b>

</b>

d=”. . . ”>

<e/>

</c>

</a>

(a) Pre/Post/Parent calculation (b) Visualisation of XPath Axes in a

Pre/Post Plane

Fig. 1. Calculation and Usage of Pre, Post and Parent ﬁelds

Some XPath axes can also be drawn in a pre/post plane (see ﬁgure 1(b)). Each

element can be drawn as a dot in the graph. The solid circle indicates just one of

them. Taking the solid circle as starting element, the quadrants indicate where

its ascendants, descendants and siblings are located.

Not all updates are eﬃcient. Modiﬁcation and deletion are no problem, but

element insertion causes the need to recalculate the pre, post and parent values

for all following elements. The number of recalculations can be reduced by an

initial sequence with a larger step (100, 200, 300, . . . ).

Torsten Grust aims at storing XML data in the clear. To protect the data

cryptographically we combine his strategy with the linear search approach of

Song, Wagner and Perrig (SWP) [1]. Only some slight modiﬁcations to the SWP

approach are necessary:

1. The input ﬁle is not an unstructured text ﬁle but a tree structured XML

document. The division of the data into ﬁxed sized blocks does not seem

natural. Therefore, we use variable block lengths that depend on the lengths

of the tag names, attribute names, attribute values and the text between

tags.

2. The sequence number of a block is no longer appropriate to deﬁne the loca-

tion within a document. We use the pre value instead.

The equations of section 2.1 can be rewritten to the equations below. Note that

all subscripts have changed. For simplicity we only describe the encryption of

tag names. Exactly the same scheme is used for attribute names (preﬁxed with

a @ sign) or the data itself by simply substituting value for tag.

129

Storage

tag

plain text block

encryption key

tag

= E

tag

) = hL

tag

, R

tag

i encrypted text block

key for f

tag

= f

tag

) key for F

pre

random number pre

pre,tag

= h S

pre

, F

tag

pre

)i tuple used by search

pre,tag

= X

tag

⊕ T

pre,tag

value to be stored

Note that the random value S

pre

does not depend on the tag name but on

the location (expressed in the pre ﬁeld) because all elements with the same

tag name should be stored diﬀerently.

Search An XPath query like /tag

//tag

[tag

= ”value”] is encrypted to

/hX

tag

, k

tag

i//hX

tag

, k

tag

i[hX

tag

, k

tag

i = ”hX

value

, k

value

i”] before send-

ing it to the server. The server calculates the result traversing the XPath

query from left to right. Each step consists of two or three sub steps:

– Evaluating the XPath axis /, //, [ and ] using the pre, post and parent

ﬁelds. It is possible to ﬁnd all children (/) or all descendants (//) of

elements found in a previous step by just using the pre, post and parent

ﬁeld. See section 3.2 for an example.

– Filtering out the records that do not satisfy S

= F

tag

) in T

p,tag

⊕ X

tag

= hS

, S

– Eventually ﬁltering out the records with an incorrect value ﬁeld.

Retrieval

key for f

encryption key

pre desired location

pre,tag

= hC

pre,tag,l

, C

pre,tag,r

i stored block

pre

random value

tag ,l

= C

pre,tag,l

⊕ S

pre

left part of encrypted block

tag

= f

tag,l

) key for F

tag

= hS

pre

, F

tag

pre

)i check tuple

tag

= C

pre,tag

⊕ T

tag

encrypted block

tag

= D

tag

) plain text block

3 Implementation

For each search strategy a prototype has been developed. Each prototype consists

of two tools: one for encryption and one for searching. All tools use the standard

crypto packages shipped with JDK 1.4.

3.1 Linear Search Prototype

Section 2.1 introduces three functions: E, f and F . E should be a blo ck cipher

in ECB mode and f and F keyed hash functions. For our prototype we chose

130

DES for all three of them. E is exactly DES in ECB mode. Since DES works on

blocks of 64 bits n should be a multiple of 64 bits.

f and F are keyed hash functions with variable sized hash values. Standard

hash functions like SHA-1 have a ﬁxed sized hash value. It is possible to use the

last (or the ﬁrst) m bits of the hash value, but then m should be less than the

size of the hash value (160 bits for SHA-1). To allow a larger value for m our

prototype uses DES in CBC mode. To hash a data block of length n − m to a

hash value of length m the block is encrypted with the speciﬁed key (56 bits

DES key) but only the last m bits are used as hash value. The only restriction

for m is that n − m ≥ m and thus n ≥ 2m. See Menezes et al [4] for a more

detailed description of the used hash algorithm.

The search algorithm implements the protocol described in [1] as summarised

in section 2.1. The program takes the whole cipher text along with the query as

input and produces the hi, C

i pairs as output.

3.2 Tree Search Prototype

Like the linear prototype the tree search prototype is split into two parts: one

for encryption and one for searching.

The Encrypt tool uses a SAX parser to read the input XML document.

In one pass over the input, the pre, post and parent values can b e calculated.

When an end tag is encountered all the information to encrypt the element is

available. Attributes are handled as tags with a leading @ sign. A new record

hpre, post, parent, C

pre,tag

, C

pre,value

i is inserted into the relational database,

where C

pre,tag

and C

pre,value

are calculated as in section 2.2. In our prototype

we use a MySQL database to store the encrypted document.

In contrast with the linear prototype there are no predeﬁned block sizes n

and m. Instead of using a ﬁxed sized block, n is simply set to the length of the

tag name. m is a predeﬁned fraction of n (for example 0.5).

In order to speed up the search process, indices are added to the MySQL

table for the pre, post and parent ﬁelds.

The XPath expression is evaluated step by step. Preliminary results are

stored in a result table. Each step consists of two or three sub steps:

1. Carry out the path delimiter (/, //, [ or ]). For this step only the pre, post

and parent ﬁelds are needed. For example // (descendants) is translated into

the SQL query:

CREATE TABLE new_result

SELECT data.*

FROM data, previous_result

WHERE data.pre > previous_result.pre AND

data.post < previous_result.post

2. Filter out the records in the preliminary result with the wrong tag/attribute

names. In this step we use the original linear search method.

131

3. When the step consists of an equation expression the previous step is re-

peated but now for the value instead of the name.

4 Experimental Data

The two prototypes give us the opportunity to experiment with the parameters

used in the protocol and, more importantly, compare the linear search approach

with the tree search approach. We are especially interested in the inﬂuence the

approach and the parameters n and m have on the encryption and search speed.

We used the XML benchmark

[5] to generate three sample XML ﬁles of sizes

1 MB, 10 MB and 100 MB. Although the linear approach does not use the

structure of these XML ﬁles the benchmark is used in both cases to compare the

results with the tree search approach.

Also the number of collisions has been measured (see ﬁgure 2(a)). Collisions

are the false hits that occur because of the collisions in the hash function F . F

hashes the random value S

of size n−m to a hash value of length m, where n −

m ≥ m. Therefore collisions are unavoidable (collisions are avoidable when n −

m = m and F is bijective, but bijective functions are not good hash functions).

4.1 Experiments with the Linear Search Prototype

For the linear prototype both n and m may be chosen freely. Tests are carried

out ∀n ∈ {8, 16, 24, 32, 40, 48, 56, 64} where these values are the number of bytes

and not bits. Because we use DES in ECB mode for the encryption function

E, we only use multiples of 8 bytes. m should be less than or equal to

m ∈ {1, 2, . . . ,

} (also in bytes). Measurement results of the 100 MB case are

plotted in ﬁgure 2(b). Tests with data inputs of 1 MB and 10 MB showed that

the numb er of collisions, the search and the encryption times are proportional

to the data size. In our technical report [6] more experimental data is provided.

All tests were carried out on a Pentium IV 2.4 MHz with 512 MB memory.

For the search query a word guaranteed to be in at least one location was

chosen. The search engine does not stop when one occurrence is found; all the

text is scanned for each query.

4.2 Experiments with the Tree Search Prototype

For the tree search prototype the only conﬁgurable parameters are m and the

data size. The block length n depends on the tag names and values. Encryption

tests are carried out on the same XML do cuments as in the linear prototype. In

this case m is relative to n; m ∈ {0.1, 0.2, 0.3, 0.4, 0.5}. The encryption times for

the 1 MB, 10 MB and the 100 MB ﬁles were 21.5, 188 and 1195 s and did not

depend on m.

Search tests were carried out with a ﬁxed m = 0.5 because m does not seem

to have much inﬂuence. Some queries are shown in table 1. Also the number of

http://www.xml-benchmark.org

132

Collissions

100

1000

0000

8 16 24 32 40 48 56 64

m=1

m=2

m=3

(a) Number of Measured Collisions

Linear Search Approach

8 16 24 32 40 48 56 64

Encryption

(b) Encryption and Search Search Times

Fig. 2. Measurement Results of Linear Search Prototype for the 100 MB Case

elements in the result is shown for each query. All three ﬁles have approximately

the same tree depth but have diﬀerent branch factors (average number of sub

children per element).

Table 1. Search Times Calculated for Search Queries with Diﬀerent Depth and Branch

Factor

t (ms)

query

count

1 MB

10 MB

100 MB

1 MB

10 MB

100 MB

1281

1506

1285

/site

1266

1380

1321

/site/regions

1358

1435

1342

/site/regions/asia

1409

1687

2464

/site/regions/asia/item

200

2000

1518

2030

4135

/site/regions/asia/item/description

200

2000

1376

1591

2442

/site/regions/africa/item/description

550

1448

2777

9059

/site/regions/europe/item/description

600

6000

1455

2098

4577

/site/regions/australia/item/description

220

2200

1654

3226

13672

/site/regions/namerica/item/description

100

1000

10000

1336

1817

3028

/site/regions/samerica/item/description

100

1000

1398

2382

18530

//*

21048

206130

2048180

3639

21775

191899

//item

217

2175

21750

5 Analysis of the Results

First we will analyse the results of the individual experiments in the ﬁrst two

subsections. In subsection 5.3 we will compare the linear search approach with

the tree search approach.

5.1 Results from the Linear Search Approach

From the linear search prototype we can conclude the following:

133

– As exp ected the larger the dataset the larger the encryption and search times.

Encryption and search times grow linear in the size of the dataset. Therefore

the protocol does not scale well and can only be used for reasonable small

databases.

– The larger n is the shorter the encryption and search times gets (ﬁgure 2(b)).

This can be explained by looking at the number of blocks. The larger n is

the fewer blocks there are. For each block a ﬁxed number of steps is taken.

Most of these steps do not depend on the length of the blocks. Therefore less

time is needed for the whole database.

– Searching is faster than encryption, because fewer operations have to be

calculated for each block.

– The larger n is the fewer collisions occur (ﬁgure 2(a)). This can also be

explained by the fewer blocks.

– For a ﬁxed value of n the encryption and search times hardly depend on the

value of m.

– Collisions can be avoided by choosing a suﬃciently large value of m. The

largest value for m =

which is also the most optimal one. But also for

m > 2 the number of collisions is negligible.

5.2 Results from the Tree Search Approach

From the tree search prototype we can conclude that:

– The encryption time is linear in the size of the input.

– The search time depends both on the structure of the XML document and

the search query. The search time is of order O(p) where p is the number

of elements to be read. For queries without // this comes down to O(bd)

where b is the branch factor (the average number of sub elements) and d is

the depth in the tree where the answer is found.

5.3 Beneﬁts of using Tree Structure

From the exp eriments with the linear search method we know that the encryption

time depends on the block size. Therefore, to make a fair comparison between

the linear text search and the tree search, we have to take into account the block

size of the tree search method. We analysed the XML documents and found the

data shown in table 2.

Comparison of the encryption sp eed in the tree search case (with an average

block size of around 18) with the linear case, shows that the tree encryption is

slightly faster than in the linear case. The reason for this is that there is no need

to encrypt the close tag.

The major beneﬁt of using the tree structure is the increase in search speed.

Only a small part of the whole tree has to be searched. Because the search

time totally dep ends on the data and the query, a straight comparison between

the linear and the tree case is impossible. However, linear search is of order

O(n) = O(b

), whereas tree searching is of order O(bd).

134

6 Conclusions

We have implemented a prototype for the theory described in [1]. We showed

that the search complexity is linear in the size of the text. We have deﬁned

a new protocol for semi-structured XML data that exploits the tree structure.

Experiments with the implementations of both protocols showed that the encryp-

tion speed remains linear in the size of the input, but that a major improvement

in the search speed can be achieved.

References

1. Dawn Xiaodong Song, David Wagner, and Adrian Perrig. Practical techniques for

searches on encrypted data. In IEEE Symposium on Security and Privacy, pages

44–55, 2000. http://citeseer.nj.nec.com/song00practical.html.

2. Torsten Grust. Accelerating xpath location steps. In Proceedings of the 21st ACM

International Conference on Management of Data (SIGMOD 2002), pages 109–120.

ACM Press, Madison, Wisconsin, USA, June 2002. http://www.informatik.uni-

konstanz.de/∼grust/ﬁles/xpath-accel.pdf.

3. Torsten Grust, Maurice van Keulen, and Jens Teubner. Staircase join: Teach

a relational dbms to watch its (axis) steps. In Proceedings of the 29th Int’l

Conference on Very Large Databases (VLDB 2003), Berlin, Germany, Sep 2003.

http://citeseer.nj.nec.com/593676.html.

4. Alfred J. Menezes, Paul C. van Oorschot, and Scott A. Vanstone.

Handbook of Applied Cryptography. CRC Press, October 1996.

http://www.cacr.math.uwaterloo.ca/hac/.

5. A. Schmidt, F. Waas, M. Kersten, D. Florescu, I. Manolescu, M. Carey, and

R. Busse. The xml benchmark project. Technical Report INS-R0103, CWI, April

2001. http://citeseer.ist.psu.edu/schmidt01xml.html.

6. R. Brinkman, L. Feng, S. Etalle, P. H. Hartel, and W. Jonker. Experimenting

with linear search in encrypted data. Technical report TR-CTIT-03-43, Centre for

Telematics and Information Technology, Univ. of Twente, The Netherlands, Sep

2003. http://www.ub.utwente.nl/webdo cs/ctit/1/000000d9.pdf.

Table 2. Block Sizes

data

avg tag

standard

avg text

standard

avg all

standard

size

length

deviation

size

deviation

blo cks

deviation

1 MB

9.8

3.4

28.0

18.2

10 MB

9.8

3.4

28.6

18.4

100 MB

9.8

3.4

28.9

18.6

135