THE HYBRID DIGITAL TREE: A NEW INDEXING TECHNIQUE

FOR LARGE STRING DATABASES

Qiang Xue and Sakti Pramanik

Department of Computer Science and Engineering

Michigan State University, East Lansing, MI 48824, USA

Gang Qian

Department of Computer Science

University of Central Oklahoma, Edmond, OK 73034, USA

Qiang Zhu

Department of Computer and Information Science

The University of Michigan,Dearborn, MI 48128, USA

Keywords:

Hybrid Digital tree, indexing, string databases, preﬁx searches, substring searches.

Abstract:

There is an increasing demand for efﬁcient indexing techniques to support queries on large string databases.

In this paper, a hybrid RAM/disk-based index structure, called the Hybrid Digital tree (HD-tree), is proposed.

The HD-tree keeps internal nodes in the RAM to minimize the number of disk I/Os, while maintaining leaf

nodes on the disk to maximize the capability of the tree for indexing large databases. Experimental results

using real data have shown that the HD-tree outperformed the Preﬁx B-tree for preﬁx and substring searches.

In particular, for distinctive random queries in the experiments, the average number of disk I/Os was reduced

by a factor of two to three, while the running time was reduced in an order of magnitude.

1 INTRODUCTION

Electronic text (string) collections have increased dra-

matically over the last decade, from megabytes of dic-

tionaries, to gigabytes of genome sequences, to ter-

abytes of web documents. Many applications need

efﬁcient indexing methods to process complex string

queries (e.g., substring searches) on these large string

data sets. In the past few decades, various data struc-

tures have been proposed for string indexing. They

can be divided into two categories: RAM-based and

disk-based. The ﬁrst category includes digital-tree-

based (trie-based) structures, such as the Patricia trie

(Morrison, 1968), the sufﬁx tree (McCreight, 1976;

Weiner, 1973), the sufﬁx array (Manber and My-

ers, 1990), and the PAT tree (Gonnet et al., 1991).

The second category includes the extendible hashing

(Fagin et al., 1979), inverted ﬁles (Baeza-Yates and

Ribiero-Neto, 1999), the Preﬁx B-tree (Bayer and Un-

terauer, 1977), and the String B-tree (Ferragina and

Grossi, 1999).

RAM-based index structures reside in the main

memory (RAM) where string queries are performed.

Among these RAM-based structures, Patricia tries

and PAT/sufﬁx trees are particularly effective in han-

dling relatively small amount of string data; however,

as the database size increases, it is no longer feasi-

ble to keep the trie structure in the RAM. Moreover,

because of the unbalanced structure of tries, it is inef-

ﬁcient to store tries on disk, especially when indexes

are dynamically created (Clark and Munro, 1996; Fer-

ragina and Grossi, 1999). Therefore, we argue that

RAM-based index structures are not suitable for in-

dexing large string databases.

On the other hand, disk-based data structures can

be used for indexing large string databases. Among

these disk-based structures, hashing technology is ef-

ﬁcient for exact string matches and inverted ﬁles are

efﬁcient for keyword-based searches; however, they

are unsuitable for substring searches. The Preﬁx B-

tree is capable of indexing large and dynamic string

databases. The String B-tree (Ferragina and Grossi,

1999) uses the Patricia trie inside its internal nodes

to provide the same worst-case performance as the B-

tree (Bayer and McCreight, 1972). Since the String

B-tree stores indexed strings in a separate ﬁle, it re-

quires more disk accesses than the Preﬁx B-tree in

general case. These disk-based indexing techniques

115

Xue Q., Pramanik S., Qian G. and Zhu Q. (2005).

THE HYBRID DIGITAL TREE: A NEW INDEXING TECHNIQUE FOR LARGE STRING DATABASES.

In Proceedings of the Seventh International Conference on Enterprise Information Systems, pages 115-121

DOI: 10.5220/0002518501150121

 SciTePress

require limited RAM to conduct string queries. To

utilize the large amount of available memory, they

rely on caching mechanisms that are usually not opti-

mized for individual data structure.

In this paper, we propose the Hybrid Digital tree

(HD-tree), a novel hybrid RAM/disk-based index

structure to support efﬁcient queries on very large

string databases. The HD-tree keeps its internal

nodes, which are similar to those in digital trees, in

the RAM to minimize the number of disk I/Os for a

string query. Its leaf nodes, which hold the sufﬁxes of

the indexed strings, are kept on disk to maximize the

capability of the tree for indexing a large database.

It is known that traditional disk-based trees, such as

Preﬁx B-trees, may use the available RAM to keep

their internal nodes (i.e., caching), so that the num-

ber of disk I/Os may be reduced. However, the HD-

tree is different from this approach as follows: First,

an internal node of disk-based trees is a disk block,

which is usually several kilobytes in size, while an in-

ternal node of the HD-tree is a data structure (i.e., a

trie node), which is usually several bytes in size. Sec-

ond, the internal nodes of disk-based trees are stored

on disks and have to be read into the RAM whenever

is necessary, while all internal nodes of the HD-tree

are kept in the RAM, so that no disk I/Os are required

to access these internal nodes.

The internal nodes of a HD-tree are built on the

preﬁxes of indexed strings and are used to guide the

search to the leaf node(s) containing the query an-

swer(s). Unlike a traditional digital tree, the parent

of a leaf node in the HD-tree allows a set (“range”) of

multiple preﬁxes so that indexed strings with different

preﬁxes may share the same leaf node (disk block) to

improve disk utilization. Moreover, unlike the tradi-

tional concept of range, the above preﬁx “range” of

a node may not be “continuous”, so that strings with

a preﬁx within the traditional range may be stored in

a separate leaf node(s) to allow further improvements

in disk utilization.

We did extensive experiments to study the behavior

of the HD-tree under different RAM sizes for various

string queries. It was observed that for a given data-

base size, a small amount of RAM improved the per-

formance of the HD-tree signiﬁcantly; however, when

the RAM size was increased beyond a certain thresh-

old point, the gain in performance became less sig-

niﬁcant. We also conducted experiments to evaluate

the performance of the HD-tree by comparing to the

Preﬁx B-tree. The experimental results showed that

the HD-tree outperformed the Preﬁx B-tree given the

same amount of RAM.

The rest of this paper is organized as follows: the

structure and algorithms of the HD-tree are described

in Section 2; experimental results using Text RE-

trieval Conference (TREC) collections (Voorhees and

Harman, 1997) are discussed in Section 3; conclu-

sions and future work are presented in Section 4.

2 THE HD-TREE

The HD-tree incorporates and extends some indexing

strategies of the digital tree and the B

-tree (Comer,

1979), taking advantages of their strengths in search

performance, compression capability, and disk uti-

lization. We ﬁrst introduce the notation and assump-

tions used in this paper. A string consists of a se-

ries of letters (symbols) chosen from an alphabet Σ of

size |Σ|. The letters and strings are assumed to have a

lexicographic order. Symbols from Σ are denoted by

lower-case letters (e.g., a, b, and c), while strings are

denoted by lower-case Greek letters (e.g., α, β, and

γ). ♯ is a special auxiliary symbol such that ♯ /∈ Σ

and ♯ < c for any c ∈ Σ. Given a string α=a

...a

length |α|=n, we call a

...a

a preﬁx, a

...a

a sufﬁx,

and a

...a

a substring of α, where 1 ≤ i ≤ j ≤ n.

Given a set Ω of letters, function MAX(Ω) yields the

greatest element in Ω. The database is considered as a

set of records with the form Υ

=(κ

, Λ

), where κ

a unique string and Λ

is the descriptive information

of κ

, such as statistic, offset, or a pointer to another

location where the information can be found. Since

the focus of this paper is on studying the issues of

string indexing, Λ

is ignored in our discussion (i.e.,

not strictly distinguishing a record and a string). Fi-

nally, databases are assumed to be too large to utilize

a RAM-based index technique.

2.1 HD-tree Structure

Multi−Group

Leaf node

Single−Group

Leaf node

Multi−Group

Leaf Pointer

Single−Group

Leaf Pointer

Internal

Pointer

Alphabet:

{a, b, c, d, e}

Internal Node

Figure 1: An HD-tree

The HD-tree is an unbalanced and ordered tree. An

internal node δ of the HD-tree contains a list of pairs

L(δ)={(a

, P

), ..., (a

, P

)}, where P

is a pointer

to its child node; a

(1 ≤ i ≤ m) is a letter from

Σ, called the label of P

; and a

< ... < a

, such

that the pointers are ordered according to their labels.

ICEIS 2005 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

116

Leaf nodes, which are implemented as disk blocks,

contain the sufﬁxes of indexed strings. The id-string

of a tree node is the concatenation of the labels along

the path traversing from the root to the node. The

id-string of the root is empty. Note that an HD-tree

node can be uniquely identiﬁed by its id-string. Let

ID(δ) denote the id-string of a tree node δ. In Figure

1, ID(2)=a, ID(9)=bbe, and ID(15)=db. An HD-

tree must satisfy two basic properties that determine

the proper leaf node for the indexed strings.

PROPERTY

1 For each internal node δ in an HD-tree,

ID(δ) is a common preﬁx of all strings contained in

any leaf node in the sub-tree with δ as the root.

Property 1 is similar to that of a digital tree; however,

the id-string of a leaf node δ

′

in an HD-tree repre-

sents one or more preﬁxes (preﬁx-set) that strings in

′

may have. Let P S(δ

′

) be the preﬁx-set of δ

′

. If

|P S(δ

′

)|=1, all strings in δ

′

share the same preﬁx in

P S(δ

′

). We call such a leaf node a Single-Group Leaf

(SGL). If |P S(δ

′

)| > 1, leaf node δ

′

contains sev-

eral groups of strings, where the strings in each group

share a preﬁx that is different from the preﬁx of an-

other group. We call such a leaf node a Multi-Group

Leaf (MGL). The reason for using SGL and MGL is

to improve the disk utilization. Otherwise, some large

groups of strings may hinder the grouping of small

groups. Note that, based on Property 1, all preﬁxes

in PS(δ

′

) are different only at their last letters. An

internal node in an HD-tree may have three types of

pointers: (1) Internal Pointer (IP) to an internal node,

(2) Single-Group Leaf Pointer (SGLP) to an SGL, and

(3) Multi-Group Leaf Pointer (MGLP) to an MGL.

A key range in a traditional index tree, such as the

B-tree, is continuous, where no key between the two

boundaries of the range can be excluded; however, the

preﬁx-set (i.e., the preﬁx “range”) in the HD-tree may

not be continuous because one or more preﬁxes be-

tween the two boundaries (minimum and maximum

preﬁxes) of the range may be excluded. The preﬁx-

set P S(δ

′

) for an SGL δ

′

contains the unique pre-

ﬁx ID(δ

′

), i.e., P S(δ

′

)={ID(δ

′

)}. For example, in

Figure 1, node 11 is an SGL where P S(11)={bbcc};

that is, all strings in this node have the common pre-

ﬁx bbcc. It is the task of the tree-building algorithm to

determine which node is an SGL.

Unlike an SGL, where its preﬁx-set is directly pre-

sented by its id-string, the preﬁx-set of an MGL

needs to be derived as follows. Let δ

′

be an

MGL, and δ be the parent node of δ

′

containing

the list L(δ)={(a

, P

, ... , (a

, P

), ... , (a

, P

)},

where m > 0 and P

is the pointer to δ

′

. Let

β=ID(δ). The preﬁx-set of the MGL δ

′

is deﬁned

as: P S(δ

′

)={βc | c ∈ Ω

}, where Ω

is a set of

letters obtained through the following steps:

1) Ω

′

={a

| (a

, P

) ∈ L(δ), a

< a

is an MGLP};

2) if (Ω

′

is empty ) b

′

=♯; else b

′

= MAX( Ω

′

);

3) Ω

={a | a ∈ Σ, b

′

< a≤ a

};

4) Ω

′′

={a

| (a

, P

) ∈ L(δ), b

′

< a

is an IP or SGLP};

5) Ω

=Ω

−Ω

′′

For example, in Figure 1, P S(9)={bbd,bbe} and

P S(12)={bbca, bbcb, bbcd}.

PROPERTY

2 Each leaf node δ

′

in an HD-tree keeps

all the indexed strings with a preﬁx in its preﬁx-set

P S(δ

′

Based on the previous discussion on the preﬁx-set,

Property 2 of the HD-tree guarantees that any string

is placed in one and only one leaf node of an HD-tree.

Although we may logically consider that each string

is kept in a leaf node, the entire string does not have to

be stored in the leaf node physically, since the preﬁx

of a string can be found along the path from the root

to a leaf node. Therefore, only the sufﬁx of a string is

stored in a leaf node.

2.2 Building the HD-Tree

To build an HD-tree, algorithms are needed for inser-

tion, deletion, and update. Due to the limitation of

space, only the insertion and its related issues are de-

scribed in this paper. Interested readers can refer to

(Xue et al., 2004) for detail algorithms.

2.2.1 Insertion Procedure

The insertion procedure is to insert a new string κ

into a given HD-tree where κ=k

...k

, k

∈ Σ, and

1 ≤ i ≤ n. Note that ♯ is appended at the end of a

string to distinguish the string from any id-string in

the given HD-tree. Assume the root of an HD-tree

is at level 1. Given an internal node δ at level l, k

is used to determine the next pointer to follow. The

insertion procedure ﬁrst follows internal pointers (k

must equal to the label) down the tree as far as possi-

ble. It stops at an internal node δ which satisﬁes the

following: P S(δ)=k

...k

; and for any internal node

in the tree, if P S(δ

)=k

...k

then j ≤ i. The letter

i+1

is then used to ﬁnd a qualiﬁed leaf node (a child

of δ) according to Property 2. If no leaf node is qual-

iﬁed, either the right-most MGL is chosen (if avail-

able) and its preﬁx-set is expanded, or a new MGL is

created. Finally, the sufﬁx string k

i+1

...k

is stored

in the selected leaf node δ

′

. If δ

′

overﬂows after the

insertion, the overﬂow processing is invoked. For ex-

ample, in Figure 1, to insert a string bbab, δ is the

internal node 7, δ

′

is the leaf node 8, and ab is stored

in node 8. In the same way, string bbcca is stored in

node 11 as ca, string bbcab is stored in node 12 as ab.

THE HYBRID DIGITAL TREE: a new indexing technique for large string databases

117

2.2.2 Overﬂow Processing

In HD-trees, only sufﬁxes of the original strings are

stored in a leaf node. These sufﬁxes are called sufﬁx-

strings. A sufﬁx-group is a set of sufﬁx-strings whose

ﬁrst letters are the same (see Figure 2). If the over-

ﬂow leaf node δ

′

(whose parent is δ

) is an SGL, a

new internal node δ

is created, the ﬁrst letter of each

sufﬁx-string in δ

′

is removed; δ

becomes the child of

; δ

′

becomes the child of δ

(i.e., the grandchild of

). Consequently, the tree grows to another level. If

the overﬂow leaf node δ

′

is an MGL, it is considered

for splitting.

caa cb cbce

dca dcc

eab edea eee

A Multi−Group Leaf

node 17

cbba cbcd

caa cbc cdd ce

cabd cac

A Single−Group Leaf

node 11

group

Figure 2: The SGL and the MGL

2.2.3 Splitting

When an MGL is split, the sufﬁx-strings in δ

′

must

be moved by one sufﬁx-group at a time. If the MGL

′

is split into two whenever it overﬂows (SSplit), the

disk utilization is shown to be very low. In order to

improve the disk utilization, two heuristics are used

(HD-Split): (1) if the size of a sufﬁx-group is greater

than a threshold T (we use 85% of the disk block size

in our experiments), an SGL containing this sufﬁx-

group is formed; (2) before an overﬂow node is split

or after an SGL is moved out of an overﬂow leaf node,

sufﬁx-groups may be moved to the qualiﬁed left or

right siblings to avoid creating a new leaf node.

2.2.4 Linked Disk Blocks

The HD-tree keeps track of the current available

RAM whenever adding or deleting an internal node.

If the RAM is available, the tree grows by creating

internal nodes through the overﬂow processing. Oth-

erwise, the tree stops creating new internal nodes.

Hence, if a leaf node overﬂows after inserting a string,

an extra disk block is linked to the original disk block

to accommodate the overﬂowing data. Consequently,

a search within the leaf node needs to access all linked

disk blocks. Using this approach, the HD-tree works

with any given size of RAM.

2.2.5 Queries

After an HD-tree is created, various queries can be ef-

ﬁciently processed using the tree. Given a database

containing strings κ

, ..., κ

, an ExactSearch(α)

retrieves κ

such that κ

=α, 1 ≤ i ≤ n; a

P ref ixSearch(α) retrieve κ

where α is a preﬁx

of κ

; a SubstringSearch(α) retrieves κ

where

α is a substring of κ

. Note that in the HD-

tree, ExactSearch(α) equals to P ref ixSearch(α♯)

and SubstringSearch(α) is processed by perform-

ing P refixSearch(α) among all sufﬁx strings of

, ..., κ

(Ferragina and Grossi, 1999).

3 EXPERIMENTAL RESULTS

We conducted extensive experiments to analyze the

behavior of the HD-tree and evaluate its performance.

The string databases were generated from TREC

(Voorhees and Harman, 1997). The HD-tree was im-

plemented using C++. Experiments were conducted

on a PC running Linux OS. The disk block size used

in our experiments was 4096 bytes.

Sample database WSJ1 was generated from the

TREC collection, Wall Street Journal 1991, by ﬁrst

removing tags and breaking the text into segments of

5MB each, then extracting unique preﬁxes of the suf-

ﬁx strings at non-space letters for every segment, and

keeping the ﬁrst 32 letters if the preﬁx string is longer

than 32. WSJ1 can be used for keyword-based docu-

ment searches (Baeza-Yates and Ribiero-Neto, 1999)

or substring searches (Gonnet et al., 1991) depending

on the starting boundaries (either words or letters) of

the sufﬁx strings. WSJ1 contained 15 million strings

and each string was associated with a four-byte inte-

ger as the descriptive information. The size of WSJ1

was 252MB.

Table 1: Split heuristics on disk utilization

DBSize(M B) 50 100 150 200 250

SSplit 45.7 44.8 44.6 44.5 44.1

HD-Split 65.1 63.5 63.1 62.7 62.6

Improvement 42.5 41.7 41.5 40.1 42.0

Databases: Samples from WSJ1, Table value: %

3.1 Split Heuristics

One set of experiments is to show the effectiveness

of the split heuristics for building an HD-tree. Table

1 shows the comparison of the disk utilization (using

one disk block for each leaf node) between the SSplit,

which is a B

tree-like approach, and the HD-Split

(see Section 2.2.3). Note that the HD-Split adopted

two heuristics to improve the disk utilization. One is

to distinguish the SGL from the MGL, which allows

the preﬁx range to be “non-continuous”. The other is

to move groups to left or right sibling to avoid a split,

ICEIS 2005 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

118

which dynamically adjusts the preﬁx-set of an MGL.

It is shown that the HD-Split increases the disk uti-

lization by more than 40%, which indicates the effec-

tiveness of the grouping mechanism in the HD-Split.

0 0.1 0.2 0.3 0.4

RAM Size as the Percentage of the Database Size

Average Number of Links

DB 25MB

DB 50MB

DB 150MB

DB 250MB

DBs: Samples from WSJ1

Figure 3: The relationship between the ANL and the avail-

able RAM as the percentage of the database size.

3.2 Query Performance

As described in Section 2.2.4, using linked disk

blocks, the HD-tree is scalable for any RAM size.

Figure 3 shows the relationship between the average

number of links (ANL) and the available RAM size

as the percentage of the database size (RAM/DB).

The ANL is the total number of linked disk blocks

divided by the number of linked leaf nodes. An ANL

value of zero means that each leaf node occupies one

disk block. It is shown that the ANL decreases as the

RAM/DB increases. Note that there exists a thresh-

old (where the curve becomes ﬂat) in the ﬁgure. The

threshold is almost invariant of database sizes.

0 0.1 0.2 0.3 0.4

RAM Size as the Percentage of the Database Size

Number of Disk I/Os per Query

DB 25MB

DB 50MB

DB 150MB

DB 250MB

DBs: Samples from WSJ1

Figure 4: The relationship between the number of I/Os and

the available RAM when the answer size is ﬁxed.

When ANL is greater than zero (i.e., the linked disk

blocks are used), the query performance of the HD-

tree is shown to be closely related to the ANL. Curves

in Figures 4 and 5, where the number of I/Os rather

than ANL is used, are similar to those in Figure 3. The

phenomenon of a threshold can be explained by the

following: because of the logarithmic nature of the

tree (i.e., lower level contains less nodes), as the HD-

tree grows, adding the same amount of the RAM (i.e.,

adding certain number of leaf nodes) has less impact

on the selectivity of the tree (i.e., the total number of

leaf nodes). Therefore, when the available RAM is

limited compared to the databases size, it is important

to allocate enough RAM at the threshold point where

the RAM is most effectively utilized.

0 0.1 0.2 0.3 0.4

0.5

RAM Size as the Percentage of the Database Size

Number of Disk I/Os per Query

DB 50MB

DB 100MB

DB 150MB

DB 250MB

DBs: Samples from WSJ1

Figure 5: The relationship between the number of I/Os and

the available RAM when the answer size changes.

3.3 Comparisons

In this subsection, we evaluate the performance of the

HD-tree by comparing it with that of the Preﬁx B-

tree. The Preﬁx B-tree is widely adopted by data-

base systems and has been shown to be a practical

technique for indexing large string databases. The

Preﬁx B-tree we used was implemented by the popu-

lar Berkeley DB (Sleepycat, 2004), which is an open

source database system. As a disk-based index struc-

ture, the Preﬁx B-tree does not use any memory, while

the HD-tree requires certain amount of RAM to keep

its internal nodes. For a fair comparison, we pro-

vided the same amount of RAM used by the HD-tree

for the Preﬁx B-tree as a cache. The caching algo-

rithm is based on the popular LRU (least-recently-

used) heuristic, which is used by almost all commer-

cial database systems because of its simplicity and ef-

fectiveness. The LRU algorithm keeps recently ac-

cessed internal nodes in the RAM to reduce the num-

ber of disk I/Os.

We ﬁrst compared the disk I/Os between the HD-

tree and the Preﬁx B-tree using 1000 queries with dif-

ferent numbers of distinctive queries. This set of ex-

periments was designed to evaluate the effect of the

locality of the query results on the performance of the

HD-tree and the Preﬁx B-tree. The queries are gen-

erated as follows: (1) select a certain number of dis-

tinctive queries to form a query pool; (2) randomly

THE HYBRID DIGITAL TREE: a new indexing technique for large string databases

119

generate 1000 queries from the query pool. In one

extreme case, the 1000 queries are all the same. As

the number of distinctive queries increases, the level

of localities in the query results reduces. The other

extreme is when all 1000 queries are different.

1 10 100 1000

Number of Distinctive Queries

2500

5000

7500

10000

12500

15000

Total Number of Disk I/Os

HD-tree

Prefix B-tree

Figure 6: I/O comparison for different query localities; av-

erage query length is 6.

As shown in Figure 6, the performance of the Pre-

ﬁx B-tree is better when the number of distinctive

queries is small. However, as the number of distinc-

tive queries increases, the performance of the Pre-

ﬁx B-tree deteriorates quickly. The two curves cross

between 10 and 20 distinct queries, where the HD-

tree starts to outperform the Preﬁx B-tree. For 1000

distinctive queries, the HD-tree is almost three times

better than the Preﬁx B-tree in term of the number

of disk I/Os. The results show that the performance

of the Preﬁx B-tree using the LRU caching mecha-

nism is very susceptible to the locality of the query

results. On the other hand, the HD-tree is quite robust

to different queries. We conclude that the HD-tree

performs better as queries become more different. In

the following I/O comparisons, we used 1000 random

distinctive queries.

500

1000

1500

2000

RAM Size (KB)

Number of Disk I/Os per Query

HD-tree

Prefix B-tree

Figure 7: I/O comparison for different RAM sizes; average

query string length is 8.

In Figures 7 and 8, we compare the performance of

the HD-tree and the Preﬁx B-tree for different RAM

sizes. In Figure 7, it is shown that the HD-tree not

RAM Size (Mb)

Number of Disk I/Os per Query

HD-tree

Prefix B-tree

Figure 8: I/O comparison for different RAM sizes; average

query string length is 6.

only reduces the number of I/Os, but also uses the

RAM more effectively than the caching mechanism

adopted by the Preﬁx B-tree. For example, as the

RAM increases from 250KB to 1.6MB, the HD-tree

reduces more than 50% of I/Os, but the Preﬁx B-tree

only reduces less than 20% of I/Os. For the given

database WSJ1 (252MB) and 1.6MB of RAM, the

HD-tree reaches its optimal status where each leaf

node occupies only one disk block. In Figure 8, more

RAM to the HD-tree is served as a cache which is

the same as that of the Preﬁx B-tree. It is shown that

the HD-tree is continually better than the Preﬁx B-

tree when the RAM is largely available. In Figure 9,

we compare the number of I/Os for different query

lengths. It is shown that the HD-tree performs in-

creasingly better than the Preﬁx B-tree as the query

string length increases. Since the Preﬁx B-tree uses

the same amount of RAM as that of the HD-tree to

cache internal nodes, we conclude that the hybrid

RAM/disk-based index structure (e.g., the HD-tree)

is better than the disk-based structure combined with

caching (e.g., the Preﬁx B-tree plus LRU caching),

especially when queries are more distinctive.

2 3 4

5 6

8 9

Average Query String Length

100

1000

Number of Disk I/Os per Query

HD-tree

Prefix B-tree

Figure 9: I/O comparison for different query lengths; y-axis

is in Logarithmic scale.

Finally, we compared the HD-tree with the Preﬁx

ICEIS 2005 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

120

1 10 100 1000

Number of Distinctive Queries

100

125

Total Running Time (seconds)

HD-tree

Prefix B-tree

Figure 10: Running time comparison; average query string

length is 6.

B-tree in terms of total running time including both

the RAM processing time and the I/O time. The

experiments were conducted in the same computing

environment (a Linux PC with 512MB RAM and

1.8GHz Pentium 4 processor). Figure 10 shows the

running time of the HD-tree and the Preﬁx B-tree

for 1000 queries with different numbers of distinc-

tive queries. We notice that the actual running time of

the HD-tree is comparable to that of the Preﬁx B-tree

even when the 1000 queries are the same. The rea-

son is that with a large amount of RAM available, the

operating system provides LRU caching for the HD-

tree as well. The HD-tree is shown to be increasingly

faster than the Preﬁx B-tree as the number of distinc-

tive queries increases. For 1000 distinctive queries,

the HD-tree is more than one magnitude faster than

the Preﬁx B-tree.

4 CONCLUSION

There is an increasing demand for efﬁcient index-

ing techniques to support various types of queries

on large string databases. Most existing string in-

dexing techniques are either RAM-based or disk-

based. RAM-based index structures are not suitable

for string matching queries on large databases when

only a limited amount of RAM is available. Disk-

based structures, on the other hand, can index large

databases but usually do not fully utilize the available

RAM.

The HD-tree is proposed as a novel hybrid

RAM/disk-based structure, taking advantage of the

strengths of both RAM-based and disk-based struc-

tures. The HD-tree not only scales well with the sizes

of the RAM and the database, but also is efﬁcient

for various types of queries. The experimental results

show that the HD-tree outperforms the Preﬁx B-tree

for preﬁx and substring searches. For random distinc-

tive queries, the number of disk I/Os is reduced by a

factor of two to three, while the running time is re-

duced in an order of magnitude. Therefore, we con-

clude that a hybrid RAM/disk-based index structure

such as the HD-tree is promising for supporting efﬁ-

cient searches in large string databases whose indexes

cannot ﬁt entirely in the RAM.

REFERENCES

Baeza-Yates, R. and Ribiero-Neto, B. (1999). Modern In-

formation Retrieval. Addison Wesley Longman Pub-

lishing Co. Inc.

Bayer, R. and McCreight, E. M. (1972). Organization and

maintenance of large ordered indexes. Acta Informat-

ica, 1(3):173–189.

Bayer, R. and Unterauer, K. (1977). Preﬁx b-trees. ACM

Trans. Database Syst., 2(1):11–26.

Clark, D. R. and Munro, J. I. (1996). Efﬁcient sufﬁx trees on

secondary storage. In Proceedings of the seventh an-

nual ACM-SIAM symposium on Discrete algorithms,

pages 383–391, Atlanta, Georgia, United States. Soci-

ety for Industrial and Applied Mathematics.

Comer, D. (1979). Ubiquitous b-tree. ACM Comput. Surv.,

11(2):121–137.

Fagin, R., Nievergelt, J., Pippenger, N., and Strong, H. R.

(1979). Extendible hashing a fast access method for

dynamic ﬁles. ACM Trans. Database Syst., 4(3):315–

344.

Ferragina, P. and Grossi, R. (1999). The string b-tree: A

new data structure for string search in external mem-

ory and its applications. J. Assoc. Comput. Mach.,

46(2):236–280.

Gonnet, G. H., Baeza-Yates, R. A., and Snider, T. (1991).

Lexicographical indices for text: Inverted ﬁles vs. pat

trees. Technical Report OED-91-01, University of

Waterloo.

Manber, U. and Myers, G. (1990). Sufﬁx arrays: a new

method for on-line string searches. In Proceedings

of the ﬁrst annual ACM-SIAM symposium on Discrete

algorithms, pages 319–327. Society for Industrial and

Applied Mathematics.

McCreight, E. M. (1976). A space-economical sufﬁx tree

construction algorithm. J. ACM, 23(2):262–272.

Morrison, D. R. (1968). Patricia practical algorithm to re-

trieve information coded in alphanumeric. J. ACM,

15(4):514–534.

Sleepycat (2004). Berkeley db. http://www.sleepycat.com/.

Voorhees, E. M. and Harman, D. (1997). Overview of the

sixth text retrieval conference (trec-6). In Proceedings

of the Sixth Text REtrieval Conference, pages 1–24.

NIST Special Publication.

Weiner, P. (1973). Linear pattern matching algorithms. In

14th Annual Symposium on Switching and Automata

Theory, pages 1–11. IEEE.

Xue, Q., Pramanik, S., Qian, G., and Zhu, Q. (2004). The

hybrid ram/disk-based index structure. Technical re-

port, Department of CSE, Michigan State University.

THE HYBRID DIGITAL TREE: a new indexing technique for large string databases

121