THE HYBRID DIGITAL TREE: A NEW INDEXING TECHNIQUE FOR LARGE STRING DATABASES

Qiang Xue, Sakti Pramanik, Gang Qian, Qiang Zhu

Abstract

There is an increasing demand for efficient indexing techniques to support queries on large string databases. In this paper, a hybrid RAM/disk-based index structure, called the Hybrid Digital tree (HD-tree), is proposed. The HD-tree keeps internal nodes in the RAM to minimize the number of disk I/Os, while maintaining leaf nodes on the disk to maximize the capability of the tree for indexing large databases. Experimental results using real data have shown that the HD-tree outperformed the Prefix B-tree for prefix and substring searches. In particular, for distinctive random queries in the experiments, the average number of disk I/Os was reduced by a factor of two to three, while the running time was reduced in an order of magnitude.

References

  1. Baeza-Yates, R. and Ribiero-Neto, B. (1999). Modern Information Retrieval. Addison Wesley Longman Publishing Co. Inc.
  2. Bayer, R. and McCreight, E. M. (1972). Organization and maintenance of large ordered indexes. Acta Informatica, 1(3):173-189.
  3. Bayer, R. and Unterauer, K. (1977). Prefix b-trees. ACM Trans. Database Syst., 2(1):11-26.
  4. Clark, D. R. and Munro, J. I. (1996). Efficient suffix trees on secondary storage. In Proceedings of the seventh annual ACM-SIAM symposium on Discrete algorithms, pages 383-391, Atlanta, Georgia, United States. Society for Industrial and Applied Mathematics.
  5. Comer, D. (1979). Ubiquitous b-tree. ACM Comput. Surv., 11(2):121-137.
  6. Fagin, R., Nievergelt, J., Pippenger, N., and Strong, H. R. (1979). Extendible hashing a fast access method for dynamic files. ACM Trans. Database Syst., 4(3):315- 344.
  7. Ferragina, P. and Grossi, R. (1999). The string b-tree: A new data structure for string search in external memory and its applications. J. Assoc. Comput. Mach., 46(2):236-280.
  8. Gonnet, G. H., Baeza-Yates, R. A., and Snider, T. (1991). Lexicographical indices for text: Inverted files vs. pat trees. Technical Report OED-91-01, University of Waterloo.
  9. Manber, U. and Myers, G. (1990). Suffix arrays: a new method for on-line string searches. In Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms, pages 319-327. Society for Industrial and Applied Mathematics.
  10. McCreight, E. M. (1976). A space-economical suffix tree construction algorithm. J. ACM, 23(2):262-272.
  11. Morrison, D. R. (1968). Patricia practical algorithm to retrieve information coded in alphanumeric. J. ACM, 15(4):514-534.
  12. Sleepycat (2004). Berkeley db. http://www.sleepycat.com/.
  13. Voorhees, E. M. and Harman, D. (1997). Overview of the sixth text retrieval conference (trec-6). In Proceedings of the Sixth Text REtrieval Conference, pages 1-24. NIST Special Publication.
  14. Weiner, P. (1973). Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, pages 1-11. IEEE.
  15. Xue, Q., Pramanik, S., Qian, G., and Zhu, Q. (2004). The hybrid ram/disk-based index structure. Technical report, Department of CSE, Michigan State University.
Download


Paper Citation


in Harvard Style

Xue Q., Pramanik S., Qian G. and Zhu Q. (2005). THE HYBRID DIGITAL TREE: A NEW INDEXING TECHNIQUE FOR LARGE STRING DATABASES . In Proceedings of the Seventh International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 972-8865-19-8, pages 115-121. DOI: 10.5220/0002518501150121


in Bibtex Style

@conference{iceis05,
author={Qiang Xue and Sakti Pramanik and Gang Qian and Qiang Zhu},
title={THE HYBRID DIGITAL TREE: A NEW INDEXING TECHNIQUE FOR LARGE STRING DATABASES},
booktitle={Proceedings of the Seventh International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2005},
pages={115-121},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002518501150121},
isbn={972-8865-19-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Seventh International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - THE HYBRID DIGITAL TREE: A NEW INDEXING TECHNIQUE FOR LARGE STRING DATABASES
SN - 972-8865-19-8
AU - Xue Q.
AU - Pramanik S.
AU - Qian G.
AU - Zhu Q.
PY - 2005
SP - 115
EP - 121
DO - 10.5220/0002518501150121