Annotated Trees and their Applications to XML Compression

Tomasz Müldner, Jan Krzysztof Miziołek, Tyler Corbin

2014

Abstract

Permutation based XML-conscious compressors permute the input document to improve the compression ratio and support efficiency of operations, such as queries or updates. One such compressor, XSAQCT, uses the properties of the permuted document, called an annotated tree, to these operations. This paper provides the formal background for the definition of an of D. It also provides an algorithm for creating an annotated tree for the XML document and its reverse algorithm, and discusses a measure of compressibility using an annotated tree. The theoretical and algorithm approaches are followed by the experimental results showing compressibility of annotated trees and a general analysis of semi-structured data and XML compression.

References

  1. Arion, A., Bonifati, A., Manolescu, I., and Pugliese, A. (2007). XQueC: a query-conscious compressed XML database. ACM Transactions on Internet Technology, 7(2).
  2. Baseball.xml (2013). baseball.xml, retrieved October 2013 from http://rassyndrome.webs.com/cc/baseball.xml.
  3. Benoit, D., Demaine, E., Munro, J., and Raman, V. (1999). Representing Trees of Higher Degree. In Dehne, F., Sack, J., Gupta, A., and Tamassia, R., editors, Algorithms and Data Structures, volume 1663 of Lecture Notes in Computer Science, pages 169-180. Springer Berlin Heidelberg.
  4. Bille, P., Gortz, I., Weimann, O., and Landau, G. M. (2013). Tree Compression with Top Trees. In In Proceedings of the 40th International Colloquium on Automata, Languages, and Programming.
  5. Burrows, M. and Wheeler, D. (1994). A block-sorting lossless data compression algorithm. Technical Report, Digital Equipment Corporation.
  6. Busatto, G., Lohrey, M., and Maneth, S. (2005). Efficient Memory Representation of XML Documents. In Bierman, G. and Koch, C., editors, Database Programming Languages, volume 3774 of Lecture Notes in Computer Science, pages 199-216. Springer Berlin Heidelberg.
  7. Busatto, G., Lohrey, M., and Maneth, S. (2008). Efficient memory representation of XML document trees. Inf. Syst., 33(4-5):456-474.
  8. bzip2 (2013). bzip2 compression, retrieved October 2013 from http://www.bzip.org/.
  9. Chen, S. and Reif, J. (1996). Efficient Lossless Compression of Trees and Graphs. In In IEEE Data Compression Conference (DCC).
  10. Consortium, T. U. (2013). Update on activities at the Universal Protein Resource (UniProt) in 2013. http://dx.doi.org/10.1093/nar/gks1068. Retrieved on June 20, 2013.
  11. Corbin, T., Müldner, T., and Miziolek, J. (2013). Pre-order Compression Schemes for XML in the Real Time Environment. In The Ninth International Conference on Web Information Systems and Technologies, Aachen, Germany. WEBIST.
  12. Corpus, W. (2013). Wratislavia XML corpus, retrieved October 2013 from http://www.ii.uni.wroc.pl/ inikep/research/wratislavia/.
  13. Ferragina, P., Luccio, F., Manzini, G., and Muthukrishnan, S. (2009). Compressing and indexing labeled trees, with applications. J. ACM, 57(1):4:1-4:33.
  14. Gottlob, G., Koch, C., and Pichler, R. (2005). Efficient algorithms for processing xpath queries. ACM Trans. Database Syst., 30(2):444-491.
  15. GZIP (2013). The gzip home page, retrieved October 2013 from http://www.gzip.org.
  16. Jacobson, G. (1989). Space-efficient static trees and graphs. In Proceedings of the 30th Annual Symposium on Foundations of Computer Science, SFCS 7889, pages 549-554, Washington, DC, USA. IEEE Computer Society.
  17. Mahoney, M. (2012). Large Text Compression Benchmark, Retrieved October 2013 from http://mattmahoney.net/dc/zpaq.html.
  18. Müldner, T., Corbin, T., Miziolek, J., and Fry, C. (2012). Design and Implementation of an Online XML Compressor for Large XML Files. International Journal On Advances in Internet Technology, 5(3):115-118.
  19. Müldner, T., Fry, C., Miziolek, J., and Durno, S. (2009). XSAQCT: XML queryable compressor. In Balisage: The Markup Conference 2009, Montreal, Canada.
  20. XML (2013). Extensible markup language (XML) 1.0 (Fifth edition), retrieved October 2013 from http://www.w3.org/tr/rec-xml/.
  21. Ziv, J. and Lempel, A. (2006). A universal algorithm for sequential data compression. IEEE Trans. Inf. Theor., 23(3):337-343.
  22. ZPAQ (2013). Zpaq, retrieved October 2013 from http://www.w3.org/tr/rec-xml/.
Download


Paper Citation


in Harvard Style

Müldner T., Miziołek J. and Corbin T. (2014). Annotated Trees and their Applications to XML Compression . In Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-758-023-9, pages 27-39. DOI: 10.5220/0004839900270039


in Bibtex Style

@conference{webist14,
author={Tomasz Müldner and Jan Krzysztof Miziołek and Tyler Corbin},
title={Annotated Trees and their Applications to XML Compression},
booktitle={Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2014},
pages={27-39},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004839900270039},
isbn={978-989-758-023-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - Annotated Trees and their Applications to XML Compression
SN - 978-989-758-023-9
AU - Müldner T.
AU - Miziołek J.
AU - Corbin T.
PY - 2014
SP - 27
EP - 39
DO - 10.5220/0004839900270039