A LOSSLESS COMPRESSION ALGORITHM FOR DNA SEQUENCES

Taysir H. A. Soliman, Tarek F. Gharib, Alshaimaa Abo-Alian, Mohammed Alsharkawy

2008

Abstract

Homology search is the seed for both genomics and proteomics research. However, the increase of the amount of DNA sequences requires efficient computational algorithms for performing sequence comparison and analysis. This is due to the fact that standard compression algorithms are not able to compress DNA sequences because they do not consider special characteristics of DNA sequences (i.e. DNA sequences contain several approximate repeats and complimentary palindromes are frequent in DNA). Recently, new algorithms have been proposed to compress DNA sequences, often using detection of long approximate repeats. The current work proposes a Lossless Compression Algorithm (LCA), providing a new encoding method. LCA achieves a better compression ratio than that of existing DNA-oriented compression algorithms, when compared to GenCompress and DNACompress, using nine different datasets.

References

  1. Allison, L., Edgoose, T., Dix, T. I. (1998) 'Compression of strings with approximate repeats', In Intelligent Systems in Mol. Biol., 8-16, Montreal.
  2. Apostolico A. and Lonardi S. (2000) 'Compression of Biological Sequences by Greedy Offline Textual Substitution', In proc. Data Compression Conference, IEEE Computer Society Press, 143-152.
  3. Bao, S., Chen, S., and Jing, Z. (2005) 'A DNA Sequence Compression Algorithm Based on Look-up Table and LZ7778, Signal Processing and Information Technology, Proceedings of the Fifth IEEE International Symposium, 23 - 28.
  4. Behzadi, B. and Le Fessant, F. (2004) 'DNA Compression Challenge Revisited', Lecture Notes in Computer Science 3537, 190-200.
  5. Chang C.-H. (2004) 'DNAC: A Compression Algorithm for DNA Sequences by Nonoverlapping Approximate Repeats', Master Thesis.
  6. Cherniavski, N., Lander, R. (2004) 'Grammar-based Compression of DNA sequences', in DIMACS Working Group on The Burrows-Wheeler Transform, Piscataway, NJ, USA.
  7. Chen, X., Kwong, S., Li, M. (1999) 'A compression Algorithm for DNA sequences and its applications in genome comparison', The 10th workshop on Genome Informatics, 51-61, Tokyo, Japan.
  8. Chen, X., Kwong, S., Li, M. (2001) 'A compression Algorithm for DNA sequences', IEEE Engineering in Medicine and Biology Magazine, 20(4), 61-66.
  9. Chen, X., Li, M., Ma, B. and Tromp, J. (2002) 'DNACompress: fast and effective DNA sequence compression', Bioinformatics, 18, 1696-1698.
  10. Deorowicz, S. (2003) 'Universal lossless data compression algorithms', Philosophy Dissertation Thesis, Gliwice.
  11. Grumbach S. and Tahi F. (1993) 'Compression of DNA Sequences', In Data compression conference, IEEE Computer Society Press, 340-350.
  12. Grumbach S. and Tahi F. (1994) 'A new Challenge for compression algorithms: genetic sequences', Journal of Information Processing and Management, 30, 875- 866.
  13. Korodi, G., Tabus, I. (2005) 'An Efficient Normalized Maximum Likelihood Algorithm for DNA Sequence Compression', ACM Transactions on Information Systems, 23(1), 3-34.
  14. Ma, B., Tromp, J., Li, M. (2002) 'PatternHunter-faster and more sensitive homology search', Bioinformatics, 18, 440-445.
  15. Matsumuto, T., Sadakane, K.,Imai, H. (2000) 'Biological sequence compression algorithms', Genome Inform. Ser. Workshop Genome Inform, 11, 43-52.
  16. Rivals E., Delahaye J.-P., Dauchet M., Delgrange O. (1996) 'A Guaranteed Compression Scheme for Repetitive DNA Sequences', Data Compression Conference, 453, Snowbird,
  17. Tubingen, U., Huson, D. (2005) 'Sequence comparison by compression', Alg. in Bioinformatics I, ZBIT, 18, 1-8.
Download


Paper Citation


in Harvard Style

H. A. Soliman T., F. Gharib T., Abo-Alian A. and Alsharkawy M. (2008). A LOSSLESS COMPRESSION ALGORITHM FOR DNA SEQUENCES . In Proceedings of the Tenth International Conference on Enterprise Information Systems - Volume 2: ICEIS, ISBN 978-989-8111-37-1, pages 435-441. DOI: 10.5220/0001683504350441


in Bibtex Style

@conference{iceis08,
author={Taysir H. A. Soliman and Tarek F. Gharib and Alshaimaa Abo-Alian and Mohammed Alsharkawy},
title={A LOSSLESS COMPRESSION ALGORITHM FOR DNA SEQUENCES},
booktitle={Proceedings of the Tenth International Conference on Enterprise Information Systems - Volume 2: ICEIS,},
year={2008},
pages={435-441},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001683504350441},
isbn={978-989-8111-37-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Tenth International Conference on Enterprise Information Systems - Volume 2: ICEIS,
TI - A LOSSLESS COMPRESSION ALGORITHM FOR DNA SEQUENCES
SN - 978-989-8111-37-1
AU - H. A. Soliman T.
AU - F. Gharib T.
AU - Abo-Alian A.
AU - Alsharkawy M.
PY - 2008
SP - 435
EP - 441
DO - 10.5220/0001683504350441