Two Novel Techniques for Space Compaction on Biological Sequences

George Volis, Christos Makris, Andreas Kanavos

2016

Abstract

The number and size of genomic databases have grown rapidly the last years. Consequently, the number of Internet-accessible databases has been rapidly growing .Therefore there is a need for satisfactory methods for managing this growing information. A lot of effort has been put to this direction. Contributing to this effort this paper presents two algorithms which can eliminate the amount of space for storing genomic information. Our first algorithm is based on the classic n-grams/2L technique for indexing a DNA sequence and it can convert the Inverted Index of this classic algorithm to a more compressed format. Researchers have revealed the existence of repeated and palindrome patterns in DNA of living organisms. The main motivation of this technique is based on this remark and proposes an alternative data structure for handling these sequences. Our experimental results show that our algorithm can achieve a more efficient index than the n-grams/2L algorithm and can be adapted by any algorithm that is based to n-grams/2L The second algorithm is based on the n-grams technique. Perceiving the four symbols of DNA alphabet as vertex of a square scheme imprint a DNA sequence as a relation between vertices, sides and diagonals of a square. The experimental results shows that this second idea succeed even more successfully compression of our index structure.

References

  1. Alatabbi, A., Crochemore, M., Iliopoulos, C. S., and Okanlawon, T. A. (2012). Overlapping repetitions in weighted sequence. In International Information Technology Conference (CUBE), pp. 435-440.
  2. Bernstein, Y., & Zobel, J. (2004, January). A scalable system for identifying co-derivative documents. In String Processing and Information Retrieval (pp. 55- 67). Springer Berlin Heidelberg.
  3. Christodoulakis, M., Iliopoulos, C. S., Mouchard, L.,Perdikuri, K., Tsakalidis, A. K., and Tsichlas, K.(2006). Computation of repetitions and regularities of biologically weighted sequences. In Journal of Computational Biology (JCB), Volume 13, pp. 1214- 1231.
  4. Diamanti, K., Kanavos, A., Makris, C., & Tokis, T.(2014) Handling Weighted Sequences Employing Inverted Files and Suffix Trees,
  5. Grechko, V. V. (2011). Repeated DNA sequences as an engine of biological diversification. Molecular Biology, 45(5), 704-727.
  6. Grumbach, S. and Tahi, F., A new challenge for compression algorithms: genetic sequences, J. Information Processing and Management, 30(6):875- 866, 1994.
  7. Kim, M.-S., Whang, K.-Y., and Lee, J.-G. (2007). ngram/2l-approximation: a two-level n-gram inverted index structure for approximate string matching. In Computer Systems: Science and Engineering, Volume 22, Number 6.
  8. Kim, M.-S., Whang, K.-Y., Lee, J.-G., and Lee, M.-J. (2005). n-gram/2l: A space and time efficient twolevel. n-gram inverted index structure. In International. Conference on Very Large Databases (VLDB), pp. 325-336.
  9. Krawinkel, U., Zoebelein, G., & Bothwell, A. L. M. (1986). Palindromic sequences are associated with sites of DNA breakage during gene conversion.Nucleic acids research, 14(9), 3871-3882.
  10. Kurtz, S., & Schleiermacher, C. (1999). REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics, 15(5), 426-427.
  11. Lee, J. H. and Ahn, J. S. (1996). Using n-grams for korean. text retrieval. In ACM SIGIR, pp. 216-224.
  12. Mayfield, J. and McNamee, P. (2003). Single n-gram stemming.In ACM SIGIR, pp. 415-416.
  13. Millar, E., Shen, D., Liu, J., & Nicholas, C. (2006). Performance and scalability of a large-scale n-gram based information retrieval system. Journal of digital information, 1(5).
  14. Navarro, G., & Baeza-Yates, R. (1998). A practical q-gram index for text retrieval allowing errors. CLEI Electronic Journal, 1(2), 1.
  15. Ogawa, Y. and Iwasaki, M. (1995). A new characterbased indexing organization using frequency data for japanese documents. In ACM SIGIR, pp. 121-129.
  16. Rivals, E., Delahaye, J.-P., Dauchet, M., and Delgrange, O., A Guaranteed Compression Scheme for Repetitive DNA Sequences, LIFL Lille I University, technical report IT-285, 1995.
  17. Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of molecular biology, 147(1), 195-197.
  18. Sun, Z., Yang, J., and Deogun, J. S. (2004). Misae: A new approach for regulatory motif extraction. In Computational Systems Bioinformatics Conference (CSB), pp.173-181.
  19. Welch, T. A. (1984). A technique for high-performance data compression computer, 6(17), 8-19..
  20. Ziv, J., & Lempel, A. (1977). A universal algorithm for sequential data compression. IEEE Transactions on information theory, 23(3), 337-343.
Download


Paper Citation


in Harvard Style

Volis G., Makris C. and Kanavos A. (2016). Two Novel Techniques for Space Compaction on Biological Sequences . In Proceedings of the 12th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-758-186-1, pages 105-112. DOI: 10.5220/0005801101050112


in Bibtex Style

@conference{webist16,
author={George Volis and Christos Makris and Andreas Kanavos},
title={Two Novel Techniques for Space Compaction on Biological Sequences},
booktitle={Proceedings of the 12th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2016},
pages={105-112},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005801101050112},
isbn={978-989-758-186-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 12th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - Two Novel Techniques for Space Compaction on Biological Sequences
SN - 978-989-758-186-1
AU - Volis G.
AU - Makris C.
AU - Kanavos A.
PY - 2016
SP - 105
EP - 112
DO - 10.5220/0005801101050112