ACCURATE LONG READ MAPPING USING ENHANCED SUFFIX ARRAYS

Michaël Vyverman, Joachim De Schrijver, Wim Van Criekinge, Peter Dawyndt, Veerle Fack

2011

Abstract

With the rise of high throughput sequencing, new programs have been developed for dealing with the alignment of a huge amount of short read data to reference genomes. Recent developments in sequencing technology allow longer reads, but the mappers for short reads are not suited for reads of several hundreds of base pairs. We propose an algorithm for mapping longer reads, which is based on chaining maximal exact matches and uses heuristics and the Needleman-Wunsch algorithm to bridge the gaps. To compute maximal exact matches we use a specialized index structure, called enhanced suffix array. The proposed algorithm is very accurate and can handle large reads with mutations and long insertions and deletions.

References

  1. Abouelhoda, M. I., Kurtz, S., and Ohlebusch, E. (2004). Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2:53-86.
  2. Bray, N., Dubchak, I., and Patcher, L. (2003). AVID: a global alignment program. Genome Research, 13:97- 102.
  3. Friedenson, B. (2007). The BRCA1/2 pathway prevents hematologic cancers in addition to breast and ovarian cancers. BMC Cancer, 7:152.
  4. Gusfield, D. (1997). Algorithms on strings, trees, and sequences. Cambridge university press, 32 Avenue of the Americas, New York, NY 10013-2473, USA, 11th edition.
  5. Hoffmann, S., Otto, C., Kurtz, S., Sharma, C., Khaitovich, P., Vogel, J., Stadler, P., and Hackermüller, J. (2009). Fast mapping of short sequences with mismatches, insertions and deletions using index structures. PLoS Computational Biology, 9:e1000502.
  6. Kärkkäinen, J. and Sanders, P. (2003). Simple linear work suffix array construction. In Proceedings of the 30th International Conference on Automata Languages and Programming, volume 2719 of Lecture Notes in Computer Science, pages 943-955. SpringerVerlag.
  7. Kasai, T., Lee, G., Arimura, H., Arikawa, S., and Park, K. (2001). Linear-time longest-common-prefix computation in suffix arrays and its applications. In Proceedings of the 12th Symposium on Combinatorial Pattern Matching (CPM 01), volume 2089 of Lecture Notes in Computer Science, pages 181-192. Springer-Verlag.
  8. Khan, Z., Bloom, J., Kruglyak, L., and Singh, M. (2009). A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays. Bioinformatics, 13:1609-1616.
  9. Kurtz, S., Phillippy, A., Delcher, A., Smoot, M., Shumway, M., Antonescu, C., and Salzberg, S. (2004). Versatile and open software for comparing large genomes. Genome Biology, 5:R12.
  10. Langmead, B., Trapnell, C., Pop, M., and Salzberg, S. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10:R25.
  11. Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25:1754-1760.
  12. Li, H. and Durbin, R. (2010). Fast and accurate long read alignment with Burrows-Wheeler transform. Bioinformatics, 5:589-595.
  13. Li, R., Li, Y., Kristiansen, K., and Wang, J. (2008). SOAP: short oligonucleotide alignment program. Bioinformatics, 24:713-714.
  14. Maaß, M. (2007). Computing suffix links for suffix trees and arrays. Information Processing Letters, 101:250- 254.
  15. Needleman, S. B. and Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443-453.
  16. Weese, D., Emde, A.-K., Rausch, T., Döring, A., and Reinert, K. (2009). RazerS - fast read mapping with sensitivity control. Genome Research, 19:1646-1654.
Download


Paper Citation


in Harvard Style

Vyverman M., De Schrijver J., Van Criekinge W., Dawyndt P. and Fack V. (2011). ACCURATE LONG READ MAPPING USING ENHANCED SUFFIX ARRAYS . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2011) ISBN 978-989-8425-36-2, pages 102-107. DOI: 10.5220/0003126201020107


in Bibtex Style

@conference{bioinformatics11,
author={Michaël Vyverman and Joachim De Schrijver and Wim Van Criekinge and Peter Dawyndt and Veerle Fack},
title={ACCURATE LONG READ MAPPING USING ENHANCED SUFFIX ARRAYS},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2011)},
year={2011},
pages={102-107},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003126201020107},
isbn={978-989-8425-36-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2011)
TI - ACCURATE LONG READ MAPPING USING ENHANCED SUFFIX ARRAYS
SN - 978-989-8425-36-2
AU - Vyverman M.
AU - De Schrijver J.
AU - Van Criekinge W.
AU - Dawyndt P.
AU - Fack V.
PY - 2011
SP - 102
EP - 107
DO - 10.5220/0003126201020107