De-Novo Assembly of Short Reads in Minimal Overlap Model

Shashank Sharma, Ankit Singhal

2015

Abstract

Next Generation Sequencing (NGS) technologies produce millions of short reads that provide high coverage of genome at much lower cost than Sanger Sequencing based technologies. The advent of NGS technologies has led to various developments in assembling techniques. Our focus is on adapting overlap graph based algorithms to work with millions of NGS reads. Due to the high coverage of the genome by NGS reads, we show that it is feasible to perform assembly while working with small overlaps. This strategy gives us a significant computational and space advantage over the existing approaches. Our method finds alternate paths in an overlap graph to construct an assembly. We compare the performance of our tool, MOBS, with some of the widely used assemblers on ideal datasets (error free reads, distributed uniformly over genome), for which finished genomes are available. We show that MOBS results are most of the time better than other assemblers with respect to quality of assemblies, running time and genome coverage.

References

  1. Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., Lesin, V. M., Nikolenko, S. I., Pham, S., Prjibelski, A. D., et al. (2012). Spades: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology, 19(5):455-477.
  2. Chaisson, M. J. P., Brinza, D., and Pevzner, P. A. (2008). De novo fragment assembly with short mate-paired reads: Does the read length matter? Genome Research, 19(2):336-346.
  3. Gonnella, G. and Kurtz, S. (2012). Readjoiner: a fast and memory efficient string graph-based sequence assembler. BMC bioinformatics, 13(1):82.
  4. Gusfield, D., Landau, G. M., and Schieber, B. (1992). An efficient algorithm for the all pairs suffix-prefix problem. Information Processing Letters, 41(4):181 - 185.
  5. Hernandez, D., Franc¸ois, P., Farinelli, L., Østera°s, M., and Schrenzel, J. (2008). De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome research, 18(5):802-809.
  6. Huang, S., Li, R., Zhang, Z., Li, L., Gu, X., Fan, W., Lucas, W. J., Wang, X., Xie, B., Ni, P., et al. (2009). The genome of the cucumber, cucumis sativus l. Nature genetics, 41(12):1275-1281.
  7. Huang, X. and Madan, A. (1999). Cap3: A dna sequence assembly program. Genome research, 9(9):868-877.
  8. Huang, X., Wang, J., Aluru, S., Yang, S.-P., and Hillier, L. (2003). Pcap: a whole-genome assembly program. Genome research, 13(9):2164-2170.
  9. Idury, R. M. and Waterman, M. S. (1995). A new algorithm for DNA sequence assembly. Journal of computational biology, 2(2):291-306.
  10. Li, R., Fan, W., Tian, G., Zhu, H., He, L., Cai, J., Huang, Q., Cai, Q., Li, B., Bai, Y., et al. (2009). The sequence and de novo assembly of the giant panda genome. Nature, 463(7279):311-317.
  11. Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., Shi, Z., Li, Y., Li, S., Shan, G., Kristiansen, K., Li, S., Yang, H., Wang, J., and Wang, J. (2010). De novo assembly of human genomes with massively parallel short read sequencing. Genome Research, 20(2):265-272.
  12. Mullikin, J. C. and Ning, Z. (2003). The phusion assembler. Genome research, 13(1):81-90.
  13. Myers, E. W. (2005). The fragment assembly string graph. Bioinformatics, 21(suppl 2):ii79-ii85.
  14. Myers, E. W., Sutton, G. G., Delcher, A. L., Dew, I. M., Fasulo, D. P., Flanigan, M. J., Kravitz, S. A., Mobarry, C. M., Reinert, K. H., Remington, K. A., et al. (2000). A whole-genome assembly of drosophila. Science, 287(5461):2196-2204.
  15. Pevzner, P. A., Tang, H., and Waterman, M. S. (2001). A new approach to fragment assembly in DNA sequencing. In RECOMB 7801: Proceedings of the fifth annual international conference on Computational biology, pages 256-267, New York, NY, USA. ACM.
  16. Simpson, J. T. and Durbin, R. (2010). Efficient construction of an assembly string graph using the fm-index. Bioinformatics, 26(12):i367-i373.
  17. Simpson, J. T. and Durbin, R. (2012). Efficient de novo assembly of large genomes using compressed data structures. Genome research, 22(3):549-556.
  18. Simpson, J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones, S. J. M., and Birol, I. (2009). ABySS: A parallel assembler for short read sequence data. Genome Research, 19(6):1117-1123.
  19. Staden, R. (1980). A new computer method for the storage and manipulation of DNA gel reading data. Nucleic acids research, 8(16):3673-3694.
  20. Warren, R. L., Sutton, G. G., Jones, S. J., and Holt, R. A. (2007). Assembling millions of short DNA sequences using SSAKE. Bioinformatics, 23(4):500-501.
  21. Whiteford, N., Haslam, N., Weber, G., Prügel-Bennett, A., Essex, J. W., Roach, P. L., Bradley, M., and Neylon, C. (2005). An analysis of the feasibility of short read sequencing. Nucleic acids research, 33(19):e171-e171.
  22. Xing Liu, Pushkar R. Pande, H. M. D. A. B. (2012). Pasqual: Parallel techniques for next generation genome sequence assembly.
  23. Zerbino, D. R. and Birney, E. (2008). Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, 18(5):821-829.
Download


Paper Citation


in Harvard Style

Sharma S. and Singhal A. (2015). De-Novo Assembly of Short Reads in Minimal Overlap Model . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2015) ISBN 978-989-758-070-3, pages 44-54. DOI: 10.5220/0005214100440054


in Bibtex Style

@conference{bioinformatics15,
author={Shashank Sharma and Ankit Singhal},
title={De-Novo Assembly of Short Reads in Minimal Overlap Model},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2015)},
year={2015},
pages={44-54},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005214100440054},
isbn={978-989-758-070-3},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2015)
TI - De-Novo Assembly of Short Reads in Minimal Overlap Model
SN - 978-989-758-070-3
AU - Sharma S.
AU - Singhal A.
PY - 2015
SP - 44
EP - 54
DO - 10.5220/0005214100440054