SKraken: Fast and Sensitive Classification of Short Metagenomic Reads based on Filtering Uninformative k-mers

Davide Marchiori, Matteo Comin

2017

Abstract

The study of microbial communities is an emerging field that is revolutionizing many disciplines from ecology to medicine. The major problem when analyzing a metagenomic sample is to taxonomic annotate its reads in order to identify the species in the sample and their relative abundance. Many tools have been developed in the recent years, however the performance in terms of precision and speed are not always adequate for these very large datasets. In this work we present SKraken an efficient approach to accurately classify metagenomic reads against a set of reference genomes, e.g. the NCBI/RefSeq database. SKraken is based on k-mers statistics combined with the taxonomic tree. Given a set of target genomes SKraken is able to detect the most representative k-mers for each species, filtering out uninformative k-mers. The classification performance on several synthetic and real metagenomics datasets shows that SKraken achieves in most cases the best performances in terms of precision and recall w.r.t. Kraken. In particular, at species level classification, the estimation of the abundance ratios improves by 6% and the precision by 8%. This behavior is confirmed also on a real stool metagenomic sample where SKraken is able to detect species with high precision. Because of the efficient filtering of uninformative $k$-mers, SKraken requires less RAM and it is faster than Kraken, one of the fastest tool. Availability: https://bitbucket.org/marchiori_dev/skraken Corresponding Author: comin@dei.unipd.it

References

  1. Ames, S. K., Hysom, D. A., Gardner, S. N., Lloyd, G. S., Gokhale, M. B., and Allen, J. E. (2013). Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics, 29.
  2. Antonello, M. and Comin, M. (2013). Fast Computation of Entropic Profiles for the Detection of Conservation in Genomes, pages 277-288. Springer Berlin Heidelberg, Berlin, Heidelberg.
  3. Antonello, M. and Comin, M. (2014). Fast entropic profiler: An information theoretic approach for the discovery of patterns in genomes. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(3):500-509.
  4. Antonello, M. and Comin, M. (2015). Fast alignmentfree comparison for regulatory sequences using multiple resolution entropic profiles. In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOSTEC 2015), pages 171-177.
  5. Brown, C., Hug, L., Thomas, B., Sharon, I., Castelle, C., and Singh, A. e. a. (2015). Unusual biology across a group comprising more than 15% of domain bacteria. Nature, 523(7559):208-11.
  6. Comin, M., Leoni, A., and Schimd, M. (2015). Clustering of reads with alignment-free measures and quality values. Algorithms for Molecular Biology, 10(1):1-10.
  7. Comin, M. and Schimd, M. (2014). Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns. BMC Bioinformatics, 15(9):1-10.
  8. Comin, M. and Verzotto, D. (2012). Whole-genome phylogeny by virtue of unic subwords. In Database and Expert Systems Applications (DEXA), 2012 23rd International Workshop on, pages 190-194.
  9. Comin, M. and Verzotto, D. (2014). Beyond fixedresolution alignment-free measures for mammalian enhancers sequence comparison. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(4):628-637.
  10. Consortium, H. M. P. (2012). Structure, function and diversity of the healthy human microbiome. Nature, 486(7402):207-214.
  11. Felczykowska, A., Bloch, S. K., Nejman-Faleczyk, B., and Baraska, S. (2012). Metagenomic approach in the investigation of new bioactive compounds in the marine environment. Acta Biochimica Polonica, 59(4):501505.
  12. Girotto, S., Pizzi, C., and Comin, M. (2016). Metaprob: accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics, 32(17):i567-i575.
  13. Goke, J., Schulz, M. H., Lasserre, J., and Vingron, M. (2012). Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts. Bioinformatics, 28(5):656-663.
  14. Huson, D. H., Auch, A. F., Qi, J., and Schuster, S. C. (2007). Megan analysis of metagenomic data. Genome Res., 17.
  15. Kantorovitz, M. R., Robinson, G. E., and Sinha, S. (2007). A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics., 23.
  16. Lindgreen, S., Adair, K. L., and Gardner, P. (2016). An evaluation of the accuracy and speed of metagenome analysis tools. Scientific Reports , 6:19233.
  17. Liu, B., Gibbons, T., Ghodsi, M., Treangen, T., and Pop, M. (2011). Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences. BMC Genomics, 12.
  18. Mande, S. S., Mohammed, M. H., and Ghosh, T. S. (2012). Classification of metagenomic sequences: methods and challenges. Briefings in Bioinformatics , 13(6):669-681.
  19. Ondov, B. D., Treangen, T. J., Melsted, P., Mallonee, A. B., Bergman, N. H., Koren, S., and Phillippy, A. M. (2016). Mash: fast genome and metagenome distance estimation using minhash. bioRxiv.
  20. Ounit, R., Wanamaker, S., Close, T. J., and Lonardi, S. (2015). Clark: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics, 16(1):1-13.
  21. Qin, J., Li, R., Raes, J., and et al. (2010). A human gut microbial gene catalogue established by metagenomic sequencing. Nature, (464):5965.
  22. Said, H. S., Suda, W., Nakagome, S., Chinen, H., Oshima, K., Kim, S., Kimura, R., Iraha, A., Ishida, H., Fujita, J., Mano, S., Morita, H., Dohi, T., Oota, H., and Hattori, M. (2014). Dysbiosis of Salivary Microbiota in Inflammatory Bowel Disease and Its Association With Oral Immunological Biomarkers. DNA Research: An International Journal for Rapid Publication of Reports on Genes and Genomes, 21(1):15-25.
  23. Schimd, M. and Comin, M. (2016). Fast comparison of genomic and meta-genomic reads with alignment-free measures based on quality values. BMC Medical Genomics, 9(1):41-50.
  24. Segata, N., Waldron, L., Ballarini, A., Narasimhan, V., Jousson, O., and Huttenhower, C. (2012). Metagenomic microbial community profiling using unique clade-specific marker genes. Nat Methods, 9.
  25. Sims, G. E., Jun, S. . R., Wu, G. A., and Kim, S. . H. (2009). Alignment-free genome comparison with feature frequency profiles (ffp) and optimal resolutions. Proc Nat Acad Sci., 106.
  26. Vinga, S. and Almeida, J. (2003). Alignment-free sequence comparison-a review. Bioinformatics., 19.
  27. Wood, D. and Salzberg, S. (2014). Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol., 15.
  28. Zhang, Z., Schwartz, S., Wagner, L., and Miller, W. (2004). A greedy algorithm for aligning dna sequences. Journal of Computational Biology, 7(1-2):203-214.
Download


Paper Citation


in Harvard Style

Marchiori D. and Comin M. (2017). SKraken: Fast and Sensitive Classification of Short Metagenomic Reads based on Filtering Uninformative k-mers . In Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2017) ISBN 978-989-758-214-1, pages 59-67. DOI: 10.5220/0006150500590067


in Bibtex Style

@conference{bioinformatics17,
author={Davide Marchiori and Matteo Comin},
title={SKraken: Fast and Sensitive Classification of Short Metagenomic Reads based on Filtering Uninformative k-mers},
booktitle={Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2017)},
year={2017},
pages={59-67},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006150500590067},
isbn={978-989-758-214-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2017)
TI - SKraken: Fast and Sensitive Classification of Short Metagenomic Reads based on Filtering Uninformative k-mers
SN - 978-989-758-214-1
AU - Marchiori D.
AU - Comin M.
PY - 2017
SP - 59
EP - 67
DO - 10.5220/0006150500590067