SINGULAR VALUE DECOMPOSITION (SVD) AND BLAST - Quite Different Methods Achieving Similar Results

Bráulio Roberto Gonçalves Marinho Couto, Macelo Matos Santoro, Marcos Augusto dos Santos

2011

Abstract

The dominant methods to search for relevant patterns in protein sequences are based on character-by-character matching, performed by software known as BLAST. In this paper, sequences are recoded as p-peptide frequency matrix that is reduced by singular value decomposition (SVD). The objective is to evaluate the association between statistics used by BLAST and similarity metrics used by SVD (Euclidean distance and cosine). We chose BLAST as a standard because this string-matching program is widely used for nucleotide searching and protein databases. Three datasets were used: mitochondrial-gene sequences, non-identical PDB sequences and a Swiss-Prot protein collection. We built scatter graphs and calculated Spearman correlation () with metrics produced by BLAST and SVD. Euclidean distance was negatively correlated with bit score (>-0.6) and positively correlated with E value (>+0.7). Cosine had negative correlation with E value (>-0.7) and positive correlation with bit score (>+0.8). Besides, we made agreement tests between SVD and BLAST in classifying protein families. For the mitochondrial gene database, we achieved a kappa coefficient of 1.0. For the Swiss-Prot sample there is an agreement higher than 80%. The fact that SVD has a strong correlation to BLAST results may represent a possible core technique within a broader algorithm.

References

  1. Altman, D. G., 1991. Practical Statistics for Medical Research. Chapman and Hall, London, UK.
  2. Altschul, S. F. et al., 1990. Basic local alignment search tool. J. Mol. Biol., 215, 403-410.
  3. Berry, M. W. et al., 1995. Using linear algebra for intelligent information retrieval. SIAM Review, 37, 573-595.
  4. Couto, B. R. G. M. et al., 2007. Application of latent semantic indexing to evaluate the similarity of sets of sequences without multiple alignments character-bycharacter. GMR, 6(4), 983-999.
  5. Deerwester, S. et al., 1990. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6), 1-13.
  6. Eldén, L., 2006. Numerical linear algebra in data mining. Acta Numerica, 327-384.
  7. Everitt, B. S. and Dunn, G., 2001. Applied multivariate data analysis. 2nd edn. Arnold, London, UK.
  8. Holm, L. and Sander, C., 1998. Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics, 14(5), 423-429.
  9. Jun, S. R. et al., 2010. Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution. Proc Natl Acad Sci U.S.A, 107(1):133-8.
  10. Korf, I.; Yandell, M.; Bedell, J., 2003. An essential guide to the Basic Local Align-ment Search Tool - BLAST. O'Reilly & Associates Inc., Sebastopol, U.S.A.
  11. Koski, L. B. and Golding, T. B., 2001. The closest BLAST hit is often not the nearest neighbor. J. Mol. Evol., 52, 540-542.
  12. Krawetz, A. S. and Womble, D. D., 2003. Introduction to Bioinformatics: a theoretical and practical approach. Humana Press, Totowa, USA.
  13. Liu, B. et al., 2008. A discriminative method for protein remote homology detection and fold recognition combining top-n-grams and latent semantic analysis. BMC Bioinformatics, 9, 510.
  14. Pertsemlidis, A. and Fondon III, J. W., 2001. Having a BLAST with bioinformatics (and avoiding BLASTphemy). Genome Biology, 2(10), 1-10.
  15. Stuart, G. W. et al., 2002. Integrated gene and species phylogenies from unaligned whole genome protein sequences. Bioinformatics, 18(1), 100-108.
  16. Stuart, G. W. and Berry, M. W., 2004. An SVD-based comparison of nine whole eukaryotic genomes supports a coelomate rather than ecdysozoan lineage. BMC Bioinformatics, 5: 204+.
  17. The Mathworks, 1996. MATLAB: mathematical computation, analysis, visualization, and algorithm development (version 5.0). Natick, Massachusetts, USA.
  18. Teichert, F. et al., 2007. SABERTOOTH: protein structural alignment based on a vectorial structure representation. BMC Bioinformatics, 8, 425.
  19. Vinga, S. and Almeida, J., 2003. Alignment-free sequence comparison: a review. Bioinformatics, 19(4), 513-523.
  20. Wu, C. et al., 1992. Protein classification artificial neural system. Protein Science, I, 667-677.
  21. Yuan, Y. et al., 2005. A Protein Classification Method Based on Latent Semantic Analysis. Conf Proc IEEE Eng. Med. Biol. Soc., 7, 7738-41.
  22. Zhu, M. and Ghodsi, A., 2006. Automatic dimensionality selection from the scree plot via the use of profile likelihood. Computational Statistics and Data Analysis, 51, 918-930.
Download


Paper Citation


in Harvard Style

Roberto Gonçalves Marinho Couto B., Matos Santoro M. and Augusto dos Santos M. (2011). SINGULAR VALUE DECOMPOSITION (SVD) AND BLAST - Quite Different Methods Achieving Similar Results . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2011) ISBN 978-989-8425-36-2, pages 189-195. DOI: 10.5220/0003162301890195


in Bibtex Style

@conference{bioinformatics11,
author={Bráulio Roberto Gonçalves Marinho Couto and Macelo Matos Santoro and Marcos Augusto dos Santos},
title={SINGULAR VALUE DECOMPOSITION (SVD) AND BLAST - Quite Different Methods Achieving Similar Results},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2011)},
year={2011},
pages={189-195},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003162301890195},
isbn={978-989-8425-36-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2011)
TI - SINGULAR VALUE DECOMPOSITION (SVD) AND BLAST - Quite Different Methods Achieving Similar Results
SN - 978-989-8425-36-2
AU - Roberto Gonçalves Marinho Couto B.
AU - Matos Santoro M.
AU - Augusto dos Santos M.
PY - 2011
SP - 189
EP - 195
DO - 10.5220/0003162301890195