Anna Gambin, Sławomir Lasota, Michał Startek, Maciej Sykulski, Laurent Noé, Gregory Kucherov


The seeding technique became central in the theory of sequence alignment and there are several efficient tools applying seeds to DNA homology search. Recently, a concept of subset seeds has been proposed for similarity search in protein sequences. We experimentally evaluate the applicability of subset seeds to protein homology search. We advocate the use of multiple subset seeds derived from a hierarchical tree of amino acid residues. Our method computes, by an evolutionary algorithm, seeds that are specifically designed for a given protein family. The representation of seeds by deterministic finite automata (DFAs) is developed and built into the NCBI-BLAST software. This extended tool, named SeedBLAST, is compared to the original NCBI-BLAST on the GPCR protein family. Our results demonstrate a clear superiority of SeedBLAST in terms of efficiency, especially in the case of twilight zone hits. SeedBLAST is an open source software freely available Supplementary material and user manual are also provided.


  1. Aho, A. and Corasick, M. (1975). Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18(6):333-340.
  2. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search tool. J. Mol. Biol., 215(3):403-410.
  3. Altschul, S. F., Madden, T. L., Schffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res., 25(17):3389-3402. PMID: 9254694.
  4. Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S., Griffiths-Jones, S., Howe, K., Marshall, M., and Sonnhammer, E. (2002). The Pfam Protein Families Database. Nucl. Acids Res., 30(1):276-280.
  5. Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M., Estreicher, A., Gasteiger, E., Martin, M., Michoud, K., O'Donovan, C., Phan, I., et al. (2003). The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucl. Acids Res., 31(1):365- 370.
  6. Brejova, B., Brown, D. G., and Vinar, T. (2004). Optimal spaced seeds for homologous coding regions. Journal of Bioinformatics and Computational Biology, 1(4):595-610.
  7. Brejová, B., Brown, D. G., and Vinar, T. (2005). Vector seeds: An extension to spaced seeds. J. Comput. Syst. Sci., 70(3):364-380.
  8. Brown, D. G. (2004). Multiple vector seeds for protein alignment. In WABI, pages 170-181.
  9. Buhler, J., Keich, U., and Sun, Y. (2005). Designing seeds for similarity search in genomic DNA. J. Comput. Syst. Sci., 70(3):342-363.
  10. Cameron, M., Williams, H., and Cannane, A. (2006). A deterministic finite automaton for faster protein hit detection in BLAST. J. Comput. Biol., 13(4):965-78.
  11. Cheng, S. and Xu, Y.-F. (1995). Constrained independence system and triangulations of planar point sets. In Computing and Combinatorics, pages 41-50.
  12. Finn, R. D., Tate, J., Mistry, J., Coggill, P. C., Sammut, S. J., Hotz, H., Ceric, G., Forslund, K., Eddy, S. R., Sonnhammer, E. L. L., and Bateman, A. (2008). The pfam protein families database. Nucl. Acids Res., 36(suppl 1):D281-288.
  13. Hopcroft, J. and Ullman, J. (1979). Introduction to automata theory, languages and computation. Massachusetts.
  14. Kahveci, T. and Singh, A. (2001). An efficient index structure for string databases. Proceedings of the 27th VLDB, pages 352-360.
  15. Kisman, D., Li, M., Ma, B., and Wang, L. (2005). tPatternHunter: gapped, fast and sensitive translated homology search. Bioinformatics (Oxford, England), 21(4):542-544. PMID: 15374861.
  16. Korte, B. and Hausmann, D. (1978). An analysis of the greedy heuristic for independence systems. Ann. Discrete Math., 2:65-74.
  17. Kucherov, G., Noe, L., and Roytberg, M. (2005). Multiseed lossless filtration. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 2(1):51-61.
  18. Kucherov, G., Noé, L., and Roytberg, M. (2006). A unifying framework for seed sensitivity and its application to subset seeds. Journal of Bioinformatics and Computational Biology, 4(2):553-570.
  19. Li, M., Ma, B., Kisman, D., and Tromp, J. (2004). PatternHunter II: Highly sensitive and fast homology search. Journal of Bioinformatics and Computational Biology, 2(3):417-439.
  20. Li, T., Fan, K., and Wang, J. Wang, W. (2003). Reduction of protein sequence complexity by residue grouping. Protein Engineering, 16(5):323-330.
  21. Li, W., Ma, B., and Zhang, K. (2009). Amino acid classification and hash seeds for homology search. In BICoB, pages 44-51.
  22. Liang, F. M. (1983). Word hy-phen-a-tion by com-put-er. Technical report, Departament of Computer Science, Stanford University.
  23. Livingstone, C. D. and Barton, G. J. (1993). Protein sequence alignments: a strategy for the hierarchical an alysis of residue conservation. Computer Applications in the Biosciences: CABIOS, 9(6):745-756. PMID: 8143162.
  24. Ma, B., Tromp, J., and Li, M. (2002). PatternHunter: faster and more sensitive homology search. Bioinformatics (Oxford, England), 18(3):440-445. PMID: 11934743.
  25. Ma, B. and Yao, H. (2008). Seed optimization is no easier than optimal golomb ruler design. In APBC, pages 133-144.
  26. Mitchell, M. (1996). An Introduction to Genetic Algorithms. MIT Press.
  27. Murphy, L., Wallqvist, A., and Levy, R. (2000). Simplified amino acid alphabets for protein fold recognition and implications for folding. Protein Engineering, 13:149-152.
  28. Neuwald, A. (1998). A probable solution to sequenceanalysis problems. Trends in Biochemical Sciences, 23(9):365-365.
  29. Nguyen, V. H. and Lavenier, D. (2008). Speeding up subset seed algorithm for intensive protein sequence comparison. In RIVF, pages 57-63.
  30. Noe, L. and Kucherov, G. (2005). YASS: enhancing the sensitivity of DNA similarity search. Nucl. Acids Res., 33(suppl 2):W540-543.
  31. Oliveira, L., Paiva, A. C. M., and Vriend, G. (1993). A common motif in g-protein-coupled seven transmembrane helix r eceptors. Journal of Computer-Aided Molecular Design, 7(6):649-658.
  32. Peterlongo, P., No, L., Lavenier, D., illes Georges, G., Jacques, J., Kucherov, G., and Giraud, M. (2008). Protein similarity search with subset seeds on a dedicated reco nfigurable hardware. In Parallel Processing and Applied Mathematics, pages 1240-1248. Springer.
  33. Ponty, Y., Termier, M., and Denise, A. (2006). GenRGenS: software for generating random genomic sequences and structures.
  34. Rost, B. (1999). Twilight zone of protein sequence alignments. Protein Engineering Design and Selection, 12(2):85-94.
  35. Roytberg, M., Gambin, A., Noé, L., Lasota, S., Furletova, E., Szczurek, E., and Kucherov, G. (2009). On subset seeds for protein alignment. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 6(3):483-494.
  36. Shiryev, A. S., Papadopoulos, J. S., S chaffer, A. A., and Agarwala, R. (2007). Improved BLAST searches using longer words for protein seedin g. Bioinformatics, 23(21):2949-2951.
  37. Smith, T. and Waterman, M. (1981). Identification of Common Molecular Subsequences. J. Mol. Biol., 147:195- 197.
  38. Sun, Y. and Buhler, J. (2004). Designing multiple simultaneous seeds for DNA similarity search. In RECOMB, pages 76-84.
  39. Yang, I.-H., Wang, S.-H., Chen, Y.-H., Huang, P.-H., Ye, L., Huang, X., and Chao, K.-M. (2004). Efficient methods for generating optimal single and multiple spaced seeds. In BIBE 7804: Proceedings of the 4th IEEE Symposium on Bioinformatics and Bioengineering, page 411, Washington, DC, USA. IEEE Computer Society.

Paper Citation

in Harvard Style

Gambin A., Lasota S., Startek M., Sykulski M., Noé L. and Kucherov G. (2011). SUBSET SEED EXTENSION TO PROTEIN BLAST . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2011) ISBN 978-989-8425-36-2, pages 149-158. DOI: 10.5220/0003147601490158

in Bibtex Style

author={Anna Gambin and Sławomir Lasota and Michał Startek and Maciej Sykulski and Laurent Noé and Gregory Kucherov},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2011)},

in EndNote Style

JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2011)
SN - 978-989-8425-36-2
AU - Gambin A.
AU - Lasota S.
AU - Startek M.
AU - Sykulski M.
AU - Noé L.
AU - Kucherov G.
PY - 2011
SP - 149
EP - 158
DO - 10.5220/0003147601490158