SUBSET SEED EXTENSION TO PROTEIN BLAST

Anna Gambin

1,2

, Sławomir Lasota

, Michał Startek

, Maciej Sykulski

Institute of Informatics, University of Warsaw, 2 Banacha, 02-097 Warsaw, Poland

Mossakowski Medical Research Centre Polish Academy of Sciences, 5 Pawinskiego, 02-106 Warsaw, Poland

Laurent No

, Gregory Kucherov

3,4

LIFL/CNRS/INRIA, B

at. M3, Campus Scientiﬁque, 59665 Villeneuve d’Ascq C

edex, France

J. -V. Poncelet Lab, Bolshoy Vlasyevsky Pereulok 11, 119002 Moscow, Russia

Keywords:

Sequence alignment, Protein BLAST, Subset seed, DFA, Genetic algorithm.

Abstract:

The seeding technique became central in the theory of sequence alignment and there are several efﬁcient tools

applying seeds to DNA homology search. Recently, a concept of subset seeds has been proposed for similarity

search in protein sequences.

We experimentally evaluate the applicability of subset seeds to protein homology search. We advocate the use

of multiple subset seeds derived from a hierarchical tree of amino acid residues. Our method computes, by an

evolutionary algorithm, seeds that are speciﬁcally designed for a given protein family. The representation of

seeds by deterministic ﬁnite automata (DFAs) is developed and built into the NCBI-BLAST software. This

extended tool, named SeedBLAST, is compared to the original NCBI-BLAST and PSI-BLAST on several

protein families. Our results demonstrate a superiority of SeedBLAST in terms of efﬁciency, especially in the

case of twilight zone hits.

SeedBLAST is an open source software freely available http://bioputer.mimuw.edu.pl/papers/sblast. Supple-

mentary material and user manual are also provided.

1 INTRODUCTION

Motivation. Since the time complexity of the opti-

mal alignment problem is quadratic (e.g., the Smith-

Waterman algorithm (Smith and Waterman, 1981)),

thus too large for everyday tasks, most of sequence

aligning is done using heuristics, typically with the

ubiquitous BLAST software (Altschul et al., 1990;

Altschul et al., 1997). It runs in three phases, and the

ﬁrst of which ﬁnds short initial alignments, so called

hot spots. Theory of seeds may be applied here, trying

to answer the question: which short aligned segments

are relevant for the indication of a true global align-

ment?

In the case of nucleotide sequences, spaced seeds

have been intensively investigated and have success-

ful applications: an improvement of BLASTN (Bre-

jova et al., 2004), sensitive alignment tools like Pat-

ternHunter (Ma et al., 2002; Li et al., 2004) and

Yass (Noe and Kucherov, 2005), automaton based

theory for modeling and analyzing seeds (Kucherov

et al., 2006; Buhler et al., 2005). The idea of using

multiple seeds is also widely recognized (Li et al.,

2004; Brejov

a et al., 2005; Sun and Buhler, 2004;

Kucherov et al., 2005). In this paper we attempt to

achieve similar results for protein homology search.

However, techniques based on spaced seeds seem not

to apply directly to protein sequences, as the alphabet

size is bigger (20 amino acids) and strict letter identity

is much less relevant.

Related Research. The ﬁrst phase of BLAST is a

search for hot spots, i.e., short initial alignments of

a query and a subject. Quite different methods are

applied to deﬁne a hot spot for DNA and protein se-

quences. In the case of DNA, a hot spot is a short

sequence of identically matching nucleotides. Ap-

plication of seeds enables the consideration of non-

identical matchings as well, and thus ﬁnding out pre-

viously overlooked good initial alignments. The re-

ported increase of the efﬁciency of homology search

is therefore quite expectable (Kisman et al., 2005;

Shiryev et al., 2007).

In the case of protein sequences, a hot spot is

149

Gambin A., Lasota S., Startek M., Sykulski M., Noé L. and Kucherov G..

SUBSET SEED EXTENSION TO PROTEIN BLAST.

DOI: 10.5220/0003147601490158

In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2011), pages 149-158

ISBN: 978-989-8425-36-2

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

deﬁned through a cumulative contribution of amino

acid matches, not necessarily identical. A short se-

quence of such matches is considered a hot spot if

their additive contribution (score) exceeds a prede-

ﬁned threshold. It is thus not clear whether seed-

based approaches may measure up with the cumu-

lative scores in expressibility and effectiveness. A

ﬁrst attempt to compare the two approaches has been

done in (Roytberg et al., 2009), with the conclusion

that subset seeds (Kucherov et al., 2006) may of-

fer an attractive alternative to the “cumulative” ap-

proach of BLAST (cf. also discussion and references

therein concerning expressibility of different classes

of seeds). It is also argued that the algorithmic cost

may thus be reduced, as application of seeds allows

the use of a direct indexing scheme based on hash-

ing. In addition, a comparison of subset seed tech-

nique with the vector seed approach (Brown, 2004)

has been reported.

A fundamental notion in seed theory is an align-

ment alphabet, whose letters correspond to matching

two residues. In the case of nucleotide sequences, the

alignment alphabet has 6 (or 12, if directional) let-

ters. In the case of amino acid sequences, however,

the alignment alphabet has at least 200 letters, which

makes exploration of even medium length sequences

costly and difﬁcult. A way of approaching the prob-

lem is to reduce the alignment alphabet, exploit-

ing similarities among various amino acids (Royt-

berg et al., 2009). By applying the subset seed the

complexity of alignment description may be reduced,

while maintaining the biological information content.

The idea of subset seeds (Kucherov et al., 2006), can

be viewed as an intermediate concept between ordi-

nary spaced seeds and vector seeds. In this approach

different types of matches (or mismatches) are dis-

tinguished, as a seed letter corresponds to a subset

of matches. In the case of protein sequences, for

instance, it might be beneﬁcial to distinguish muta-

tions inside some predeﬁned amino acid groups (like

aliphatic, aromatic, tiny, etc. (Livingstone and Barton,

1993)) from mutations between these groups.

Deterministic ﬁnite automata (DFAs) ﬁnd many

applications related to homology search and seed

theory. Early BLAST implementations investigated

two hashing methods (Altschul et al., 1990) to efﬁ-

ciently manipulate the dictionary of hot spots. One

of these was a dictionary organized as a DFA; this

method was preferred until it has been abandoned

in 1997 in NCBI-BLAST (Neuwald, 1998). DFAs

are also known to be useful in constructing a per-

fect hashing function, in computing the value of a

hashing function, or retrieving a value from a dictio-

nary (Hopcroft and Ullman, 1979; Aho and Corasick,

1975). In (Kucherov et al., 2006) a special type of

DFA called probability transducer was used for com-

putation of seed sensitivity.

Recently the concept of hash seed for protein ho-

mology search has been proposed in (Li et al., 2009).

It also applies amino acid grouping (in this case based

on BLOSUM matrix) to designing good seeds that al-

low for a direct hashing scheme. The subset seeds, as

considered in this paper, can be regarded as another

(potentially more powerful) type of hash seeds.

Our Contribution. The overall aim of this paper is

to experimentally conﬁrm the value of applying seed-

based hot spot search, using the approach of (Royt-

berg et al., 2009). It appears to be especially inter-

esting in cases when the search is restricted to a par-

ticular protein family, as this opens the possibility of

designing specialized seeds in order to increase the ef-

ﬁciency of the homology search. In short, as our tech-

nical contribution we propose a method of computing

a well-performing multiple space seed, and present an

implementation of a new seed-based hot spot search

routine. Furthermore, we advocate the use of deter-

ministic ﬁnite automata (DFAs) as a seed representa-

tion. Finally, we experimentally conﬁrm a supremacy

of this new approach over the original NCBI-BLAST

hot spot search.

We investigate, and search for, reduced alignment

alphabets, called seed alphabets, that can be derived

from hierarchical trees of amino acids. Such trees

were designed, e.g., in (Li et al., 2003; Murphy et al.,

2000); for our purposes we compute, (by amino acids

clustering), a speciﬁc tree for a given protein family.

An advantage of using hierarchical trees is that the

alphabets are always transitive (i.e., each letter corre-

sponds to a transitive set of matching pairs) and thus

enable application of the direct hashing scheme.

We search for a well-performing alphabet and a

multiple subset seed over it with the use of an evolu-

tionary algorithm. The ﬁtness evaluation is based on

computing the seed sensitivity and selectivity in the

way suggested in (Kucherov et al., 2006).

The multiple seed, represented as a DFA, is then

used in the hot spot search of BLAST. We have im-

plemented an extension to the NCBI-BLAST soft-

ware, called SeedBLAST, that accepts a multiple sub-

set seed as its input parameter. The extension is

written in C++, relies on the template mechanism,

and is prone to compiler optimizations (most func-

tions can be inlined). An important advantage of

our implementation is that being developed within the

NCBI-BLAST framework, it inherits all stable and

tested features of this implementation.

The ﬁrst test results can be perceived as promis-

BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms

150

ing: although our multiple seed selection method is

rather simplistic, our tool returns more interesting hits

than the standard BLAST with comparable settings.

Some returned hits tend to be long although having

only medium E-value, the type of hits known to be

dimmed and not reported by BLAST. This kind of hits

is termed twilight zone after (Rost, 1999).

Furthermore, this methodology can be useful for

searching for particular type of alignments. Given a

set of alignments, one can construct a speciﬁc seed

automaton and perform database search for this cer-

tain type of alignments. Following this idea we in-

vestigated the ability to align known structurally ho-

mologous domains of the Rhodopsin family of G-

protein coupled receptors (GPCRs). The outcome of

our experiment showed a signiﬁcant difference be-

tween NCBI-BLAST and SeedBLAST, in favor to the

latter: our method yielded much longer alignments

covering up to 70% of the entire domain, even for pro-

teins sharing low sequence identity (20-30%).

2 SUBSET SEED DESIGN

General Approach. Given a protein family, we as-

sume that a small representative subset of this family

has already been aligned well (for example manually

by experts), and is available as a training set. The

algorithm designing subset seed attempt to extract in-

formation about the structure of the family from this

set, and use it to produce alignments for the entire

family. In the ﬁrst phase a hierarchical tree is con-

structed that represents similarities of amino acids.

Then, a seed alphabet is designed, along with a set

of seeds. This is a learning phase, and runs indepen-

dently of our BLAST enhancement. Next, the seed

alphabet along with the corresponding set of seeds is

used by the SeedBLAST algorithm to ﬁnd hot spots.

Afterwards, the computation of SeedBLAST follows

the standard BLAST scheme.

Hierarchical Tree of Amino Acids. Let Σ =

{A,C,D,...} be the amino acid alphabet (|Σ| = 20).

A valid hierarchical tree of amino acids is a binary

tree whose leaves are labeled bijectively by elements

of Σ, and whose every internal (non-leaf) node has

two children. An example of such a tree is shown in

Figure 1. The speciﬁc tree used e. g. for Rhodopsin

family is presented in supplementary material. Such

a tree constitutes a parameter in our approach; we as-

sume that it corresponds to some biologically signif-

icant hierarchical clustering of amino acid residues,

c.f. (Murphy et al., 2000; Li et al., 2003).

Any non-leaf node v of T is represented by a set

of (labels of) leaves in the subtree rooted in v. This

set is denoted by Σ

. In particular, the root is labeled

by the whole set Σ. There are precisely |Σ| − 1 = 19

non-leaf nodes.

Our basic intuition is as follows. Think of a leaf

labeled by A ∈ Σ as a representation of the exact

match A—A. Then a node v represents all matches

A—B for A,B ∈ Σ

The tree is obtained from the training set of align-

ments in the following way: ﬁrst, for each pair of

amino acids the number of times they have been

aligned one with another is counted, and then, using

those counts, the amino acids are hierarchically clus-

tered through neighbor-joining method.

Seed Alphabets and Seeds. From now on we as-

sume a ﬁxed hierarchical tree T . The tree nodes are

partially ordered by a natural ordering induced by the

tree structure (we call it tree ordering). This coin-

cides with the inclusion ordering of the labeling sets:

≤ v

⇐⇒Σ

⊆ Σ

. We assume here for techni-

cal convenience that the leaves are labeled by single-

tons {A} instead of single amino-acids A ∈ Σ. Below

we consider sets of nodes of T, ordered by inclusion

as well. Certain sets of nodes will be seed letters (po-

tential elements of a seed alphabet).

A seed letter is deﬁned as any subset α of nodes

such that:

(i) (maximality) α contains all leaves and

(ii) (downward closedness) whenever v ∈ α and

< v then v

∈ α.

Hence, a single seed letter α is deﬁned as a lower

set of a maximal antichain wrt. the tree ordering. This

antichain contains the maximal elements of α wrt. the

tree ordering and may be visualized by a horizontal

cut through the tree T. Seed letters are naturally or-

dered by inclusion. The smallest one is the ”exact

match” seed letter #, containing only the leaves. The

largest one is the ”don’t care” seed letter , containing

all the nodes of T. One particular seed letter, denoted

by @, is obtained by removing from the root node.

We place an additional restriction on alphabets that

we use, that they must contain both # and .

The maximal elements of a seed letter α wrt. the

tree ordering form a partition of Σ. Thus α repre-

sents naturally an equivalence relation on Σ: A and B

are related iff they belong jointly to some node of α;

i.e., iff there exists some v ∈ α such that A ∈ Σ

and

B ∈ Σ

. We feel free to write (A, B) ∈ α in this case.

The induced equivalence is identity relation in case of

# and full relation in case of . The inclusion order-

ing of seed letters coincides with the inclusion of the

induced equivalences.

SUBSET SEED EXTENSION TO PROTEIN BLAST

151

CFYWMLIVGPATSNHQEDRK

CFYWMLIV

CFYW

FYW

F Y

MLIV

M L

GPATSNHQEDRK

GPATS

G PATS

ATS

T S

NHQEDRK

NHQED

QED

Figure 1: The hierarchical tree of amino acids proposed by (Li et al., 2003).

Certain families of seed letters will be allowed as seed

alphabets. Essentially, we forbid two letters α

,α

that are incomparable by inclusion. A seed alphabet

is a family A of seed letters totally ordered by inclu-

sion: for each α

,α

∈ A, either α

⊆ α

or α

⊆ α

Alphabets with this property are called hierarchical

in (Roytberg et al., 2009). We used this assumption as

it leads to a nice mathematical formalization, namely

the family of seed alphabets forms a constrained inde-

pendence system (Korte and Hausmann, 1978; Cheng

and Xu, 1995). We show that even with this restric-

tion very efﬁcient seeds can be obtained. Thus, in this

paper we will not consider non-hierarchical alphabets.

Note that, again, the seed alphabets may be naturally

ordered by inclusion as well.

We deﬁne a seed over a seed alphabet A as a ﬁnite

word over A. A multiple seed is a pair consisting of a

seed alphabet and a set of seeds over that alphabet.

We say that a seed s = s

...s

aligns two amino

acid sequences a = a

...a

, b = b

...b

, if and

only if for all i ∈ {1, 2, ..., n}, (a

) ∈ s

Foreground sensitivity (or just sensitivity) of a

multiple seed M, denoted by sens

(M), is the num-

ber of positions in the training set of alignments

matched by at least one of the seeds from M, divided

by the total number of positions. Foreground sensitiv-

ity is computed directly from the training set.

Background sensitivity of a seed corresponds to

the probability of matching two aligned random se-

quences. We assume that the background model for

amino acid sequences is given as a Markov chain.

For our experiments, the Markov chain models of

orders 1,2 and 3 were learned from the TrEMBL

database (Boeckmann et al., 2003) using GenRGenS

Java tool (Ponty et al., 2006). The background sensi-

tivity of a seed was computed with the use of Marko-

vian probability transducer as described in (Kucherov

et al., 2006). Background sensitivity of a multiple

seed M, denoted by sens

(M), is estimated from

above by the sum of background sensitivities of each

of the individual seeds in M (the estimation is sharp

only if seed occurrences are independent).

Evolutionary Approach. Optimizing multiple

seeds is recognized as a highly non-trivial task (Yang

et al., 2004; Buhler et al., 2005; Ma and Yao,

2008). In the case of hierarchical subset seeds the

combinatorial structure of seed alphabets suggests

hardness of the optimization problem (see (Roytberg

et al., 2009) for details). Therefore we decided to use

an efﬁcient heuristic algorithm.

In the proposed approach seed alphabets and seeds

are simultaneously chosen through an application of a

genetic algorithm. The genetic algorithms are used to

solve various optimization problems (Mitchell, 1996).

They work by ﬁrst generating a random multiset (ini-

tial population) of potential solutions, evaluating the

function being optimized (ﬁtness function) for each

one of them, culling a percentage of them with low

values of such function, cloning and slightly altering

(mutating) the rest at random – and repeating this pro-

cess until a satisfactory solution is obtained.

BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms

152

In our case, the potential solutions are pairs: a

seed alphabet, and a set of seeds. A mutation applies

thus either to the alphabet, or to one of the seeds. Mu-

tating the alphabet is one of the following: deleting a

randomly chosen letter (except for the top # and bot-

tom one), altering a letter (by adding a tree node to

it, or removing a tree node – but only if that would

not violate the constraint that the alphabet must be

hierarchical), or adding a random (non-conﬂicting)

letter. While modifying the letter one has to respect

its deﬁnition, i.e. the (i) maximality and (ii) down-

ward closedness conditions. Mutating the set of seeds

means either deleting one of the seeds, adding a ran-

dom seed, or replacing a random letter in a random

seed by one of its neighbors in the alphabet. Algo-

rithm 1 explains the details.

A multiple seed may contain individual seeds of

different lengths in general. However to simplify and

speed-up the computations we have decided to ﬁx the

length; all individual seeds computed by the evolu-

tionary algorithms have the same length W = 5.

Algorithm 1: Genetic Algorithm.

Input: Protein family F

Output: A multiple seed for family F

begin

Population ←a multiset of 100 randomly

chosen multiple seeds (the initial

population);

while Not Run Out Of Time do

foreach multiple seed M ∈ Population

f ←fitness

(M);

Randomly, based on f, choose one

of the following:

• Population ←Population \ {M};

- with increasing probability for

low values of f

• Population ←Population \ {M}∪

{Mutate(M)};

• Population ←Population∪

{Mutate(M)}; - with increasing

probability for high values of f

end

return the member of Population that

maximizes fitness

end

The most important aspect of every optimization

algorithm, a genetic algorithm being no exception, is

the ﬁtness function chosen. Usually, what we want to

obtain is a seed that has as low background sensitiv-

ity as possible, while at the same time having as high

foreground sensitivity as possible. So, the ﬁrst idea

might be to choose the following function:

fitness

(M) =

sens

(M)

sens

(M)

This, however, yields unsatisfactory results – the evo-

lution just results in a smallest multiple seed possible,

with minuscule foreground and background sensitiv-

ity.

The ﬁtness function has to reﬂect the trade-off

between foreground sensitivity and background sen-

sitivity. It should be noted that both of these play

similar role to NCBI-BLAST ’-f’ parameter (i.e. the

threshold for the cumulative score of three hit posi-

tions). The ’-f’ parameter allows one to adjust the

length of computation, and the quality of results. With

SeedBLAST it has been split in two – the sens

(M)

part is responsible for the quality of results, while

sens

(M) is responsible for the length of computa-

tion. Keeping that in mind, we can select a ﬁtness

function that can match our needs – using it, we can

in effect specify ’give me the best results you can

achieve within a given time-frame’ – or, the opposite

– ’give me results at least this good, and I don’t care

how long it takes to compute them’. Or everything

in-between.

An example of ﬁtness function that adheres to the

ﬁrst approach might be as follows:

fitness

(M) =



0 if sens

(M) > c

sens

(M) otherwise

The second approach is fulﬁlled by the following ﬁt-

ness function:

fitness

(M) =



sens

(M) if sens

(M) < c

sens

(M)

sens

(M)

otherwise

Another appropriate ﬁtness functions and their

evolution paths are presented in supplementary ma-

terial.

For further tests, described in the rest of the paper,

we have chosen the function fitness

, with c = 0.15;

except for the performance evaluation, where we pre-

fer to use fitness

(in order to make the fair compar-

ison with NCBI-BLAST ).

This decision was taken through trial and error -

there is no guarantee that this is the optimal choice.

The multiple seed that was computed and used for

further experiments exhibits foreground sensitivity

equal to 0.179906, and background sensitivity equal

to 0.01047971. The underlying alphabet is presented

in supplementary material; the whole multiple seed,

consisting of 3686 individual seeds, is not subject to

a concise presentation.

SUBSET SEED EXTENSION TO PROTEIN BLAST

153

3 SeedBLAST: EXTENSTION OF

NCBI-BLAST

Given a query, the goal of the ﬁrst phase of the

BLAST algorithm is to index all subwords of length

W (chosen as a parameter). Not only exact subwords

are indexed but also their predeﬁned neighborhoods,

with respect to a metric determined by the cumula-

tive score according to the BLOSUM matrix. With

each query, the occurrences of the neighborhoods are

stored in a dictionary-type data structure; current ver-

sion of NCBI-BLAST uses a hash table.

The size of neighborhood is crucial as it must be

stored in a dictionary. BLAST uses a threshold on the

BLOSUM score of an alignment of a segment pair.

The threshold represents the trade-off between sensi-

tivity and time and memory efﬁciency since it has a

direct impact on the number of analyzed hits. The

default threshold was adjusted experimentally by the

BLAST developers and currently equals 11 in protein

NCBI-BLAST.

We seek to describe the neighborhood using our

selected multiple seed. In principle, the method may

be applied to any multiple seed, possibly containing

words of different lengths. However, in the case study

described in the following section, all the seeds have

the same length W = 5. Moreover, all individual seeds

are constructed over the same alphabet. This assump-

tion greatly simpliﬁes the seed design and allows to

construct a single automaton for looking for all hot

spots simultaneously.

3.1 Hot Spot Search using DFA

A trie, or a preﬁx tree, is a dictionary with a tree-

structured transition graph, in which the start node

is the root and all the leaves are ﬁnal nodes (Liang,

1983). Tries are especially convenient when the keys

are short strings: the tree edges are labeled by letters,

and retrieving a value assigned to a given key w is

done by following the w-labeled path in the tree, thus

very efﬁcient.

It is assumed that labels of edges outgoing from a

node are all different. A trie may be thus seen as an

acyclic DFA recognizing a ﬁnite language (the lan-

guage contains labels of all the paths going from the

root to a leaf). Upon acceptance, the automaton in ad-

dition returns the value assigned to a word read (being

a key). In our case, the value will be typically a set of

positions in a query.

In our algorithm, to be described below, we con-

struct a number of different tries (automata). To op-

timize for memory, on the implementation level we

always conform to the Mealy paradigm of keeping

values attached to transitions, not vertices.

In a preprocessing phase a trie S is constructed to

represent the multiple seed. Its input alphabet is the

seed alphabet A.

Next, we proceed with constructing a trie Q, over

the input alphabet Σ, that keeps all subwords of length

W from a given query. For each such word we store in

Q pointers to all positions in query where it appears.

This will reduce operations in the following phases. It

is worth noting that Q may be used to process jointly

multiple queries. Analogously, NCBI-BLAST also

permits many queries to be stored jointly in its hash

table.

As a consecutive step, a trie N is built to store

neighborhoods. Its alphabet is Σ, and language is

given by

N = Q ∝ S :=

{w | for some q ∈ Q and s ∈ S, s aligns q and w}.

The trie N is constructed by systematically traversing

a product of Q and S. The value assigned to a word w

in N denotes, similarly as in Q, a set of positions in

the query. It is given by the union of values assigned

to q in Q, for all q ranging over

{q ∈ Q | for some s ∈ S, s aligns q and w}.

On the implementation level, the union is represented

by a suitable pointer data structure.

Finally we construct an automaton H over the al-

phabet Σ, whose aim is to ﬁnd hot spots in the sub-

ject sequences. Operation of H is similar to a pattern-

matching automaton. It is built on the basis of the au-

tomaton N, by adding additional edges outgoing from

the ﬁnal (leaf) nodes. To easily explain the construc-

tion, we recall that each node of N is uniquely deter-

mined by the labeling of the path from the root to that

node. Fix a leaf determined by w and a letter a ∈ Σ;

the outgoing a-labeled edge will point to a node de-

termined by the longest sufﬁx of wa that belongs to

N. Clearly, in contrast to all other automata, H may

have cycles.

Having constructed H, next BLAST phases re-

main unchanged. Each subject sequence is traversed

along, starting from the root of H. At each step, the

value assigned to the current node (state) of H informs

whether any hot spots are found at the current position

in a subject. If so, the hot spots are stored for further

processing in the following phases of BLAST.

4 EXPERIMENTS

Datasets. We used a dataset extracted from the

Pfam database, that contains expert-made pro-

tein structural families and their multi-alignments,

BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms

154

later extended to larger families using proﬁle-

HMMs (Bateman et al., 2002; Finn et al., 2008).

A protein family, exhibiting low identity per-

centage, has been selected from Pfam (namely

PF00001). This family contains, amongst other G-

protein-coupled receptors (GPCRs), members of the

opsin family, which have been considered to be typ-

ical members of the rhodopsin superfamily. They

share several motifs, mainly the seven transmembrane

helices (7tm 1 domain). This domain will be the main

focus of our experiment.

The rhodopsin-like GPCRs themselves represent

a widespread protein family that includes hormone,

neurotransmitter and light receptors, all of which

transduce extracellular signals through interaction

with guanine nucleotide-binding (G) proteins. Al-

though their activating ligands vary widely in struc-

ture and character, the receptors are believed to adopt

a common structural framework comprising 7 trans-

membrane helices.

The expert-made multi alignment of 7tm 1 do-

mains from 64 of the family members was down-

loaded (the whole family contains 16975 proteins),

and used as a training set to obtain a multiple seed.

The latter was subsequently used by the SeedBLAST

algorithm to compute pair-wise alignments of the

7tm 1 domain of all the family members. The results

were compared with those obtained by the standard

BLAST algorithm.

For a fair comparison, it had to be ensured that

both algorithms actually run with the same back-

ground sensitivity. Thus, the ’-f’ parameter of

NCBI-BLAST was adjusted in the course of the ex-

periment to obtain similar background sensitivity to

that of the multiple seed used by SeedBLAST. The

table shows typical values of the ’- f’ parameter to-

gether with the corresponding values of background

sensitivity:

f parameter of

NCBI-BLAST

SeedBLAST back-

ground sensitivity

11 (default) 0.002195

10 0.005342

9 0.00816

8 0.012276

7 0.018163

The background sensitivity of the seed used by

SeedBLAST was 0.01047971; this corresponds to

8 or 9 as the value of the ’-f’ parameter in the

NCBI-BLAST invocations.

As the alignment concerned only the domain frag-

ment of each protein, and the domain is already

known to be the same in each protein, every alignment

found should be considered biologically signiﬁcant.

Figure 2: Symmetric difference between outputs of BLAST

and SeedBLAST.

Comparing Efﬁciency. Figure 2 shows the

symmetric difference between hits found by

NCBI-BLAST, and those found by SeedBLAST (that

is, alignments found by one of the algorithms but not

the other).

We observe that alignments found by SeedBLAST

are in general longer than those found by BLAST, and

thus provide better coverage of the domain. Espe-

cially many alignments that have not been found by

BLAST lie in the so-called twilight zone (Rost, 1999)

– namely long alignments with low identity percent-

age, and thus low E-value, that nevertheless are bi-

ologically signiﬁcant. The reason why SeedBLAST

constructs longer alignments is that it detects much

more biologically signiﬁcant hot-spots. A supremacy

of SeedBLAST becomes more apparent in view of

Figure 3. To obtain this diagram, pairs of domains

for which BLAST and SeedBLAST found different

alignments were chosen from the set of all align-

ments, and the coverage of domains by these align-

ments was computed. We conclude that SeedBLAST

is much more efﬁcient in providing biologically sig-

niﬁcant alignments than the standard BLAST algo-

rithm.

Figure 3: Domain coverage by alignments found by BLAST

and SeedBLAST.

The reason for SeedBLAST’s improved efﬁciency

SUBSET SEED EXTENSION TO PROTEIN BLAST

155

is the inclusion of subset seeds speciﬁcally tuned

for the domain under consideration. When one

aims at aligning different family of proteins appro-

priate multiple seed should be designed and used in

SeedBLAST. We conclude that SeedBLAST appears

much more efﬁcient than BLAST in recognizing pro-

tein domains, as it is both more effective in covering

the entire domain as well as much less likely to cover

anything beyond the sought-for domain.

Table 1: f: -f Parameter of NCBI-BLAST c: correspond-

ing background sensitivity (parameter c) used in seed de-

sign, sens

(M): actual background sensitivity of the mul-

tiple seed SeedBLAST, NCBI-BLAST: running time of

NCBI-BLAST,SeedBLAST: running time of SeedBLAST.

f c sens

(M) NCBI-

BLAST

(sec.)

Seed-

BLAST

(sec.)

15 0.000306 0.00030564 1.20 0.45

11 0.002195 0.00211789 3.32 1.40

8 0.012276 0.01156471 12.17 4.74

5 0.026406 0.02622019 23.84 12.40

Comparing Running Time. In addition,

SeedBLAST and NCBI-BLAST were compared

with respect to their running time (preprocessing,

i. e. seed design phase is not included). For a

fair comparison, again, we had to ensure that both

algorithms actually run with the same background

sensitivity.

In case of SeedBLAST, we had to be able to con-

trol the background sensitivity of the multiple seed

used. This led us to choose the ﬁtness function:

fitness

(M) =



0 if sens

(M) > c

sens

(M) otherwise

(cf. Section 2) that seems to suit best to this purpose:

the parameter c corresponds directly to the desired

background sensitivity of the multiple seed.

In case of NCBI-BLAST, its background sensitiv-

ity can be adjusted by the ’-f’ parameter. For the test,

we picked several different values of the ’-f’ param-

eter, and then calculated the background sensitivities

induced by these values. These background sensitivi-

ties were taken as the value of the c parameter in the

above ﬁtness function, exploited in the computation

of multiple seeds used by SeedBLAST.

The results are summarized in the Table 1. In fact,

because of the unpredictable nature of multiple seed

evolution, we can’t control the background sensitivity

of a multiple seed exactly. This gives SeedBLAST a

slight advantage over the other, represented by the dif-

ference between the c parameter (equal to the actual

background sensitivity NCBI-BLAST runs with), and

the background sensitivity of an obtained multiple

seed. Still, even accounting for this slight difference,

the results show that SeedBLAST algorithm is over

two times faster than NCBI-BLAST on average. As

the SeedBLAST is an extension of NCBI-BLAST we

conclude, that the speed-up is achieved due to faster

hot-spot identiﬁcation stage. We argue that the mul-

tiple seed approach enables to detect biologically sig-

niﬁcant hot-spots, e.g. those corresponding to func-

tional residues (Oliveira et al., 1993).

It is worth mentioning here that the performance

of SeedBLAST (being the extension of standard

NCBI-BLAST implementation) is comparable with

the performance of subset seed based tools that use

parallel implementation or specialized hardware (Pe-

terlongo et al., 2008; Nguyen and Lavenier, 2008).

Comparison with PSI-BLAST. One can see the

close similarity between the seed approach and po-

sition speciﬁc scoring matrices used to improve ho-

mology search. Therefore we decided to compare the

selectivity and sensitivity of SeedBLAST to the efﬁ-

ciency of popular PSI-BLAST algorithm. The experi-

ment was performed on two protein families (Surface

antigen - PF01617 and Globin - PF00042). The per-

formace of both algorithms on Globin family was al-

most identical (data not shown). On the other hand

on Antigen family SeedBLAST achieved much better

selectivity while keeping the same level of sensitivity

(c. f. Fig.4).

5 FURTHER RESEARCH

Further experiments using in-sample, out-of-sample

tests and larger families are needed. At this stage

the performance of SeedBLAST is comparable with

BLAST running with low threshold for hot spots. We

hope that this result can be improved if more ad-

vanced methods for construction of multiple seeds are

used, and longer seeds (i.e., of length > 5) are in-

cluded.

The remaining goal is to develop a method of seed

construction that would keep sensitivity high and im-

prove selectivity. That would eventually reduce the

running time and memory requirements for greater

seed lengths.

Protein homology search requires unavoidably

storing a large dictionary in memory. Hence it seems

to be worth pursuing ideas from (Kahveci and Singh,

2001), where a novel method of compression, based

on wavelets, was proposed for dictionaries of words.

Another possible improvement could be integration

BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms

156

Figure 4: Histogram of found alignments grouped by logarithm of their E-value. We can observe that within the Antigen

family SeedBLAST ﬁnds all of the alignments that PSI-BLAST does (except for a small fraction of some non-signiﬁcant

ones with E-values of 1 and more). Histogram on the right side shows results of alignment of proteins which we know

to be unrelated to Antigens with Antigens. We can see that SeedBLAST ﬁnds less non-homology-related alignments than

PSI-BLAST does.

of the cache-conscious hashing DFA to improve ef-

ﬁciency of page-swapping, as described in (Cameron

et al., 2006). However, we would like to recall here

that our overall goal of investigating the seed-based

hot spot search was to reduce the need for large infor-

mation storage by choosing only those hits that seem

important.

ACKNOWLEDGEMENTS

Parts of this work have been done during visits to

LIFL of AG and SL supported by the ECO-NET and

Polonium programs of the French Ministry of For-

eign Affairs. The research project was also funded

by the Polish Ministry of Science and Higher Educa-

tion grants no. N N301 065236 and N N206 356036

and European Social Fund (UDA-POKL.04.01.01-

00-072/09-00).

REFERENCES

Aho, A. and Corasick, M. (1975). Efﬁcient string matching:

an aid to bibliographic search. Communications of the

ACM, 18(6):333–340.

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and

Lipman, D. J. (1990). Basic local alignment search

tool. J. Mol. Biol., 215(3):403–410.

Altschul, S. F., Madden, T. L., Schffer, A. A., Zhang, J.,

Zhang, Z., Miller, W., and Lipman, D. J. (1997).

Gapped BLAST and PSI-BLAST: a new generation

of protein database search programs. Nucl. Acids Res.,

25(17):3389–3402. PMID: 9254694.

Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller,

L., Eddy, S., Grifﬁths-Jones, S., Howe, K., Marshall,

M., and Sonnhammer, E. (2002). The Pfam Protein

Families Database. Nucl. Acids Res., 30(1):276–280.

Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.,

Estreicher, A., Gasteiger, E., Martin, M., Michoud,

K., O’Donovan, C., Phan, I., et al. (2003). The

SWISS-PROT protein knowledgebase and its supple-

ment TrEMBL in 2003. Nucl. Acids Res., 31(1):365–

370.

Brejova, B., Brown, D. G., and Vinar, T. (2004). Op-

timal spaced seeds for homologous coding regions.

Journal of Bioinformatics and Computational Biol-

ogy, 1(4):595–610.

Brejov

a, B., Brown, D. G., and Vinar, T. (2005). Vector

seeds: An extension to spaced seeds. J. Comput. Syst.

Sci., 70(3):364–380.

Brown, D. G. (2004). Multiple vector seeds for protein

alignment. In WABI, pages 170–181.

Buhler, J., Keich, U., and Sun, Y. (2005). Designing seeds

for similarity search in genomic DNA. J. Comput.

Syst. Sci., 70(3):342–363.

Cameron, M., Williams, H., and Cannane, A. (2006). A de-

terministic ﬁnite automaton for faster protein hit de-

tection in BLAST. J. Comput. Biol., 13(4):965–78.

Cheng, S. and Xu, Y.-F. (1995). Constrained independence

system and triangulations of planar point sets. In Com-

puting and Combinatorics, pages 41–50.

Finn, R. D., Tate, J., Mistry, J., Coggill, P. C., Sammut,

S. J., Hotz, H., Ceric, G., Forslund, K., Eddy, S. R.,

Sonnhammer, E. L. L., and Bateman, A. (2008). The

pfam protein families database. Nucl. Acids Res.,

36(suppl 1):D281–288.

Hopcroft, J. and Ullman, J. (1979). Introduction to au-

tomata theory, languages and computation. Mas-

sachusetts.

Kahveci, T. and Singh, A. (2001). An efﬁcient index struc-

ture for string databases. Proceedings of the 27th

VLDB, pages 352–360.

SUBSET SEED EXTENSION TO PROTEIN BLAST

157

Kisman, D., Li, M., Ma, B., and Wang, L. (2005). tPat-

ternHunter: gapped, fast and sensitive translated ho-

mology search. Bioinformatics (Oxford, England),

21(4):542–544. PMID: 15374861.

Korte, B. and Hausmann, D. (1978). An analysis of the

greedy heuristic for independence systems. Ann. Dis-

crete Math., 2:65–74.

Kucherov, G., Noe, L., and Roytberg, M. (2005). Multiseed

lossless ﬁltration. IEEE/ACM Trans. Comput. Biol.

Bioinformatics, 2(1):51–61.

Kucherov, G., No

e, L., and Roytberg, M. (2006). A unify-

ing framework for seed sensitivity and its application

to subset seeds. Journal of Bioinformatics and Com-

putational Biology, 4(2):553–570.

Li, M., Ma, B., Kisman, D., and Tromp, J. (2004). Pattern-

Hunter II: Highly sensitive and fast homology search.

Journal of Bioinformatics and Computational Biol-

ogy, 2(3):417–439.

Li, T., Fan, K., and Wang, J. Wang, W. (2003). Reduction

of protein sequence complexity by residue grouping.

Protein Engineering, 16(5):323–330.

Li, W., Ma, B., and Zhang, K. (2009). Amino acid classiﬁ-

cation and hash seeds for homology search. In BICoB,

pages 44–51.

Liang, F. M. (1983). Word hy-phen-a-tion by com-put-er.

Technical report, Departament of Computer Science,

Stanford University.

Livingstone, C. D. and Barton, G. J. (1993). Protein se-

quence alignments: a strategy for the hierarchical an

alysis of residue conservation. Computer Applications

in the Biosciences: CABIOS, 9(6):745–756. PMID:

8143162.

Ma, B., Tromp, J., and Li, M. (2002). PatternHunter: faster

and more sensitive homology search. Bioinformatics

(Oxford, England), 18(3):440–445. PMID: 11934743.

Ma, B. and Yao, H. (2008). Seed optimization is no easier

than optimal golomb ruler design. In APBC, pages

133–144.

Mitchell, M. (1996). An Introduction to Genetic Algo-

rithms. MIT Press.

Murphy, L., Wallqvist, A., and Levy, R. (2000). Simpli-

ﬁed amino acid alphabets for protein fold recognition

and implications for folding. Protein Engineering,

13:149–152.

Neuwald, A. (1998). A probable solution to sequence-

analysis problems. Trends in Biochemical Sciences,

23(9):365–365.

Nguyen, V. H. and Lavenier, D. (2008). Speeding up subset

seed algorithm for intensive protein sequence compar-

ison. In RIVF, pages 57–63.

Noe, L. and Kucherov, G. (2005). YASS: enhancing the

sensitivity of DNA similarity search. Nucl. Acids Res.,

33(suppl 2):W540–543.

Oliveira, L., Paiva, A. C. M., and Vriend, G. (1993). A com-

mon motif in g-protein-coupled seven transmembrane

helix r eceptors. Journal of Computer-Aided Molecu-

lar Design, 7(6):649–658.

Peterlongo, P., No, L., Lavenier, D., illes Georges, G.,

Jacques, J., Kucherov, G., and Giraud, M. (2008). Pro-

tein similarity search with subset seeds on a dedicated

reco nﬁgurable hardware. In Parallel Processing and

Applied Mathematics, pages 1240–1248. Springer.

Ponty, Y., Termier, M., and Denise, A. (2006). GenRGenS:

software for generating random genomic sequences

and structures.

Rost, B. (1999). Twilight zone of protein sequence align-

ments. Protein Engineering Design and Selection,

12(2):85–94.

Roytberg, M., Gambin, A., No

e, L., Lasota, S., Furletova,

E., Szczurek, E., and Kucherov, G. (2009). On sub-

set seeds for protein alignment. IEEE/ACM Transac-

tions on Computational Biology and Bioinformatics,

6(3):483–494.

Shiryev, A. S., Papadopoulos, J. S., S chaffer, A. A., and

Agarwala, R. (2007). Improved BLAST searches us-

ing longer words for protein seedin g. Bioinformatics,

23(21):2949–2951.

Smith, T. and Waterman, M. (1981). Identiﬁcation of Com-

mon Molecular Subsequences. J. Mol. Biol., 147:195–

197.

Sun, Y. and Buhler, J. (2004). Designing multiple simulta-

neous seeds for DNA similarity search. In RECOMB,

pages 76–84.

Yang, I.-H., Wang, S.-H., Chen, Y.-H., Huang, P.-H., Ye, L.,

Huang, X., and Chao, K.-M. (2004). Efﬁcient meth-

ods for generating optimal single and multiple spaced

seeds. In BIBE ’04: Proceedings of the 4th IEEE Sym-

posium on Bioinformatics and Bioengineering, page

411, Washington, DC, USA. IEEE Computer Society.

BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms

158