EXPERIMENTAL RESULTS ON MULTIPLE PATTERN MATCHING

ALGORITHMS FOR BIOLOGICAL SEQUENCES

Charalampos S. Kouzinopoulos, Panagiotis D. Michailidis and Konstantinos G. Margaritis

Parallel and Distributed Processing Laboratory, Department of Applied Informatics, University of Macedonia

156 Egnatia str., P.O. Box 1591, 54006 Thessaloniki, Greece

Keywords:

Algorithms, Multiple pattern matching, Multiple keyword matching, String searching, Biological sequence

databases.

Abstract:

With the remarkable increase in the number of DNA and proteins sequences, it is very important to study

the performance of multiple pattern matching algorithms when querying sequence patterns in biological se-

quence databases. In this paper, we present a performance study of the running time of well known multiple

pattern matching algorithms on widely used biological sequence databases containing the building blocks of

nucleotides (in the case of nucleic acid sequence databases) and amino acids (in the case of protein sequence

databases).

1 INTRODUCTION

Multiple pattern matching is the computationally in-

tensive kernel of many applications including infor-

mation retrieval, intrusion detection systems, web ﬁl-

tering, virus scanners and spam ﬁlters. In recent years

however, an immediate interest in string-matching

problems as a powerful tool in locating nucleotide or

amino acid sequence patterns in biological sequence

databases has been witnessed. For example, when

proteomics data is used for genome annotation in a

process called proteogenomic mapping (Jaffe et al.,

2004), a set of peptide identiﬁcations obtained using

mass spectrometry is matched against a target genome

translated in all six reading frames.

The multiple pattern matching problem can be

deﬁned as follows: Given a sequence database (or

text) T = t

...t

of length n and a ﬁnite set of r

patterns P = p

, p

, ..., p

, where each p

is a string

= p

... p

of length m over a ﬁnite character set

Σ, the task is to ﬁnd all occurrences of any of the pat-

terns in the sequence database.

The naive solution to this problem is to per-

form r separate searches with a sequential algorithm

(Navarro and Rafﬁnot, 2002). While frequently used

in the past, this technique is not efﬁcient when a large

pattern set is involved. The aim of the multiple pattern

matching algorithms is to scan the input string T in a

single pass to locate the occurrences of all patterns.

These algorithms are based on single-pattern match-

ing algorithms, with some of their functions general-

ized to process multiple patterns simultaneously dur-

ing the preprocessing phase, generally with the use of

trie structures and hashing.

The multiple pattern matching is widely used in

computational biology for a variety of pattern match-

ing tasks. For example, Brundo and Morgenstern

use a simpliﬁed version of the Aho-Corasick algo-

rithm to identify anchor points in their CHAOS algo-

rithm for fast alignment of large genomic sequences

(Brudno and Morgenstern, 2002; Brudno et al., 2004).

Hyyro et al. demonstrate that Aho-Corasick outper-

forms other algorithms for locating unique oligonu-

cleotides in the yeast genome (Hyyro et al., 2005).

The SITEBLAST algorithm (Michael et al., 2005)

employs the Aho-Corasick algorithm to retrieve all

motif anchors for a local alignment procedure for ge-

nomic sequences that makes use of prior knowledge.

Buhler et al use Aho-Corasick to design simultaneous

seeds for DNA similarity search (Buhler et al., 2005).

This paper presents experiments for the running

time of the well known Commentz-Walter, Wu-

Manber, Set Backward Oracle Matching and Salmela-

Tarhio-Kyt¨ojoki multiple pattern matching algorithms

for biological sequences. A detailed analysis of the

multiple pattern matching algorithms presented in this

paper, additional experiments on different types of

data as well as a study on the preprocessing time and

the memory requirements of the algorithms can be

found in (Kouzinopoulos and Margaritis, 2010).

274

S. Kouzinopoulos C., D. Michailidis P. and G. Margaritis K..

EXPERIMENTAL RESULTS ON MULTIPLE PATTERN MATCHING ALGORITHMS FOR BIOLOGICAL SEQUENCES.

DOI: 10.5220/0003133202740277

In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2011), pages 274-277

ISBN: 978-989-8425-36-2

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

The aim of this experimental study is to identify a

suitable and preferably fast multiple pattern matching

algorithm for several problem parameters such as a

given biological database, the size of the pattern set,

and the length of the patterns.

2 EXPERIMENTAL

METHODOLOGY

The experiments were executed locally on an Intel

Core 2 Duo CPU with a 3.00GHz clock speed and 2

Gb of memory, 64 KB L1 cache and 6 MB L2 cache.

The Ubuntu Linux operating system was used and

during the experiments only the typical background

processes ran. To decrease random variation, the time

results were averages of 100 runs. All algorithms

were implemented using the ANSI C programming

language and were compiled using the GCC 4.4.3

compiler with the “-O2” and “-funroll-loops” opti-

mization ﬂags.

To compare the pattern matching algorithms, the

practical running time was used as a measure. Practi-

cal running time is the total time in seconds an algo-

rithm needs to ﬁnd all occurrences of a pattern in an

input string including any preprocessing time and was

measured using the MPI Wtime function of the Mes-

sage Passing Interface since it has a better resolution

than the standar clock() function.

The data set was similar to the ones used in (Sheik

et al., 2005) and (Kalsi et al., 2008). It consisted

of the SWISS-PROT Amino Acid sequence database

with a size of n = 182.116.687 characters and an al-

phabet of size 20, the FASTA Amino Acid (FAA)

sequence of the A-thaliana genome with a size of

n = 11.273.437 characters and an alphabet of size 20

and the FASTA Nucleidic Acid (FNA) sequence of

the A-thaliana genome with a size of n = 118.100.062

characters and an alphabet of size 4.

3 EXPERIMENTAL RESULTS

In this section, the performance of the algorithms is

evaluated according to their running time for different

biological databases.

Figures 1 to 3 present the running time of the algo-

rithms including preprocessing for the SWISS-PROT

amino acid sequence database and for the FASTA

amino acid and nucleidic acid databases of the A-

thaliana genome respectively for a pattern length of

m = 8 and m = 32 and for 100 to 100.000 patterns.

As can generally be seen from the Figures, by vary-

ing different parameters such as the size of the pattern

set, the length of the patterns and the size of the al-

phabet can affect the performance of the algorithms

in different ways.

In the case of the SWISS-PROT database and for a

pattern length of m = 8, the SOG and BG algorithms

had the best performance when up to 10.000 patterns

were used while the SBOM algorithm was faster for

more than 10.000 patterns. When a pattern length of

m = 32 was used, the SOG and BG algorithms had

the fastest running time for up to 20.000 patterns,

while SBOM was faster for more than 20.000 pat-

terns. The HG and Wu-Manber algorithms had an av-

erage performance for either m = 8 or m = 32 while

Commentz-Walter was consistently the slowest algo-

rithm in terms of running time.

For the FASTA amino acid database, and for a

pattern length of m = 8, the SOG and BG algo-

rithms were faster when up to 10.000 patterns were

used while for more than 10.000 patterns, the Wu-

Manber algorithm had the best performance, followed

by SBOM. When a pattern length of m = 32 was

used, the SOG and BG algorithms had the fastest

running time for up to 30.000 patterns. For bigger

pattern sets, Wu-Manber was the fastest algorithm.

Commentz-Walter was the algorithm with the worst

performance when m = 8 was used while for m = 32,

the Commentz-Walter and the SBOM algorithms had

the worst performance.

In the case of the FASTA nucleidic acid database,

SBOM was the algorithm that worked consistently

faster for a pattern length of m = 8. When up to 2.000

patterns were used, Commentz-Walter was the slow-

est algorithm while for more than 2.000 patterns, HG

was the algorithm with the worst performance. For

a pattern length of m = 32 the Wu-Manber was the

fastest algorithm, especially when more than 20.000

patterns were used.

Speciﬁc performancecomments on the algorithms

follow. Commentz-Walter was the algorithm with the

fastest running time when used on the FASTA nucle-

idic acid database with a pattern length of m = 32,

especially when more than 10.000 to 50.000 pat-

terns were used. The algorithm had the worst per-

formance when used on the SWISS-PROT and the

FASTA amino acid databases and thus its use is not

recommended in general on large alphabet sizes such

as amino acid databases. Wu-Manber was the fastest

algorithm on the FASTA amino acid database when

more than 10.000 patterns were used and on the

FASTA nucleidic acid for a pattern length of m = 32

together with the Commentz-Walter algorithm. On

the SWISS-PROT database and for a pattern length

of m = 8, the algorithm had a good performance with

EXPERIMENTAL RESULTS ON MULTIPLE PATTERN MATCHING ALGORITHMS FOR BIOLOGICAL

SEQUENCES

275

100 1000 10000 100000

Running time (sec)

Number of patterns (m=8)

SBOM

SOG

100 1000 10000 100000

Running time (sec)

Number of patterns (m=32)

SBOM

SOG

Figure 1: Running time of the algorithms for the SWISS-PROT Amino Acid database.

0.2

0.4

0.6

0.8

100 1000 10000 100000

Running time (sec)

Number of patterns (m=8)

SBOM

SOG

0.2

0.4

0.6

0.8

100 1000 10000 100000

Running time (sec)

Number of patterns (m=32)

SBOM

SOG

Figure 2: Running time of the algorithms for the FASTA Amino Acid database.

a running time close to that of the other algorithms.

The SBOM algorithm outperformed the rest of

the algorithms when the SWISS-PROT database was

used and for a pattern set size of more than 10.000 to

20.000 patterns. It had also the best performance on

the FASTA nucleidic acid text when a pattern length

of m = 8 was used. On the FASTA amino acid had

an average performance for m = 8 and the worst per-

formance comparing to the rest of the algorithms for

m = 32. Among the Salmela-Tarhio-Kyt¨ojoki algo-

rithms, SOG and BG had the best performance in

practice when used on the SWISS-PROT database to-

gether with the SBOM algorithm and on the FASTA

amino acid database for fewer than 5.000 to 20.000

patterns. On the FASTA nucleidic acid database the

Salmela-Tarhio-Kyt¨ojoki algorithms had the slowest

running time comparing to the rest of the algorithms

and so their used is not recommended on small alpha-

bet sizes such as DNA-type databases.

4 CONCLUSIONS

In this paper, experimental results of the well

known Commentz-Walter, Wu-Manber, SBOM and

the Salmela-Tarhio-Kyt¨ojoki algorithms were pre-

sented. The algorithms were compared in terms of

running time for the SWISS-PROT Amino Acid se-

quence database and the FASTA Amino Acid (FAA)

and FASTA Nucleidic Acid (FNA) sequences of the

A-thaliana genome and for sets of size between 100

and 100.000 patterns with a length of m = 8 and m =

32. The experimental study proved that no algorithm

is the best for all values of the problem parameters.

Instead it was shown that for different databases, dif-

ferent algorithms are preferable: Commentz-Walter

had the best performance on the FASTA nucleidic

acid database for more than 10.000 patterns. Wu-

Manber was the fastest algorithm for the FASTA

amino acid database for more than 10.000 to 50.000

BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms

276

100 1000 10000 100000

Running time (sec)

Number of patterns (m=8)

SBOM

SOG

100 1000 10000 100000

Running time (sec)

Number of patterns (m=32)

SBOM

SOG

Figure 3: Running time of the algorithms for the FASTA Nucleidic Acid database.

patterns. The Wu-Manber algorithm also outper-

formed the rest of the algorithms on the FASTA nu-

cleidic acid database for a pattern length of m = 32

and for more than 20.000 patterns. The SBOM algo-

rithm had the best performance on the SWISS-PROT

database for m = 8 and for more than 10.000 patterns

and on the FASTA nucleidic acid database, especially

when a pattern length of m = 8 was used. Among the

Salmela-Tarhio-Kyt¨ojoki algorithms, HG did not per-

form well on any biological database while the SOG

and BG algorithms had the fastest running time on the

SWISS-PROT database when up to 10.000 to 20.000

patterns were used, and on the FASTA amino acid

database when up to 10.000 to 30.000 patterns were

used.

Since biological databases and pattern sets are

usually inherently parallel in nature, the work pre-

sented in this paper could be extended with a focus

on the speed up of the existing algorithms when paral-

lel processed on traditional parallel architectures like

cluster environmentsand multicore systems as well as

on modern parallel systems like GPU architectures.

REFERENCES

Brudno, M. and Morgenstern, B. (2002). Fast and sensitive

alignment of large genomic sequences. In IEEE Com-

puter Society Bioinformatics Conference, volume 1,

pages 138–147.

Brudno, M., Steinkamp, R., and Morgenstern, B. (2004).

The chaos/dialign www server for multiple align-

ment of genomic sequences. Nucleic Acids Research,

32:41–44.

Buhler, J., Keich, U., and Sun, Y. (2005). Designing seeds

for similarity search in genomic dna. Journal of Com-

puter and System Sciences, 70(3):342–363.

Hyyro, H., Juhola, M., and Vihinen, M. (2005). On exact

string matching of unique oligonucleotides. Comput-

ers in Biology and Medicine, 35(2):173–181.

Jaffe, J., Berg, H., and Church, G. (2004). Proteoge-

nomic mapping as a complementary method to per-

form genome annotation. Proteomics, 4(1):59–77.

Kalsi, P., Peltola, H., and Tarhio, T. (2008). Comparison

of exact string matching algorithms for biological se-

quences. Communications in Computer and Informa-

tion Science, pages 417–426.

Kouzinopoulos, C. S. and Margaritis, K. G. (2010). Al-

gorithms for multiple keyword matching: Survey and

experimental results. Technical report.

Michael, M., Dieterich, C., and Vingron, M. (2005).

Siteblast–rapid and sensitive local alignment of ge-

nomic sequences employing motif anchors. Bioinfor-

matics, 21(9):2093–2094.

Navarro, G. and Rafﬁnot, M. (2002). Flexible pattern

matching in strings: practical on-line search algo-

rithms for texts and biological sequences. Cambridge

University Press.

Sheik, S., Aggarwal, S. K., Poddar, A., Sathiyabhama,

B., Balakrishna, N., and Sekar, K. (2005). Analysis

of string-searching algorithms on biological sequence

databases. Current Science, 89(2):368–374.

EXPERIMENTAL RESULTS ON MULTIPLE PATTERN MATCHING ALGORITHMS FOR BIOLOGICAL

SEQUENCES

277