Backward Pattern Matching on Elastic Degenerate Strings
Petr Proch
´
azka, Ond
ˇ
rej Cvacho, Lubo
ˇ
s Kr
ˇ
c
´
al and Jan Holub
Dep. of Theoretical Computer Science, Faculty of Information Technology, Czech Technical University in Prague,
Thakurova 2700/9, Prague 6, Czech Republic
Keywords:
Elastic Degenerate Strings, Genomic Sequences, Degenerate Pattern Matching, Pan-Genomics.
Abstract:
Recently, the concept of Elastic Degenerate Strings (EDS) was introduced as a way of representing a sequenced
population of the same species. Several on-line Elastic Degenerate String Matching (EDSM) algorithms were
presented so far. Some of them provide practical implementation. We propose a new on-line EDSM algorithm
BNDM-EDS. Our algorithm combines two traditional algorithms BNDM and the Shift-And that were adapted
to the specifics needed by Elastic Degenerate Strings. BNDM-EDS is running in O(Nmd
m
w
e) worst-case time.
This implies O(Nm) time for small patterns, where m is the length of the searched pattern, N is the size of
EDS, and w is the size of the computer word. The algorithm uses O(N + n) space, where n is the length of
EDS. BNDM-EDS requires a simple preprocessing step with time and space O(m). Experimental results on real
genomic data show superiority of BNDM-EDS over state-of-the-art algorithms.
1 INTRODUCTION
Decreasing costs of DNA sequencing enabled a
huge development of projects focused on sequenc-
ing whole populations of individuals of the same
species. The projects cataloguing human genetic
variations include the 1000 Genomes Projects (Con-
sortium, The 1000 Genomes Project, 2011) and the
UK10K project (Consortium, The UK10K, 2015).
Sequenced populations pose a challenge to fast pat-
tern matching algorithms on a set of similar strings.
A common approach to represent the sequenced
population is to store genetic variations of individuals
of the population with respect to the chosen ‘reference
sequence’. This approach is very space efficient since
single human individuals differ only in 0.1% bases on
average. In return, the pattern matching algorithms
performed on this kind of data are confronted by a
few obstacles given by the structure of the data.
During the last two decades, many different for-
mats representing the sequenced population on dif-
ferent level of detail were developed. A consensus
sequence drawn from an entire population in multiple
sequence alignment (MSA) is one of the most concise
forms that the sequenced population can take. The
consensus sequence can be expressed as a degener-
ate string over a degenerate alphabet. IUPAC alpha-
bet (on Biochemical Nomenclature (CBN), 1971) is
used for genomic data. Another more recent format
is EDS (Elastic Degenerate String) (Iliopoulos et al.,
2017). EDS enables to express the variants of dif-
ferent lengths within one degenerate position in the
sequence which provides more accuracy since inser-
tions and deletions can be expressed precisely. Vari-
ation graph (Marschall, 2018; Church et al., 2015;
Dilthey et al., 2015) data structure provides similar
level of accuracy as EDS. It is a graph data struc-
ture where the backbone of the graph represents the
reference genome and other alternatives (variants)
across the population are stored as additional edges.
The 1000 Genomes Projects (Consortium, The 1000
Genomes Project, 2011) stores the sequenced popu-
lations in vcf format (Danecek et al., 2011). The
variant-call-format (vcf) provides even more infor-
mation since the stored variations are addressed to
single individuals of the population.
The idea of Pan-genome poses another challenge
for efficient storage of the sequenced population be-
cause the Pan-genome represents a collection of ge-
nomic sequences that are analyzed together or used
as a reference (Consortium, 2016).
A lot of work has been done on off-line search-
ing in the sequenced populations, i.e. many in-
dex data structures solving this problem were pro-
posed (Proch
´
azka and Holub, 2014; Maciuca et al.,
2016; Na et al., 2016; Navarro and Pereira, 2016;
Sir
´
en, 2016). However, the authors of (Grossi et al.,
2017) clearly state the following scenarios where on-
line version is more appropriate: (i) efficient on-line
solutions can be used in combination with partial
50
Procházka, P., Cvacho, O., Kr
ˇ
cál, L. and Holub, J.
Backward Pattern Matching on Elastic Degenerate Strings.
DOI: 10.5220/0010243600500059
In Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2021) - Volume 3: BIOINFORMATICS, pages 50-59
ISBN: 978-989-758-490-9
Copyright
c
2021 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
(block-oriented) indexes as practical trade-offs; (ii)
efficient on-line solutions for exact pattern matching
can be applied for fast average-case approximate pat-
tern matching; (iii) on-line solutions can be useful
when one wants to search for a set of patterns in elas-
tic degenerate texts.
Elastic Degenerate Strings (EDS) provide an
alternative to represent consensus/pan-genomic se-
quences. Implementation called ElDeS (the first on-
line search algoritm on EDS) was provided later satis-
fying the presented upper bound O(N +αγnm), where
α is the maximum number of strings in any elastic-
degenerate symbol of the text and γ is the maximum
number of elastic-degenerate symbols spanned by any
occurrence of the pattern in the text. Constants α and
γ represent the limits of the algorithm. ElDeS com-
bines KMP algorithm (Knuth et al., 1977) and search-
ing in preprocessed suffix trees for potential occur-
rences spanning the degenerate positions in the text.
The suffix trees overhead and their use in recursive
extension of the potential occurrences is the possible
reason of high search time.
Later, a practical implementation of a pattern
matching algorithm in Elastic Degenerate Strings was
given in (Grossi et al., 2017). The proposed EDSM
algorithm works in O(nm
2
+ N) time. Border ta-
bles (Crochemore et al., 2007) are used to reveal pat-
tern prefixes and a suffix tree S T
P
of the pattern to
reveal the pattern factors. Potential occurrences are
stored in lists and extended over segment borders in
left-to-right fashion. KMP is used in elements longer
than the pattern length m. The authors also present a
bit-vector version of the algorithm where the potential
occurrences are stored within a bit register. The bit-
version of the algorithm EDSM-BV works in O(Nd
m
w
e)
time, where w is a size of the computer word (basi-
cally, size of the computer registers, usually w = 64
bits for standard computers). The practical imple-
mentation of the bit-version dominates especially for
longer patterns where the original version of the al-
gorithm suffers from the O(m
2
) factor. Pissis and
Retha (Pissis and Retha, 2018) presented their on-line
algorithm for a set of patterns of total length M. Their
algorithm achieves O(Nd
M
w
e) time with preprocess-
ing time and space O(M).
Recently, Cisłak et al. (Cisłak et al., 2018) pre-
sented SOPanG algorithm solving Elastic Degenerate
String matching problem in the same upper bound
O(Nd
m
w
e), however, achieving an order of magni-
tude better time than (Grossi et al., 2017) in prac-
tical experiments. The algorithm performs standard
Shift-Or pattern matching on every element of a
segment. Resulting search registers (after processing
an element) are merged across all elements in each
EDS segment. The register then stores all prefixes of
the potential occurrences and the algorithm starts to
process next EDS segment.
Aoyama et al. (Aoyama et al., 2018) proposed an-
other EDSM on-line algorithm based on efficient sum
set computation using Fast Fourier Transform (Coo-
ley and Tukey, 1965). The proposed algorithm
improved the time complexity to O(nm
mlog m +
N). Bernardini at al. (Bernardini et al., 2019) de-
signed non-combinatorial O(nm
1.381
+ N)-time algo-
rithm combining three basic ingredients: a string peri-
odicity argument, Fast Fourier Transform and fast ma-
trix multiplication. In addition, a variant of EDSM al-
gorithm allowing errors was presented in (Bernardini
et al., 2017). Unfortunately, we could not compare
BNDM-EDS with the last three algorithms mentioned in
this paragraph due to missing public implementation.
1.1 Our Contribution
We propose a novel on-line algorithm BNDM-EDS solv-
ing EDSM in O(Nmd
m
w
e) worst-case time. To the
knowledge of the authors, BNDM-EDS is the first back-
ward pattern matching algorithm optimized for EDS.
The algorithm exploits standard BNDM algorithm for
single elements of EDS text. Shift-And (D
¨
om
¨
olki,
1964) algorithm is used for crossing the borders be-
tween two consecutive EDS segments to reveal the
occurrences starting in one EDS segment and contin-
uing in the following EDS segment. BNDM-EDS is de-
signed to be efficient on real-world data. For small
patterns (m w), an alphabet size σ, and variability
v, the average time complexity is O(
N(1v)log
σ
m
m
) +
O(Nvm(1 v)) + O(Nvβδ), where β = O(1) repre-
sents the average number of elements within a degen-
erate EDS segment and δ < m represents the average
length of an element within a degenerate EDS seg-
ment. The assumptions about β and δ are made based
on their values in real datasets. Moreover, BNDM-EDS
beats all its competitors in experiments performed on
real genomic data.
1.2 Roadmap
The rest of the paper is organized as follows. We de-
fine basic notions in Section 2. In Section 3, we pro-
vide exact definition and description of BNDM-EDS al-
gorithm. Section 4 summarizes experimental results
performed on synthetic and real genomic data. We
give the conclusion and some ideas for future work in
Section 5.
Backward Pattern Matching on Elastic Degenerate Strings
51
2 BASIC NOTIONS
Let x = x
1
x
2
..x
n
be a string composed of single sym-
bols x
i
of a finite ordered alphabet Σ. The length
of the string x is n = |x|. The size of the alpha-
bet Σ is σ = |Σ| = O(1). The start position i and
the length j define a factor (or substring) denoted
by x
i, j
= x
i
..x
i+ j1
. Factor with i = 0 is called pre-
fix and a factor with i + j 1 = n is called suffix of the
string x. The empty string (of length 0) is denoted by
ε.
Definition 1. Elastic Degenerate String (Grossi
et al., 2017)
An Elastic Degenerate String (EDS)
˜
X =
˜
X[0]
˜
X[1]...
˜
X[n 1] of length n on an alphabet
Σ is a finite sequence of n degenerate symbols. Every
degenerate symbol
˜
X[i], for all 0 i < n, is a non-
empty set of strings
˜
X[i][ j], with 0 j < |
˜
X[i]|, where
every
˜
X[i][ j] is a deterministic string on an alphabet
Σ. The degenerate symbols are also called segments.
The strings contained in the set corresponding to a
segment are also called elements.
N is the total size of an EDS and is defined as
N =
n1
i=0
|
˜
X[i]|−1
j=0
|
˜
X[i][ j]|
Example 1. Elastic Degenerate String.
Suppose the following multiple sequence align-
ment (MSA):
G C A A C G G G T A - - A C T
| | | | | | | | | | | | | | |
G C A A C G G G T A T A A C T
| | | × | × | | | | | | | | |
G C A C C T G G - - - - A C T
The resulting Elastic Degenerate String (EDS)
˜
T
has the following properties:
˜
T =
{
GCA
}
A
C
{
C
}
G
T
{
GG
}
TA
TATA
ε
{
ACT
}
N =
6
i=0
|
˜
T [i]|−1
j=0
|
˜
T [i][ j]|= 3+2+1 +2+2+6 +3 = 19
The length of
˜
T corresponds to the number of seg-
ments which is n = 7. The size of
˜
T corresponds to
the sum of the length of all elements which is N = 19.
A solid string Y is a finite sequence of solid sym-
bols. It means that every position i of the solid string
Y is represented just by one symbol of alphabet Σ.
A solid string Y matches (Grossi et al., 2017) Elas-
tic Degenerate String
˜
X[0][1]...[n 1] of length n > 1
(denoted by Y
˜
X), if and only if string Y can be
decomposed into factors y
0
...y
n1
, y
i
Σ
, such that:
1. there exists a string s
˜
X[0] such that a suffix of s
is y
0
6= ε;
2. if n > 2, there exists s
˜
X[i], for all 1 i n 2,
such that s = y
i
;
3. there exists a string s
˜
X[n 1] such that a prefix
of s is y
n1
6= ε.
A string Y has an occurrence in an Elastic Degen-
erate String
˜
T ending at position j if there exists a
position i < j such that Y
˜
T [i]...T [ j], or if there ex-
ists an element s
˜
T [ j] such that Y occurs in s. The
string matching in Elastic Degenerate Strings is for-
mally defined as Problem 1.
Problem 1. Elastic Degenerate String Matching
(EDSM) (Grossi et al., 2017) Given a string P of
length m and an Elastic Degenerate String
˜
T of length
n and size N m, find all positions j in
˜
T where at
least one occurrence of P ends.
Example 2. Elastic Degenerate String Matching.
Consider pattern P = AAC of length m = 3. Next,
consider the following Elastic Degenerate String
˜
T of
length n = 7 and size N = 19.
˜
T =
{
GCA
}
A
C
{
C
}
G
T
{
GG
}
TA
TATA
ε
{
ACT
}
The occurrences are marked with red color. The
first occurrence of P starts at position 0 and it ends
at position 2. In terms of the first occurrence the pat-
tern P can be (according to the previous description)
decomposed into the following three factors: p
0
= A,
p
1
= A and p
2
= C. The second occurrence of P starts
at position 5 and it ends at position 6. The second oc-
currence can be decomposed into the two following
factors: p
0
= A and p
1
= AC. Notice that the pat-
tern factor p
0
is a suffix of two elements
˜
T [5][0] and
˜
T [5][1]. Despite of that, we account only one occur-
rence since by EDSM definition we accept only dis-
tinct positions in EDS .
3 ALGORITHM
The idea behind BNDM-EDS is simple: to exploit
shifting capabilities of backward pattern matching
algorithms, specifically “good prefix shift” heuris-
tic (Melichar et al., 2005). We chose BNDM (Navarro
and Raffinot, 1998) algorithm because of its sim-
plicity and its bit-parallel approach. We adapted
BNDM for EDS matching problem in the follow-
ing way. The elements
˜
T [i][ j] of each segment
˜
T [i]
are searched independently using a standard variant
of BNDM. Intermediate results (i.e. the discovered
pattern prefixes occurring in the last m 1 bases of
each element
˜
T [i][ j]) must be merged at the end of
BIOINFORMATICS 2021 - 12th International Conference on Bioinformatics Models, Methods and Algorithms
52
each segment
˜
T [i]. The registers storing the inter-
mediate results are transformed to be suitable for the
Shift-And algorithm that is used to process the bor-
ders of the two following segments
˜
T [i] and
˜
T [i + 1].
The Shift-And algorithm initialized with the inter-
mediate results from the previous segment processes
up to the first m bases of each element from the fol-
lowing segment
˜
T [i + 1]. The point is to discover all
potential occurrences spanning two consecutive seg-
ments. The Shift-And algorithm was chosen since
BNDM (generally any backward algorithm) could be
forced to process O(
m1
k=1
|
˜
T [i k]|) different strings
when crossing the borders of the segments right-to-
left for pathological EDS (i.e. EDS containing many
consecutive segments with one-base-long elements).
Algorithm 1 describes the idea of BNDM-EDS in
more detail. The algorithm assumes pattern length
smaller than computer word size (m w) for simplic-
ity. The PREPROCESS function performs preprocess-
ing of the pattern P of length m. The first while cycle
traverses the pattern from right to left and stores the
corresponding byte values (with active bits set to 1) to
byte vector B of length σ that is necessary for BNDM
part of the algorithm. Byte vector R (of length m)
stores active prefixes corresponding to BNDM last
position. The second while cycle traverses the pattern
from left to right and stores the corresponding byte
values to byte vector S that is used for the Shift-And
part of the algorithm. The active bits correspond to
the positions of the indexed character in the pattern P
in both cases (vector B and vector S).
The SEARCH function iterates over individual
EDS segments and their elements (see line 15 and
line 17). At the beginning of each element e,
Shift-And algorithm is performed for the first up to
m bases (see lines 18 23). Elements shorter than m
bases continue to line 45 where intermediate results
(found pattern prefixes) are merged to the register R
2
.
BNDM algorithm is performed on elements longer
than m bases starting from the position m. The search
register D is initialized with all bits set to 1 and it
stores the active bits corresponding to the discovered
pattern factors. The condition at line 31 defines that
either pattern prefix (subcondition at line 32) or an oc-
currence (else statement at line 34) was found. Every
discovered pattern prefix is also transformed to the
corresponding bit position and stored to D
2
register
(see line 33) that is later used for the Shift-And algo-
rithm performed at the end of each element e. BNDM
part of the algorithm is finished when j + m 1 ex-
ceeds the element boundary (see the while condition
at line 26). At last, the Shift-And algorithm starts at
the position j + m last and the search register is ini-
tialized with the active prefixes discovered in the last
Algorithm 1: BNDM-EDS preprocessing and searching phase.
1: function PREPROCESS(P, m)
2: B[1..σ] 0; S[1..σ] 0; R [1..m] 0;
3: i m; F 1;
4: while i > 0 do Building BNDM mask vector
5: B[P
i,1
] B[P
i,1
] | F;
6: R [i] F;
7: F F 1; i i 1;
8: i 1; j 1;
9: while i m do Building Shift-And mask vector
10: S[P
i,1
] S [P
i,1
] | j;
11: i i + 1;
12: j j 1;
13: function SEARCH(
˜
T , P, m)
14: R
1
0; R
2
0; D
2
0; count 0; F 10
m1
15: for each segment
˜
T [i] of
˜
T do
16: R
1
R
2
; R
2
0;
17: for each element e
˜
T [i] do
18: j 1; D R
1
;
19: while j < m & j < |e| do Shift-And part for the first
up to m bases
20: D ((D 1) | 1) & S[e
j,1
];
21: if D & F then
22: count count + 1;
23: j j +1;
24: if j < m then go to elementEnd;
25: j 1;
26: while j + m 1 < |e| do BNDM part
27: D
2
0; last m; i m 1; D 0;
28: while i 0 & D 6= 0 do
29: D D & B [e
j+i,1
];
30: i i 1;
31: if (D & F) 6= 0 then
32: if i 0 then
33: last i + 1; D
2
D
2
| R [last];
34: else
35: count count + 1;
36: D D 1;
37: j j +last;
38: j j +m last; D D
2
;
39: while j < |e| & D 6= 0 do Shift-And part for the last
up to m bases
40: D ((D 1) | 1) & S[e
j,1
];
41: if D & F then
42: count count + 1;
43: j j +1;
44: elementEnd:
45: R
2
R
2
| D;
46: return count;
iteration of BNDM (see line 38).
Example 3. BNDM-EDS searching. Figure 1
demonstrates individual steps of BNDM-EDS algo-
rithm. Consider EDS fragment
˜
T and pattern P =
AATAAATA. Each step depicts: (i) the pattern (green
color), (ii) the processed element of
˜
T (blue under-
line), (iii) search register D together with active bits
Backward Pattern Matching on Elastic Degenerate Strings
53
˜
T = ...CAATAAATAA
ATA
A
TA...
j + m 1
last = 7
last = 4
1
0
1
1
1
0
1
1
0
0
0
0
0
0
0
0
A
T
A
A
A
T
A
A
B
A
C
1
0
1
1
1
0
1
1
0
0
0
0
0
0
0
0
G
0
1
0
0
0
1
0
0
T
0
1
0
0
0
1
0
0
0
0
1
0
0
0
1
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
D :
A
T
A
A
A
T
A
A
P
rev
lsb
msb
lsb
msb
(a) Step 1: BNDM.
˜
T = ...CAATAAATAA
ATA
A
TA...
j + m 1
last = 7, D
2
= h0000 0001i
last = 4, D
2
= h0000 1001i
1
0
1
1
1
0
1
1
0
0
0
0
0
0
0
0
A
T
A
A
A
T
A
A
B
A
C
0
0
0
0
0
0
0
0
G
0
1
0
0
0
1
0
0
T
1
0
1
1
1
0
1
1
0
1
0
0
0
1
0
0
0
0
1
0
0
0
1
0
0
0
0
1
0
0
0
1
0
0
0
0
1
0
0
0
A
T
A
A
A
T
A
A
P
rev
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
D :
lsb
msb
lsb
msb
(b) Step 2: BNDM.
˜
T = ...CAATAAATAA
ATA
A
TA...
j
1
1
0
1
1
1
0
1
0
0
0
0
0
0
0
0
A
A
T
A
A
A
T
A
S
A
C
0
0
0
0
0
0
0
0
G
0
0
1
0
0
0
1
0
T
1
1
0
0
1
0
0
0
A
A
T
A
A
A
T
A
P
D :
lsb
msb
msb
lsb
(c) Step 3: Shift-And (end of element).
˜
T = ...CAATAAATAA
ATA
A
TA...
j
1
1
0
1
1
1
0
1
0
0
0
0
0
0
0
0
A
A
T
A
A
A
T
A
S
A
C
0
0
0
0
0
0
0
0
G
0
0
1
0
0
0
1
0
T
1
1
0
0
1
0
0
0
A
A
T
A
A
A
T
A
P
1
1
0
0
0
1
0
0
0
0
1
0
0
0
1
0
1
0
0
1
0
0
0
1
D :
lsb
msb
lsb
msb
(d) Step 4: Shift-And (beginning of element).
˜
T = ...CAATAAATAA
ATA
A
TA...
j
1
1
0
1
1
1
0
1
0
0
0
0
0
0
0
0
A
A
T
A
A
A
T
A
S
A
C
0
0
0
0
0
0
0
0
G
0
0
1
0
0
0
1
0
T
1
1
0
0
1
0
0
0
A
A
T
A
A
A
T
A
P
1
1
0
0
0
1
0
0
D :
lsb
msb
lsb
msb
(e) Step 5: Shift-And (beginning of element).
˜
T = ...CAATAAATAA
ATA
A
TA...
j
1
1
0
1
1
1
0
1
0
0
0
0
0
0
0
0
A
A
T
A
A
A
T
A
S
A
C
0
0
0
0
0
0
0
0
G
0
0
1
0
0
0
1
0
T
A
A
T
A
A
A
T
A
P
1
1
0
1
0
1
0
1
0
0
1
0
0
0
1
0
1
0
0
1
0
0
0
1
D :
lsb
msb
lsb
msb
(f) Step 6: Shift-And (beginning of element).
Figure 1: BNDM-EDS: Searching P = AATAAATA in EDS.
(red color), and (iv) preprocessed byte vectors B and
S. Search register D is depicted for each step. In case
of BNDM, the direction is right to left. In case of the
Shift-And, the direction is left to right. The least
significant bit (lsb) is on the top and the most signif-
icant bit (msb) is at the bottom. For the sake of clar-
ity, the example starts inside of the first segment with
BNDM part of the algorithm (line 25 of Algorithm 1).
In the first step, BNDM discovers the longest prefix
AATA which implies the shift by last = 4 bases. In the
second step, BNDM reports an occurrence (see the
diagonal of red bits in the search register D). There
are two other potential occurrences since two pattern
prefixes A and AATA are discovered. These prefixes
are also stored in register D
2
= h0000 1001i. The
value of D
2
register (shifted by one to the left) be-
comes the initial value of search register D in the
next step. In the third step, the Shift-And algo-
rithm processes the end of the element. In the fourth
step, the Shift-And processes the element ATA and
discovers the second occurrence. The initial value
of search register D is inherited from the previous
segment. In the fifth step, element A is processed
by the Shift-And. Register D values from step 4
and 5 are merged and used as the initial value in the
next step. The beginning of the last depicted element
(TA...) is processed by the Shift-And in the last
step and the last occurrence is reported.
BNDM-EDS time and space complexity can be de-
duced from Algorithm 1. The preprocessing step
consists of two while cycles iterating over bases of
the searched pattern P. This implies worst-case time
O(m) for preprocessing phase. The needed space
comprises of vectors B, S and R . For patterns smaller
that the size of the computer word (m w), this im-
plies O(σ + m) space. This space can further be re-
duced to O(m) for small alphabets where σ m.
BNDM-EDS search function consists of three steps
performed on all EDS elements. For each element, the
first up to m bases and the last up to m bases are pro-
cessed by the Shift-And. Assuming (based on values
of real datasets) constant number of elements within
one EDS segment, this implies processing up to 2 ×
BIOINFORMATICS 2021 - 12th International Conference on Bioinformatics Models, Methods and Algorithms
54
n ×m = O(nm) bases for small patterns (m w) and
O(nmd
m
w
e) for large patterns. These two Shift-And
steps are represented as while cycles at lines 19 and
39 of Algorithm 1. The third step is BNDM algorithm
that needs to process all EDS elements (all bases).
This implies processing O(Nm) bases for small pat-
terns (m w) and O(Nmd
m
w
e) for large patterns in the
worst case. This step is represented as a while cycle
at line 26 of Algorithm 1. Finally, the total worst-case
time is O(Nm + nm) = O(Nm) for small patterns and
O(Nmd
m
w
e+nmd
m
w
e) = O(Nmd
m
w
e) for large patterns.
We need to define a few parameters to derive the
average time complexity for BNDM-EDS. Let v [0, 1]
be the variability of a genomic sequence of length N
bases. It implies that every d
1
v
e-th base is degener-
ate. Furthermore, let β = O(1) represents the average
number of elements within a degenerate EDS segment
and let δ < m represents the average length of an el-
ement within a degenerate EDS segment. The num-
ber of EDS segments can be defined by the number
of bases (which approximately equals the EDS size
N) and the variability v: n 2Nv. The probability of
solid segments of length k bases is defined as: p
k
=
v(1 v)
k
. Thus, we can state the number of k-base-
long solid segments as:
n
2
p
k
= Nvp
k
= Nv
2
(1 v)
k
.
BNDM part of the algorithm processes only solid seg-
ments (since δ < m) that are longer than m bases.
Shift-And part of the algorithm processes: (i) all
solid segments, (ii) all degenerate segments and their
elements.
BNDM average time complexity is claimed to
be the same as the BDM average time complex-
ity (Navarro and Raffinot, 1998). Crochemore
and Rytter proved BDM average time complex-
ity O(
nlog
σ
m
m
) in (Crochemore and Rytter, 1994).
BNDM part of the algorithm processes Nv
2
(1 v)
k
solid segments for k [m + 1, ). This implies
Nv
2
k=m+1
k(1 v)
k
processed bases.
k=m+1
k(1
v)
k
can by upper-bounded by
k=1
k(1 v)
k
=
1v
v
2
which means that BNDM part needs to process N(1
v) bases. Finally, BNDM part of the algorithm works
in O(
N(1v)log
σ
m
m
) time.
Shift-And part of the algorithm also processes
Nv
2
(1 v)
k
solid segments for k [1, ). In ev-
ery solid segment, O(m) bases are processed which
is Nv
2
m
k=1
(1 v)
k
= Nvm(1 v) processed bases
overall. Next, Shift-And processes Nv degener-
ate segments which is Nvβδ processed bases over-
all. Finally, Shift-And part of the algorithm
works in O(Nvm(1 v)) + O(Nvβδ) time. We con-
clude that BNDM-EDS average time complexity is
O(
N(1v)log
σ
m
m
) + O(Nvm(1 v)) + O(Nvβδ).
4 EXPERIMENTS
We present experimental results comparing BNDM-EDS
with other baseline algorithms performed on EDS
data. Particularly, we compare our algorithm with
ElDeS (Iliopoulos et al., 2017), EDSM-BV (Grossi
et al., 2017) and SOPanG (Cisłak et al., 2018).
All tested algorithms are optimized for EDS data
and were implemented in C/C++ programming lan-
guage. BNDM-EDS implementation is available at
https://github.com/stringology-prague/eds search.
We carried out our tests on Intel
R
Core
TM
i7-
4740 3.40 GHz, 24 GB RAM, Linux Gentoo kernel
4.19.57. We used compiler gcc version 8.3.0 with
compiler optimization -O3. The tested patterns were
chosen randomly from the input text and their length
m was ranging from 8 to 32. All experiments were
run in loop 100 times and we report the mean running
time in seconds. All reported times represent mea-
sured user time + sys time. The reported times al-
ways include any necessary pattern preprocessing and
exclude the time necessary to read the data in mem-
ory.
We performed our tests on two different data sets.
We include synthetic EDS data from (Grossi et al.,
2017) to achieve better comparability. The synthetic
data set includes 5 files with n ranging from 100000
to 1 600000. The percentage of degenerate positions
is 10%. The number of elements within one segment
is random and is bounded by 10. The length of each
element at the degenerate position is random and it
is bounded by 10 bases. Table 1 illustrates the re-
sults achieved by the algorithms performed on syn-
thetic data set. The files of the data set are character-
ized by the following parameters: EDS length n, EDS
size N and variability v. The measured times prove the
dominance of BNDM-EDS and SOPanG that is given by
their simplicity. BNDM-EDS achieved slightly better re-
sults than SOPanG and this difference is growing (up to
15%) with growing pattern length. This performance
gain is caused by BNDM shifting across the searched
text at increased rates for longer patterns. The mea-
sured results provide two observations: (i) all com-
pared algorithms depend on the size of EDS by factor
O(N); (ii) ElDeS and EDSM-BV require large portion
of time for their preprocessing phase. The second ob-
servation is analyzed in more detail in Figure 2. We
measured ratio of preprocessing time and total time
of single algorithms. The results are given in % of the
total time. Figure 2 shows that ElDeS and EDSM-BV
need significantly larger portion of the preprocessing
time in comparison to BNDM-EDS and SOPanG. More-
over, the measured ratio of preprocessing time is not
decreasing with the growing file size for ElDeS.
Backward Pattern Matching on Elastic Degenerate Strings
55
Table 1: Search time in seconds of BNDM-EDS, ElDeS, EDSM-BV and SOPanG for synthetic data set.
File File params Pattern length BNDM-EDS SOPanG EDSM-BV ELDeS
100000
n = 100 000
N = 361 546
v = 0.095410
m = 8 0.00118260 0.00113000 0.02016180 1.80522000
m = 16 0.00100830 0.00121400 0.02511590 1.67540000
m = 24 0.00099149 0.00113000 0.02863710 1.70129000
m = 32 0.00098746 0.00113000 0.02898240 1.76652000
200000
n = 200 000
N = 725 343
v = 0.095325
m = 8 0.00237368 0.00225000 0.03969590 3.75465000
m = 16 0.00202329 0.00224500 0.04990100 3.94397000
m = 24 0.00198232 0.00225400 0.05714240 3.60240000
m = 32 0.00197887 0.00225400 0.05706520 3.81179000
400000
n = 400 000
N = 1 443 453
v = 0.095015
m = 8 0.00471225 0.00465500 0.07850180 8.18302000
m = 16 0.00402697 0.00457400 0.09871310 7.81520000
m = 24 0.00394175 0.00460800 0.11369500 7.90823000
m = 32 0.00393171 0.00461000 0.11334600 7.92150000
800000
n = 800 003
N = 2 887 787
v = 0.095186
m = 8 0.00944679 0.01000000 0.15683100 16.57840000
m = 16 0.00804698 0.00915000 0.19761300 16.06050000
m = 24 0.00788773 0.00916900 0.22609000 15.96310000
m = 32 0.00786342 0.00915300 0.22549400 16.76730000
1600000
n = 1 600 000
N = 5 769 390
v = 0.095174
m = 8 0.01884609 0.01831300 0.31360500 33.98690000
m = 16 0.01611825 0.01824100 0.39423000 35.46980000
m = 24 0.01581450 0.01827100 0.45437200 33.81830000
m = 32 0.01576465 0.01827700 0.45616800 33.92970000
0.0001
0.001
0.01
0.1
1
10
100000 200000 400000 800000 1600000
% preprocessing time (log10 scale)
Files
BNDM-EDS
SOPanG EDSM-BV ELDeS
Figure 2: Preprocessing time / Total time ratio for syn-
thetic data set.
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
0.01
0.011
0.01 0.03 0.05 0.10 0.15 0.20 0.25 0.30 0.40 0.50
Time [s]
Probability of degenerate segment
BNDM-EDS
SOPanG
Figure 3: BNDM-EDS and SOPanG comparison on files
with different variation.
The other data set is composed of real-world
human genomic data obtained from 1000 Genomes
Projects (Consortium, The 1000 Genomes Project,
2011) (Phase 3 of the project). We downloaded 24
vcf files representing 24 human chromosomes (same
population of 2 504 individuals for all chromosomes)
and transformed them into EDS. The percentage of
degenerate positions is approximately 3% on average
and the average length of one element (within a de-
generate position) is 2 bases. We selected representa-
tive chromosomes/files in terms of their size and vari-
abilty and we show the measured times achieved on
these files in Table 2. Furthermore, Figure 4 provides
the general comparison of individual algorithms for
all tested chromosomes/files. Due to large difference
of measured times, the vertical axis is in log
10
scale.
We can observe stronger BNDM-EDS dominance. This
is given by lower percentage of degenerate positions
which implies that Shift-And part of the algorithm
is not applied so often. This reduces the overhead of
swapping between two algorithms and enables better
utilization of BNDM shifting capability (“good prefix
shift” heuristic). Particularly, this can be seen in case
of chromosome Y where the variability among indi-
viduals is minimal and the time difference between
BNDM-EDS and SopANG increased. BNDM-EDS achieves
43% time improvement in comparison to SopANG for
chromosome Y and pattern length m = 8. However,
for longer patterns m = 32, BNDM-EDS achieves 81%
time improvement. Generally, the time achieved for
single chromosomes is given especially by their size
that corresponds to the N element in the time com-
BIOINFORMATICS 2021 - 12th International Conference on Bioinformatics Models, Methods and Algorithms
56
Table 2: Search time in seconds of BNDM-EDS, ElDeS, EDSM-BV and SOPanG for real data set.
File File params Pattern length BNDM-EDS SOPanG EDSM-BV ELDeS
Chrom1
n = 248 458 108
N = 255 872 195
v = 0.025814
m = 8 0.44913196 0.53431100 7.35913000 1695.14000000
m = 16 0.37187386 0.53086700 8.80502000 1590.15000000
m = 24 0.35176826 0.53060000 9.35136000 1590.61000000
m = 32 0.35678421 0.53099200 9.68736000 1688.14000000
Chrom7
n = 158 984 673
N = 164 380 710
v = 0.029508
m = 8 0.31989103 0.34962400 5.32563000 1070.06000000
m = 16 0.30178564 0.34705100 6.40293000 1077.33000000
m = 24 0.24784241 0.34708700 6.84947000 1035.08000000
m = 32 0.25140232 0.34661600 7.09477000 1068.89000000
Chrom22
n = 50 713 670
N = 52 004 687
v = 0.021605
m = 8 0.07712649 0.10663000 1.30624000 304.21900000
m = 16 0.06338260 0.10615300 1.47754000 298.53900000
m = 24 0.05970330 0.10614700 1.62679000 277.77800000
m = 32 0.05970250 0.10603900 1.63156000 315.12800000
ChromX
n = 155 619 941
N = 157 232 908
v = 0.009292
m = 8 0.22626265 0.30025500 1.96430000 963.23200000
m = 16 0.16053621 0.29899700 2.35361000 903.40400000
m = 24 0.14193202 0.29896900 2.48494000 928.85500000
m = 32 0.14133996 0.29900700 2.55960000 943.67600000
ChromY
n = 26 622 695
N = 26 629 512
v = 0.000245
m = 8 0.02775240 0.04873900 0.08626860 134.60100000
m = 16 0.01600782 0.04860800 0.10365000 142.23600000
m = 24 0.01156795 0.04859900 0.10476000 136.01800000
m = 32 0.00933492 0.04861900 0.10436000 134.60100000
(a) m = 8.
0.01
0.1
1
10
100
1000
10000
Chrom1
Chrom2
Chrom3
Chrom4
Chrom5
Chrom6
Chrom7
Chrom8
Chrom9
Chrom10
Chrom11
Chrom12
Chrom13
Chrom14
Chrom15
Chrom16
Chrom17
Chrom18
Chrom19
Chrom20
Chrom21
Chrom22
ChromX
ChromY
Time [s] (log10 scale)
Files
BNDM-EDS
SopANG EDSM-BV ELDeS
(b) m = 16.
0.01
0.1
1
10
100
1000
10000
Chrom1
Chrom2
Chrom3
Chrom4
Chrom5
Chrom6
Chrom7
Chrom8
Chrom9
Chrom10
Chrom11
Chrom12
Chrom13
Chrom14
Chrom15
Chrom16
Chrom17
Chrom18
Chrom19
Chrom20
Chrom21
Chrom22
ChromX
ChromY
Time [s] (log10 scale)
Files
BNDM-EDS
SopANG EDSM-BV ELDeS
(c) m = 24.
0.01
0.1
1
10
100
1000
10000
Chrom1
Chrom2
Chrom3
Chrom4
Chrom5
Chrom6
Chrom7
Chrom8
Chrom9
Chrom10
Chrom11
Chrom12
Chrom13
Chrom14
Chrom15
Chrom16
Chrom17
Chrom18
Chrom19
Chrom20
Chrom21
Chrom22
ChromX
ChromY
Time [s] (log10 scale)
Files
BNDM-EDS
SopANG EDSM-BV ELDeS
(d) m = 32.
0.001
0.01
0.1
1
10
100
1000
10000
Chrom1
Chrom2
Chrom3
Chrom4
Chrom5
Chrom6
Chrom7
Chrom8
Chrom9
Chrom10
Chrom11
Chrom12
Chrom13
Chrom14
Chrom15
Chrom16
Chrom17
Chrom18
Chrom19
Chrom20
Chrom21
Chrom22
ChromX
ChromY
Time [s] (log10 scale)
Files
BNDM-EDS
SopANG EDSM IKP
Figure 4: Search time of BNDM-EDS, ElDeS, EDSM-BV and SOPanG for real data set.
Backward Pattern Matching on Elastic Degenerate Strings
57
plexity of single algorithms.
Since BNDM-EDS gain in comparison to SopANG
clearly depends on variability in the processed EDS,
we included another experiment comparing both al-
gorithms on synthetic data with different probability
of degenerate EDS segment. Figure 3 depicts the
achieved search time in seconds. The probability of
degenerate EDS segment in single files starts from
0.01 and continues up to 0.5. The size of the files
is 1 600000 bases and we searched for a randomly
chosen pattern of length m = 16. The results show
that starting from 0.05 probability of degenerate seg-
ment, the length of processed elements (of the solid
segments) is shorter than the pattern length m = 16.
This implies that BNDM part of BNDM-EDS is not uti-
lized for most of the processed elements.
5 CONCLUSION AND FUTURE
WORK
We proposed BNDM-EDS algorithm which is the first
backward pattern matching algorithm for Elastic De-
generate Strings. Its worst-case time is bounded
by O(Nmd
m
w
e). Moreover, for small patterns (m
w), BNDM-EDS achieves average time complexity
O(
N(1v)log
σ
m
m
) + O(Nvm(1 v)) + O(Nvβδ) which
is optimal. Our goal was to design a simple (easy-to-
implement) algorithm achieving fast search times on
real-world data. This expectation was confirmed by
experiments on real-world data set. BNDM-EDS proves
its superiority especially on data with lower degener-
ate rate and for longer patterns.
We plan to extend BNDM-EDS for protein alpha-
bet in future work. Furthermore, we want to focus
on processing more general variants of Elastic De-
generate Strings, such as recursive Elastic Degenerate
Strings or colored Elastic Degenerate Strings (allow-
ing to map elements to individuals of the sequenced
population).
ACKNOWLEDGEMENTS
The research was partially supported by the Czech
Science Foundation as project No. 19-20759S.
REFERENCES
Aoyama, K., Nakashima, Y., I, T., Inenaga, S., Bannai, H.,
and Takeda, M. (2018). Faster Online Elastic Degen-
erate String Matching. In Navarro, G., Sankoff, D.,
and Zhu, B., editors, Annual Symposium on Combi-
natorial Pattern Matching (CPM 2018), volume 105
of Leibniz International Proceedings in Informatics
(LIPIcs), pages 9:1–9:10, Dagstuhl, Germany. Schloss
Dagstuhl–Leibniz-Zentrum fuer Informatik.
Bernardini, G., Gawrychowski, P., Pisanti, N., Pissis, S. P.,
and Rosone, G. (2019). Even faster elastic-degenerate
string matching via fast matrix multiplication. In 46th
International Colloquium on Automata, Languages,
and Programming, ICALP 2019, July 9-12, 2019, Pa-
tras, Greece., pages 21:1–21:15.
Bernardini, G., Pisanti, N., Pissis, S. P., and Rosone, G.
(2017). Pattern matching on elastic-degenerate text
with errors. In Fici, G., Sciortino, M., and Ven-
turini, R., editors, String Processing and Information
Retrieval, pages 74–90, Cham. Springer International
Publishing.
Church, D. M. et al. (2015). Extending reference assembly
models. Genome Biology, 16(13).
Cisłak, A., Grabowski, S., and Holub, J. (2018). Sopang:
online text searching over a pan-genome. Bioinfor-
matics, page bty506.
Consortium, T. C. P.-G. (2016). Computational pan-
genomics: status, promises and challenges. Briefings
in Bioinformatics, 19(1):118–135.
Consortium, The 1000 Genomes Project (2011). A map
of human genome variation from population-scale se-
quencing. Nature, 473:544 EP –. Corrigendum.
Consortium, The UK10K (2015). The uk10k project identi-
fies rare variants in health and disease. Nature, 526:82
EP –.
Cooley, J. W. and Tukey, J. W. (1965). An algorithm for the
machine calculation of complex fourier series. Math-
ematics of Computation, 19(90):297–301.
Crochemore, M., Hancart, C., and Lecroq, T. (2007). Algo-
rithms on Strings. Cambridge University Press, New
York, NY, USA.
Crochemore, M. and Rytter, W. (1994). Text algorithms.
Oxford University Press.
Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks,
E., DePristo, M. A., Handsaker, R. E., Lunter, G.,
Marth, G. T., Sherry, S. T., McVean, G., and Durbin,
R. (2011). The variant call format and vcftools. Bioin-
formatics, 27(15):2156–2158.
Dilthey, A. et al. (2015). Improved genome inference in
the mhc using a population reference graph. Nature
Genetics, 47:682.
D
¨
om
¨
olki, B. (1964). An algorithm for syntactical analy-
sis. Computational Linguistics, 3:29–46. Hungarian
Academy of Science, Budapest.
Grossi, R., Iliopoulos, C. S., Liu, C., Pisanti, N., Pissis,
S. P., Retha, A., Rosone, G., Vayani, F., and Versari, L.
(2017). On-line pattern matching on similar texts. In
28th Symposium on Combinatorial Pattern Matching,
CPM 2017, July 4-6, 2017, Warsaw, Poland, pages
9:1–9:14.
Iliopoulos, C. S., Kundu, R., and Pissis, S. P. (2017). Ef-
ficient pattern matching in elastic-degenerate texts.
In Drewes, F., Mart
´
ın-Vide, C., and Truthe, B., ed-
itors, Language and Automata Theory and Applica-
BIOINFORMATICS 2021 - 12th International Conference on Bioinformatics Models, Methods and Algorithms
58
tions, pages 131–142, Cham. Springer International
Publishing.
Knuth, D. E., Morris, J. H., and Pratt, V. R. (1977). Fast
Pattern Matching in Strings. SIAM Journal on Com-
puting, 6(2):323–350.
Maciuca, S. et al. (2016). A natural encoding of genetic
variation in a burrows-wheeler transform to enable
mapping and genome inference. In Algorithms in
Bioinformatics, pages 222–233, Cham. Springer In-
ternational Publishing.
Marschall, T. (2018). Computational pan-genomics: status,
promises and challenges. Briefings in Bioinformatics,
19(1):118–135.
Melichar, B., Holub, J., and Polcar, J. (2005). Text Search-
ing Algorithms.
Na, J. C., Kim, H., Park, H., Lecroq, T., L
´
eonard, M.,
Mouchard, L., and Park, K. (2016). Fm-index of
alignment: A compressed index for similar strings.
Theoretical Computer Science, 638:159 170. Pattern
Matching, Text Data Structures and Compression.
Navarro, G. and Pereira, A. O. n. (2016). Faster compressed
suffix trees for repetitive collections. J. Exp. Algorith-
mics, 21:1.8:1–1.8:38.
Navarro, G. and Raffinot, M. (1998). A bit-parallel ap-
proach to suffix automata: Fast extended string match-
ing. In Proceedings of the 9th Annual Symposium
on Combinatorial Pattern Matching, CPM ’98, pages
14–33, London, UK, UK. Springer-Verlag.
on Biochemical Nomenclature (CBN), I.-I. C. (1971). Ab-
breviations and symbols for nucleic acids, polynu-
cleotides and their constituents. Journal of Molecular
Biology, 55(3):299 – 310.
Pissis, S. P. and Retha, A. (2018). Dictionary Match-
ing in Elastic-Degenerate Texts with Applications in
Searching VCF Files On-line. In D’Angelo, G., ed-
itor, 17th International Symposium on Experimental
Algorithms (SEA 2018), volume 103 of Leibniz Inter-
national Proceedings in Informatics (LIPIcs), pages
16:1–16:14, Dagstuhl, Germany. Schloss Dagstuhl–
Leibniz-Zentrum fuer Informatik.
Proch
´
azka, P. and Holub, J. (2014). Compressing similar
biological sequences using fm-index. In Data Com-
pression Conference, DCC 2014, Snowbird, UT, USA,
26-28 March, 2014, pages 312–321.
Sir
´
en, J. (2016). Indexing variation graphs. CoRR,
abs/1604.06605.
Backward Pattern Matching on Elastic Degenerate Strings
59