Integer Linear Programming Approach to Median and Center Strings

for a Probability Distribution on a Set of Strings

Morihiro Hayashida

1

and Hitoshi Koyano

2

1

Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan

2

Graduate School of Medicine, Kyoto University, 54 Kawahara-cho, Shogoin, Sakyo-ku, Kyoto 606-8397, Japan

Keywords:

Median String, Center String, Integer Linear Programming.

Abstract:

We address problems of ﬁnding median and center strings for a probability distribution on a set of strings

under Levenshtein distance, which are known to be NP-hard in a special case. There are many applications in

various research ﬁelds, for instance, to ﬁnd functional motifs in protein amino acid sequences, and to recognize

shapes and characters in image processing. In this paper, we propose novel integer linear programming-based

methods for ﬁnding median and center strings for a probability distribution on a set of strings under Leven-

shtein distance. Furthermore, we restrict several variables to a region near the diagonal in the formulation,

and propose novel integer linear programming-based methods also for ﬁnding approximate median and center

strings for a probability distribution on a set of strings. For evaluation of our proposed methods, we perform

several computational experiments, and show that the restricted formulation reduced the execution time.

1 INTRODUCTION

It is a fundamental statistical method for understand-

ing a data set to take an average. In this paper, we

focus on a set of strings. For instance, nucleotide

sequences of DNAs and RNAs are represented by

strings as well as protein amino acid sequences. The

number of such sequences has rapidly increased, and

analytical methods are required. In the ﬁeld of evo-

lutionary studies of organisms, it would be an aim

to ﬁnd genetic information, nucleotide sequences of

common ancestors. In the ﬁeld of protein science, it

is essential to ﬁnd functional motifs in protein amino

acid sequences. Also in the ﬁeld of image recog-

nition, there are several applications such as post-

processing of optical character recognition (OCR) re-

sults (Bunke et al., 2002) and shape recognition(Chen

et al., 1998). Furthermore, it can be applied to clas-

siﬁcation and clustering of strings and biological se-

quences (Mart´ınez-Hinarejos et al., 2003).

Several deﬁnitions have been proposed for repre-

senting an average of strings because the average is

not uniquely determined. One is a median string,

which is deﬁned as a string that minimizes the sum

of distances with strings included in a set (Koho-

nen, 1985). One is a center string, which is de-

ﬁned as a string that minimizes the maximum of dis-

tances with strings (Gusﬁeld, 1997). As distances be-

tween two strings, several distances such as Leven-

shtein distance (Levenshtein, 1965), Hamming dis-

tance (Hamming, 1950), and Jaro-Winkler distance

(Winkler, 1990) have been proposed, where the Jaro-

Winkler distance is known not to obey the triangle in-

equality. The Levenshteindistance between two given

strings s and t allows three types of edit operations,

insertion, deletion, substitution, and can be calculated

in polynomial time O(|s||t|) using dynamic program-

ming, where |s| denotes the length of s. The Hamming

distance has been also used for closest strings and re-

lated problems (Gramm, 2003; Gramm et al., 2003).

Data reduction techniques that reduce candidates of

a center string under the Hamming distance were de-

veloped (Hufsky et al., 2011). However, they men-

tioned that their parameterized methods would be not

applicable for ﬁnding center strings under the Leven-

shtein distance. A genetic algorithm for ﬁnding clos-

est strings under rank distance was developed (Dinu

and Ionescu, 2012), where the rank distance has been

applied in biology, natural language processing, and

authorship attribution.

The problems of ﬁnding the median and center

strings for a ﬁnite set of strings under the Levenshtein

distance have been proved to be NP-complete for

an unbounded alphabet (de la Higuera and Casacu-

berta, 2000), and even for a binary alphabet (Nicolas

and Rivals, 2003; Nicolas and Rivals, 2005). It has

Hayashida, M. and Koyano, H.

Integer Linear Programming Approach to Median and Center Strings for a Probability Distribution on a Set of Strings.

DOI: 10.5220/0005666400350041

In Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2016) - Volume 3: BIOINFORMATICS, pages 35-41

ISBN: 978-989-758-170-0

Copyright

c

2016 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

35

been proved that a related problem CSCE is also NP-

complete when a penalty matrix is a metric (Sim and

Park, 2003). An exact algorithm for ﬁnding the me-

dian string under the Levenshtein distance using dy-

namic programming was proposed (Kruskal, 1983),

which requires N-dimensional array and O(n

N

) time

and space for a set of N strings with length n, for

example, it requires 10

10

· 4 bytes = 40GB memory

for n = N = 10. For computing approximate me-

dian strings in practical time, several methods have

been proposed. If given strings are all quite similar,

the path by the optimal dynamic programming should

be close to the main diagonal. Hence, the method

to restrict candidate paths to a region near the di-

agonal was proposed (Lopresti and Zhou, 1997). A

greedy algorithm starts from an empty string, and se-

lects a letter that minimizes the exact consensus error

(Casacuberta and de Antoni, 1997). An online algo-

rithm takes the current approximate median string and

a new string, and calculates a weighted mean of these

strings (Jiang et al., 2003). In a stochastic approach,

some conditional probability from a string to an-

other was deﬁned, and an approximate median string

was obtained by expectation maximization technique

(Olivares-Rodr´ıguez and Oncina, 2008). An iterative

algorithm applies the edit operation with some high-

est score to the current string until a better solution is

not found (Abreu and Rico-Juan, 2014). These meth-

ods output approximate median strings, and there are

a few methods to output optimal median strings. As

far as we know, methods for ﬁnding optimal cen-

ter strings have not been developed. In this paper,

hence, we propose an approach using integer linear

programming for ﬁnding optimal median and center

strings because efﬁcient solvers for integer linear pro-

gramming problems have been developed. In addi-

tion, we introduce a probability distribution on a set

of strings (Koyano and Kishino, 2010), and propose

methods for ﬁnding median and center strings of such

a probability distribution under the Levenshtein dis-

tance. Furthermore, we propose integer linear pro-

gramming formulations restricted to a region near the

diagonal for ﬁnding approximate median and center

strings. We perform several computational experi-

ments and verify the efﬁciency of our methods.

2 METHODS

We use the Levenshtein distance because it is often

used and a fundamental edit distance. In this sec-

tion, we brieﬂy review the computation of the Lev-

enshtein distance, median, center strings, and pro-

pose integer linear programming formulations for ex-

act and approximate median and center strings. Let

A = {a

1

,... ,a

z

} be an alphabet composed of z letters,

for instance, A = {A,T,G,C} for DNA nucleotide se-

quences. We deﬁne A

∗

to be the set of all strings on

A with varying lengths, and for a string s ∈ A

∗

, |s|

denotes the length of s.

2.1 Levenshtein Distance

The Levenshtein distance d(s,t) between two strings

s and t is deﬁned as the minimum cost of sequences

of edit operations transforming s = s

1

·· ·s

n

into t =

t

1

·· ·t

m

, and can be calculated by the following dy-

namic programming (Wagner and Fischer, 1974).

D[0,0] = 0, (1)

D[i, j] = min

D[i− 1, j − 1] + γ(s

i

→ t

j

)

D[i− 1, j] + γ(s

i

→ ε)

D[i, j− 1] + γ(ε → t

j

)

(2)

where ε denotes an empty letter, γ(s

i

→ t

j

), γ(s

i

→ ε),

and γ(ε → t

j

) denote the costs of substitution, dele-

tion, and insertion, respectively. Then, D[n,m] is the

Levenshtein distance d(s,t).

2.2 Median and Center Strings

Given N strings s

(k)

with length n

k

(k = 1,..., N) on

A

∗

, the median string is deﬁned by

argmin

t∈A

∗

N

∑

k=1

d(t,s

(k)

). (3)

Similarly, the center string is deﬁned by

argmin

t∈A

∗

max

k∈{1,...,N}

d(t,s

(k)

). (4)

For a given probability distribution p(s) on A

∗

, we

deﬁne median and center strings by

argmin

t∈A

∗

∑

s∈A

∗

p(s)d(t,s), (5)

argmin

t∈A

∗

max

s∈A

∗

p(s)d(t,s), (6)

respectively. If p(s

(k)

) =

1

N

for k = 1,. .. , N and

p(s) = 0 for all s /∈ {s

(k)

}, Eqs (3) and (4) are equiva-

lent to Eqs (5) and (6), respectively.

2.3 Integer Linear Programming

Formulation

Since it is known that problems of ﬁnding median and

center strings under the Levenshtein distance are NP-

hard (Nicolas and Rivals, 2003; Nicolas and Rivals,

2005), we make use of integer linear programming

BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms

36

x

k,1,0

x

k,i+1,j

x

kij

y

k,i,j+1

y

kij

z

kij

z

k,i+1,j+1

y

k,0,1

x

kn

k

m

y

kn

k

m

z

k,1,1

s

1

(k)

s

n

k

(k)

t

1

t

j

t

m

z

kn

k

m

s

i

(k)

t

l

t

m

y

kn

k

j

=1

s

n

k

(k)

Figure 1: Illustration on variables appeared in our integer

linear programming formulation. (Left) Variables x

ki j

, y

ki j

,

and z

ki j

represent a path in dynamic programming for calcu-

lating the Levenshtein distance if the value of its variable is

equal to 1. (Right) Variable l represents the length of string

t. For all j > l, y

kn

k

j

is forced to be 1.

which efﬁcient solvers have been developed. We can

ﬁnd a median string t by integer linear programming

if the Levenshtein distance d(t,s

(k)

) between t and

s

(k)

can be calculated in linear formulas. It, however,

is difﬁcult to directly represent the array D[i, j] in

the dynamic programming by integer linear program-

ming because it includes the selection of the mini-

mum value in Eq. (2).

Suppose that a probability distribution p(s) on A

∗

is given, where the number of strings s satisfying

p(s) > 0 is ﬁnite, N, that is, p(s

(k)

) > 0 (k = 1, .. ., N).

We use integer numbers 1,...,|A | instead of letters

in A because a variable takes a value in linear pro-

gramming. s

(k)

i

(i = 1, .. .,n

k

) is given as a constant

value of 1,. .. ,|A |, and represents the i-th letter in

s

(k)

. t

j

( j = 1,..., m) is a variable taking a value of

1,..., |A|. Then, we propose the following integer

linear programming formulation, called ILPMed, for

ﬁnding the median string for p(s) under the Leven-

shtein distance with costs C

sub

, C

del

, C

ins

of substitu-

tion, deletion, insertion (see Fig. 1).

min

N

∑

k=1

p(s

(k)

)

n

n

k

∑

i=1

C

del

x

k,i,0

+

m

∑

j=1

C

ins

y

k,0, j

+

n

k

∑

i=1

m

∑

j=1

(C

del

x

kij

+C

ins

y

kij

+C

sub

h

kij

)

o

−C

ins

(m− l)

subject to

for all k = 1, .. ., N,

1 = x

k,1,0

+ y

k,0,1

+ z

k,1,1

, (a1)

x

k,i,0

= x

k,i+1,0

+ y

k,i,1

+ z

k,i+1,1

for all i < n

k

, (a2)

x

k,n

k

,0

= y

k,n

k

,1

, (a3)

y

k,0, j

= x

k,1, j

+ y

k,0, j+1

+ z

k,1, j+1

for all j < m, (a4)

y

k,0,m

= x

k,1,m

, (a5)

x

kij

+ y

kij

+ z

kij

= x

k,i+1, j

+ y

k,i, j+1

+ z

k,i+1, j+1

for all i < n

k

, j < m, (a6)

x

kn

k

j

+ y

kn

k

j

+ z

kn

k

j

= y

k,n

k

, j+1

for all j < m, (a7)

x

kim

+ y

kim

+ z

kim

= x

k,i+1,m

for all i < n

k

, (a8)

x

kn

k

m

+ y

kn

k

m

+ z

kn

k

m

= 1, (a9)

y

kn

k

j

≥

1

m

( j − l) for all j, (b)

for all k, i, j,

s

(k)

i

− t

j

≤ |A |g

kij

, (c1)

t

j

− s

(k)

i

≤ |A |g

kij

, (c2)

h

kij

≥ z

kij

+ g

kij

− 1, (d1)

h

kij

≤

1

2

(z

kij

+ g

kij

), (d2)

x

kij

,y

kij

,z

kij

,g

kij

,h

kij

∈ {0,1},

t

j

∈ {1,...,|A|}, 0 ≤ l ≤ m,

where m is a sufﬁcient large constant integer, that is,

the sum of n

k

, and l is the variable representing the

length of median string.

In the formulation, variable x

kij

takes 1 if s

(k)

i

is

deleted, otherwise 0. y

kij

takes 1 if t

j

is inserted, oth-

erwise 0. z

kij

takes 1 if s

(k)

i

is substituted with t

j

, oth-

erwise 0. There must be exactly one path from the

upper left to the lower right for each string s

(k)

. If ei-

ther of x

kij

, y

kij

, and z

kij

is 1, either of x

k,i+1, j

, y

k,i, j+1

,

and z

k,i+1, j+1

must be 1, which is represented by Eq.

(a6). According to the position (i, j), Eqs (a1-9) are

constructed. Eq. (b) represents the constraint that the

length of median string t is l, and y

kn

k

j

is forced to be

1 if j > l. It is difﬁcult to represent the Levenshtein

distance d(t,s

(k)

) =

∑

n

k

i=1

C

del

x

k,i,0

+

∑

l

j=1

C

ins

y

k,0, j

+

∑

n

k

i=1

∑

l

j=1

(C

del

x

kij

+C

ins

y

kij

+C

sub

h

kij

) in the formu-

lation because l is also a variable to be found. Hence,

we use a constant integer m instead of l. Then, the

sum includes the extra cost of C

ins

(m− l). We reduce

the cost such that the objective function represents the

sum in Eq. (5). It should be noted that for all j > l,

y

kn

k

j

is forced to be 1. Eqs (c1-2) represent that g

kij

becomes 1 if s

(k)

i

is the same as t

j

. Eqs (d1-2) repre-

sent that h

kij

becomes 1 if and only if both of z

kij

and

g

kij

are 1. It means that the substitution cost from s

(k)

i

to t

j

is C

sub

if s

(k)

i

6= t

j

, otherwise 0.

It is guaranteed that we can ﬁnd the median string

for p(s) under the Levenshtein distance by solving

this integer linear programming formulation because

the objective function is equivalent to the sum in Eq.

(5), t can be any string with length up to m =

∑

N

k=1

n

k

,

and the sum for a string with length more than m is

larger than that for the concatenated string of all s

(k)

.

In a similar way to median strings, we propose

the following integer linear programming formula-

Integer Linear Programming Approach to Median and Center Strings for a Probability Distribution on a Set of Strings

37

tion, called ILPCen, for ﬁnding the center string for

a probability distribution p(s).

min d

subject to

for all k = 1, .. ., N,

p(s

(k)

)

n

n

k

∑

i=1

C

del

x

k,i,0

+

m

∑

j=1

C

ins

y

k,0, j

+

n

k

∑

i=1

m

∑

j=1

(C

del

x

kij

+C

ins

y

kij

+C

sub

h

kij

) −C

ins

(m− l)

o

≤ d,

1 = x

k,1,0

+ y

k,0,1

+ z

k,1,1

,

x

k,i,0

= x

k,i+1,0

+ y

k,i,1

+ z

k,i+1,1

for all i < n

k

,

x

k,n

k

,0

= y

k,n

k

,1

,

y

k,0, j

= x

k,1, j

+ y

k,0, j+1

+ z

k,1, j+1

for all j < m,

y

k,0,m

= x

k,1,m

,

x

kij

+ y

kij

+ z

kij

= x

k,i+1, j

+ y

k,i, j+1

+ z

k,i+1, j+1

for all i < n

k

, j < m,

x

kn

k

j

+ y

kn

k

j

+ z

kn

k

j

= y

k,n

k

, j+1

for all j < m,

x

kim

+ y

kim

+ z

kim

= x

k,i+1,m

for all i < n

k

,

x

kn

k

m

+ y

kn

k

m

+ z

kn

k

m

= 1,

y

kn

k

j

≥

1

m

( j − l) for all j,

for all k, i, j,

s

(k)

i

− t

j

≤ |A |g

kij

,

t

j

− s

(k)

i

≤ |A |g

kij

,

h

kij

≥ z

kij

+ g

kij

− 1,

h

kij

≤

1

2

(z

kij

+ g

kij

),

x

kij

,y

kij

,z

kij

,g

kij

,h

kij

∈ {0,1},

t

j

∈ {1,. .. ,|A |},

0 ≤ l ≤ m, d ≥ 0.

Here, d is a variable that represents the minimum in

Eq.(6).

If strings s

(k)

are similar to each other, we can

restrict candidate paths to a region near the diago-

nal without loss of optimality. We introduce an con-

stant positive integer w, and propose integer linear

programming formulations, called ILPMedDiag and

ILPCenDiag, by reducing variables, x

kij

, y

kij

, z

kij

,

g

kij

, and h

kij

with |i− j| > w from ILPMed and ILP-

Cen, respectively.

3 COMPUTATIONAL

EXPERIMENTS

For the evaluation of our methods, we performed

several computational experiments. We used C

del

=

C

ins

= C

sub

= 1 to calculate the Levenshtein distance,

and used an alphabet A with 4 letters as DNA and

RNA nucleotide sequences. We randomly gener-

ated two types of probability distributions, p

1

(s) and

p

2

(s), on A

∗

. In p

1

(s), N strings s

(k)

with length n

k

were generated as strings satisfying p

1

(s) > 0 while

varying N = 2, .. ., 10 and n

k

= 2,. .. ,10, where n

k

was the same for all k = 1,. .. ,N. Each s

(k)

i

was gen-

erated as min(1 + ⌊|α|⌋, |A|), where α followed the

normal distribution with mean 0 and variance 1, and

⌊α⌋ is the largest integer not greater than α. The prob-

ability of p

1

(s

(k)

) was generated uniquely at random

such that

∑

N

k=1

p

1

(s

(k)

) = 1 holds. In p

2

(s), N strings

s

(k)

were generated from a string of a

1

·· ·a

1

(a

1

∈ A )

with length n by applying randomly selected edit op-

erations of substitution, insertion, and deletion, three

times, where we examined N = 2,. .. , 10 and n =

5,...,10, and the length n

k

of s

(k)

could be different

according to k. The probability of p

2

(s

(k)

) was gener-

ated uniquely at random such that

∑

N

k=1

p

2

(s

(k)

) = 1

holds. For each case of p

1

(s), p

2

(s), n

k

, and N,

we generated a set of N strings s

(k)

with p

1

(s

(k)

) or

p

2

(s

(k)

) ten times, and took the average of execution

times. We used CPLEX (version 12.5) as the inte-

ger linear programming solver under a linux operat-

ing system with Xeon 2.9GHz processor and 35GB

memory.

2

3

4

5

6

7

8

9

10

2

3

4

5

6

7

8

9

10

0.01

0.1

1

10

100

1000

10000

100000

Elapsed time (sec)

ILPMed

ILPMedDiag

Length

# strings

Elapsed time (sec)

(a) median string

2

3

4

5

6

7

8

9

10

2

3

4

5

6

7

8

9

10

0.01

0.1

1

10

100

1000

10000

Elapsed time (sec)

ILPCen

ILPCenDiag

Length

# strings

Elapsed time (sec)

(b) center string

Figure 2: Results on the average execution time in seconds

on a log scale by ILPMed, ILPMedDiag with w = 2, ILP-

Cen, and ILPCenDiag with w = 2 for probability distribu-

tions p

1

(s) for N = 2, ... ,10 and n

k

= 2, ..., 10. (a) For

ﬁnding median strings. (b) For ﬁnding center strings.

BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms

38

Table 1: Results on the average and standard deviation of execution time in seconds by ILPMed, ILPMedDiag with w = 2,

ILPCen, and ILPCenDiag with w = 2 for probability distributions p

1

(s) for n = 10 and N = 2,. ..,10.

N ILPMed ILPMedDiag ILPCen ILPCenDiag

average s.d. average s.d. average s.d. average s.d.

2 14.4 6.8 1.7 1.8 6.4 1.9 3.8 2.5

3 21.5 10.5 7.0 4.4 18.1 7.4 7.0 5.1

4 23.7 15.3 7.0 4.5 72.1 27.2 5.5 4.0

5 168.0 69.0 15.7 3.6 464.8 312.8 8.8 4.8

6 798.6 280.4 21.7 8.3 1015.6 580.6 6.9 3.0

7 1400.6 773.3 41.6 20.5 2810.3 1218.8 10.7 4.4

8 11269.8 9074.5 61.0 39.2 5936.2 3522.8 20.4 9.5

9 22438.1 16417.6 37.7 17.9 5086.0 1358.6 23.9 10.5

10 18467.8 16625.5 67.4 22.8 8120.3 4336.0 41.1 29.5

Fig. 2 shows results on the average execution time

in seconds on a log scale by ILPMed, ILPMedDiag

with w = 2, ILPCen, and ILPCenDiag with w = 2

for probability distributions p

1

(s) for N = 2, .. .,10

and n

k

= 2,. .. ,10. Table 1 shows the detailed av-

erage and standard deviation of execution time by

ILPMed, ILPMedDiag, ILPCen, and ILPCenDiag for

n = 10. We can see from these that the average exe-

cution times by ILPMed and ILPCen rapidly, almost

exponentially, increased with both of the number N

of strings and the length n

k

because the problems are

NP-hard. On the other hand, the average execution

times by ILPMedDiag and ILPCenDiag were smaller

than those by ILPMed and ILPCen, respectively, be-

cause candidate solutions for the problems were re-

stricted to the region near the diagonal.

Fig. 3 shows results on the average execution time

in seconds on a log scale by ILPMed, ILPMedDiag

with w = 2, ILPCen, and ILPCenDiag with w = 2

for probability distributions p

2

(s) for N = 2, .. .,10

and n = 5, .. ., 10. Table 2 shows the detailed av-

erage and standard deviation of execution time by

ILPMed, ILPMedDiag, ILPCen, and ILPCenDiag for

n = 10. Also for p

2

(s), the average execution times

by ILPMedDiag and ILPCenDiag were smaller than

those by ILPMed and ILPCen, respectively. The

slopes of ILPMed and ILPCen along the length for

p

2

(s) were smaller than those for p

1

(s), respectively.

In addition, the average execution times for p

2

(s)

were smaller than those for p

1

(s). It, however, is con-

sidered that ILPMed and ILPCen might be not sufﬁ-

cient to be applied to actual data for ﬁnding their op-

timal median and center strings. Fig. 4 shows results

on the average objective value by ILPMed, ILPMed-

Diag with w = 2, ILPCen, and ILPCen with w = 2

for probability distributions p

2

(s) for N = 2, .. .,10

and n = 5, .. .,10. It is noted that the objective val-

ues by ILPMed and ILPCen for p

1

(s) were almost

the same as those by ILPMedDiag and ILPCenDiag,

respectively. In p

2

(s), three edit operations were ap-

5

6

7

8

9

10

2

3

4

5

6

7

8

9

10

0.1

1

10

100

1000

10000

Elapsed time (sec)

ILPMed

ILPMedDiag

Length

# strings

Elapsed time (sec)

(a) median string

5

6

7

8

9

10

2

3

4

5

6

7

8

9

10

0.1

1

10

100

1000

10000

Elapsed time (sec)

ILPCen

ILPCenDiag

Length

# strings

Elapsed time (sec)

(b) center string

Figure 3: Results on the average execution time in seconds

on a log scale by ILPMed, ILPMedDiag with w = 2, ILP-

Cen, and ILPCenDiag with w = 2 for probability distribu-

tions p

2

(s) for N = 2,.. .,10 and n = 5, ... ,10.

plied to strings, and differences of objective values

between ILPMed and ILPMedDiag with w = 2, and

between ILPCen and ILPCenDiag with w = 2, oc-

curred. We can obtain optimal strings using ILPMed-

Diag and ILPCenDiag by increasing the width w of

diagonal.

Integer Linear Programming Approach to Median and Center Strings for a Probability Distribution on a Set of Strings

39

Table 2: Results on the average and standard deviation of execution time in seconds by ILPMed, ILPMedDiag with w = 2,

ILPCen, and ILPCenDiag with w = 2 for probability distributions p

2

(s) for n = 10 and N = 2,. ..,10.

N ILPMed ILPMedDiag ILPCen ILPCenDiag

average s.d. average s.d. average s.d. average s.d.

2 4.7 3.2 1.6 2.4 2.7 2.8 2.4 2.8

3 12.4 7.8 2.1 2.5 7.5 4.5 2.3 2.6

4 18.9 11.5 4.8 5.3 19.5 13.9 4.9 3.1

5 23.3 23.2 3.2 2.5 26.6 20.0 2.1 2.4

6 98.6 76.1 6.4 3.9 157.0 110.7 3.8 2.6

7 154.3 120.0 7.5 4.0 320.4 257.5 4.9 4.1

8 293.7 170.0 7.1 6.5 641.3 425.0 4.8 4.4

9 943.4 747.9 14.1 8.4 1259.2 676.6 3.1 3.3

10 1160.4 276.1 9.4 5.5 1266.7 488.1 3.9 3.6

5

6

7

8

9

10

2

3

4

5

6

7

8

9

10

0.8

1

1.2

1.4

1.6

1.8

2

2.2

Objective

ILPMed

ILPMedDiag

Length

# strings

Objective

(a) median string

5

6

7

8

9

10

2

3

4

5

6

7

8

9

10

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Objective

ILPCen

ILPCenDiag

Length

# strings

Objective

(b) center string

Figure 4: Results on the average objective value by

ILPMed, ILPMedDiag with w = 2, ILPCen, and ILPCen

with w = 2 for probability distributions p

2

(s) for N =

2,.. ., 10 and n = 5,... ,10.

4 CONCLUSION

We extended the deﬁnitions of median and center

strings, which problems are known to be NP-hard,

to those over a probability distribution p(s) on a

set of strings A

∗

, and proposed novel integer linear

programming-based methods, ILPMed, and ILPCen,

for ﬁnding median and center strings for p(s) on A

∗

,

and ILPMedDiag, ILPCenDiag for ﬁnding approxi-

mate median and center strings for p(s) on A

∗

by re-

stricting several variables of ILPMed and ILPCen to

a region near the diagonal. We performed computa-

tional experiments, and conﬁrmed that the execution

times by ILPMedDiag and ILPCenDiag were smaller

than those by ILPMed and ILPCen, respectively, and

ILPMedDiag and ILPCenDiag reduced the execution

times. ILPMed and ILPCen, however, might be not

sufﬁcient to be applied to actual data for ﬁnding their

optimal median and center strings. It is considered

because the number of candidate paths from the up-

per left to the lower right in ILPMed and ILPCen is

enormous and should be selected by solvers although

the Levenshtein distance between two strings can be

calculated in polynomial time. On the other hand,

ILPMedDiag and ILPCenDiag are considered to be

useful if given strings are similar to each other be-

cause the number of such candidate paths in ILPMed-

Diag and ILPCenDiag is small. As future work, we

need to analyze computational time and space com-

plexities for our proposed methods. Furthermore, we

would like to improve our methods by introducing

other types of restriction to the variables than those

in ILPMedDiag and ILPCenDiag. In addition, we

will consider decomposition of strings, linear pro-

gramming relaxation, and utilize approximate solu-

tions obtained by ILPMedDiag and ILPCenDiag in

order to ﬁnd optimal solutions by ILPMed and ILP-

Cen.

ACKNOWLEDGEMENTS

This work was partially supported by Grants-in-Aid

#24500361, and #26610037 from MEXT, Japan.

BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms

40

REFERENCES

Abreu, J. and Rico-Juan, J. (2014). A new iterative algo-

rithm for computing a quality approximate median of

strings based on edit operations. Pattern Recognition

Letters, 36:74–80.

Bunke, H., Jiang, X., Abegglen, K., and Kandel, A. (2002).

On the weighted mean of a pair of strings. Pattern

Analysis and Applications, 5:23–30.

Casacuberta, F. and de Antoni, M. (1997). A greedy al-

gorithm for computing approximate median strings.

pages 193–198.

Chen, S., Tung, S., Fang, C., Cherng, S., and Jain, A.

(1998). Extended attributed string matching for shape

recognition. Computer Vision and Image Understand-

ing, 70:36–50.

de la Higuera, C. and Casacuberta, F. (2000). Topology of

strings: Median string is NP-complete. Theoretical

Computer Science, 230:39–48.

Dinu, L. and Ionescu, R. (2012). An efﬁcient rank based

based approach for closest string and closest sub-

string. PLoS ONE, 7(6):e37576.

Gramm, J. (2003). Fixed-parameter algorithms for the con-

sensus analysis of genomic data. PhD thesis, Univer-

sit¨at T¨ubingen.

Gramm, J., Niedermeier, R., and Rossmanith, P. (2003).

Fixed-parameter algorithms for closest string and re-

lated problems. Algorithmica, 37:25–42.

Gusﬁeld, D. (1997). Algorithms on strings, trees and se-

quences. Cambridge University Press.

Hamming, R. (1950). Error detecting and error correcting

codes. The Bell System Technical Journal, 29(2):147–

160.

Hufsky, F., Kuchenbecker, L., Jahn, K., Stoye, J., and

B¨ocker, S. (2011). Swiftly computing center strings.

BMC Bioinformatics, 12:106.

Jiang, X., Abegglen, K., Bunke, H., and Csirik, J. (2003).

Dynamic computation of generalised median strings.

Pattern Analysis and Applications, 6:185–193.

Kohonen, T. (1985). Median strings. Pattern Recognition

Letters, 3:309–313.

Koyano, H. and Kishino, H. (2010). Quantifying biodiver-

sity and asymptotics for a sequence of random strings.

Physical Review E, 81(6):061912.

Kruskal, J. (1983). An overview of sequence comparison:

Time warps, string edits, and macromolecules. SIAM

Reviews, 25(2):201–237.

Levenshtein, V. (1965). Binary codes capable of correcting

deletions, insertions and reversals. Doklady Adademii

Nauk SSSR, 163(4):845–848.

Lopresti, D. and Zhou, J. (1997). Using consensus sequence

voting to correct OCR errors. Computer Vision and

Image Understanding, 67(1):39–47.

Mart´ınez-Hinarejos, C., Juan, A., and Casacuberta, F.

(2003). Median strings for k-nearest neighbour clas-

siﬁcation. Pattern Recognition Letters, 24:173–181.

Nicolas, F. and Rivals, E. (2003). Complexities of the centre

and median string problems. Lecture Notes in Com-

puter Science, 2676:315–327.

Nicolas, F. and Rivals, E. (2005). Hardness results for the

center and median string problems under the weighted

and unweighted edit distances. Journal of Discrete

Algorithms, 3:390–415.

Olivares-Rodr´ıguez, C. and Oncina, J. (2008). A Stochastic

Approach to Median String Computation, pages 431–

440. Springer, Berlin.

Sim, J. S. and Park, K. (2003). The consensus string prob-

lem for a metric is NP-complete. Journal of Discrete

Algorithms, 1:111–117.

Wagner, R. and Fischer, M. (1974). The string-to-string

correction problem. Journal of the ACM, 21(1):168–

173.

Winkler, W. (1990). String comparator metrics and en-

hanced decision rules in the Fellegi-Sunter model of

record linkage. pages 354–359.

Integer Linear Programming Approach to Median and Center Strings for a Probability Distribution on a Set of Strings

41