MINIMUM MUTATION ALGORITHM FOR GAPLESS METABOLIC

NETWORK EVOLUTION

Esa Pitk

anen, Juho Rousu

Department of Computer Science, University of Helsinki, P.O. Box 68, FI-00014 Helsingin yliopisto, Finland

Mikko Arvas

VTT Technical Research Centre of Finland, P.O. Box 1000, FI-02044 VTT, Espoo, Finland

Keywords:

Algorithms, Bioinformatics, Evolution, Fungi, Metabolism, Parsimony, Phylogeny, Systems Biology.

Abstract:

We present a method for inferring the structure of ancestral metabolic networks directly from the networks of

observed species and their phylogenetic tree. Our method aims to minimize the number of mutations on the

phylogenetic tree, whilst keeping the ancestral networks structurally feasible, i.e., free of reaction gaps.

To this end, we present a parsimony-based method that generates metabolic network phylogenies where the an-

cestral nodes are required to represent gapless metabolic networks, networks where all reactions are reachable

from external substrates. In particular, we introduce the gapless minimum mutation problem: ﬁnding phylo-

genies of gapless metabolic networks when the topology of the phylogenetic tree is given, but the content of

ancestral nodes is unknown.

The gapless minimum mutation problem is shown to be computationally hard to solve even approximatively.

We then propose an efﬁcient dynamic programming based heuristic that combines knowledge on both the

metabolic network topology and phylogeny of species. Speciﬁcally, the reconstruction of each ancestral

network is guided by the heuristic to minimize the total phylogeny cost. We experiment by reconstructing

phylogenies generated under a simple random model and derived from KEGG for a number of fungal species.

1 INTRODUCTION

Modelling of metabolism is essential in a variety of

applications of biotechnology and medicine including

bioprocess development (Raman and Chandra, 2009),

study of metabolic diseases (Sigurdsson et al., 2009)

and drug target identiﬁcation (Jamshidi and Palsson,

2007). Global characteristics of cellular metabolism

by metabolic networks have been studied intensively

by a variety of computational approaches, includ-

ing metabolic reconstruction (see (Pitk

anen et al.,

2010) for a recent survey), metabolic ﬂux analy-

sis (Palsson, 2006), 13C isotopic tracing (Rantanen

et al., 2008) and structural analysis of metabolic net-

works (Lacroix et al., 2008).

The structure of the metabolic network is a major

contributor to the phenotypes that an organism man-

ifests. Metabolic networks have been shown to be

scale free (i.e., the networks contain hub metabolites

of high connectivity) and modular (Kreimer et al.,

2008). The structure is known to constrain the phe-

notypes the network can realize (Palsson, 2006), thus

the structure is also likely to be conserved in evolu-

tion (Wagner, 2009).

Recently, the increasing number of fully se-

quenced genomes has enabled comparative genomics

analysis of metabolic network evolution (for review

see (Caetano-Anoll

es et al., 2009)). Many compu-

tational approaches have concentrated on deriving

rigorous measures of biological network and path-

way similarity (Sharan and Ideker, 2006), thus en-

abling construction of phylogenies of networks with

distance-based methods. Particularly methods for

metabolic network and pathway comparison have

been developed (Dandekar et al., 1999; Tohsato et al.,

2000; Clemente et al., 2007; Mano et al., 2010).

Distance-based methods do not immediately yield

predictions on the contents of ancestral networks,

however. Knowledge on ancestral networks is impor-

tant as it may shed light on the evolutionary mecha-

nisms that have generated the observed networks. An

approach complementary to distance-based methods

Pitkänen E., Rousu J. and Arvas M..

MINIMUM MUTATION ALGORITHM FOR GAPLESS METABOLIC NETWORK EVOLUTION.

DOI: 10.5220/0003132200280038

In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2011), pages 28-38

ISBN: 978-989-8425-36-2

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

often used to give insight on ancestral node contents

in phylogenetic trees is maximum parsimony, where

one tries to ﬁnd ancestral objects which minimize the

total number of evolutionary changes required to ex-

plain the observed data. The maximum parsimony

principle has been utilized in many domains, for in-

stance in the analysis of sequence data (Fitch, 1971;

Sankoff, 1975; Clemente et al., 2009; Tuller et al.,

2010) as well as gene regulatory networks (Bourque

and Sankoff, 2004).

A direct application of maximum parsimony

methods to biological network data generally results

in structurally infeasible ancestral networks. For in-

stance, consider the two metabolic networks Y and Z

with a common immediate ancestor X shown in Fig-

ure 1. A parsimonious scenario (top right) includes an

ancestral network where a metabolite is required by

the network but cannot be produced, suggesting that

the pathway m → c cannot operate and the network is

infeasible.

Graph evolution models taking into account de-

pendencies imposed by the network structure have

been proposed. For instance, Mithani et al. gave

a Markov process for simulating metabolic network

evolution under a neighbor dependency model where

appearance of a reaction depends on the fraction

of neighboring reactions already present in the net-

work (Mithani et al., 2009; Mithani et al., 2010).

However, they reported results only for relatively

small metabolic networks.

In this paper, we introduce a computational

method for reconstructing ancestral metabolic net-

works in a given phylogenetic tree. The method com-

bines the maximum parsimony principle with the re-

quirement that the resulting networks are plausible

in terms of network connectivity, thus contributing

towards bridging the gap between the structural and

phylogenetic analysis of metabolic networks. Specif-

ically, our method builds phylogenies of metabolic

networks where the ancestral nodes of the phylogeny

adhere to structural network constraints: The net-

works are required to be free of reaction gaps, that is,

reactions whose substrates cannot be produced from

external metabolites. The choice of external metabo-

lites reﬂects the estimated metabolic environment:

the organisms are assumed to have them available in

abundance and possess the necessary transports. To

this purpose, computational methods have been de-

veloped to identify a set of minimal nutrients, given

metabolic network structure (Handorf et al., 2008;

Borenstein et al., 2008).

In section 2, we formulate the gapless minimum

mutation problem where the topology of the phyloge-

netic tree is taken as input and the problem is to infer

the structure of the ancestral networks so that the to-

tal phylogeny cost is minimized. We show the prob-

lem to be computationally hard to even approximate

and go on to propose an efﬁcient heuristic algorithm,

which solves the problem well in practise. In section

3, we experiment with the algorithm in two scenarios:

First, we analyze randomly generated and perturbed

data. Second, we study gapless phylogeny reconstruc-

tion for a collection of fungal species. Section 4 ends

the paper with conclusions.

2 METHODS

We are interested in metabolic networks that are func-

tional in the sense that the network is able to produce

substrates of all its reactions from some given set of

source metabolites. Such networks are termed gap-

less, with a precise deﬁnition given below.

A metabolic network can be described as a binary

string N ∈ {0,1}

, where each N

= 1 states that the

reaction r

, drawn from a collection of reactions R ,

is in the network. We use the shorthand r

∈ N when

= 1. Further we assume a set of metabolites M is

consumed and produced by the m reactions. The set

of substrate and product metabolites of a reaction r

are given by S(r

) ⊂ M and P(r

) ⊂ M , respectively.

To see how a string N encodes metabolic net-

work connectivity, note that N induces a directed bi-

partite graph G(N) = (V, E), with a node v

∈ V for

each reaction r ∈ N. Additionally each metabolite

b ∈ S(r

) ∪ P(r

) for every r

∈ N contributes a node

∈ V . Edges (v

) ∈ E and (v

) ∈ E are added

whenever reaction r produces or consumes metabo-

lite b, respectively. Figure 2 shows this graph rep-

resentation implicitly encoded by a string N for ﬁve

reactions.

We ﬁrst deﬁne gaplessness in terms of reactions

that are reachable from a set of source metabolites

S (Pitk

anen et al., 2005).

Deﬁnition 2.1. Let N be a metabolic network and S ⊆

M be a set of source metabolites.

• A reaction r ∈ N is reachable from S in N if all its

substrates S(r) are reachable from S in N.

• A metabolite b ∈ M is reachable from S in N if

either b ∈ S or b ∈ P(r) for some reaction r ∈ N

that is reachable from S in N.

A gapless metabolic network N under S is a metabolic

network where all reactions are reachable from S in

We often omit an explicit mention of the source

set S if it is clear in the context, saying only that a

metabolic network is gapless. If a reaction r ∈ N is

MINIMUM MUTATION ALGORITHM FOR GAPLESS METABOLIC NETWORK EVOLUTION

Y Z

X,Y,Z

X,Y

X,Z

X,Y,Z

Figure 1: Left: a small example phylogeny for three metabolic networks X,Y,Z. Right: two parsimonious scenarios involving

metabolic pathways a → m, b → m and m → c. Metabolites and pathways are shown as circles and arrows, respectively.

Pathways are labeled with organisms which have the pathway. In upper scenario, only pathway m → c is assigned to ancestor

X, thus leaving metabolite m without a producing pathway. In lower scenario, both pathways from Y and Z are assigned to X.

not reachable from S, we say that r is a reaction gap

(under S). In addition, we say that a metabolic net-

work is gapped, if it contains at least one reaction gap.

Figure 2 illustrates these concepts.

2.1 Gapless Minimum Mutation

Problem

We next introduce the computational problem of ﬁnd-

ing a gapless phylogeny when the tree topology and

input taxa are given.

Problem 2.2. Gapless Minimum Mutation problem

(GMM). Given a reaction collection R , m = |R |, a

rooted binary tree T = (V, E), labeling L(u) ∈ {0,1}

specifying a metabolic network for each leaf node and

source metabolites S, ﬁnd a labeling for each internal

node of T such that

(1) c =

∑

(u,v)∈E

d(L(u),L(v)) is minimized and

(2) L(u) is a gapless metabolic network under S for

each internal node u ∈ V ,

where d is Hamming distance.

The equivalent problem deﬁned for binary strings

without the gapless constraint (2), Minimum Muta-

tion problem, can be solved in polynomial time with

the Fitch algorithm because each character position

can be solved independently of each other (Fitch,

1971; Gusﬁeld, 1997). However, in contrast to the

Minimum Mutation problem, the character positions

in GMM are not necessarily independent of each

other: setting a certain L

(u) ∈ {0, 1} may impose

constraints on other positions j 6= i due to the gapless

constraint. Note that the taxa contained in the leaves

may or may not correspond to gapless metabolic net-

works.

Theorem 2.3. Deciding whether a solution with cost

c ≤ k to Gapless Minimum Mutation problem exists

given k ∈ N is NP-complete.

Proof. Given a solution to GMM, we can both com-

pute the cost c and check that each network L(u) is

gapless in polynomial time (Pitk

anen et al., 2005),

hence the problem is in NP.

To show that the problem is NP-hard, we reduce

the well-known NP-complete Minimum Set Cover

problem (Garey and Johnson, 1979) to Gapless Min-

imum Mutation problem. Let X be a ﬁnite set and C

be a collection of subsets of the set X. In the Mini-

mum Set Cover problem, we ask for the smallest sub-

set C

⊆ C such that every element of X belongs to

at least one member of C

. To create an instance of

GMM, we ﬁrst set up a reaction collection R with two

groups of reactions, one group for items in X, another

for sets in C. Speciﬁcally, let R to contain a reaction

for each x

∈ X with S(r

) = {b

} and P(r

) = {c

and a reaction q

for each c

∈ C with S(q

) = {a

}

and P(q

) = {b

∈ c

}. In addition, let R contain a

reaction x with S(x) = {c

∈ X} and P(x) = {m}.

To set up a phylogenetic tree, let T = (V, E)

be a binary rooted tree with V = {v

}, E =

{(v

),(v

)} and root v

. Finally, let input

metabolic networks at leaves be L(v

) = L(v

) =

∈ X} ∪ {x} and the set of source metabolites

BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms

Figure 2: Example metabolic network with reactions N = {r

,. .. ,r

} and metabolites {m

,. .. ,m

}. When S = {m

}

(double circles), reactions r

and r

(dotted rectangles) are reaction gaps in N. However, the network N under S

} would be gapless, because then reaction r

would be reachable. On the other hand, N

= {r

} is gap-

less under S = {m

be S = {a

∈ C}.

A minimal solution to GMM will contain at node

a network with r

and x, and a minimal number of

reactions q

to make L(v

) gapless. Existence of x

will ensure that L(v

) 6=

0 when |C| = |X|. To trans-

form a solution to GMM back to a solution to Min-

imum Set Cover, let set C

contain c

for each q

as-

signed to node v

. The reduction can be done in poly-

nomial time. In reduction, each reaction q

is assigned

to v

if and only if c

appears in the optimal solution,

assuming without loss of generality a unique set cover

solution. If there is no solution to the set cover in-

stance, also GMM is unsolvable as it is not possible to

ﬁx L(v

) to be gapless. Figure 3 shows the reduction

from an example set cover instance X = {x

,.. .,x

}

and c

= {x

},c

= {x

},c

= {x

Since the problem is both in NP and NP-hard, the

claim follows.

Theorem 2.4. Gapless Minimum Mutation cannot be

efﬁciently α-approximated unless P = NP.

Proof. We show next that the reduction described

above is actually an approximation-ratio preserving

reduction and we can thus exploit the inapproxima-

bility of the set cover problem. Set cover problem has

been shown to be hard to approximate within a log-

arithmic factor unless P = NP (Raz and Safra, 1997;

Alon et al., 2006).

Assume that there is an α-approximation algo-

rithm for GMM for some α > 1. We can thus obtain

a solution of cost ≤ αOPT where OPT is an opti-

mal solution cost of GMM. Given a set cover instance

with optimal size k, we obtain a GMM instance of

optimal cost 2k in polynomial time with the above re-

duction. Solving the instance approximatively we get

a solution of cost at most 2αk. This yields an approx-

imative solution

k ≤ αk to the set cover problem, thus

contradicting the assumption.

Often we have gapped metabolic networks as taxa

to begin with. For instance, initial networks may be a

result of function assignment by annotation transfer,

where the resulting structure of the draft metabolic

network is not of concern. However, these networks

should also be functional and therefore we can at-

tempt to ﬁx them gapless while ﬁnding ancestral

nodes. To do this, we can extend the tree T such

that for each leaf u we add an edge (u, u

) and as-

sign L(u

) ← L(u). Solving GMM in the modiﬁed tree

thus ﬁnds a solution where nodes u

retain the origi-

nal input networks but gaps in internal nodes u are

ﬁxed. We provide an example of such situation in ex-

periments, where we utilize gapped networks derived

from a metabolic database.

2.2 Local Adjustment Algorithm

To overcome the computational complexity, we next

propose an algorithm that solves the Gapless Mini-

mum Mutation problem in two phases. In the ﬁrst

phase, the assignments corresponding to the mini-

mum mutation cost are computed with the Fitch al-

gorithm (Fitch, 1971; Gusﬁeld, 1997). In the second

phase, the tree is traversed top-down and a gapless

metabolic network is assigned at each node by ﬁll-

ing the gaps remaining after the Fitch pass. The al-

gorithm relies on estimates on how much ﬁlling each

gap would increase the total phylogeny cost, and at-

tempts to choose gap-ﬁlling reactions which increase

the cost as little as possible. The cost increase esti-

mates are computed by the algorithm for each ances-

tral network.

MINIMUM MUTATION ALGORITHM FOR GAPLESS METABOLIC NETWORK EVOLUTION

Figure 3: Example reduction of a Minimum Set Cover instance with X = {x

,. .. ,x

} and c

= {x

},c

= {x

}

and c

= {x

} to Gapless Minimum Mutation problem. Left: reactions in R . Right: tree T .

Algorithm 1: GaplessMinimumMutation: Local adjust-

ment algorithm for gapless minimum mutation.

1: Input: tree T , taxa L

2: F ← Fitch(T, L)

3: for all internal nodes v in top-down order do

4: if v is root then

5: for all i ∈ {1,... , m} do

6: L

(v) ← 0 if 0 ∈ F

(v); otherwise L

(v) ←

7: else

8: for all i ∈ {1,... , m} do

9: if F

(v) = {0,1} then

10: L

(v) ← L

(pa(v))

11: else

12: L

(v) ← 0 if 0 ∈ F

(v); otherwise

(v) ← 1

13: (D,M) ← MinDist(v,S)

14: L(v) ← L(v) ∪MinFill(v,S,D,M)

In Algorithm 1 at line 2, we ﬁrst compute equal-

ity sets F

(v) ⊆ {0,1} for each internal node v and

position i with the Fitch algorithm. In a binary tree,

equality set F

(v) is deﬁned for each position i as

(v) =











(v)} iff v is leaf

(x) ∩ F

(y) iff v isnot leaf and F

(x) ∩ F

(y) 6=

(x) ∪ F

(y) iff v isnot leaf and F

(x) ∩ F

(y) =

where x and y are the children of an internal node

v (Gusﬁeld, 1997).

The tree is traversed in a top-down pass and an

initial labeling L(v) is decided according to the Fitch

top-down phase (Algorithm 1, lines 3–12). Each ini-

tial labeling is then adjusted so that it satisﬁes the gap-

lessness constraint.

For each internal node v, the algorithm calls the

subroutine MinFill, which returns a gapless network

containing L(v) and gapﬁll reactions (Algorithm 2).

Particularly, if L(v) is already gapless, it is returned as

such. MinFill attempts to satisfy the gaplessness cri-

terion by backtracking from each reaction not reach-

able from sources S in network L(v) and iteratively

adding reactions to the ﬁll set Γ. Subprocedure termi-

nates when either all reactions have been reached or

the algorithm notices that all gaps cannot be ﬁlled.

A heuristic is used to guide the backtracking phase

by considering reaction assignments in the parent and

child nodes. In particular, the algorithm attempts to

choose reactions to the ﬁll set Γ such that the par-

simony cost increase is minimized. To do this, we

ﬁrst compute distances d

that provide an estimate

for each reaction how much the parsimony cost will

increase compared to the optimal Minimum Mutation

solution, if the reaction is added to the network. For-

mally, the distance d

is a lower bound to the increase

in parsimony cost that is the result of adding reaction

and other reactions required to make r reachable to

(v,r

) = max

m∈S(r

)

min

q∈Pr(m)

(v,q) + d

(v,r

)

where S(r

) are the substrates of reaction r

and Pr(m)

are the reactions producing m. To introduce some

notation, let G

(w,v) = {L

(w)} iff pa(v) = w and

BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms

(w,v) = F

(w) iff w ∈ ch(v), where the parent and

children of v are denoted by pa(v) and ch(v), re-

spectively. Cost function d

(v,r

) speciﬁes the in-

crease in parsimony cost when adding reaction r

only and is deﬁned as d

(v,r

) = δ(v, pa(v),r

) +

∑

c∈ch(v)

δ(v,c,r

) with

δ(v,w,r

) =











1 ifG

(w,v) = {0} and L

(v) = 0

−1 if G

(w,v) = {1} and L

(v) = 0

0 ifG

(w,v) = {0, 1} and L

(v) = 0

0 ifL

(v) = 1

Note that at root v, we set δ(v, pa(v)) = 0.

Parsimony cost increase d

at each reaction is al-

ways non-negative. To see this, we can enumerate the

values for d

at an internal node given G

at parent and

children (Table 1). Symmetric cases are omitted from

the table, where a boldface 0 signiﬁes that this combi-

nation would yield L

(v) = 1 in the bottom-up phase

of the algorithm and thus the reaction would already

be in the network.

Subroutine MinDist is called from Algorithm 1 to

compute distances d

for all r with dynamic program-

ming (Algorithm 3). If for some required substrate

there are only producers r

that have d

(v,r

) = ∞,

the algorithm fails as the required substrate cannot

be reached from S. In such cases, source set S needs

to be expanded or reaction collection R revised. To

avoid loops, ε > 0 is added to distances d

, ensur-

ing that distances strictly increase when traversing the

network.

Algorithm 2: MinFill.

1: Input: v,S,D,M

2: Γ ←

0; Q ← {r ∈ v |D(r) = ∞}

3: while |Q| > 0 do

4: r ← pop(Q)

5: for all o ∈ S(r) do

6: if o /∈ M then

7: q ← argmin

q∈Pr(o)

D(q)

8: if D(q) = ∞ then

9: return “Impossible to ﬁnd gapﬁlling

set”

10: if q /∈ Γ and q /∈ v then

11: push(Q,q)

12: Γ ← Γ ∪{r}

13: Return Γ

Theorem 2.5. Local adjustment algorithm returns

gapless metabolic networks L(u) for each internal

node u or reports failure in O(nmlog m) time, where

n is the number of species and m = |R | is the number

of reactions.

Proof. As each reaction is inserted at most once to

the heap, Algorithm 3 takes O(m log m) time to com-

pute distances d

assuming that both heap operations

Algorithm 3: MinDist: compute distances d

(v,S).

1: Input: v,S

2: M ← S; Q ← min-heap

3: for all r ∈ {r ∈ R |S(r) ⊆ S} do

4: insert(Q,r,0)

5: for all r ∈ R do

6: D(r) ← 0 if r ∈ Q; D(r) ← ∞ otherwise

7: while |Q| > 0 do

8: r ← extract-min(Q)

9: for all o ∈ P(r) do

10: if o /∈ M then

11: M ← M ∪ {o}

12: for all q ∈ {q ∈ R | D(q) = ∞∧S(q) ⊆ M}

13: D(q) ← D(r) + δ

(v,q) + ε

14: insert(Q, q, D(q))

15: return (D,M)

insert and extract-min take O(logm) time. Fur-

ther we assume that |P(r)| is bound by a constant,

which is reasonable as typically in enzymatic reac-

tions |P(r)| = 1,.. .,3. In Algorithm 2 each reaction

is inserted at most once to the queue Q, hence the

subroutine takes O(m) time. Algorithm 1 makes a

single call to subroutine Fitch which takes O(nm)

time. Loop at line 11 is executed also O(nm) times.

The time complexity of the algorithm is dominated

by the O(n) calls to Algorithm 2, resulting in total

O(nmlogm) time. Figure 4 illustrates the operation

of the algorithm.

Note that it is possible to have an instance of

GMM where the algorithm fails to ﬁnd a gapless so-

lution although such solution exists. As an example,

consider two networks with a single reaction L(v

) =

L(v

) = {r}, where S(r) = {m

},P(r) = {m

} and

S =

0, and the tree of Figure 3. Then, the optimal

solution is L(v

) =

0. However, as the algorithm does

not attempt to remove reactions from the initial Fitch

solution L(v

) = {r}, the network remains gapless.

Such cases are avoided by carefully selecting the re-

action collection and source metabolites. In practice,

one must ensure that the source metabolite set is large

enough for each reaction to be reachable in network

L(u) = R .

3 EXPERIMENTS

We experimented with two datasets. First, we

generated random phylogenies consisting of gap-

less metabolic networks under a simple model of

metabolic network evolution. Second, we derived

MINIMUM MUTATION ALGORITHM FOR GAPLESS METABOLIC NETWORK EVOLUTION

Table 1: Values of heuristic d

for different parent and children assignments.

Parent Left child Right child d

0 0 0 3 1 0 0 1

0 0 1 1 1 0 1 0

0 0 0,1 2 1 0 0,1 0

0 1 1 0 1 1 1 0

0 1 0,1 0 1 1 0,1 0

0 0,1 0,1 1 1 0,1 0,1 0

0,1

3 3

Figure 4: Operation of Algorithm 1 to solve the Gapless Minimum Mutation problem. Left: A part of a phylogenetic tree.

Right: Subnetworks of nodes u,v,x, y. State at node v after a call to MinDist: example values of d

and reaction assignments

shown at bottom right corner and on top of each reaction of v, respectively. Metabolite m

assumed to be a source. Dashed

edges indicate parts of networks not shown. Distances for r

: d

,v) = d

,u) + d

,x) +d

,y) = 1 − 1 + 1 = 1 and

thus d

,v) = max(min(d

,v), d

,v)), 0) + d

,v) = max(min(2,4), 0) + 1 = 2 + 1 = 3.

metabolic networks for 16 fungal species from the

KEGG database (Kanehisa et al., 2008). We then

solved the Gapless Minimum Mutation problem for

both datasets.

3.1 Random Phylogenies

To generate a phylogenetic tree, we ﬁrst started with

a gapless network containing 300 random reactions

from KEGG and then simulated evolution by ran-

domly adding or removing one reaction at time. Only

additions and deletions, or reaction ﬂips, that pre-

served gaplessness were allowed. Probability of both

the addition and deletion was 0.5. The reaction to be

added or deleted was chosen uniformly from the set of

reactions whose addition or deletion preserved gap-

lessness. After each ﬂip, a speciation event occured

at a ﬁxed probability 0.005 resulting in a new branch

in the phylogenetic tree. The process was terminated

after a tree of 30 nodes was generated resulting in net-

works of 249 ± 22 reactions at each node

The generated taxa were reconstructed by the

Fitch algorithm and Algorithm 1. Prior to recon-

struction, the input taxa were randomly perturbed to

simulate effects of annotation errors. Speciﬁcally,

each reaction present in the taxa was deleted with

the probability p

= 0,0.01, 0.02, 0.05,0.1. Table 2

We use the ± notation to indicate standard deviations.

BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms

Table 2: Reconstructing random phylogenies when errors were introduced to data by deleting reactions from taxa with a

ﬁxed probability p

. Columns FitchError and GaplessError give the reconstruction error measured as the average Hamming

distance between reconstructed network and generating taxa at an internal node. Columns Gaps and Fills show the average

number of gapped reactions and gapﬁll reactions added by our algorithm. Results are averages over 25 repeats. Standard

deviations given in parentheses.

FitchError GaplessError Gaps Fills

0.0 52.3 (4.12) 50.3 (3.80) 0.18 (0.11) 0.15 (0.09)

0.005 53.3 (3.49) 51.2 (3.47) 0.35 (0.22) 0.24 (0.14)

0.01 54.8 (3.93) 52.4 (4.09) 0.55 (0.21) 0.35 (0.13)

0.02 57.3 (4.84) 54.9 (4.60) 0.94 (0.35) 0.59 (0.20)

0.05 63.3 (4.06) 60.3 (4.04) 1.82 (0.60) 1.11 (0.27)

0.1 76.5 (4.14) 73.2 (4.00) 3.65 (1.00) 2.37 (0.53)

Table 3: Species in fungal dataset. Columns Reactions, Gaps and Fills give the number of all reactions and gapped reactions

in the initial networks, and reactions added by the algorithm to ﬁll gaps, respectively.

Abbr Species Reactions Gaps Fills

ago Ashbya gossypii 272 64 (23.5%) 37 (+13.6%)

afm Aspergillus fumigatus 468 80 (17.1%) 50 (+10.7%)

ani Aspergillus nidulans 382 74 (19.4%) 42 (+11.0%)

aor Aspergillus oryzae 396 70 (17.7%) 42 (+10.6%)

cgr Candida glabrata 270 70 (25.9%) 35 (+13.0%)

cne Cryptococcus neoformans 336 68 (20.2%) 35 (+10.4%)

dha Debaryomyces hansenii 326 66 (20.2%) 34 (+10.4%)

fgr Fusarium graminearum 474 86 (18.1%) 43 (+9.1%)

kla Kluyveromyces lactis 292 68 (23.3%) 36 (+12.3%)

mgr Magnaporthe grisea 390 74 (19.0%) 41 (+10.5%)

ncr Neurospora crassa 416 74 (17.8%) 38 (+9.1%)

dpch Phanerochaete chrysosporium 410 90 (22.0%) 44 (+10.7%)

sce Saccharomyces cerevisiae 332 80 (24.1%) 36 (+10.8%)

spo Schizosaccharomyces pombe 278 68 (24.5%) 35 (+12.6%)

uma Ustilago maydis 296 76 (25.7%) 47 (+15.9%)

yli Yarrowia lipolytica 362 92 (25.4%) 43 (+11.9%)

shows the reconstruction error measured as the av-

erage Hamming distance between node labels in re-

construction and generating taxa. Further, the aver-

age number of gapped and gapﬁll reactions at each

node are shown. For instance, with random deletion

probability p

= 0.1, the reconstruction errors mea-

sured, 76.5 and 73.2 for Fitch’s and our algorithm, re-

spectively, were statistically different from each other

(paired t(48) = 2.87, p = 0.006). Moreover, the simu-

lated trees contained on the average 3.65 gaps in each

network. Our algorithm added 2.37 gap-ﬁlling reac-

tions to each network on the average. Regardless of

parameter p

, gapless reconstruction of each instance

took about 27 seconds on a standard desktop com-

puter running a Python implementation our algorithm.

3.2 Fungal Phylogenies

To experiment with a more real-world scenario, we

constructed metabolic networks for 16 fungal species

corresponding to 17 carbohydrate metabolism path-

ways (Kanehisa et al., 2008) from KEGG gene-

reaction links. As shown in Table 3, the process re-

sulted in a high number of gapped reactions in these

initial networks largely due to incomplete annotations

and stoichiometry in KEGG enzymes and reactions.

As input, we provided our algorithm with the

initial networks and a phylogenetic tree of the

species ((Fitzpatrick et al., 2006), Figure 5). Because

we had gapped networks to begin with, we added a

new internal node for each species and an edge to the

node as described earlier.

Many microorganisms and free living fungi in par-

ticular can synthesize all their cellular components

from inorganic sources, given a source of energy and

MINIMUM MUTATION ALGORITHM FOR GAPLESS METABOLIC NETWORK EVOLUTION

afm

ani

uma

dpch

cne

spo

aor

dha

ncr

mgr

yli

sce

cgr

fgr

ago

kla

N22

N11

N13

N25

N14

N15

N17

N18

N20

N21

Figure 5: A phylogeny tree for the 16 fungal species rooted at node N4. Shaded nodes indicate the two species and their

immediate ancestor N22 highlighted in Figure 6.

C00137

myo-Inositol

R01184 (d=1)

EC 1.3.99.1

C00191

D-Glucuronate

R01481 (d=2)

EC 1.1.1.19

R01483 (d=2)

EC 3.1.1.19

C00800

L-Gulonate

R02933 (d=2)

EC 3.1.1.17

EC 3.1.1.25

C02670

Glucurone

R02957 (d=2)

EC 1.2.1.3

C01040

L-Gulono-1,4-lactone

C00818

D-Glucarate

Figure 6: A subnetwork of the metabolic network reconstructed at node N22 of the phylogenetic tree shown in Figure 5.

Reactions and metabolites are drawn as rectangles and ellipses, respectively. Rounded rectangles show the three reactions

added by the algorithm to ﬁll the gapped reactions R02933 and R02957. Distances d

given in parentheses as d.

carbon such as glucose (Deacon, 2006). However,

many fungi require one or more vitamins such as thi-

amine or biotin. To model the fungal metabolic envi-

ronment, a ﬁxed set of 1518 source metabolites were

used containing fungal energy and carbon sources, co-

factors such as ATP and NAD and metabolites needed

to account for the large number of otherwise isolated

subgraphs in the KEGG universal metabolic network.

Table 3 lists the number of reactions added to ﬁll

gaps in the internal nodes corresponding to species.

On the average, 39.4 ± 3.9 reactions were used to ﬁll

72.5 ± 7.0 gaps divided into 7.2 ± 0.9 graph compo-

nents at each node. The optimal minimum mutation

cost was 1082; our algorithm achieved gapless mini-

mum mutation cost of 1789.

To give an example of how the GMM result can

provide insight into metabolic network evolution and

aid reconstruction curation efforts, Figure 6 shows

ﬁve reactions from the internal node N22 that is the

parent of species nodes mgr and ncr (Figure 5). Two

reactions (KEGG ids R02933 and R02957) remained

gapped after the ﬁrst pass. Three reactions were pre-

dicted by the algorithm to ﬁll these particular gaps

(R01184, R01481, R01483). All ﬁlling reactions

were used in parent but were absent from children,

thus addition of each reaction increased the parsi-

mony cost by one.

Even though KEGG reaction-gene links were

missing for reactions R01184 and R01481, algo-

rithm predictions were supported by homologous

genes (Arvas et al., 2007) found for both reactions

in M. grisea. Further, a homologous gene was found

also in N. crassa for reaction R01184. No gene was

found to support the predicted existence of reaction

R01483, warranting further study. For the two gapped

reactions, homologues were found in both organisms,

supporting KEGG data.

4 CONCLUSIONS

In this paper, we introduced a maximum parsi-

mony algorithm for reconstructing gapless ancestral

metabolic networks for a given phylogenetic tree.

Furthermore, the method can be used to suggest gap-

less variants of draft metabolic networks of observed

species given as input. The algorithm minimizes the

number of mutations in the phylogenetic tree while

BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms

maintaining the gaplessness of the ancestral networks.

Thus, the algorithm can be used both to elucidate net-

work structures in ancestral nodes and to ﬁll in gaps

in draft metabolic networks in a evolutionarily plausi-

ble manner.

We argue that such approach, where the recon-

struction networks are required to be gapless, im-

proves prediction performance over the unconstrained

case. This work extends the method of (Pitk

anen

et al., 2008), where gapless reconstructions of in-

dividual metabolic networks were inferred from se-

quence data, to take into account the phylogenetic

context of the reconstructed network.

The proposed algorithm was found to perform

well in practice despite the computational complexity

of the underlying problem. In experiments with ran-

dom data, the algorithm was able to recover the orig-

inal data from perturbed input more accurately than

the baseline method that did not enforce gaplessness

in reconstructed networks. Moreover, the algorithm

yielded explanations to the question why a given re-

action (enzyme) appears in an ancestral network by

suggesting the required reactions that render the reac-

tion gapless. This is especially important when we at-

tempt to uncover the evolutionary history leading into

observed networks.

While we experimented only with a simple ran-

dom model of evolution, the framework introduced

here lends itself to more realistic models.

Exploring this direction is left as future work,

though we note the importance of incorporat-

ing sequence data with the joint metabolic net-

work/phylogenetic tree topology driven approach pre-

sented here. To this end, dealing with inaccuracies

and omissions in the underlying metabolic reaction

databases presents an additional challenge.

ACKNOWLEDGEMENTS

We would like to thank Pasi Rastas and Esko Ukko-

nen for insightful discussions. This work was ﬁnan-

cially supported by Academy of Finland grant 118653

(ALGODAN), in part by the IST Programme of the

European Community, under the PASCAL2 Network

of Excellence, ICT-216886-PASCAL2, and by the

Academy of Finland postdoctoral researcher’s fellow-

ship 127715 (as part of the Finnish Centre of Ex-

cellence in White Biotechnology - Green Chemistry,

Project No. 118573). This publication only reﬂects

the authors’ views.

REFERENCES

Alon, N., Moshkovitz, D., and Safra, S. (2006). Algorith-

mic construction of sets for k-restrictions. ACM Trans.

Algorithms, 2(2):153–177.

Arvas, M., Kivioja, T., Mitchell, A., Saloheimo, M., Ussery,

D., Penttil

a, M., and Oliver, S. (2007). Comparison of

protein coding gene contents of the fungal phyla Pez-

izomycotina and Saccharomycotina. BMC Genomics,

8(1):325.

Borenstein, E., Kupiec, M., Feldman, M. W., and Rup-

pin, E. (2008). Large-scale reconstruction and phy-

logenetic analysis of metabolic environments. PNAS,

105(38):14482–14487.

Bourque, G. and Sankoff, D. (2004). Improving gene net-

work inference by comparing expression time-series

across species, developmental stages or tissues. J

Bioinform Comput Biol, 2(4):765–783.

Caetano-Anoll

es, G., Yafremava, L., Gee, H., Caetano-

Anoll

es, D., Kim, H., and Mittenthal, J. (2009). The

origin and evolution of modern metabolism. The In-

ternational Journal of Biochemistry & Cell Biology,

41(2):285–297.

Clemente, J., Satou, K., and Valiente, G. (2007). Phyloge-

netic reconstruction from non-genomic data. Bioin-

formatics, 23(2):e110.

Clemente, J. C., Ikeo, K., Valiente, G., and Gojobori, T.

(2009). Optimized ancestral state reconstruction using

sankoff parsimony. BMC Bioinformatics, 10(51).

Dandekar, T., Schuster, S., Snel, B., Huynen, M., and Bork,

P. (1999). Pathway alignment: application to the com-

parative analysis of glycolytic enzymes. Biochem J.,

343(Pt 1):115–124.

Deacon, J. (2006). Fungal biology. Wiley-Blackwell.

Fitch, W. M. (1971). Toward deﬁning the course of evo-

lution: minimum change for a speciﬁc tree topology.

Syst. Zool., 20:406–416.

Fitzpatrick, D., Logue, M., Stajich, J., and Butler, G.

(2006). A fungal phylogeny based on 42 complete

genomes derived from supertree and combined gene

analysis. BMC Evolutionary Biology, 6(1):99.

Garey, M. R. and Johnson, D. S. (1979). Computers

and Intractability: A Guide to the Theory of NP-

Completeness. W. H. Freeman.

Gusﬁeld, D. (1997). Algorithms on Strings, Trees, and Se-

quences. Cambridge University Press.

Handorf, T., Christian, N., Ebenh

oh, O., and Kahn, D.

(2008). An environmental perspective on metabolism.

Journal of Theoretical Biology, 252(3):530–537.

Jamshidi, N. and Palsson, B. O. (2007). Investigating the

metabolic capabilities of Mycobacterium tuberculosis

H37Rv using the in silico strain iNJ661 and proposing

alternative drug targets. BMC Systems Biology, 1(26).

Kanehisa, M., Araki, M., Goto, S., Hattori, M., Hirakawa,

M., Itoh, M., Katayama, T., Kawashima, S., Okuda,

S., Tokimatsu, T., and Yamanishi, Y. (2008). Kegg for

linking genomes to life and the environment. Nucleic

Acids Res., 36:D480–D484.

MINIMUM MUTATION ALGORITHM FOR GAPLESS METABOLIC NETWORK EVOLUTION

Kreimer, A., Borenstein, E., Gophna, U., and Ruppin,

E. (2008). The evolution of modularity in bacte-

rial metabolic networks. Proceedings of the National

Academy of Sciences, 105(19):6976.

Lacroix, V., Cottret, L., Thebault, P., and Sagot, M.-F.

(2008). An introduction to metabolic networks and

their structural analysis. IEEE Transactions on Com-

putational Biology and Bioinformatics, 5(4):594–617.

Mano, A., Tuller, T., Bj, O., and Pinter, R. Y. (2010). Com-

parative classiﬁcation of species and the study of path-

way evolution based on the alignment of metabolic

pathways. BMC Bioinformatics, 11(Suppl 1):S38.

Mithani, A., Preston, G., and Hein, J. (2010). A bayesian

approach to the evolution of metabolic networks on a

phylogeny. PLoS Computational Biology, 6(8).

Mithani, A., Preston, G. M., and Hein, J. (2009). A stochas-

tic model for the evolution of metabolic networks with

neighbor dependence. Bioinformatics, 25(12):1528–

1535.

Palsson, B. (2006). Systems biology: properties of recon-

structed networks. Cambridge University Press Cam-

bridge.

Pitk

anen, E., Rantanen, A., Rousu, J., and Ukkonen, E.

(2005). Finding feasible pathways in metabolic net-

works. In Advances in Informatics: 10th Panhellenic

Conference on Informatics (PCI 2005). Lecture Notes

in Computer Science 3746, pages 123–133.

Pitk

anen, E., Rantanen, A., Rousu, J., and Ukkonen, E.

(2008). A computational method for reconstructing

gapless metabolic networks. In Proceedings of the

2nd International Conference on Bioinformatics Re-

search and Development (BIRD’08), volume 13 of

Communications in Computer and Information Sci-

ence. Springer.

Pitk

anen, E., Rousu, J., and Ukkonen, E. (2010). Computa-

tional methods for metabolic reconstruction. Current

Opinion in Biotechnology, 21(1):70–77.

Raman, K. and Chandra, N. (2009). Flux balance analy-

sis of biological systems: applications and challenges.

Brieﬁngs in Bioinformatics, 10(4):435–449.

Rantanen, A., Rousu, J., Jouhten, P., Zamboni, N., Maa-

heimo, H., and Ukkonen, E. (2008). An analytic and

systematic framework for estimating metabolic ﬂux

ratios from 13 C tracer experiments. BMC Bioinfor-

matics, 9(1):266.

Raz, R. and Safra, S. (1997). A sub-constant error-

probability low-degree test, and a sub-constant error-

probability PCP characterization of NP. In Proceed-

ings of the Twenty-Ninth Annual ACM Symposium on

Theory of Computing, pages 475–484.

Sankoff, D. (1975). Minimal mutation trees of sequences.

SIAM J. Appl, 28:35–42.

Sharan, R. and Ideker, T. (2006). Modeling cellular machin-

ery through biological network comparison. Nature

Biotechnology, 24:427–433.

Sigurdsson, M. I., Jamshidi, N., Jonsson, J. J., and Palsson,

B. O. (2009). Genome-scale network analysis of im-

printed human metabolic genes. Epigenetics, 4(1):43–

46.

Tohsato, Y., Matsuda, H., and Hashimoto, A. (2000). A

multiple alignment algorithm for metabolic pathway

analysis using enzyme hierarchy. In Proceedings

of the Eighth International Conference on Intelligent

Systems for Molecular Biology, pages 376–383.

Tuller, T., Birin, H., Gophna, U., Kupiec, M., and Ruppin,

E. (2010). Reconstructing ancestral gene content by

coevolution. Genome Res., 20(1):122–132.

Wagner, A. (2009). Evolutionary constraints permeate

large metabolic networks. BMC Evolutionary Biol-

ogy, 9(1):231.

BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms