Computing the Variations of Edit Distance
for Rooted Labaled Caterpillars
Manami Hagihara, Takuya Yoshino and Kouich Hirata
Kyushu Institute of Technology, Kawazu 680-4, Iizuka 820-8502, Japan
Keywords:
Edit Distance, Rooted Labeled Caterpillar, Rooted Labeled Unordered Tree, Top-down Distance, Bottom-up
Distance, LCA-preserving Distance.
Abstract:
In this paper, we pay our attention to top-down distance, LCA-preserving distance and bottom-up distance for
rooted labeled caterpillars (caterpillars, for short), as the variations of the edit distance. Here, the top-down
distance is the edit distance that the deletion and the insertion are allowed to just leaves, the LCA-preserving
distance is one to just either leaves or vertices with one child and the bottom-up distance is one to just the root.
Then, we show that the top-down and the bottom-up distances for caterpillars can be computed in O(n) time
and the LCA-preserving distance for caterpillars in O(n
2
) time. Furthermore, we give experimental results of
computing these variations for caterpillars in real data.
1 INTRODUCTION
Comparing tree-structured data such as HTML and
XML data for web mining or RNA and glycan data for
bioinformatics is one of the important tasks for data
mining. The most famous distance measure (Deza
and Deza, 2016) between rooted labeled unordered
trees (trees, for short) is the edit distance τ
TAI
(Tai,
1979). The edit distance is formulated as the mini-
mum cost of edit operations, consisting of a substitu-
tion, a deletion and an insertion, applied to transform
a tree to another tree.
It is known that the edit distance is always a met-
ric and coincides with the minimum cost of Tai map-
pings (Tai, 1979). Unfortunately,the problem of com-
puting the edit distance between trees is MAX SNP-
hard (Zhang and Jiang, 1994), even if trees are binary
or the maximum height of trees is at most 3 (Akutsu
et al., 2013; Hirata et al., 2011).
Whereas the edit distance is the standard mea-
sure for comparing trees, it is too general for sev-
eral applications. Therefore, more structurally sen-
sitive distances of the edit distance such as the top-
down (or degree-1) distance τ
TOP
(Chawathe, 1999;
Selkow, 1977), the LCA-preserving (or degree-2) dis-
tance τ
LCA
(Zhang et al., 1996) and the bottom-up dis-
tance τ
BOT
(Valiente, 2001) required for these applica-
tions. Such distances are formulated as the minimum
cost of the variations of the Tai mapping such as a
top-down mapping (Chawathe, 1999; Selkow, 1977),
an LCA-preserving mapping (Zhang et al., 1996) and
a bottom-up mapping (Kuboyama, 2007; Valiente,
2001) respectively.
As operational, the top-down distance is the edit
distance that the deletion and the insertion are allowed
to just leaves, the LCA-preserving distance is one to
just either leaves or vertices with one child and the
bottom-up distance is one to just the root. Yoshino
and Hirata (Yoshino and Hirata, 2017) have summa-
rized and characterized the other variations of the Tai
mapping as a Tai mapping hierarchy.
For trees, we can compute the top-down and
the LCA-preserving distances in O(n
2
d) time (Ya-
mamoto et al., 2014; Zhang et al., 1996), where
n is the maximum number of vertices and d is the
minimum degree in two trees. On the other hand,
the problems of computing the bottom-up distance is
MAX SNP-hard (Yamamoto et al., 2014).
A caterpillar (cf. (Gallian, 2007)) is a tree trans-
formed to a rooted path after removing all the leaves
in it. Whereas the caterpillars are very restricted and
simple, there are some cases containing many cater-
pillars in real dataset (cf., (Muraka et al., 2018; Ukita
et al., 2021)). Recently, Muraka et al. (Muraka et al.,
2018) have proposed the algorithm to compute the
edit distance between caterpillars in O(n
2
λ) time un-
der the unit cost function, where λ is the maximum
number of leaves in caterpillars
1
.
1
This time complexity is different from the result in
272
Hagihara, M., Yoshino, T. and Hirata, K.
Computing the Variations of Edit Distance for Rooted Labaled Caterpillars.
DOI: 10.5220/0010826100003122
In Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2022), pages 272-279
ISBN: 978-989-758-549-4; ISSN: 2184-4313
Copyright
c
2022 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
Hence, in this paper, we pay our attention to the
top-down, the LCA-preserving and bottom-up dis-
tances for caterpillars as the variations of the edit dis-
tance. Then, we design the algorithm to compute
them and show that the top-down distance and the
bottom-up distance for caterpillars can be computed
in O(n) time and the LCA-preserving distance for
caterpillars in O(n
2
) time, see Table 1.
Table 1: The time complexity of computing τ
TAI
, τ
TOP
,
τ
LCA
and τ
BOT
for trees and caterpillars. Here, n is the max-
imum number of vertices, d is the minimum degree and λ is
the maximum number of leaves in two trees or caterpillars.
distance tree caterpillar
τ
TAI
MAX SNP-hard O(n
2
λ)
(Zhang and Jiang, 1994) (Muraka et al., 2018)
τ
TOP
O(n
2
d) O(n)
(Yamamoto et al., 2014) Theorem 4
τ
LCA
O(n
2
d) O(n
2
)
(Yamamoto et al., 2014) Theorem 5
τ
BOT
MAX SNP-hard O(n)
(Yamamoto et al., 2014) Theorem 6
Also, we give experimental results of computing
these variations for caterpillars in real data. In partic-
ular, we compare the running time of the algorithms
in this paper with the previous algorithms of comput-
ing the top-down and the LCA-preserving distances
for trees (Yamamoto et al., 2014).
2 PRELIMINARIES
A tree is a connected graph without cycles. For a tree
T = (V, E), we denote V and E by V(T) and E(T).
We sometimes denote v V(T) by v T. A rooted
tree is a tree with one vertex r chosen as its root,
which we denote by r(T).
For each vertex v in a rooted tree with the root r,
let UP
r
(v) be the unique path from v to r. The parent
of v(6= r), which we denote by par(v), is its adjacent
vertex on UP
r
(v) and the ancestors of v(6= r) are the
vertices on UP
r
(v) {v}. We say that u is a child of
v if v is the parent of u, and u is a descendant of v if
v is an ancestor of u. We denote the set of all children
of v by ch(v). Two vertices with the same parent are
called siblings. A leaf is a vertex having no children.
We denote the set of all leaves in a tree T by lv(T).
We denote u < v if v is an ancestor of u, and we
denote u v if either u < v or u = v. Also we say that
w is the least common ancestor of u and v, denoted
(Muraka et al., 2018), because it contains some errors. See
(Ukita et al., 2021) in more detail.
by u v, if u w, v w and there exists no w
such
that u w
, v w
and w
w. A complete subtree of
T at v, denoted by T[v], is a rooted tree T
= (V
, E
)
such that r(T
) = v, V
= {u V | u v} and E
=
{(u, w) E | u, w V
}.
The height h(v) of v is defined as |UP
r
(v)|1 and
the height h(T) of T is the maximum height for every
vertex v T. The degree d(v) of v is the number of
the children of v T. and the degree d(T) of T is the
maximum degree for every vertex in T.
We say that a rooted tree is ordered if a left-to-
right order among siblings is given; Unordered oth-
erwise. Also we say that a tree is labeled over Σ if
each vertex is assigned a symbol from a fixed nite
alphabet Σ, where we denote the label of a vertex v by
l(v), and sometimes identify v with l(v). In this paper,
we call a rooted labeled unordered tree over Σ a tree,
simply.
As the restricted form of trees, we introduce a
rooted labeled caterpillar (caterpillar, for short).
Definition 1. We say that a tree is a caterpil-
lar (cf. (Gallian, 2007)) if it is transformed to a rooted
path after removing all the leaves in it. For a caterpil-
lar C, we call the remained rooted path a backbone of
C and denote it by bb(C).
It is obvious that r(C) = r(bb(C)) and V(C) =
bb(C) lv(C) for a caterpillar C, that is, every ver-
tex in a caterpillar is either a leaf or an element of the
backbone.
Next, we introduce a tree edit distance and a Tai
mapping.
Definition 2 (Edit operations (Tai, 1979)). The edit
operations of a tree T are defined as follows, see Fig-
ure 1.
1. Substitution: Change the label of the node v in T.
2. Deletion: Delete a node v in T with parent v
,
making the children of v become the children of
v
. The children are inserted in the place of v as a
subsequence in the left-to-right order of the chil-
dren of v
. In particular, if v is the root in T, then
the result applying the deletion is a forest consist-
ing of the children of the root.
3. Insertion: The complement of deletion. Insert a
node v as a child of v
in T making v the parent of
a consecutivesubsequence a subset of the children
of v
.
Let ε 6∈ Σ denote a special blank symbol and define
Σ
ε
= Σ {ε}. Then, we represent each edit operation
by (l
1
7→ l
2
), where (l
1
, l
2
) (Σ
ε
×Σ
ε
{(ε, ε)}). The
operation is a substitution if l
1
6= ε and l
2
6= ε, a dele-
tion if l
2
= ε, and an insertion if l
1
= ε. For nodes v
and w, we also denote (l(v) 7→ l(w)) by (v 7→ w). We
define a cost function γ : (Σ
ε
× Σ
ε
\ {(ε, ε)}) 7→ R
+
on
Computing the Variations of Edit Distance for Rooted Labaled Caterpillars
273
Substitution (v 7→ w)
v
7→
w
Deletion (v 7→ ε)
v
v
7→
v
Insertion (ε 7→ v)
v
7→
v
v
Figure 1: Edit operations for trees.
pairs of labels. We often constrain a cost function γ to
be a metric, that is, γ(l
1
, l
2
) 0, γ(l
1
, l
2
) = 0 iff l
1
= l
2
,
γ(l
1
, l
2
) = γ(l
2
, l
1
) and γ(l
1
, l
3
) γ(l
1
, l
2
)+γ(l
2
, l
3
). In
particular, we call the cost function that γ(l
1
, l
2
) = 1
if l
1
6= l
2
a unit cost function.
Definition 3 (Edit distance (Tai, 1979)). For a cost
function γ, the cost of an edit operation e = l
1
7→ l
2
is given by γ(e) = γ(l
1
, l
2
). The cost of a sequence
E = e
1
, . . . , e
k
of edit operations is given by γ(E) =
k
i=1
γ(e
i
). Then, an edit distance τ
TAI
(T
1
, T
2
) be-
tween trees T
1
and T
2
is defined as follows:
τ
TAI
(T
1
, T
2
) = min
γ(E)
E is a sequence
of edit operations
transforming T
1
to T
2
.
Definition 4 (Tai mapping (Tai, 1979)). Let T
1
and
T
2
be trees. We say that a triple (M, T
1
, T
2
) is a Tai
mapping (a mapping, for short) from T
1
to T
2
if M
V(T
1
) ×V(T
2
) and every pair (v
1
, w
1
) and (v
2
, w
2
) in
M satisfies the following conditions.
1. v
1
= v
2
iff w
1
= w
2
(one-to-one condition).
2. v
1
v
2
iff w
1
w
2
(ancestor condition).
We will use M instead of (M, T
1
, T
2
) when there is no
confusion denote it by M M
TAI
(T
1
, T
2
).
Let M be a mapping from T
1
to T
2
. Let I
M
and J
M
be the sets of nodes in T
1
and T
2
but not in M, that is,
I
M
= {v T
1
| (v, w) 6∈ M} and J
M
= {w T
2
| (v, w) 6∈
M}. Then, the cost γ(M) of M is given as follows.
γ(M) =
(v,w)M
γ(v, w) +
vI
M
γ(v, ε) +
wJ
M
γ(ε, w).
Trees T
1
and T
2
are isomorphic without labels,
denoted by T
1
l
T
2
, if there exists a mapping M
M
TAI
(T
1
, T
2
) such that I
M
= J
M
=
/
0, and isomorphic,
denoted by T
1
T
2
, if there exists a mapping M
M
TAI
(T
1
, T
2
) such that I
M
= J
M
=
/
0 and γ(M) = 0.
Theorem 1. (Tai, 1979) It holds that:
τ
TAI
(T
1
, T
2
) = min{γ(M) | M M
TAI
(T
1
, T
2
)}.
Furthermore, we introduce the variations of the
Tai mapping and the edit distance, which are main
topics in this paper.
Definition 5. Let T and S be trees and suppose that
M M
TAI
(T, S). We define M
as M \ {r(T), r(S)}.
1. We say that M is a top-down mapping (Chawathe,
1999; Selkow, 1977), which we denote by M
M
TOP
(T, S), if (par(v), par(w)) M for every
(v, w) M
.
2. We say that M is an LCA-preserving mapping (or
degree-2 mapping) (Zhang et al., 1996), which we
denote by M M
LCA
(T, S) if (v v
, w w
) M
for every (v, w), (v
, w
) M.
3. We say that M is a bottom-up mapping (Valiente,
2001), which we denote by M M
BOT
(T, S), if
the following condition holds for every(v, w) M.
v
T[v]w
S[w]
(v
, w
) M
w
S[w]v
T[v]
(v
, w
) M
.
Furthermore, for {TOP, LCA, BOT}, we define the
distance τ
(T, S) between T and S as the minimum
cost of all the mappings in M
(T, S), that is:
τ
(T, S) = min{γ(M) | M M
(T, S)}.
Here, we call τ
TOP
, τ
LCA
and τ
BOT
a top-down dis-
tance, an LCA-preserving distance and a bottom-up
distance, respectively.
As the time complexityof the variations of the edit
distance in Definition 5, the following theorem holds,
also see Table 1 in Section 1.
Theorem 2. Let T and S be trees, where n =
max{|T|, |S|} and d = min{d(T), d(S)}
1. The problem of computing τ
TAI
(T, S) is
MAX SNP-hard (Zhang and Jiang, 1994).
This statement also holds even if both T and S are
binary trees or the maximum height of trees is at
most 3 (Akutsu et al., 2013; Hirata et al., 2011).
2. We can compute τ
TOP
(T, S) and τ
LCA
(T, S) in
O(n
2
d) time (Yamamoto et al., 2014).
3. The problem of computing τ
BOT
(T, S) is
MAX SNP-hard. This statement also holds
even if both T and S are binary trees (Yamamoto
et al., 2014).
It is know the following theorem for caterpillars.
Theorem 3. (Muraka et al., 2018) Let C and
D be caterpillars, where n = max{|C|, |D|} and
λ = min{|lv(C)|, |lv(D)|}. Then, we can compute
τ
TAI
(C, D) in O(n
2
λ) time under the unit cost func-
tion.
ICPRAM 2022 - 11th International Conference on Pattern Recognition Applications and Methods
274
3 COMPUTING THE
VARIATIONS FOR
CATERPILLARS
Let C and D be caterpillars. We regard a back-
bone bb(C) as a sequence [v
1
, . . . , v
n
], where v
1
=
r(C) and (v
i
, v
i+1
) E(C), and a backbone bb(D)
as a sequence [w
1
, . . . , w
m
], where w
1
= r(D) and
(w
j
, w
j+1
) E(D).
Let L
i
(1 i n) denote the set of leaves in ch(v
i
),
that is, L
i
= ch(v
i
)\ {v
i+1
} for 1 i n 1 and L
n
=
ch(v
n
). Also Let K
j
(1 j m) denote the set of
leaves in ch(w
j
), that is, let K
j
= ch(w
j
) \ {w
j+1
} for
1 j m 1 and K
m
= ch(w
m
).
Recall that C[v] denotes the (complete) subcater-
pillar of C rooted at v. Also C(v) denotes the forest
obtained by deleting the root v in C[v]. For a Cater-
pillar C and a subcaterpillar C
of C, we denote the
caterpillar obtained by deleting C
from C by C \C
.
When designing the algorithm to compute the
variations of edit distance for caterpillars, we use a
multiset of labels on an alphabet Σ. A multiset on Σ is
a mapping S : Σ N. For a multiset S on Σ, we say
that a Σ is an element of S if S(a) > 0 and denote it
by a S (like as a standard set). The cardinality of S,
denoted by |S|, is defined as
aΣ
S(a).
Let S
1
and S
2
be multisets on Σ. Then, we
define the intersection S
1
S
2
, the union S
1
S
2
and the difference S
1
\ S
2
as multisets satisfying
that (S
1
S
2
)(a) = min{S
1
(a), S
2
(a)}, (S
1
S
2
)(a) =
max{S
1
(a), S
2
(a)} and (S
1
\ S
2
)(a) = max{S
1
(a)
S
2
(a), 0} for every a Σ. Note that S
1
\S
2
= S
1
\(S
1
S
2
) and |S
1
\ S
2
| = |S
1
\ (S
1
S
2
)| = |S
1
| |S
1
S
2
|.
We can compute the edit distance µ between mul-
tisets S and S
in O(|S| + |S
|) time, since µ(S, S
) =
max{|S\ S
|, |S
\ S|} under the unit cost function (cf.,
(Ukita et al., 2021)).
Let S be a set of vertices. Then, we denote the
multiset of labels on Σ occurring in S by
e
S. Also, we
denote
vS
γ(v, ε) by del(S) and
wS
γ(ε, w) by ins(S).
3.1 Top-down Distance
First, we consider the equation to compute the top-
down distance τ
TOP
(C, D) between caterpillars C and
D, illustrated in Figure 2.
Theorem 4. Let C and D be caterpillars, where n =
max{|C|, |D|}. Then, we can compute τ
TOP
(C, D) in
O(n) time under the unit cost function.
Proof. First, we show that the equations in Figure 2
is correct. Suppose that M M
TOP
(C, D). Note
that bb(C) = [v
1
, . . . , v
n
] and bb(D) = [w
1
, . . . , w
m
].
τ
TOP
(C, D) =
min{n,m}−2
i=1
γ(v
i
, w
i
) + µ(
e
L
i
,
e
K
i
)
+ d
n,m
(C, D),
d
n,m
(C, D) =
γ(v
m1
, w
m1
)+
min
(
γ(v
m
, w
m
) + µ(
^
ch(v
m
),
f
K
m
) + del(C(v
m+1
)),
µ(
^
ch(v
m1
),
^
ch(w
m1
)) + del(C(v
m
)) + ins(K
m
)
)
if n > m,
γ(v
n1
, w
n1
)+
min
(
γ(v
n
, w
n
) + µ(
f
L
n
,
f
K
n
),
µ(
^
ch(v
n1
),
^
ch(w
n1
)) + del(L
n
) + ins(K
n
)
)
if n = m,
γ(v
n1
, w
n1
)+
min
(
γ(v
n
, w
n
) + µ(
f
L
n
,
^
ch(w
n
)) + ins(D(w
n+1
)),
µ(
^
ch(v
n1
),
^
ch(w
n1
)) + ins(D(w
n
)) + del(L
n
)
)
if n < m.
Figure 2: The equations of computing τ
TOP
.
It is obvious that (v
1
, w
1
) M. Also, if (v
i
, w)
M (resp., (v, w
j
) M), then it holds that w = w
i
(resp., v = v
j
). Hence, there exists an index h such
that 1 h min{n, m} and M contains the pairs
(v
1
, w
1
), . . . , (v
h
, w
h
). Furthermore, if M is the mini-
mum cost, then M contains the pairs (v
h
, w
h
) as many
as possible, so such an h is min{n, m} and such an M
implies τ
TOP
(C, D).
For every i (1 i h 2), we can compute the
correspondences in M between the leaves in L
i
and
the leaves in K
i
as µ(
e
L
i
,
e
K
i
), where
e
L
i
e
K
i
implicitly
represents such correspondences. Then, it holds that
τ
TOP
(C \ C[h], D \ D[h]) =
h2
i=1
γ(v
i
, w
i
) + µ(
e
L
i
,
e
K
i
)
,
which is computed in the formula of τ
TOP
(C, D) ex-
cept d
n,m
(C, D) in Figure 2.
Consider the case that i = h 1, that is, consider
the formula d
n,m
(C, D).
If n = m, then we can compute
τ
TOP
(C[v
n1
], D[v
n1
]) as the sum of γ(v
n1
, w
n1
)
and the minimum value of γ(v
n
, w
n
) + µ(
e
L
n
,
f
K
n
) (if v
n
is corresponding to w
n
) and µ(
^
ch(v
n1
),
^
ch(w
n1
)) +
del(L
n
) + ins(K
n
) (otherwise), which is realized as
the second formula in d
n,m
(C, D) in Figure 2.
Suppose that n > m. Then, we can compute
τ
TOP
(C[v
n1
], D[v
n1
]) as the sum of γ(v
m1
, w
m1
)
and the minimum value of the upper and the lower
formulas in the first formula in d
n,m
(C, D) in Figure 2.
If v
m
is corresponding to w
m
, then its cost is
γ(v
m
, w
m
) and the leaves in K
m
are possible to cor-
respond to not only the leaves in L
m
but also v
m+1
,
that is, ch(v
m
). Such correspondences are computed
as µ(
^
ch(v
m
),
f
K
m
). Furthermore, the remained ver-
tices in C(v
m+1
) are deleted, which is realized as
Computing the Variations of Edit Distance for Rooted Labaled Caterpillars
275
del(C(v
m+1
)). Hence, the upper formula is correct.
Otherwise, that is, if v
m
is not corresponding to
w
m
, then the vertices in ch(v
m1
) are correspondingto
the vertices in ch(w
m1
). Such correspondences are
computed as µ(
^
ch(v
m1
),
^
ch(w
m1
)). Furthermore,
the remained vertices in C(v
m
) are deleted and the re-
mained vertices in K
m
are inserted, which is realized
as del(C(v
m
)) + ins(K
m
). Hence, the lower formula is
correct. Therefore, the first formula is correct.
Similarly, for the case that n < m, the third for-
mula in d
n,m
(C, D) in Figure 2 is also correct.
Since the equations traverse at most once for every
vertex in C and D with traversing (v
i
, w
i
) and the pro-
cessing for (v
i
, w
i
) runs in O(1) time, the total running
time is O(|C| + |D|) = O(n).
3.2 LCA-preserving Distance
Next, we consider the recurrences of computing the
LCA-preserving distance τ
LCA
(C, D) between cater-
pillars C and D illustrated in in Figure 3. Here, we
regard
^
C[v
i+1
] and
^
D[w
j+1
] in the recurrences as the
multisets of labels occurring in all the vertices in
C[v
i+1
] and D[w
j+1
].
τ
LCA
(C[v
i
], D[w
j
]) =
min
γ(v
i
, w
j
) + µ(
e
L
i
,
f
K
j
) + τ
LCA
(C[v
i+1
], D[w
j+1
]),
γ(v
i
, ε) + del(L
i
) + τ
LCA
(C[v
i+1
], D[w
j
]),
γ(ε, w
j
) + ins(K
j
) + τ
LCA
(C[v
i
], D[w
j+1
])
if 1 i < n and 1 j < m,
τ
LCA
(C[v
n
], D[w
j
]) =
min
γ(ε, w
j
) + ins(K
j
) + τ
LCA
(C[v
n
], D[w
j+1
]),
γ(v
n
, w
j
)
+ min
vL
n
n
µ(
f
L
n
\
g
{v},
f
K
j
) + µ(
g
{v},
^
D[w
j+1
])
o
,
γ(v
n
, w
j
) + µ(
f
L
n
,
f
K
j
) + ins(D[w
j+1
])
if 1 j < m,
τ
LCA
(C[v
i
], D[w
m
]) =
min
γ(v
i
, ε) + del(L
i
) + τ
LCA
(C[v
i+1
], D[w
j
]),
γ(v
i
, w
m
)
+ min
wK
m
n
µ(
e
L
i
,
f
K
m
\
g
{w}) + µ(
^
C[v
i+1
],
g
{w})
o
,
γ(v
i
, w
m
) + µ(
e
L
i
,
f
K
m
) + del(C[v
i+1
])
if 1 i < n,
τ
LCA
(C[v
n
], D[w
m
]) = γ(v
n
, w
m
) + µ(
f
L
n
,
f
K
m
).
Figure 3: The recurrences of computing τ
LCA
.
We start the following simple lemma.
Lemma 1. Let C be a caterpillar. For distinct vertices
v, w C, it holds that v w bb(C).
Proof. If v bb(C) and w bb(C), then it holds that
v w = v or w, which implies that v w bb(C). If
v bb(C) and w lv(C), then it holds that v w =
v par(w), which implies that v w bb(C). By the
same reason, it holds that v w bb(C) if v lv(C)
and w bb(C). If v lv(C) and w lv(C), then it
holds that vw = par(v)par(w), which implies that
v w bb(C).
Theorem 5. Let C and D be caterpillar, where n =
max{|C|, |D|}. Then, we can compute τ
LCA
(C, D) in
O(n
2
) time under the unit cost function.
Proof. The first recurrence in Figure 3 computes that,
for M M
LCA
(C, D), (1) if (v
i
, w
j
) M, then (v, w)
L
i
× K
j
such that v, w
e
L
i
f
K
j
are added to M and
next it computes τ
LCA
(C[v
i+1
], D[w
j+1
]), (2) if v
i
is
deleted, then all the leaves in L
i
are deleted and next
it computes τ
LCA
(C[v
i+1
], D[w
j
]), or (3) if w
j
is in-
serted, then all the leaves in K
j
are inserted and next
it computes τ
LCA
(C[v
i
], D[w
j+1
]), for 1 i < n and
1 j < m. Then, the pairs added to M are obtained
from just the case (1), and the pairs consist of some
(v, w) L
i
× K
j
and (v
i
, w
j
). Since par(v) = v
i
and
par(w) = w
j
and by Lemma 1, M is LCA-preserving.
By the same reason, the mapping obtained from
the last recurrence in Figure 3 is LCA-preserving.
Consider the second recurrence in Figure 3, that
is, the case that τ
LCA
(C[v
n
], D[w
j
]) for 1 j < m and
M M
LCA
(C, D). The first formula means to insert
w
j
and L
j
, and next compute τ
LCA
(C[v
n
], D[w
j+1
]), no
pairs are added to M.
The second and third formulas mean to add
(v
n
, w
j
) to M. Then, the second formula means that
some v L
n
is corresponding to some vertex w
D[w
j+1
] (and (v, w) is added to M and the remained
vertices in D[w
j+1
] are inserted), and then L
n
\ {v} is
corresponding to K
j
as possible (and the correspond-
ing pairs are added to M). On the other hand, the
third formula means that no v L
n
is corresponding
to D[w
j+1
]. In this case, L
n
is corresponding to K
j
as
possible (and the corresponding pairs are added to M)
and the vertices in D[w
j+1
] are inserted.
For both formulas, by Lemma 1, it holds that
(v
1
v
2
, w
1
w
2
) = (v
n
, w
j+1
) M for distinct pairs
(v
1
, w
1
), (v
2
, w
2
) M(C[v
n
]×D[w
j+1
]). Then, M is
LCA-preserving.
By the same reason, the mapping obtained from
the third recurrence in Figure 3 is LCA-preserving.
Hence, the mapping obtained from the recurrences
in Figure 3 is LCA-preserving.
By traversing C and D at once in O(n) time, we
can obtain the information of v
i
, w
i
, L
i
and K
i
. Then,
in computing τ
LCA
(C[v
i
], D[w
j
]) for a fixed i and j, the
running time is O(1). Since the recurrences compute
τ
LCA
(C[v
i
], D[w
j
]) for 1 i n and 1 j m, the
total running time is O(n
2
).
ICPRAM 2022 - 11th International Conference on Pattern Recognition Applications and Methods
276
3.3 Bottom-up Distance
When considering the algorithm of computing the
bottom-up distance τ
BOT
(C, D) between caterpillars C
and D, we deal with the reversal of backbones, that
is, bb(C) = [v
1
, . . . , v
n
] for and bb(D) = [w
1
, . . . , w
m
],
where r(C) = v
n
, (v
i
, v
i+1
) E(C), r(D) = w
m
and
(w
j
, w
j+1
) E(D). Then, we design the algorithm
BOTCAT in Algorithm 1.
procedure BOTCAT(C, D)
/* C, D: caterpillars */
/* bb(C) = [v
1
, . . . , v
n
], r(C) = v
n
*/
/* bb(D) = [w
1
, . . . , w
m
], r(D) = w
m
*/
for h = 1 to min{n, m} do1
if |L
h
| 6= |K
h
| then break;2
h
--
; /* |L
i
| = |K
i
| for 1 i h */3
A bb(C); B bb(D); L lv(C); K lv(D);4
d µ(
e
L,
e
K) + del(A) + ins(B);5
if h > 0 then6
d
0
d; L
0
L; K
0
K;7
for i = 1 to h do8
L
i
L
i1
\ L
i
; K
i
K
i1
\ K
i
;
9
d
i
10
d
i1
γ(v
i
, ε) γ(ε, w
i
) + γ(v
i
, w
i
)
µ(
g
L
i1
,
]
M
i1
) + µ(
e
L
i
,
e
K
i
) + µ(
e
L
i
,
e
K
i
);
d min{d, d
i
};11
return d;12
Algorithm 1: BOTCAT.
Theorem 6. Let C and D be caterpillars, where n =
max{|C|, |D|}. Then, we can compute τ
BOT
(C, D) in
O(n) time under the unit cost function.
Proof. Since C and D are caterpillars, if |L
i
| = |K
i
|
for 1 i h but |L
h+1
| 6= |K
h+1
|, then it holds that
C[v
h
]
l
D[w
h
] but C[v
h+1
] 6≡
l
D[w
h+1
]. The algo-
rithm BOTCAT first finds such an h in lines 1, 2 and
3. In this case, we can obtain bottom-up mapping
M M
BOT
(C, D) between C[w
h
] and D[w
h
] by (1)
adding (v
i
, w
i
) to M, (2) adding (v, w) to M for v L
i
,
w K
i
and l(v) = l(w) and (3) adding (v, w) to M for
the remained v L
i
and w K
i
for 1 i h. We can
compute the distance concerned with the above (2)
and (3) as µ(
e
L
i
,
e
K
i
). Note that the remained vertices
in C are deleted and those in D are inserted.
After obtaining the above h, the algorithm BOT-
CAT computes the bottom-up distance whose bottom-
up mapping M M
BOT
(C, D) contain no pair in
bb(C) × bb(D) as d in line 6. Then, in for-loop in
lines from 7 to 12, the algorithm BOTCAT updates d
as the minimum value of the current d and the newly
obtained d
i
such that (v
i
, w
i
) M. Here, d
i
is the dis-
tance that v
i
bb(C) is corresponding to w
i
bb(D),
by adding γ(v
i
, w
i
) instead of γ(v
i
, ε)+ γ(ε, w
i
), and L
i
are corresponding to K
i
, by adding µ(
e
L
i
,
e
K
i
) instead
of µ(
g
L
i1
,
g
K
i1
). This is realized at line 11, that is, by
using the following formula.
d
i
d
i1
γ(v
i
, ε) γ(ε, w
i
) + γ(v
i
, w
i
)
µ(
g
L
i1
,
]
M
i1
) + µ(
e
L
i
,
e
K
i
) + µ(
e
L
i
,
e
K
i
).
In other words, for 1 i h, the bottom-up mapping
M M
BOT
(C, D) is updated by adding (v
i
, w
i
) and
the correspondence between L
i
and K
i
to M for every
i, after removing the correspondences between L
1
··· L
i
and K
1
··· K
i
in M. Hence, the algorithm
BOTCAT is correct.
By traversing C and D at once in O(n) time, we
can obtain the information of bb(C), bb(D), lv(C) and
lv(D) (so v
i
, w
i
, L
i
and K
i
). Then, each of lines 2, 4,
5, 7 and 9 to 11 runs in O(1) time. Hence, the total
running time of the algorithm BOTCAT is O(n).
4 EXPERIMENTAL RESULTS
In this section, we give the experimental results of
computing τ
TOP
, τ
LCA
and τ
BOT
. Here, the computer
environment is that OS is Ubuntu 14.04.6, CPU is In-
tel Xeon E5-1650 v3(3.50GHz) and RAM is 15GB.
We deal with caterpillars for N-glycans from
KEGG
2
, the largest 5,154 caterpillars (0.1%) in dblp
3
(refer to dblp
0.1%
), SwissProt and non-isomorphic
caterpillars in TPC-H (refer to TPC-H
) from UW
XML Repository
4
. Also we deal with caterpillars
obtained by deleting the root in Auction (refer to
Auction
) and non-isomorphic caterpillars obtained
by deleting the root in Nasa (refer to NASA
),
Protein (refer to Protein
) and University (refer to
University
) from UW XML Repository. Table 2 il-
lustrates the information of such caterpillars. Here,
#, n, d, h, λ and β are the number of caterpillars, the
average number of vertices, the average degree, the
average height, the average number of leaves and the
average number of labels.
Then, we use all the pairs in the caterpillars in Ta-
ble 2, of which the number is
#× (# 1)
2
. Table 3
illustrates the number (#pairs) of all the pairs in cater-
pillars in Table 2.
Table 4 illustrates the running time to compute
τ
TOP
, τ
LCA
and τ
BOT
, as comparing with τ
TAI
by the
algorithm in (Muraka et al., 2018).
2
Kyoto Encyclopedia of Genes and Genomes,
http://www.kegg.jp/
3
http://dblp.uni-trier.de/
4
http://aiweb.cs.washington.edu/research/projects/xmltk/
xmldata/www/repository.html
Computing the Variations of Edit Distance for Rooted Labaled Caterpillars
277
Table 2: The information of caterpillars.
data # n d h λ β
N-glycans 514 6.40 1.84 4.22 2.18 4.50
dblp
0.1%
5,154 41.74 40.73 1.01 40.73 10.61
SwissProt 6,804 35.10 24.96 2.00 33.10 16.79
TPC-H
8 8.63 7.63 1.00 7.63 8.63
Auction
259 4.29 3.00 0.71 3.57 4.29
Nasa
33 7.27 5.15 1.64 5.64 3.18
Protein
5,150 4,97 3.63 1.16 3.81 4.57
University
26 1.35 0.35 0.19 1.15 1.35
Table 3: The number (#pairs) of all the pairs in caterpillars
in Table 2.
data #pairs
N-glycans 131,841
dblp
0.1%
13,279,281
SwissProt 23,143,806
TPC-H
28
data #pairs
Auction
33,411
Nasa
528
Protein
13,258,675
University
325
Table 4: The running time (sec.) to compute τ
TAI
, τ
TOP
,
τ
LCA
and τ
BOT
.
data τ
TAI
τ
TOP
τ
LCA
τ
BOT
N-glycans 753.33 1.23 2,804.82 2.57
dblp
0.1%
7,525.28 343.70 1,505.05 737.96
SwissProt 82,031.10 1,594.42 9,819.62 2,138.54
TPC-H
5.78×10
3
0.64×10
3
1.77×10
3
1.43×10
3
Auction
4.55 0.23 0.87 0.94
Nasa
20.93×10
2
0.34×10
2
4.91×10
2
0.57×10
2
Protein
2,055.77 118.20 433.22 327.66
University
14.22×10
3
0.40×10
3
2.84×10
3
6.58×10
3
Table 4 shows that, whereas the time complexity
of computing τ
TOP
is same as that of computing τ
BOT
,
the running time of computing τ
TOP
is slightly smaller
than that of computing τ
BOT
. On the other hand, the
running time of computing τ
LCA
is smaller than that of
computing τ
TAI
except N-glycan. The reason is that
the depth of caterpillars in N-glycan is much larger
than other caterpillars.
Table 5 illustrates the number (#cases) of cases
that τ
TAI
< τ
TOP
, τ
TAI
< τ
LCA
and τ
TAI
< τ
BOT
with their
ratios (%) in all the pairs (#pairs), where “max. is the
maximum difference from τ
TAI
. Since it always holds
τ
LCA
τ
TOP
, we omit the cases that τ
TAI
< τ
LCA
in Ta-
ble 5 when the number of cases that τ
TAI
< τ
TOP
is 0.
Table 5 show that, for caterpillars in dblp
0.1%
,
TPC-H
, Auction
and University
, τ
TOP
is an alter-
native and much faster distance to τ
TAI
. Also, whereas
τ
LCA
is an improved distance of τ
TOP
for caterpillars
in N-glycans, τ
LCA
and τ
TOP
are not changed for the
Table 5: The number (#cases) of cases that τ
TAI
< τ
TOP
,
τ
TAI
< τ
LCA
and τ
TAI
< τ
BOT
with their ratios (%) in all the
pairs (#pairs) with the maximum difference (max.)
τ
TAI
< τ
TOP
data #pairs #cases % max.
N-glycans 131,841 64,467 48.90 10
dblp
0.1%
13,279,281 0 0.00 0
SwissProt 23,143,806 5,933,179 25.64 30
TPC-H
28 0 0 0
Auction
33,411 0 0 0
Nasa
528 104 19.70 9
Protein
13,258,675 697,697 5.26 50
University
325 0 0 0
τ
TAI
< τ
LCA
data #pairs #cases % max.
N-glycans 131,841 5,490 4.16 2
SwissProt 23,143,806 5,933,179 25.64 29
Nasa
528 56 10.61 1
Protein
132,586,75 348,119 2.63 10
τ
TAI
< τ
BOT
data #pairs #cases % max.
N-glycans 131,841 117,657 89.24 16
dblp
0.1%
13,279,281 12,667,501 95.39 4
SwissProt 23,143,806 23,019,607 99.46 4
TPC-H
28 27 96.43 2
Auction
33,411 4,107 12.29 1
Nasa
528 403 76.33 4
Protein
13,258,675 8,828,524 66.59 5
University
325 5 1.54 1
other caterpillars. Furthermore, τ
BOT
is insufficient
to approximate to τ
TAI
since the number of cases that
τ
TAI
< τ
BOT
are much larger than the number of cases
that τ
TAI
< τ
TOP
.
On the other hand, by focusing on the maxi-
mum difference, for caterpillars in SwissProt and
Protein
, the maximum difference of τ
BOT
τ
TAI
is
much smaller than that of τ
TOP
τ
TAI
and τ
LCA
τ
TAI
.
Then, for these caterpillars, whereas the number of
cases that τ
TAI
< τ
BOT
is larger than the number of
cases that τ
TAI
< τ
TOP
and τ
TAI
< τ
LCA
, τ
BOT
is more
appropriate to characterize the forms of caterpillars
than τ
TOP
and τ
LCA
.
In order to improve the results in Table 5, Table 6
summarizes the case that min{τ
TOP
, τ
BOT
}.
By comparing with Table 5, Table 6 shows that
the usage of min{τ
TOP
, τ
BOT
} succeeds to decrease the
maximum difference with slightly decreasing the ra-
tio. Hence, min{τ
TOP
, τ
BOT
} provides to fast approxi-
mate to τ
TAI
for caterpillars.
Finally, we compare the algorithms in this paper
ICPRAM 2022 - 11th International Conference on Pattern Recognition Applications and Methods
278
Table 6: The number (#cases) of cases that τ
TAI
<
min{τ
TOP
, τ
BOT
} with their ratios (%) in all the pairs
(#pairs) with the maximum difference (max.).
data #pairs #cases % max.
N-glycans 131,841 59,921 45.45 9
dblp
0.1%
13,279,281 0 0.00 0
SwissProt 23,143,806 5,933,179 25.64 2
TPC-H
28 0 0 0
Auction
33,411 0 0 0
Nasa
528 94 17.80 1
Protein
13,258,675 637,773 4.81 2
University
325 0 0 0
for caterpillars with the algorithms designed by (Ya-
mamoto et al., 2014) for standard trees. Table 7 illus-
trates the running time of computing τ
TOP
and τ
LCA
by
using such algorithms which refer to τ
T
TOP
and τ
T
LCA
.
Here, “–” denotes time out over 10,000 seconds.
Table 7: The running time (sec.) of computing τ
TOP
and
τ
LCA
by using the algorithms in this paper and the algo-
rithms τ
T
TOP
and τ
T
LCA
in (Yamamoto et al., 2014).
data τ
TOP
τ
LCA
τ
T
TOP
τ
T
LCA
N-glycans 1.23 2,804.82 11.77 25.64
dblp
0.1%
343.70 1,505.05
SwissProt 1,594.42 9,819.62
TPC-H
0.64×10
3
1.77×10
3
3.77×10
3
7.45×10
3
Auction
0.23 0.87 1.20 2.12
Nasa
0.34×10
2
4.91×10
2
5.64×10
2
10.68×10
2
Protein
118.20 433.22 628.79 1156.32
University
0.40×10
3
2.84×10
3
2.93×10
3
2.19×10
3
Table 7 shows that the algorithm of computing
τ
TOP
in this paper is much faster than τ
T
TOP
. Also,
except N-glycans and University
, the algorithm of
computing τ
LCA
in this paper is faster than τ
T
LCA
.
5 CONCLUSION
In this paper, we have designed the algorithms of
computing τ
TOP
and τ
BOT
for caterpillars in O(n) time
and τ
LCA
in O(n
2
) time. Also, we have given ex-
perimental results of computing τ
TOP
, τ
LCA
and τ
BOT
for caterpillars in real data. Then, the usage of
min{τ
TOP
, τ
BOT
} have provided to fast approximate to
τ
TAI
for caterpillars. Also, the algorithms in this pa-
per have been almost fast and faster than the previous
algorithms for trees (Yamamoto et al., 2014).
Since the algorithm of computing τ
LCA
for cater-
pillars is slow for N-glycan, it is a future work to im-
prove the implementation, in particular, to apply to
larger number of caterpillars such as all-glycans in
KEGG and CSLOGS
5
. Also it is a future work to in-
vestigate the other variations of the edit distance for
caterpillars presented in (Yoshino and Hirata, 2017).
REFERENCES
Akutsu, T., Fukagawa, D., Halld´orsson, M. M., Takasu, A.,
and Tanaka, K. (2013). Approximation and parame-
terized algorithms for common subtrees and edit dis-
tance between unordered trees. Theoret. Comput. Sci.,
470:10–22.
Chawathe, S. S. (1999). Comparing hierarchical data in ex-
ternal memory. In Proc. VLDB’99, pages 90–101.
Deza, M. M. and Deza, E. (2016). Encyclopedia of dis-
tances (4th ed.). Springer.
Gallian, J. A. (2007). A dynamic survey of graph labeling.
Electorn. J. Combin., 14:DS6.
Hirata, K., Yamamoto, Y., and Kuboyama, T. (2011). Im-
proved MAX SNP-hard results for finding an edit dis-
tance between unordered trees. In Proc. CPM’11
(LNCS 6661), pages 402–415.
Kuboyama, T. (2007). Matching and learning in trees. Ph.D
thesis, University of Tokyo.
Muraka, K., Yoshino, T., and Hirata, K. (2018). Computing
edit distance between rooted labeled caterpillars. In
Proc. FedCSIS’18, pages 245–252.
Selkow, S. M. (1977). The tree-to-tree editing problem. In-
form. Process. Lett., 6:184–186.
Tai, K.-C. (1979). The tree-to-tree correction problem. J.
ACM, 26:422–433.
Ukita, Y., Yoshino, T., and Hirata, K. (2021). Caterpil-
lar alignment distance for rooted labeled caterpillars:
Distance based on alignments required to be caterpil-
lars. In Recent advance in computational optimiza-
tion, pages 111–134.
Valiente, G. (2001). An efficient bottom-up distance be-
tween trees. In Proc. SPIRE’01, pages 212–219.
Yamamoto, Y., Hirata, K., and Kuboyama, T. (2014).
Tractable and intractable variations of unordered tree
edit distance. Internat. J. Found. Comput. Sci.,
25:307–329.
Yoshino, T. and Hirata, K. (2017). Tai mapping hierarchy
for rooted labeled trees through common subforest.
Theory of Comput. Sys., 60:769–787.
Zhang, K. and Jiang, T. (1994). Some MAX SNP-hard re-
sults concerning unordered labeled trees. Inform. Pro-
cess. Lett., 49:249–254.
Zhang, K., Wang, J., and Shasha, D. (1996). On the editing
distance between undirected acyclic graphs. Internat.
J. Found. Comput. Sci., 7:43–58.
5
http://www.cs.rpi.edu/˜zaki/www-
new/pmwiki.php/Software/Software
Computing the Variations of Edit Distance for Rooted Labaled Caterpillars
279