Heavy Caterpillar Distances for Rooted Labeled Unordered Trees
Nozomi Abe, Takuya Yoshino and Kouich Hirata
Kyushu Institute of Technology, Kawazu 680-4, Iizuka 820-8502, Japan
Keywords:
Heavy Caterpillar, Heavy Caterpillar Distance, Rooted Labeled Unordered Tree, Tree Edit Distance, Varia-
tions of Tree Edit Distance.
Abstract:
In this paper, we introduce two heavy caterpillar distances between rooted labeled unordered trees (trees, for
short) based on the edit distance between the heavy caterpillars obtained from the heavy paths in trees. Then,
we show that the heavy caterpillar distances provide the upper bound of the edit distance for trees, can be
computed in quadratic time under the unit cost function and are incomparable with other variations of the edit
distance.
1 INTRODUCTION
Comparing tree-structured data such as HTML and
XML data for web mining or RNA and glycan data for
bioinformatics is one of the important tasks for data
mining. The most famous distance measure (Deza
and Deza, 2016) between rooted labeled unordered
trees (trees, for short) is the edit distance τ
TAI
(Tai,
1979). The edit distance is formulated as the mini-
mum cost of edit operations, consisting of a substitu-
tion, a deletion and an insertion, applied to transform
a tree to another tree. It is known that the edit distance
is always a metric and coincides with the minimum
cost of Tai mappings (Tai, 1979). Unfortunately, the
problem of computing the edit distance between trees
is MAX SNP-hard (Zhang and Jiang, 1994). This
statement also holds even if trees are binary or the
maximum height of trees is at most 3 (Akutsu et al.,
2013; Hirata et al., 2011).
Many variations of the edit distance have de-
veloped as more structurally sensitive distances as
the minimum cost of the variations of the Tai map-
ping (Jiang et al., 1995; Kan et al., 2014; Kuboyama,
2007; Lu et al., 2001; Wang and Zhang, 2001; Ya-
mamoto et al., 2014; Yoshino and Hirata, 2017;
Zhang, 1996). In particular, the alignment distance
τ
ALN
(Jiang et al., 1995) and the segmental distance
τ
SG
(Kan et al., 2014) are the most general variations
of τ
TAI
, where τ
ALN
is incomparable with τ
SG
, and
the isolated-subtree distance τ
ILST
(Wang and Zhang,
2001) (or constrained distance) (Zhang, 1996) is the
most general tractable variation of τ
TAI
(Yoshino and
Hirata, 2017).
A caterpillar (cf. (Gallian, 2007)) is a tree trans-
formed to a path after removing all the leaves in it.
Recently, Muraka et al. (Muraka et al., 2018) have
shown that the problem of computing the edit distance
between caterpillars is tractable and the structural re-
striction of caterpillars provides the limitation of the
tractability for computing the edit distance. Also Mu-
raka et al. (Muraka et al., 2019) have developed the
method to fast approximate the edit distance between
caterpillars.
Hence, in this paper, we introduce new distances
for trees by using the edit distance between the em-
bedded caterpillars. Then, we focus on the heavy
path (Sleator and Tarjan, 1983), which is a famous
embedded path in a tree obtained by selecting vertices
whose number of descendants is largest from the root.
In particular, Demaine et al. (Demaine et al., 2009)
have adopted the heavy path to analyze the time com-
plexity of computing the edit distance for rooted la-
beled ordered trees.
In this paper, first we formulate a heavy caterpil-
lar in a tree as the caterpillar whose backbone is the
heavy path in the tree and whose set of leaves con-
sists of all the adjacent vertices to the heavy path in
the tree. Then, we introduce the following two heavy
caterpillar distances τ
HC
and τ
c
HC
between trees.
The heavy caterpillar distance τ
HC
is formulated
as the sum of the edit distance between heavy cater-
pillars and the cost of deleting and inserting the re-
mained vertices not contained in the heavy caterpil-
lars. On the other hand, the heavy caterpillar distance
τ
c
HC
is formulated as the sum of the edit distance be-
tween heavy caterpillars and the cost of the Tai map-
198
Abe, N., Yoshino, T. and Hirata, K.
Heavy Caterpillar Distances for Rooted Labeled Unordered Trees.
DOI: 10.5220/0009095801980204
In Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2020), pages 198-204
ISBN: 978-989-758-397-1; ISSN: 2184-4313
Copyright
c
2022 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
ping obtained by repeating recursively, after selecting
vertices (as leaves in heavy caterpillars) to bridge the
Tai mapping between the heavy caterpillars, to com-
pute the edit distance (the Tai mapping) between the
heavy caterpillars of the complete subtree rooted by
the selected vertices.
Then, in this paper, we show that the heavy cater-
pillar distances τ
HC
and τ
c
HC
provide the upper bound
of τ
TAI
, that is, τ
TAI
τ
c
HC
τ
HC
. For the maxi-
mum height h and the maximum number λ of leaves
in given two trees, we can compute τ
HC
in O(h
2
λ
3
)
time under the general cost function and in O(h
2
λ)
time under the unit cost function, and τ
c
HC
(T
1
,T
2
) in
O(h
2
λ
4
) time under the general cost function and in
O(h
2
λ
2
) time under the unit cost function. Further-
more, we show that τ
HC
and τ
c
HC
are incomparable
with τ
ILST
, τ
ALN
and τ
SG
. Hence, the heavy caterpillar
distances τ
HC
and τ
c
HC
provide another tractable vari-
ations of the edit distance τ
TAI
incomparable with the
isolated-subtree distance τ
ILST
.
2 PRELIMINARIES
A tree T is a connected graph (V,E) without cycles,
where V is the set of vertices and E is the set of edges.
We denote V and E by V(T) and E(T). The size of
T is |V| and denoted by |T|. We sometime denote
v V(T) by v T. We denote an empty tree (
/
0,
/
0) by
/
0. A rooted tree is a tree with one node r chosen as its
root. We denote the root of a rooted tree T by r(T).
Let T be a rooted tree such that r = r(T) and
u, v, w T. We denote the unique path from r to v, that
is, the tree (V
,E
) such that V
= {v
1
,... , v
k
}, v
1
= r,
v
k
= v and (v
i
,v
i+1
) E
for every i (1 i k 1),
by UP
r
(v).
The parent of v(6= r), which we denote by par(v),
is its adjacent node on UP
r
(v) and the ancestors of
v(6= r) are the nodes on UP
r
(v) {v}. We denote the
set of all ancestors of v by anc(v). We say that u is a
child of v if v is the parent of u and u is a descendant
of v if v is an ancestor of u. We denote the set of
children of v by ch(v) and that v is a ancestor of u
by u v. We call a node with no children a leaf and
denote the set of all the leaves in T by lv(T).
A rooted path P is a rooted tree
({v
1
,... , v
n
}, {(v
i
,v
i+1
) | 1 i n 1}) such
that r(P) = v
1
. We call the node v
n
(the leaf of P) an
endpoint of P and denote it by e(P).
The degree of v, denoted by d(v), is the number of
children of v, and the degree of T, denoted by d(T), is
max{d(v) | v T}. The height of v, denoted by h(v),
is max{|UP
v
(w)| | w lv(T[v])}, and the height of T,
denoted by h(T), is max{h(v) | v T}.
We use the ancestor orders < and , that is, u < v
if v is an ancestor of u and u v if u < v or u = v.
We say that w is the least common ancestor of u and
v, denoted by u v, if u w, v w and there exists
no node w
T such that w
w, u w
and v
w
. Let T be a rooted tree (V,E) and v a node in T.
A complete subtree of T at v, denoted by T[v], is a
rooted tree T
= (V
,E
) such that r(T
) = v, V
=
{u V | u v} and E
= {(u, w) E | u,w V
}.
We say that u is to the left of v in T if pre(u)
pre(v) for the preorder number pre in T and post(u)
post(v) for the postorder number post in T. We say
that a rooted tree is ordered if a left-to-right order
among siblings is given; unordered otherwise. We say
that a rooted tree is labeled if each node is assigned a
symbol from a fixed finite alphabet Σ. For a node v,
we denote the label of v by l(v), and sometimes iden-
tify v with l(v). In this paper, we call a rooted labeled
unordered tree a tree simply.
Furthermore, we call a set of trees a forest. In
particular, we denote the forest obtained by deleting
v in T[v] by T(v).
Definition 1 (Caterpillar (cf., (Gallian, 2007))). We
say that a tree is a caterpillar if it is transformed to a
rooted path after removing all the leaves in it. For a
caterpillarC, we call the remained rooted path a back-
bone of C and denote it by bb(C).
It is obvious that r(C) = r(bb(C)) and V(C) =
bb(C) lv(C) for a caterpillar C, that is, every node
in a caterpillar is either a leaf or an element of the
backbone.
Next, we introduce a tree edit distance and a Tai
mapping.
Definition 2 (Edit operations (Tai, 1979)). The edit
operations of a tree T are defined as follows, see Fig-
ure 1.
1. Substitution: Change the label of the node v in T.
2. Deletion: Delete a node v in T with parent v
,
making the children of v become the children of
v
. The children are inserted in the place of v as
a subset of the children of v
. In particular, if v is
the root in T, then the result applying the deletion
is a forest consisting of the children of the root.
3. Insertion: The complement of deletion. Insert a
node v as a child of v
in T making v the parent of
a subset of the children of v
.
Let ε 6∈ Σ denote a special blank symbol and define
Σ
ε
= Σ {ε}. Then, we represent each edit operation
by (l
1
7→ l
2
), where (l
1
,l
2
) (Σ
ε
×Σ
ε
{(ε, ε)}). The
operation is a substitution if l
1
6= ε and l
2
6= ε, a dele-
tion if l
2
= ε, and an insertion if l
1
= ε. For nodes v
and w, we also denote (l(v) 7→ l(w)) by (v 7→ w). We
define a cost function γ : (Σ
ε
× Σ
ε
\ {(ε, ε)}) 7→ R
+
on
Heavy Caterpillar Distances for Rooted Labeled Unordered Trees
199
Substitution (v 7→ w)
v
7→
w
Deletion (v 7→ ε)
v
v
7→
v
Insertion (ε 7→ v)
v
7→
v
v
Figure 1: Edit operations for trees.
pairs of labels. We often constrain a cost function γ to
be a metric, that is, γ(l
1
,l
2
) 0, γ(l
1
,l
2
) = 0 iff l
1
= l
2
,
γ(l
1
,l
2
) = γ(l
2
,l
1
) and γ(l
1
,l
3
) γ(l
1
,l
2
)+ γ(l
2
,l
3
). In
particular, we call the cost function that γ(l
1
,l
2
) = 1
if l
1
6= l
2
a unit cost function.
Definition 3 (Edit distance (Tai, 1979)). For a cost
function γ, the cost of an edit operation e = l
1
7→ l
2
is given by γ(e) = γ(l
1
,l
2
). The cost of a sequence
E = e
1
,... , e
k
of edit operations is given by γ(E) =
k
i=1
γ(e
i
). Then, an edit distance τ
TAI
(T
1
,T
2
) be-
tween trees T
1
and T
2
is defined as follows:
τ
TAI
(T
1
,T
2
) = min
γ(E)
E is a sequence
of edit operations
transforming T
1
to T
2
.
Definition 4 (Tai mapping (Tai, 1979)). Let T
1
and
T
2
be trees. We say that a triple (M,T
1
,T
2
) is a Tai
mapping (a mapping, for short) from T
1
to T
2
if M
V(T
1
) ×V(T
2
) and every pair (v
1
,w
1
) and (v
2
,w
2
) in
M satisfies the following conditions.
1. v
1
= v
2
iff w
1
= w
2
(one-to-one condition).
2. v
1
v
2
iff w
1
w
2
(ancestor condition).
We will use M instead of (M, T
1
,T
2
) when there is no
confusion denote it by M M
TAI
(T
1
,T
2
).
Let M be a mapping from T
1
to T
2
. Let I
M
and J
M
be the sets of nodes in T
1
and T
2
but not in M, that is,
I
M
= {v T
1
| (v, w) 6∈ M} and J
M
= {w T
2
| (v, w) 6∈
M}. Then, the cost γ(M) of M is given as follows.
γ(M) =
(v,w)M
γ(v,w) +
vI
M
γ(v,ε) +
wJ
M
γ(ε,w).
Theorem 1 ((Tai, 1979)). τ
TAI
(T
1
,T
2
) = min{γ(M) |
M M
TAI
(T
1
,T
2
)}.
Furthermore, we introduce the variations of Tai
mappings. Whereas the alignment distance (Jiang
et al., 1995) has first defined by using an align-
ment tree between two trees as the common su-
pertree, it is known that the alignment distance coin-
cides with the minimum cost of less-constrained map-
pings (Kuboyama, 2007). Hence, in this paper, we
regard the less-constrained mapping as an alignable
mapping and formulate the alignment distance as the
minimum cost of alignable mappings.
Definition 5 (Variations of Tai mapping). Let T
1
and
T
2
be trees and M M
TAI
(T
1
,T
2
).
1. We say that M is an alignable map-
ping (Kuboyama, 2007) (or an less-constrained
mapping (Lu et al., 2001)), denoted by
M M
ALN
(T
1
,T
2
), if M satisfies the follow-
ing condition:
(v
1
,w
1
)(v
2
,w
2
)(v
3
,w
3
) M
(v
1
v
2
< v
1
v
3
) = (w
2
w
3
= w
1
w
3
)
.
Also we define an alignment distance
τ
ALN
(T
1
,T
2
) (Jiang et al., 1995) as the mini-
mum cost of all the alignable mappings, that
is:
τ
ALN
(T
1
,T
2
) = min{γ(M) | M M
ALN
(T
1
,T
2
)}.
2. We say that M is an isolated-subtree map-
ping (Wang and Zhang, 2001) (or a con-
strained mapping (Zhang, 1996)), denoted by
M M
ILST
(T
1
,T
2
), if M satisfies the following
condition:
(v
1
,w
1
)(v
2
,w
2
)(v
3
,w
3
) M
(v
3
< v
1
v
2
) (w
3
< w
1
w
2
)
.
Also we define an isolated-subtree distance
τ
ILST
(T
1
,T
2
) as the minimum cost of all the
isolated-subtree mappings, that is:
τ
ILST
(T
1
,T
2
) = min{γ(M) | M M
ILST
(T
1
,T
2
)}.
3. We say that M is a segmental mapping (Kan et al.,
2014), denoted by M M
SG
(T
1
,T
2
), if M satisfies
the following condition.
(v,w) M
(v
, w
) M
v
anc(v)
w
anc(w)
=
(par(v),par(w)) M
.
Also we define a segmental distance τ
SG
(T
1
, T
2
) as
the minimum cost of all the segmental mappings,
that is:
τ
SG
(T
1
, T
2
) = min{γ(M) | M M
SG
(T
1
, T
2
)}.
Furthermore, for distances τ
A
and τ
B
, we say that
τ
A
is incomparable with τ
B
if there exist trees T
1
,
T
2
, T
3
and T
4
such that τ
A
(T
1
,T
2
) < τ
B
(T
1
,T
2
) and
τ
B
(T
3
,T
4
) < τ
A
(T
3
,T
4
).
ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods
200
Theorem 2 ((Kuboyama, 2007; Yoshino and Hirata,
2017)). Let T
1
and T
2
be trees. Then, it holds that
M
ILST
(T
1
,T
2
) M
ALN
(T
1
,T
2
) M
TAI
(T
1
,T
2
) and
M
SG
(T
1
,T
2
) M
TAI
(T
1
,T
2
). On the other hand,
M
ILST
(T
1
,T
2
) or M
ALN
(T
1
,T
2
) is incomparable with
M
SG
(T
1
,T
2
) with respect to set inclusion.
Theorem 2 implies τ
TAI
(T
1
,T
2
) τ
ALN
(T
1
,T
2
)
τ
ILST
(T
1
,T
2
) and τ
TAI
(T
1
,T
2
) τ
SG
(T
1
,T
2
) for every
tree T
1
and T
2
. On the other hand, τ
ILST
or τ
ALN
is
incomparable with τ
SG
. Furthermore, the following
theorem is known for the problem of computing τ
TAI
and its variations.
Theorem 3. Let T
1
and T
2
be trees such that n =
max{|T
1
|,|T
2
|} and d = min{d(T
1
),d(T
2
)}.
1. The problem of computing τ
TAI
(T
1
,T
2
) is
MAX SNP-hard (Zhang and Jiang, 1994). This
statement holds even if both T
1
and T
2
are binary,
the maximum height of T
1
and T
2
is at most 3 or
the cost function is the unit cost function (Akutsu
et al., 2013; Hirata et al., 2011).
2. The problem of computing τ
ALN
(T
1
,T
2
) is
MAX SNP-hard. On the other hand, if the
degrees of T
1
and T
2
are bounded by some
constants, then we can compute τ
ALN
(T
1
,T
2
) in
polynomial time with respect to n (Jiang et al.,
1995).
3. We can compute τ
ILST
(T
1
,T
2
) in O(n
2
d) time (cf.,
(Yamamoto et al., 2014)).
4. The problem of computing τ
SG
(T
1
,T
2
) is
MAX SNP-hard. This statement holds even
if both T
1
and T
2
are binary or the cost function is
the unit cost function (Yamamoto et al., 2014).
In contrast to Theorem 3, Muraka et al. (Muraka
et al., 2018) have recently shown the following theo-
rem of the edit distance for caterpillars.
Theorem 4 ((Muraka et al., 2018)). Let C
1
and
C
2
be caterpillars, h = max{h(C
1
),h(C
2
)} and λ =
max{|lv(C
1
)|,|lv(C
2
)|}. Then, we can compute
τ
TAI
(C
1
,C
2
) in O(h
2
λ
3
) time under the general cost
function and O(h
2
λ) time under the unit cost function.
3 HEAVY CATERPILLAR
DISTANCES
In this section, we introduce the heavy caterpillar in
a tree, based on the heavy path (Sleator and Tarjan,
1983). Then, we formulate another variation of the
edit distance as heavy caterpillar distances based on
the edit distance for heavy caterpillars.
Definition 6 (Heavy path (Sleator and Tarjan, 1983)).
Let T be a tree. For v T and w ch(v), w is
a heavy child of v if |T[w]| is maximum and de-
note it by hv(v). A heavy path of T is the rooted
path ({v
1
,... , v
n
}, {(v
i
,v
i+1
) | 1 i n 1}) such
that v
1
= r(T), v
i+1
= hv(v
i
) (1 i n 1) and
v
n
lv(T).
If there exist more than two heavy children of v,
then we may name one of them arbitrary a heavy child
of v. Then, based on the heavy path in a tree, we
introduce the heavy caterpillar in a tree as follows.
Definition 7 (Heavy caterpillar). Let T be a tree and
P the heavy path of T. Then, we define the heavy
caterpillar hc(T) = (V, E) of T as follows.
V = V(P) {w ch(v) | v V(P)},
E = E(P) {(v, w) | v V(P),w ch(v)}.
We denote the minimum cost Tai mapping be-
tween C
1
= hc(T
1
) and C
2
= hc(T
2
) by M
hc
(C
1
,C
2
).
Then, the algorithm HVYCATMAP in Algorithm 1
returns a Tai mapping based on the heavy caterpil-
lars C
1
and C
2
. We define the heavy caterpillar map-
ping between T
1
and T
2
as the mapping obtained from
the algorithm HVYCATMAP(T
1
,T
2
) and denote it by
M
hc
(T
1
,T
2
).
1 procedure HVYCATMAP(T
1
,T
2
)
/* T
1
,T
2
: trees */
2 C
1
hc(T
1
); C
2
hc(T
2
); L
1
lv(C
1
);
L
2
lv(C
2
); M M
hc
(C
1
,C
2
);
3 L {(v, w) M | v L
1
,w L
2
,T
1
(v) 6=
/
0, T
2
(w) 6=
/
0};
4 foreach (v, w) L do
5 M
1
HVYCATMAP(T
1
[v],T
2
[w]);
M M M
1
;
6 return M;
Algorithm 1: HVYCATMAP.
Definition 8 (Heavy caterpillar distances). Let T
i
be
a tree, C
i
= hc(T
i
) and D
i
= T
i
\C
i
(i = 1, 2). Then,
we define the heavy caterpillar distances τ
HC
(T
1
,T
2
)
and τ
c
HC
(T
1
,T
2
) as follows.
τ
HC
(T
1
,T
2
)
= τ
TAI
(C
1
,C
2
) +
vD
1
γ(v,ε) +
wD
2
γ(ε,w),
τ
c
HC
(T
1
,T
2
) = γ(M
hc
(T
1
,T
2
)).
Theorem 5. For trees T
1
and T
2
, it holds that
τ
TAI
(T
1
,T
2
) τ
c
HC
(T
1
,T
2
) τ
HC
(T
1
,T
2
).
Proof. For C
i
= hc(T
i
) and M
= M
hc
(T
1
,T
2
) \
M
hc
(C
1
,C
2
), since τ
TAI
(C
1
,C
2
) = γ(M
hc
(C
1
,C
2
)), it
holds that τ
c
HC
(T
1
,T
2
) = τ
TAI
(C
1
,C
2
)+ γ(M
). If M
=
/
0, then it holds that γ(M
) =
vD
1
γ(v,ε) +
wD
2
γ(ε,w),
which implies that τ
c
HC
(T
1
,T
2
) τ
HC
(T
1
,T
2
).
Heavy Caterpillar Distances for Rooted Labeled Unordered Trees
201
In order to show that τ
TAI
(T
1
,T
2
) τ
c
HC
(T
1
,T
2
), it
is sufficient to show that the heavy caterpillar map-
ping M
hc
(T
1
,T
2
) is a Tai mapping. If it is true, then it
holds that τ
TAI
(T
1
,T
2
) γ(M
hc
(T
1
,T
2
)).
Let L
= {(v
1
,w
1
),... (v
k
,w
k
)} be the union of all
the L selected at line 2 in HVYCATMAP in Algo-
rithm 1 recursively, v
0
= r(T
1
) and w
0
= r(T
2
). Also
let M
i
be the output of HVYCATMAP(T
1
[v
i
],T
2
[w
i
])
(0 i k) and M = M
0
M
1
··· M
k
, where
M
0
= M
hc
(C
1
,C
2
) M
TAI
(T
1
,T
2
). Note that M
i
M
TAI
(T
1
[v
i
],T
2
[w
i
]), so M
i
M
TAI
(T
1
,T
2
).
Since M
i
is mutually distinct for every i and
M
i
M
TAI
(T
1
[v
i
],T
2
[w
i
]), M satisfies the one-to-one
condition. By the construction of L, (M \ M
i
)
{(v
i
,w
i
)} satisfies the ancestor condition for every
(v
i
,w
i
) L
, which implies that M satisfies the ances-
tor condition. Hence, it holds that M M
TAI
(T
1
,T
2
).
Since M = M
hc
(T
1
,T
2
), it holds that M
hc
(T
1
,T
2
)
M
TAI
(T
1
,T
2
).
Theorem 6. Let T
1
and T
2
be trees, where h =
max{h(T
1
),h(T
2
)} and λ = max{|lv(T
1
)|,|lv(T
2
)|}.
Then, we can compute τ
HC
(T
1
,T
2
) in O(h
2
λ
3
) time
under the general cost function and in O(h
2
λ) time
under the unit cost function. Also we can compute
τ
c
HC
(T
1
,T
2
) in O(h
2
λ
4
) time under the general cost
function and in O(h
2
λ
2
) time under the unit cost func-
tion.
Proof. Let C
i
= hc(T
i
) (i = 1, 2). First, we can obtain
C
i
in O(|T
i
|) = O(hλ) time (Sleator and Tarjan, 1983).
Since it is essential for computing τ
HC
(T
1
,T
2
) to com-
pute τ
TAI
(C
1
,C
2
), the time complexity of computing
τ
HC
follows from Theorem 4.
Next, consider the number of recursive calls
in HVYCATMAP in Algorithm 1. For L
in the
proof of Theorem 5, we denote L
1
= {v V(T
1
) |
(v, w) L
} and L
2
= {w V(T
2
) | (v, w) L
}.
Then, for every leaf u lv(T
1
) \ lv(C
1
) (resp., u
lv(T
2
) \ lv(C
2
)), there exists exactly one v L
1
(resp.,
w L
2
) such that T
1
[v] (resp., T
2
[w]) called as
HVYCATMAP(T
1
[v],T
2
[w]) at line 4 in Algorithm 1
contains u. This statement implies that |L
| λ.
Hence, the number of recursive calls is at most λ, so
the statement of computing τ
c
HC
holds.
In the remainder of this section, we assume that
the cost function is the unit cost function. Then, we
compare τ
c
HC
with the edit distance τ
TAI
and its other
variations τ
ALN
, τ
ILST
and τ
ALN
.
Lemma 1. There exist trees T
1
and T
2
such that |T
1
| =
|T
2
| = O(n), τ
TAI
(T
1
,T
2
) = O(1) but τ
c
HC
(T
1
,T
2
) =
(n).
Proof. Consider T
1
and T
2
illustrated in Figure 2. It
is obvious that |T
1
| = |T
2
| = 2n+ 1. Also it holds that
τ
TAI
(T
1
,T
2
) = 2 because M
1
in Figure 2 is the mini-
mum cost mapping for τ
TAI
. Note that τ
ILST
(T
1
,T
2
) =
τ
ALN
(T
1
,T
2
) = τ
SG
(T
1
,T
2
) = 2.
On the other hand, by the definition of τ
c
HC
, we
construct the mapping with cost 0 between hc(T
1
) and
hc(T
2
), that is, the second child of the root in T
1
(la-
beled by a) is corresponding to the third child of the
root in T
2
(labeled by a) and the third child of the
root in T
1
(labeled by b) is to the second child of the
root in T
2
(labeled by b). Then, M
2
in Figure 2 is the
minimum cost mapping for τ
c
HC
. Hence, it holds that
τ
c
HC
(T
1
,T
2
) = 2n 4.
a
a
a
a
n 1
a
b
a
a
n 2
a
a
a
a
n 1
b
a
a
a
n 2
T
1
T
2
a
a
a
a
a
b
a
a
a
a
a
a
b
a
a
a
M
1
a
a
a
a
a
b
a
a
a
a
a
a
b
a
a
a
M
2
Figure 2: Trees T
1
and T
2
in Lemma 1 and the minimum
cost mappings M
1
for τ
TAI
and M
2
for τ
c
HC
.
Lemma 2. There exist trees T
1
and T
2
such that
|T
1
| = |T
2
| = O(n), τ
TAI
(T
1
,T
2
) = τ
c
HC
(T
1
,T
2
) = O(1)
but τ
ILST
(T
1
,T
2
) = (n).
Proof. Consider T
1
and T
2
illustrated in Figure 3. It is
obvious that |T
1
| = |T
2
| = 2n+ 1. Since T
1
and T
2
are
caterpillars, it holds that τ
TAI
(T
1
,T
2
) = τ
c
HC
(T
1
,T
2
) =
τ
HC
(T
1
,T
2
) = 1. Note that τ
ALN
(T
1
,T
2
) = 1 and
τ
SG
(T
1
,T
2
) = 3.
On the other hand, the minimum cost isolated-
subtree mapping maps r
1
= r(T
1
) to r
2
= r(T
2
), n+ 1
children of r
1
to n + 1 children of r
2
, so the number
of the remained (non-mapped) vertices is n 1 + n =
2n1. Hence, it holds that τ
ILST
(T
1
,T
2
) = 2n1.
ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods
202
a
a a
2n
a
a
n
a a
a a
n
T
1
T
2
Figure 3: Trees T
1
and T
2
in Lemma 2.
Lemma 3. There exist trees T
1
and T
2
such that
|T
1
| = |T
2
| = O(n), τ
TAI
(T
1
,T
2
) = τ
c
HC
(T
1
,T
2
) = O(1)
but τ
ALN
(T
1
,T
2
) = (n).
Proof. Consider trees T
1
and T
2
in Figure 4. It is
obvious that |T
1
| = |T
2
| = 2n + 2. Since T
1
is trans-
formed to T
2
by inserting a vertex labeled d in T
2
af-
ter deleting a vertex labeled by d in T
1
, it holds that
τ
TAI
(T
1
,T
2
) = 2. Since T
1
and T
2
are caterpillars, it
also holds that τ
HC
(T
1
,T
2
) = τ
c
HC
(T
1
,T
2
) = 2. Note
that τ
SG
(T
1
,T
2
) = 4.
On the other hand, the minimum cost alignable
mapping maps a vertex labeled by b (resp., c) in T
1
to a vertex labeled by c (resp., b) in T
2
injectively.
Then, it holds that τ
ALN
(T
1
,T
2
) = 2n. Also it holds
that τ
ILST
(T
1
,T
2
) = 2n.
a
b
n
b d
c c
n
a
c
n
c
d
b b
n
T
1
T
2
Figure 4: Trees T
1
and T
2
in Lemma 3.
Lemma 4. There exist trees T
1
and T
2
such that
|T
1
| = |T
2
| = O(n), τ
TAI
(T
1
,T
2
) = τ
c
HC
(T
1
,T
2
) = O(1)
but τ
SG
(T
1
,T
2
) = (n).
Proof. Consider T
1
and T
2
illustrated in Figure 5 (cf.,
(Kan et al., 2014)) and let C
i
= hc(T
i
) (i = 1,2). It is
obvious that |T
1
| = 4n and |T
2
| = 4n2. Also it holds
that τ
TAI
(T
1
,T
2
) = 2.
For C
1
and C
2
in Figure 5, it holds that
τ
HC
(T
1
,T
2
) = τ
TAI
(C
1
,C
2
) + 2n = 2n + 2. Since the
minimum cost mapping for τ
TAI
(C
1
,C
2
) maps the
rightmost vertex v in C
1
to the rightmost vertex
w in C
2
, hc(T
1
,T
2
) maps the children of v in T
1
to the children of w in T
2
injectively. Hence, it
holds that τ
c
HC
(T
1
,T
2
) = τ
TAI
(C
1
,C
2
) = 2. Note that
τ
ILST
(T
1
,T
2
) = τ
ALN
(T
1
,T
2
) = 2.
On the other hand, since the minimum cost seg-
mental mapping maps to the path with n 1 vertices
and its n children and the vertex and its n children in
T
2
, the number of remained (i.e., non-mapped) ver-
tices is n + 1 in T
1
and n 1 in T
2
, so it holds that
τ
SG
(T
1
,T
2
) = 2n.
a
a
n
a
a
n
a a
n
a
a a
n
a
a
n 1
a
a
n 1
a a
n
a
a a
n
T
1
T
2
a
a
n
a
a
n
a a
n
a
v
a
a
n 1
a
a
n 1
a a
n
a
w
C
1
C
2
Figure 5: Trees T
1
, T
2
, C
1
and C
2
in Lemma 4.
Lemmas 2, 3 and 4 imply the following theorem.
Theorem 7. The distances τ
HC
and τ
c
HC
are incompa-
rable with the distances τ
ALN
, τ
ILST
and τ
SG
.
By incorporating Theorem 6 and 7, we can con-
clude that the heavy caterpillar distances τ
HC
and τ
c
HC
are tractable variations of the edit distance τ
TAI
incom-
parable with the isolated-subtree distance τ
ILST
.
4 CONCLUSION
In this paper, we have introduced heavy the caterpil-
lar distances τ
HC
and τ
c
HC
and shown that they pro-
vide the upper bound of the edit distance τ
TAI
, they
are tractable, in particular, quadratic-time computable
under the unit cost function, and incomparable with
other variations of τ
TAI
presented by (Yoshino and Hi-
rata, 2017). Since τ
ILST
is the most general tractable
variation of τ
TAI
(Yoshino and Hirata, 2017), τ
HC
and
τ
c
HC
are another tractable variations of τ
TAI
incompa-
rable with τ
ILST
.
Concerned with Lemma 1, it is possible to avoid
this problem to compute the edit distance (the Tai
mapping) between heavy caterpillars by considering
the occurrences of labels in the descendants. It is a fu-
ture work whether or not we can design a new method
to avoid to this problem.
The heavy caterpillar distances τ
HC
and τ
c
HC
are
defined by M
hc
(C
1
,C
2
) and M
hc
(T
1
,T
2
) as opera-
Heavy Caterpillar Distances for Rooted Labeled Unordered Trees
203
tional, whereas other variations of τ
TAI
are based on
the declarative definition of the Tai mapping. Then, it
is a future work whether or not to give the declarative
definition of τ
HC
and τ
c
HC
.
In general, we cannot determine the heavy path
and then the heavy caterpillar uniquely. Then, it is a
future work to design the method to select the heavy
path and the heavy caterpillar uniquely appropriate to
τ
HC
and τ
c
HC
.
Finally, after improving that the heavy caterpillar
distances τ
HC
and τ
c
HC
are determined uniquely, it is
an important future work to give experimental results
to compare τ
HC
and τ
c
HC
with the isolated-subtree dis-
tance τ
ILST
for real data.
ACKNOWLEDGMENTS
This work is partially supported by Grant-in-Aid
for Scientific Research 17H00762, 16H02870 and
16H01743 from the Ministry of Education, Culture,
Sports, Science and Technology, Japan. The au-
thors would like to thank anonymous referees of
ICPRAM’20 for valueable comments to revise the
submitted version of this paper.
REFERENCES
Akutsu, T., Fukagawa, D., Halld´orsson, M. M., Takasu, A.,
and Tanaka, K. (2013). Approximation and parame-
terized algorithms for common subtrees and edit dis-
tance between unordered trees. Theoret. Comput. Sci.,
470:10–22.
Demaine, E. D., Mozes, S., Rossman, B., and Weimann, O.
(2009). An optimal decomposition algorithm for tree
edit distance. ACM Trans. Algo., 6.
Deza, M. M. and Deza, E. (2016). Encyclopedia of dis-
tances (4th ed.). Springer.
Gallian, J. A. (2007). A dynamic survey of graph labeling.
Electorn. J. Combin., 14:DS6.
Hirata, K., Yamamoto, Y., and Kuboyama, T. (2011). Im-
proved MAX SNP-hard results for finding an edit dis-
tance between unordered trees. In Proc. CPM’11
(LNCS 6661), pages 402–415.
Jiang, T., Wang, L., and Zhang, K. (1995). Alignment of
trees an alternative to tree edit. Theoret. Comput.
Sci., 143:137–148.
Kan, T., Higuchi, S., and Hirata, K. (2014). Segmental
mapping and distance for rooted ordered labeled trees.
Fundam. Inform., 132:1–23.
Kuboyama, T. (2007). Matching and learning in trees. Ph.D
thesis, University of Tokyo.
Lu, C. L., Su, Z.-Y., and Yang, C. Y. (2001). A new mea-
sure of edit distance between labeled trees. In Proc.
COCOON’01 (LNCS 2108), pages 338–348.
Muraka, K., Yoshino, T., and Hirata, K. (2018). Computing
edit distance between rooted labeled caterpillars. In
Proc. FedCSIS’18, pages 245–252.
Muraka, K., Yoshino, T., and Hirata, K. (2019). Vertical
and horizontal distances to approximate edit distance
for rooted labeled caterpillars. In Proc. ICPRAM’19,
pages 590–597.
Sleator, D. D. and Tarjan, R. E. (1983). A data structure for
dynamoic trees. J. Comput. Sys. Sci., 26:362–391.
Tai, K.-C. (1979). The tree-to-tree correction problem. J.
ACM, 26:422–433.
Wang, J. T. L. and Zhang, K. (2001). Finding similar con-
sensus between trees: An algorithm and a distance hi-
erarchy. Pattern Recog., 34:127–137.
Yamamoto, Y., Hirata, K., and Kuboyama, T. (2014).
Tractable and intractable variations of unordered tree
edit distance. Internat. J. Found. Comput. Sci.,
25:307–329.
Yoshino, T. and Hirata, K. (2017). Tai mapping hierarchy
for rooted labeled trees through common subforest.
Theory of Comput. Sys., 60:769–787.
Zhang, K. (1996). A constrained edit distance between un-
ordered labeled trees. Algorithmica, 15:205–222.
Zhang, K. and Jiang, T. (1994). Some MAX SNP-hard re-
sults concerning unordered labeled trees. Inform. Pro-
cess. Lett., 49:249–254.
ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods
204