Accurate and Fast Computation of Approximate Graph Edit Distance
based on Graph Relabeling
Sousuke Takami and Akihiro Inokuchi
School of Science and Technology, Kwansei Gakuin University, 2-1 Gakuen, Sanda, Hyogo, Japan
Keywords:
Graph Edit Distance, Graph Relabeling, Graph Classification.
Abstract:
The graph edit distance, a well-known metric for determining the similarity between two graphs, is commonly
used for analyzing large sets of structured data, such as those used in chemoinformatics, document analysis,
and malware detection. As computing the exact graph edit distance is computationally expensive, and may
be intractable for large-scale datasets, various approximation techniques have been developed. In this paper,
we present a method based on graph relabeling that is both faster and more accurate than the conventional
approach. We use unfolded subtrees to denote the potential relabeling of local structures around a given vertex.
These subtree representations are concatenated as a vector, and the distance between different vectors is used
to characterize the distance between the corresponding graphs. This avoids the need for multiple calculations
of the exact graph edit distance between local structures. Simulation experiments on two real-world chemical
datasets are reported. Compared with the conventional technique, the proposed method gives a more accurate
approximation of the graph edit distance and is significantly faster on both datasets. This suggests the proposed
method could be applicable in the analysis of larger and more complex graph-like datasets.
1 INTRODUCTION
Graphs are one of the most natural means of repre-
senting structured data. For instance, a chemical com-
pound can be represented as a graph in which each
vertex corresponds to an atom, each edge corresponds
to a bond between two atoms, and the label of each
vertex corresponds to the atom type. With recent im-
provements in system throughput, the need to analyze
large numbers of graphs has arisen, and the topic of
graph mining has received considerable interest be-
cause the knowledge present in structured data can
be applied to various real-world datasets. For exam-
ple, in cheminformatics, certain properties of chem-
ical compounds (e.g., mutagenicity or toxicity) can
be identified by analyzing their structural informa-
tion, and in bioinformatics, the prediction of protein–
protein interactions is beneficial for drug discovery.
When analyzing datasets of graphs, one of the
most critical measures is the dissimilarity (or simi-
larity) among the graphs. A representative measure
of the dissimilarity is the graph edit distance. The
graph edit distance d(g
1
,g
2
) between graphs g
1
and
g
2
is defined as the minimum length of the sequence
of edit operations needed to transform g
1
into g
2
,
where one edit operation includes the insertion, dele-
tion, or the substitution of a vertex and edge in the
graphs. The method based on the A
?
algorithm is a
well-known technique for computing the exact graph
edit distance (Hart et al., 1968). However, this method
cannot be applied to large graphs, because the prob-
lem of obtaining the exact graph edit distance between
two graphs is known to be NP-complete.
To overcome this difficulty, a method that uses
the minimum matching problem of a complete bipar-
tite graph (V
1
,V
2
,E, w) has been proposed to com-
pute the approximate graph edit distance between the
graphs (Riesen, 2015). In a bipartite graph, V
1
and
V
2
correspond to the vertices of g
1
and g
2
, respec-
tively, and w is a function assigning values of the dis-
similarities between local structures around the ver-
tices to edges in the bipartite. Given the complete
bipartite graph, the matching problem returns a map-
ping from the vertices in g
1
to vertices in g
2
that is
solvable in O(b
3
), where b = max{|V
1
|,|V
2
|}. Com-
puting the approximate graph edit distance using this
method is computationally simpler than computing
the exact graph edit distance. In this method, the
local structures are important in obtaining as accu-
rate an approximate distance as possible. The lo-
cal structure around each vertex v is represented by
star structures (Zeng et al., 2009), random walks
Takami, S. and Inokuchi, A.
Accurate and Fast Computation of Approximate Graph Edit Distance based on Graph Relabeling.
DOI: 10.5220/0006540000170026
In Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2018), pages 17-26
ISBN: 978-989-758-276-9
Copyright © 2018 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
17
from v (Ga
¨
uz
`
ere et al., 2014), and subgraphs induced
by vertices reachable within h steps from v, called
limited-size subgraphs (Carletti et al., 2015). If com-
plex and/or large structures such as limited-size sub-
graphs are used as local structures, the computation
time required to assign values to edges in the bipar-
tite graph can be significant. In contrast, if simple
structures such as stars or walks are used, the ac-
curacy of the approximated graph edit distance de-
creases. A recently proposed method (Carletti et al.,
2015) measures the dissimilarity between local struc-
tures around v
1
in g
1
and v
2
in g
2
with small limited-
size subgraphs. The dissimilarity in this approach
is the exact graph edit distance. Thus, although this
method gives a relatively accurate approximate graph
edit distance between g
1
and g
2
, it requires consid-
erable computation time to measure multiple exact
graph edit distances for the small subgraphs.
In this paper, we tackle the problem of measuring
the approximate graph edit distance between graphs
and propose an accurate and fast method. In the pro-
posed method, each of the complex local structures is
represented by a vector, which enables this method to
be applied to complex and large local structures and
ensures fast computation. The proposed method re-
quires O(b
2
h(|Σ| + b) time to compute the approxi-
mate graph edit distance, where Σ is the set of vertex
labels in the graphs. In addition, the method requires
O(|Σ|b +b
2
) memory.
The remainder of this paper is organized as fol-
lows. Section 2 formalizes the problem of measur-
ing the graph edit distance and explains the existing
method based on the minimum matching problem of
a complete bipartite graph. In Section 3, we propose
an accurate and fast method for measuring the graph
edit distance by representing local structures as vec-
tors. In Section 4, we verify the computational effi-
ciency of the proposed method and compare it with
the conventional method in terms of accuracy using
real-world datasets. Finally, we conclude the paper in
Section 5.
2 PRELIMINARIES
This paper tackles the problem of computing the ap-
proximate graph edit distance between graphs. First,
we define the terminology used to solve the problem.
An undirected graph is represented as g = (V,E, Σ, `),
where V is a set of vertices, E V × V is a set of
edges, Σ = {σ
1
,σ
2
,··· , σ
Σ
} is a set of vertex labels,
and ` : V Σ is a function that assigns a label to each
vertex in the graph. Additionally, the set of vertices
in graph g is represented as V (g). Although we as-
Figure 1: Sequence of edit operations for transforming g
1
into g
2
.
sume that only the vertices in the graphs have labels,
the methods in this paper can be applied to graphs
where both the vertices and edges have labels (Hido
and Kashima, 2009). The vertices adjacent to vertex v
are represented as N(v) = {u | (v, u) E}. The aver-
age number of adjacent vertices is represented as d. A
sequence of vertices from v to u is called a path, and
its step refers to the number of edges on that path. A
path is said to be simple if and only if it does not have
repeating vertices. The paths discussed in this paper
are not always simple.
The graph edit distance is one of the most repre-
sentative metrics to measure the dissimilarity between
graphs. The graph edit distance d(g
1
,g
2
) between
graphs g
1
and g
2
is defined as the minimum length
of the sequence of edit operations needed to trans-
form g
1
into g
2
, where one edit operation includes the
insertion or deletion of a vertex/edge and the substi-
tution of a vertex label. Although the edit distance
was originally proposed for measuring the dissimilar-
ity between two strings, the metric was extended to
graphs by introducing graph edit operations.
Figure 1 shows a certain sequence of edit oper-
ations that consists of one deletion of vertex (e
2
),
one insertion of edge (e
4
), one deletion of edge (e
1
),
and two substitutions of labels (e
3
,e
5
). Computing
the edit distance between g
1
and g
2
is equivalent to
searching for the minimum length of the sequence of
edit operations needed to transform g
1
into g
2
. The
method based on the A
?
algorithm is a well-known
technique for computing the exact graph edit dis-
tance (Hart et al., 1968). However, this method can-
not be applied to large graphs, because the problem
of obtaining the exact graph edit distance between
two graphs is known to be NP-complete. To address
this drawback, various methods for computing the ap-
proximate graph edit distance have been proposed.
The graph edit distance is an important measure
for analyzing graphs, and is applicable to a wide
range of practical applications such as fingerprint au-
thentication (Choi and Kim, 2010), malware detec-
tion (Kinable and Kostakis, 2011), chemoinformat-
ics (Kashima et al., 2003), bioinformatics, and doc-
ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods
18
ument analysis (Wang et al., 2014). We explain three
typical fundamental problems that use the analysis
of graphs. First, given a set of graphs, we can ap-
ply methods for clustering the graphs to detect some
latent groups within them (Robles-Kelly and Han-
cock, 2003). Second, given a set of graphs and a
query graph g
q
, we can apply a similarity search to
the graphs (Yan et al., 2005). Third, given a set of
graphs with class labels, we can apply a Support Vec-
tor Machine (SVM) and kernel functions to forecast
the class of a graph whose label is unknown (Kashima
et al., 2003). The graph edit distance is converted to
similarity using the Gaussian kernel, which is defined
as
k(g
1
,g
2
) = exp
d(g
1
,g
2
)
2
2σ
2
. (1)
Most machine learning and data mining algorithms
that are designed to analyze d-dimensional vectors
can be applied to graphs using this kernel.
3 APPROXIMATE GRAPH EDIT
DISTANCE
This section surveys a framework for computing the
approximate graph edit distance between two graphs
using the linear sum assignment problem (LSAP).
The problem of computing the exact graph edit dis-
tance between graphs g
1
= (V
1
,E
1
,Σ,`
1
) and g
2
=
(V
2
,E
2
,Σ,`
2
) is formalized as follows (Riesen and
Bunke, 2009; Riesen, 2015): First, the set of vertices
V
1
is extended to
V
+
1
= V
1
ba empty vertices
z }| {
{ ε
1
,ε
2
,··· , ε
ba
}
where |V
1
| = a, |V
2
| = b, and we assume that |V
1
|
|V
2
|. The graph edit distance computation is even-
tually performed on graphs g
1
= (V
+
1
,E
1
,Σ,`
1
) and
g
2
= (V
2
,E
2
,Σ,`
2
). Additionally, a cost matrix for
editing g
1
to g
2
is defined as
C =
1 2 ··· b
1 c
11
c
12
··· c
1b
2 c
21
c
22
··· c
2b
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a c
a1
c
a2
··· c
ab
1 c
ε1
c
ε2
··· c
εb
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
b a c
ε1
c
ε2
··· c
εb
, (2)
where
c
i j
denotes the cost of replacing a label of vertex
v
i
V
1
with a label of vertex v
j
V
2
and
c
εi
denotes the cost of inserting a vertex v
i
into g
1
to edit g
1
to g
2
.
Additionally, we denote the cost of editing an edge
(v
i
,v
j
) in g
1
into an edge (v
i
0
,v
j
0
) in g
2
by c((v
i
,v
j
)
(v
i
0
,v
j
0
)). In this paper, we assume that only vertices
in graphs have labels, and so c((v
i
,v
j
) (v
i
0
,v
j
0
)) is
the cost of inserting or deleting an edge. When g
1
is edited to g
2
, the mapping between the two sets of
vertices is denoted by a bijective function ϕ : V
+
1
V
2
. Then, the total cost of editing g
1
to g
2
via ϕ is
dist(g
1
,g
2
,ϕ) =
b
i=1
c
iϕ(i)
+
b1
i=1
b
j=i+1
c
(v
i
,v
j
)
v
ϕ(i)
,v
ϕ( j)

. (3)
Therefore, the exact edit distance between g
1
and g
2
is
d(g
1
,g
2
) = min
ϕΦ
dist(g
1
,g
2
,ϕ), (4)
where Φ is a set of all possible permutations of
integers 1, 2, · · · ,b. Equation (4) is a type of
quadratic assignment problem that is known to be NP-
complete (Koopmans and Beckmann, 1957), although
it reduces to the LSAP if Eq. (3) does not contain its
second term.
Riesen et al. proposed an efficient algorithmic
framework that enables us to obtain the approximate
graph edit distance by omitting the second term of
Eq. (3). The framework first solves
ˆ
ϕ = arg min
ϕΦ
b
i=1
c
iϕ(i)
(5)
and then obtains the approximate edit distance be-
tween g
1
and g
2
by substituting
ˆ
ϕ into Eq. (3). Equa-
tion (5) implies that each vertex v
i
in g
1
should, as
far as possible, be mapped to a vertex in g
2
that
has the same label as v
i
. Solving Eq. (5) is equiva-
lent to the minimum matching problem of a bipartite
graph whose vertices are V
+
1
and V
2
, and whose edge
weights are c
i j
in Eq. (2). Therefore, this problem
is tractable in O(b
3
). However, because the structural
information of the two graphs is ignored and only ver-
tex labels are taken into account in the optimization
problem shown in Eq. (3), we do not always obtain an
adequate mapping from V
+
1
to V
2
. To overcome this
difficulty, the cost matrix in Eq. (2) are redefined to
take account of the structural information as the ma-
trix C
with elements
c
i j
= c(local(v
i
) local(v
j
)),
Accurate and Fast Computation of Approximate Graph Edit Distance based on Graph Relabeling
19
Kʻ
Kʻ



Figure 2: Limited-size subgraphs induced by vertices
within h steps from vertex v
1
.
Algorithm 1: Approximate Edit Distance.
Data: graphs g
1
and g
2
, and h
Result: approximate graph edit distance
ˆ
d
1 while |V (g
1
)| < |V (g
2
)| do
2 V (g
1
) V (g
1
) {ε};
3 for (v
i
,v
j
) V (g
1
) ×V (g
2
) do
4 c
i j
d(g
i
h
,g
j
h
);
5
ˆ
ϕ LSAP(C
);
6
ˆ
d dist(g
1
,g
2
,
ˆ
ϕ);
7 return
ˆ
d;
where local(v
i
) is the local structure around a vertex
v
i
, and c(local(v
i
) local(v
j
)) is the cost of edit-
ing local(v
i
) to local(v
j
). Recently, various methods
within this framework have been developed by repre-
senting local structures as stars (Zeng et al., 2009),
walks (Ga
¨
uz
`
ere et al., 2014), or limited-size sub-
graphs (Carletti et al., 2015). This enables a more
accurate mapping between sets of vertices than that
of Eq. (5).
Carletti et al. (2015) introduced local(v
i
), a
limited-size subgraph induced by vertices reachable
within h steps from vertex v
1
, as shown in Fig. 2. The
subgraph local(v
i
) is denoted by g
i
h
, and c
i j
is the
exact edit distance between g
i
h
and g
j
h
, that is, c
i j
=
d(g
i
h
,g
j
h
). d(g
i
h
,g
j
h
) is tractable for sufficiently small
h, although applying the problem to large graphs is
intractable, because the problem of computing the
graph edit distance is NP-complete. Additionally, be-
cause the structural information contained in limited-
size subgraphs is greater than that of walks and stars,
the graph edit distance based on limited-size sub-
graphs provides an accurate approximate graph edit
distance.
Algorithm 1 shows the pseudo-code for com-
puting an approximate graph edit distance between
graphs g
1
and g
2
. In Lines 1–2, the numbers of ver-
tices in g
1
and g
2
are equalized. For every pair of
vertices in V (g
1
) ×V(g
2
), the exact edit distance be-
tween limited-size subgraphs g
i
h
and g
j
h
is measured
and set as the (i, j)-th element in the cost matrix C
.
LSAP in Line 5 returns the optimum mapping accord-
ing to the optimal bipartite graph matching. Finally,
in Line 7, Algorithm 1 returns the approximate graph
edit distance between graphs g
1
and g
2
.
This algorithm has a drawback in terms of com-
putational efficiency, because it computes the exact
edit distance multiple times. To overcome this, we
propose a novel method for computing the approxi-
mate graph edit distance more efficiently by compar-
ing structural information in g
i
h
and g
j
h
with a compu-
tation time that is proportional to |V (g
i
h
)| and |V (g
j
h
)|.
4 APPROXIMATE GRAPH EDIT
DISTANCE BASED ON
RELABELING GRAPHS
Given a graph g
(h)
= (V, E,Σ,`
(h)
), all labels of
vertices in g
(h)
are updated to obtain another
graph g
(h+1)
= (V,E,Σ
0
,`
(h+1)
). We call this op-
eration “relabeling, and define it as `
(h+1)
(v) =
r(v, N(v), `
(h)
). Representative methods based on re-
labeling include the Weisfeiler–Lehman Subtree Ker-
nel (WLSK) (Shervashidze et al., 2011), Neighbor-
hood Hash Kernel (NHK) (Hido and Kashima, 2009),
and Hadamard Code Kernel (HCK) (Kataoka and
Inokuchi, 2016). The vertex labels of WLSK are rep-
resented as strings and the relabeling of vertex v is de-
fined as a string concatenation of the labels of N(v).
In NHK, the vertex labels are represented as fixed-
length bit strings and relabeling v is defined in terms
of logical operations such as XOR on the labels of
N(v). The labels of HCK are based on the Hadamard
code, which is used in spread spectrum-based com-
munication technologies, and relabeling v is defined
as a summation on the labels of N(v).
Figure 3 shows an example of the framework
based on graph relabeling. Let g
(0)
be the original
graph whose vertices have labels a, b, and c. Each
of the labels is relabeled to obtain g
(1)
. Although
the actual calculation depends on the method of re-
labeling (e.g., NHK, WLSK, or HCK), the relabeling
of v is commonly applied using v, N(v), and `
(0)
(v).
In the center of Figure 3, `
(0)
(v
1
) = b is relabeled
into d using adjacent vertices v
2
and v
4
. Therefore,
`
(1)
(v
1
) = d represents the characteristics of st(v
1
,1),
where st(v, 1) is a tree of height 1 whose root and
leaves are v and N(v), respectively. The labels of v
1
and v
3
in g
(1)
are identical because st(v
1
,1) = st(v
3
,1)
in g
(0)
.
Given a graph g and its vertex v, an unfolded sub-
tree st(v, h) of height h is defined recursively from the
root of the tree:
ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods
20
Figure 3: Example of relabeling (g
(0)
g
(1)
).
Kʻ
Kʻ



Figure 4: Derivation of an unfolded subtree of height 2.
the root of st(v, h) is the vertex v in g,
each node in st(v,h), except for leaves, has chil-
dren that are N(u) of the vertex u corresponding
to the node, and
the height of st(v, h) is h.
Figure 4 shows an example of an unfolded subtree
st(v
1
,2) derived from the graph shown in Figure 2. A
vertex in the graph may appear in the unfolded subtree
(such as v
1
, v
2
, and v
3
), because multiple paths reach
to the vertices from the root; this causes the size of
the tree to grow exponentially as h increases.
The Label Aggregate Kernel (LAK) (Kataoka and
Inokuchi, 2016) is another method based on this
framework. Next, we present a concrete definition of
the relabeling operation in LAK. In LAK, `
`
`
(0)
L
(v) is a
vector in |Σ|-dimensional space. If a vertex in a graph
has a label σ
i
from the set Σ = {σ
1
,σ
2
,··· , σ
|Σ|
}, the
i-th element in the vector is 1 and the other elements
are 0. In LAK, `
`
`
(h)
L
(v) is defined as
`
`
`
(h)
L
(v) = `
`
`
(h1)
L
(v) +
uN(v)
`
`
`
(h1)
L
(u).
The i-th element in `
`
`
(h)
L
(v) equals the frequency of oc-
currence of σ
i
in st(v, h). Label `
`
`
(h)
L
(v), obtained by
iteratively relabeling h times, has a distribution of la-
bels that is reachable within h steps from v. Therefore,
Figure 5: Example of relabeling in LAK.
`
`
`
(h)
L
(v) represents the characteristics of st(v, h). In ad-
dition, although the size of st(v, h) increases exponen-
tially as h increases, the size of `
`
`
(h)
L
(v) remains |Σ|.
Therefore, retaining `
`
`
(h)
L
(v) in memory is tractable,
although retaining st(v, h) is intractable.
We show an example of relabeling in LAK in Fig-
ure 5, assuming that |Σ| = 3 and relabeling is applied
only once. Consider the graph g
(0)
, whose vertices
have labels (1, 0, 0), (0, 1, 0), and (0, 0, 1). We next
relabel the graphs to obtain g
(1)
. The label of ver-
tex v in g
(1)
represents the distribution of labels con-
tained in st(v, 1). For instance, the label of v
5
in g
(1)
is
`
`
`
(1)
L
(v
5
) = (2, 1, 1), which indicates that there are two
vertices labeled (1,0,0), one vertex labeled (0,1,0),
and one vertex labeled (0, 0, 1). This distribution is
equivalent to that of the labels contained in st(v
5
,1).
By relabeling the graphs h times, each
vertex v has h + 1 |Σ|-dimensional vectors
`
`
`
(0)
L
(v),`
`
`
(1)
L
(v),··· , `
`
`
(h)
L
(v). We concatenate these
vectors to obtain an (h + 1) × |Σ|-dimensional vector
as
`
`
`
L
(v) = (
`
`
`
(0)
L
(v)
z }| {
`
(0)
1
,·· · ,`
(0)
|Σ|
,
`
`
`
(1)
L
(v)
z }| {
`
(1)
1
,·· · ,`
(1)
|Σ|
,·· · ,
`
`
`
(h)
L
(v)
z }| {
`
(h)
1
,·· · ,`
(h)
|Σ|
).
(6)
Given two graphs g
1
and g
2
, if g
i
h
is isomorphic
to g
j
h
, where v
i
V (g
1
) and v
j
V (g
2
), then `
`
`
L
(v
i
) is
the same as `
`
`
L
(v
j
).
Based on this observation, we propose a novel
method for computing the difference between local
structures around v
i
and v
j
. Given two concatenated
vectors `
`
`
L
(v
i
) and `
`
`
L
(v
j
), the distance between `
`
`
L
(v
i
)
and `
`
`
L
(v
j
) is defined as
c
i j
=
`
`
`
L
(v
i
) `
`
`
L
(v
j
)
2
(7)
=
h
t=0
`
`
`
(t)
L
(v
i
) `
`
`
(t)
L
(v
j
)
2
. (8)
Because Eq. (7) is the square of the Euclidean dis-
tance between `
`
`
L
(v
i
) and `
`
`
L
(v
j
), the four axioms
about this metric hold (non-negativity, identity of
indiscernibles, symmetry, and triangle inequality).
However, we need to discuss the identity of indis-
Accurate and Fast Computation of Approximate Graph Edit Distance based on Graph Relabeling
21
0),( =
j
h
i
h
ggd
j
h
i
h
gg =
0)()(
2
=
jLiL
vv ll
)()(
jLiL
vv ll =
DOZD\V
DOPRVWDOOJUDSKV
Figure 6: Identity of indiscernibles.
Algorithm 2: Proposed Method.
Data: graphs g
1
and g
2
, and H
Result: approximate graph edit distance
ˆ
d
1 while |V (g
1
)| < |V (g
2
)| do
2 V (g
1
) V (g
1
) {ε};
3 C
0;
4 for h [0, H] do
5 for (v
i
,v
j
) V (g
1
) ×V (g
2
) do
6 c
i j
c
i j
+ γ
2h
`
`
`
(h)
L
(v
i
) `
`
`
(h)
L
(v
j
)
2
;
7 g
1
(V (g
1
),E(g
1
),Z
|Σ|
,`
(h+1)
);
8 g
2
(V (g
2
),E(g
2
),Z
|Σ|
,`
(h+1)
);
9
ˆ
ϕ LSAP(C
);
10
ˆ
d
h
dist(g
1
,g
2
,
ˆ
ϕ);
11 return min
h[0,H]
ˆ
d
h
;
cernibles in more detail. As shown in Fig. 6, the
identity of indiscernibles for the Euclidean distance
implies that `
`
`
L
(v
i
) = `
`
`
L
(v
j
) if and only if k`
`
`
L
(v
i
)
`
`
`
L
(v
j
)k
2
= 0. The above framework for relabel-
ing graphs is based on the 1-dimensional Weisfeiler–
Lehman algorithm, which checks the isomorphism
between two graphs. This algorithm is known to be a
valid isomorphism test for almost all graphs (see (Cai
et al., 1992) for examples of graphs that cannot be
distinguished by the algorithm) (Shervashidze et al.,
2011). Therefore, if g
i
h
= g
j
h
, k`
`
`
L
(v
i
) `
`
`
L
(v
j
)k
2
= 0
always holds. In contrast, its converse holds for al-
most all graphs.
To control the effect of steps from a central vertex
of the local structure, we add to Eq. (6) a coefficient
that changes exponentially as h increases:
`
`
`
L
(v, γ) = (
`
`
`
(0)
L
(v)
z }| {
`
(0)
1
,··· , `
(0)
|Σ|
,··· ,
γ
h
`
`
`
(h)
L
(v)
z }| {
γ
h
`
(h)
1
,··· , γ
h
`
(h)
|Σ|
). (9)
Algorithm 2 shows the pseudo-code of the pro-
posed method for computing the approximate graph
edit distance between graphs g
1
and g
2
. In Lines 1–
2, the numbers of vertices in g
1
and g
2
are equalized.
This is repeated at most b times. For each pair of
vertices in V (g
1
) ×V (g
2
), the Euclidean distance be-
tween two vectors `
`
`
(h)
L
(v
i
) and `
`
`
(h)
L
(v
j
) is measured
and set as the (i, j)-th element in a square matrix C
in Line 6. In Lines 7–8, using the set of nonnega-
tive integers Z, g
(h)
1
and g
(h)
2
are relabeled to obtain
g
(h+1)
1
and g
(h+1)
2
, respectively. In Line 9, LSAP re-
turns the mapping
ˆ
ϕ from V (g
1
) to V (g
2
) according
to the optimal bipartite graph matching. The pro-
cesses in Lines 5–10 are repeated H + 1 times. Fi-
nally, the algorithm returns the minimum approximate
graph edit distance among
ˆ
d
h
. This algorithm runs in
O(H(|Σ|b
2
+|Σ|bd +b
3
)) time, because the computa-
tional complexities of Lines 6, 7, 9, and 10 are O(|Σ|),
O(|Σ|bd), O(b
3
), and O(b
2
), respectively. Because
d is bounded by b, the computational complexity of
Algorithm 2 becomes O(b
2
H(|Σ| + b). If we com-
pute Eq. (7) in a straightforward manner, we require
O(|Σ|H) memory for each vertex of graphs g
1
and g
2
.
However, when we compute Lines 6, 7, and 8 in Algo-
rithm 2, we do not require `
`
`
(τ)
L
(v) for τ < h. Therefore,
Algorithm 2 requires O(|Σ|b +bd) memory.
The notable difference between Algorithms 1 and
2 is that the former requires the exact graph edit dis-
tance to be computed, which is known to be an NP-
complete problem. Although the computation is con-
ducted for small graphs, Algorithm 1 runs the compu-
tation b
2
times, which entails a significant computa-
tional cost. In contrast, the proposed method does not
require the exact graph edit distance to be computed,
replacing this with computations of the Euclidean dis-
tance. The complexity of this operation is indepen-
dent of the size of the graphs, which enables us to use
LSAP multiple times in our proposed method. The
final output of the proposed method is selected from
ˆ
d
h
for 0 h H, with our method returning the most
accurate graph edit distance.
In the proposed method, we used LAK to charac-
terize the local structure around each vertex. If we
use WLSK, HCK, or NHK, the difference between g
i
h
and g
j
h
is represented as only a binary value, rather
than in a quantitative form. The binary value repre-
sents whether g
i
h
and g
j
h
are isomorphic or not. This
is why we use LAK in our proposed method.
5 EXPERIMENTAL EVALUATION
The proposed method was implemented in Java. All
experiments were conducted on an Intel Xeon E5-
2609 2.50 GHz computer with 32 GB memory run-
ning Microsoft Windows 7. We used two real-world
datasets. The first dataset, MUTAG (Debnath et al.,
1991), contains information on 188 chemical com-
pounds and their class labels. The class labels are
binary values that indicate the mutagenicity of chem-
ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods
22
Figure 7: Conversion of a graph.
Table 1: Summary of evaluation datasets.
MUTAG ENZYMES
Number of graphs |D| 188 600
Maximum graph size 84 126
Average graph size 53.9 32.6
Number of labels |Σ| 12 3
Number of classes 2 6
(class distribution) (126,63) (100,100,100,)
100,100,100)
Avg. degree of vertices 2.1 3.8
ical compounds. Each chemical compound is rep-
resented as an undirected graph where each vertex,
edge, vertex label, and edge label corresponds to an
atom, chemical bond, atom type, and bond type, re-
spectively. Because we assume that only the vertices
in the graphs have labels, the chemical graphs are con-
verted using a previous method (Hido and Kashima,
2009), that is, an edge labeled with ` that is adjacent
to vertices v and u in a chemical graph is replaced
with a vertex labeled with ` that is adjacent to v and u
with unlabeled edges, as shown in Figure 7. The sec-
ond dataset, ENZYMES, contains information on 600
proteins and their class labels. These are one of six la-
bels denoting the six EC top-level classes (Schomburg
et al., 2004). Table 1 summarizes the datasets.
5.1 Computational Efficiency
Figures 8 and 9 show the average computation time
required to compute the approximate graph edit dis-
tance for each pair of graphs in the MUTAG and EN-
ZYMES datasets for various H, respectively. These
figures do not contain results of the exact graph edit
distance, because we could not compute the exact
graph edit distance between most of the graphs with
more vertices than 30. The horizontal axes are log-
arithmic. To plot the computation times of both
methods for H = 0, note that the horizontal axes re-
fer to H + 1 rather than H. The result of the pro-
posed method for H = 0 is almost the same as that
of the conventional method for H = 0, because they
use the same local structures around vertices. As H
increases, the computation time of the conventional
method aforementioned in Algorithm 1 (Carletti et al.,
2015) increases dramatically. We could compute up
to H = 2 within a total computation time of 3 hours,
ϭ͘Ϭ
ϭϬ͘Ϭ
ϭϬϬ͘Ϭ
ϭϬϬϬ͘Ϭ
ϭ ϭϬ ϭϬϬ ϭϬϬϬ
ĐŽŵƉƵƚĂƚŝŽŶƚŝŵĞ΀ŵƐĞĐ΁
,нϭ
WƌŽƉŽƐĞĚ
ŽŶǀĞŶƚŝŽŶĂů
Figure 8: Average computation time for various H (MU-
TAG).
Ϭ͘ϭ
ϭ͘Ϭ
ϭϬ͘Ϭ
ϭϬϬ͘Ϭ
ϭ ϭϬ ϭϬϬ
ĐŽŵƉƵƚĂƚŝŽŶƚŝŵĞ΀ŵƐĞĐ΁
,нϭ
WƌŽƉŽƐĞĚ
ŽŶǀĞŶƚŝŽŶĂů
Figure 9: Average computation time for various H (EN-
ZYMES).
with more than 95% of the computation time in the
conventional method taken up by computing the exact
graph edit distance between local structures. There-
fore, the conventional method can only take account
of small local structures around vertices to compute
the approximate graph edit distance. In contrast, the
computation time of the proposed method is propor-
tional to H, as discussed in the final part of Section 4.
As the proposed method does not require the exact
graph edit distance to be computed, we obtained re-
sults for up to H = 200 within a total computation
time of 3 hours, which indicates that the proposed
method takes account of large local structures around
vertices. Although the proposed method computes
LSAP multiple times, the computational complexity
of LSAP is O(b
3
), which is sufficiently tractable for
the size of the graphs in the experimental datasets.
From these results, we can confirm that the proposed
method is much faster than the conventional method
for computing approximate graph edit distances.
5.2 Accuracy of the Graph Edit
Distance
Figures 10 and 11 show the average approximate
graph edit distance measured by the proposed method
Accurate and Fast Computation of Approximate Graph Edit Distance based on Graph Relabeling
23
for each pair of graphs in the MUTAG and EN-
ZYMES datasets, respectively. The approximate
graph edit distance measured by the proposed method
is not less than the exact graph edit distance, be-
cause the exact graph edit distance is defined as the
“minimum length” of the sequence of edit operations
needed to transform g
1
into g
2
and the approximate
graph edit distance measured by the proposed method
is computed by Eq. (3). As H increases, the aver-
age approximate graph edit distance decreases in both
datasets, because large local structures around the ver-
tices are used to map the vertices in one graph to
vertices in another graph. In addition, by tuning γ,
we can obtain more accurate edit distances than with
γ = 1. In Figure 12, each point (
ˆ
d
c
,
ˆ
d
p
) shows the
approximate graph edit distances
ˆ
d
c
and
ˆ
d
p
measured
by the conventional and proposed methods, respec-
tively, for a pair of graphs in the MUTAG dataset.
There are many points above the red line, which sug-
gests that most of the approximate graph edit dis-
tances measured by the proposed method are more
accurate than those given by the conventional method,
because
ˆ
d
c
>
ˆ
d
p
. For the MUTAG dataset, the average
approximate graph edit distances of the conventional
and proposed methods are 78.5 and 82.4, respectively.
Figure 13 for ENZYMES indicates the same tendency
as Figure 12. For the ENZYMES dataset, the average
approximate graph edit distances of the conventional
and proposed methods are 115.8 and 129.0, respec-
tively. One of the reasons why the proposed method
computes accurate approximate graph edit distances
is that it very efficiently takes account of large local
structures around vertices, which enables the mini-
mum approximate graph edit distance to be selected
from among
ˆ
d
0
,
ˆ
d
1
,··· ,
ˆ
d
H
.
In general, there is a trade-off between fast com-
putation and accuracy. However, as shown in the pre-
vious subsection and here, the proposed method is
much faster and more accurate than the conventional
method.
5.3 Application of Graph Edit Distance
To validate the applicability of the graph edit dis-
tance, we compared the accuracy of the predictions
given by the conventional and proposed methods. The
graph classification problem is defined as follows.
Given a set of n training examples D = {(g
i
,y
i
)}
(i = 1, 2, ··· , n), where each example is a pair consist-
ing of a labeled graph g
i
and the class y
i
{+1, 1}
to which it belongs, the objective is to learn a function
f that correctly predicts the classes of the test exam-
ples. In this experiment, graphs are classified by an
SVM using a graph kernel. The graph kernel used in
ϳϳ
ϳϴ
ϳϵ
ϴϬ
ϴϭ
ϴϮ
ϴϯ
ϴϰ
Ϭ Ϭ͘ϱ ϭ ϭ͘ϱ
ƉƉƌŽdžŝŵĂƚĞ ŐƌĂƉŚĞĚŝƚĚŝƐƚĂŶĐĞ
ɶ
,сϬ
,сϭ
,сϮ
,сϯ
Figure 10: Average approximate graph edit distance for var-
ious H and γ (MUTAG).
ϭϭϬ
ϭϭϭ
ϭϭϮ
ϭϭϯ
ϭϭϰ
ϭϭϱ
ϭϭϲ
ϭϭϳ
ϭϭϴ
ϭϭϵ
Ϭ Ϭ͘ϱ ϭ ϭ͘ϱ Ϯ
ƉƉƌŽdžŝŵĂƚĞ ŐƌĂƉŚĞĚŝƚĚŝƐƚĂŶĐĞ
ɶ
,сϬ ,сϭ ,сϮ ,сϯ
Figure 11: Average approximate graph edit distance for var-
ious H and γ (ENZYMES).
this experiment is defined as
k(g
1
,g
2
) = exp
ˆ
d(g
1
,g
2
)
2
2σ
2
!
, (10)
where
ˆ
d(g
1
,g
2
) is the approximate graph edit dis-
tance between g
1
and g
2
measured by the conven-
tional and proposed methods
1
. To learn from the
kernel matrices generated by the above graph kernel,
we used the LIBSVM package with 10-fold cross-
validation (Chang and Lin, 2001).
Figures 14 and 15 show the classification accu-
racy of the conventional and proposed methods, as
well as that of WLSK (Shervashidze et al., 2011), for
various H using the MUTAG and ENZYME datasets.
The maximum accuracy of the proposed method is
greater than that of the conventional method, because
the proposed method computes accurate approximate
graph edit distances, as shown in the previous section.
In addition, the maximum accuracy of the proposed
1
Their experimental results are represented with “Pro-
posed” and “Conventional” in Figs. 14 and 15.
ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods
24
Ϭ
ϮϬ
ϰϬ
ϲϬ
ϴϬ
ϭϬϬ
ϭϮϬ
ϭϰϬ
ϭϲϬ
ϭϴϬ
ϮϬϬ
Ϭ ϮϬ ϰϬ ϲϬ ϴϬ ϭϬϬ ϭϮϬ ϭϰϬ ϭϲϬ ϭϴϬ ϮϬϬ
ƉƉƌŽ͘ŐƌĂƉŚĞĚŝƚĚŝƐƚĂŶĐĞĨŽƌƚŚĞ
ĐŽŶǀĞŶƚŝŽŶĂůŵĞƚŚŽĚ
ƉƉƌŽ͘ŐƌĂƉŚĞĚŝƚĚŝƐƚĂŶĐĞĨŽƌƚŚĞƉƌŽƉŽƐĞĚ
ŵĞƚŚŽĚ
Figure 12: Approximate graph edit distance for H = 2 (MU-
TAG).
Ϭ
ϱϬ
ϭϬϬ
ϭϱϬ
ϮϬϬ
ϮϱϬ
ϯϬϬ
ϯϱϬ
Ϭ ϱϬ ϭϬϬ ϭϱϬ ϮϬϬ ϮϱϬ ϯϬϬ ϯϱϬ
ƉƉƌŽ͘Ő ƌĂƉŚĞĚŝƚĚŝƐƚĂŶĐĞĨŽƌƚŚĞ
ĐŽŶǀĞŶƚŝŽŶĂůŵĞƚŚŽĚ
ƉƉƌŽ͘ ŐƌĂƉŚĞĚŝƚĚŝƐƚĂŶĐĞĨŽƌƚŚĞƉƌŽƉŽƐĞĚ
ŵĞƚŚŽĚ
Figure 13: Approximate graph edit distance for H = 2 (EN-
ZYMES).
method is comparative with that of WLSK, which is
one of the most representative graph kernels.
6 CONCLUSION
As computing the exact graph edit distance is compu-
tationally expensive, and may be intractable for large-
scale datasets, in this paper, we proposed a method
based on graph relabeling that is both faster and more
accurate than the conventional approach. We used un-
folded subtrees to denote the potential relabeling of
local structures around a given vertex. These subtree
representations are concatenated as a vector, and the
distance between different vectors is used to charac-
terize the distance between the corresponding graphs.
This avoids the need for multiple calculations of the
exact graph edit distance between local structures.
ϳϬ
ϳϱ
ϴϬ
ϴϱ
ϵϬ
ϵϱ
ϭϬϬ
Ϭ ϭ Ϯ ϯ ϰ ϱ
ůĂƐƐŝĨŝĐĂƚŝŽŶĐĐƵƌĂĐLJ΀й΁
,
WƌŽƉŽƐĞĚ
ŽŶǀĞŶƚŝŽŶĂů
t>^<
Figure 14: Prediction accuracy for various H (MUTAG).
ϮϬ
ϯϬ
ϰϬ
ϱϬ
ϲϬ
ϳϬ
Ϭ ϭ Ϯ ϯ ϰ ϱ
ůĂƐƐŝĨŝĐĂƚŝŽŶĐĐƵƌĂĐLJ
΀й΁
,
WƌŽƉŽƐĞĚ
ŽŶǀĞŶƚŝŽŶĂů
t>^<
Figure 15: Prediction accuracy for various H (ENZYMES).
Simulation experiments on two real-world chemical
datasets were reported. Compared with the conven-
tional technique, the proposed method gave a more
accurate approximation of the graph edit distance and
is significantly faster on both datasets. This suggested
the proposed method could be applicable in the anal-
ysis of larger and more complex graph-like datasets.
REFERENCES
Cai, Jin-yi, F
¨
urer, Martin, and Immerman, Neil. 1992. An
Optimal Lower Bound on the Number of Variables for
Graph Identification. Combinatorica, 12(4), 389–410.
Carletti, Vincenzo, Ga
¨
uz
`
ere, Benoit, Brun, Luc, and Vento,
Mario. 2015. Approximate Graph Edit Distance Com-
putation Combining Bipartite Matching and Exact
Neighborhood Substructure Distance. In Interna-
tional Workshop on Graph Based Representations in
Pattern Recognition (GbRPR), 188–197.
Chang, Chih-Chung, and Lin, Chih-Jen. 2001. LIBSVM: A
library for Support Vector Machines. Available online
at
http://www.csie.ntu.edu.tw/cjlin/libsvm.
Choi, Yeonjoo, and Kim, Gyeonghwan. 2010. Graph-based
Fingerprint Classification using Orientation Field in
Core Area. IEICE Electronic Express, 7(17), 1303–
1309.
Debnath, Asim Kumar, Lopez de Compadre, Rosa L., Deb-
nath, Gargi, Shusterman, Alan J., and Hansch, Cor-
Accurate and Fast Computation of Approximate Graph Edit Distance based on Graph Relabeling
25
win. 1991. Structure-Activity Relationship of Mu-
tagenic Aromatic and Heteroaromatic Nitro Com-
pounds. Correlation with Molecular Orbital Energies
and Hydrophobicity. Journal of Medicinal Chemistry,
34, 786–797.
Ga
¨
uz
`
ere, Benoit, Bougleux, S
´
ebastien, Riesen, Kaspar, and
Brun, Luc. 2014. Approximate Graph Edit Distance
Guided by Bipartite Matching of Bags of Walks. In
Proc. of International Workshop on Structural and
Syntactic Pattern Recognition (SSPR), 73–82.
Hart, Peter E., Nilsson, Nils J., and Raphael, Bertram. 1968.
A Formal Basis for the Heuristic Determination of
Minimum Cost Paths. Journal of IEEE Transaction
on Systems Science and Cybernetics, 4(2), 100–107.
Hido, Shohei and Kashima, Hisashi. 2009. A Linear-Time
Graph Kernel. In Proc. of International Conference
on Data Mining (ICDM), 179–188.
Kashima, Hisashi, Tsuda, Koji, and Inokuchi, Aki-
hiro. 2003. Marginalized Kernels Between Labeled
Graphs. In Proc. of International Conference on Ma-
chine Learning (ICML), 321–328.
Kataoka, Tetsuya and Inokuchi, Akihiro. 2016. Hadamard
Code Graph Kernels for Classifying Graphs. In Proc.
of International Conference on Pattern Recognition
Applications and Methods (ICPRAM), 24–32.
Kinable, Joris, and Kostakis, Orestis. 2011. Malware Clas-
sification based on Call Graph Clustering. Journal in
Computer Virology, 7(4), 233–245.
Koopmans, Tjalling C. and Beckmann, Martin. 1957. As-
signment Problems and the Location of Economic Ac-
tivities. Econometrica, 25(1), 53–76.
Riesen, Kaspar and Bunkle, Horst. 2009. Approximate
Graph Edit Distance Computation by Means of Bipar-
tite Graph Matching. Image Vision Computing, 27(7),
950–959.
Riesen, Kaspar. 2015. Structural Pattern Recognition with
Graph Edit Distance: Approximation Algorithms and
Applications. Advances in Computer Vision and Pat-
tern Recognition, Springer.
Robles-Kelly, Antonio, and Hancock, Edwin R. 2003. Edit
Distance From Graph Spectra. In Proc. of Interna-
tional Conference on Computer Vision (ICCV), 234–
241.
Schomburg, Ida, Chang, Antje, Ebeling, Christian, Gremse,
Marion, Heldt, Christian, Huhn, Gregor, and Schom-
burg, Dietmar. 2004. BRENDA, the Enzyme
Database: Updates and Major New Developments.
Nucleic Acids Research, 32D, 431–433.
Shervashidze, Nino, Schweitzer, Pascal, Jan van Leeuwen,
Erik, Mehlhorn, Kurt, and Borgwardt, Karsten M..
2011. Weisfeiler-Lehman Graph Kernels. Journal of
Machine Learning Research (JMLR), 2539–2561.
Wang, Peng, Eglin, V
´
eronique, Garcia, Christophe, Larg-
eron, Christine, Llad
´
os, Josep, and Forn
´
es, Alicia.
2014. A Coarse-to-Fine Word Spotting Approach for
Historical Handwritten Documents Based on Graph
Embedding and Graph Edit Distance. In Proc of Inter-
national Conference on Pattern Recognition (ICPR),
3074–3079.
Yan, Xifeng, Yu, Philip S., and Han, Jiawei. 2005 Substruc-
ture Similarity Search in Graph Databases. In Proc. of
the ACM SIGMOD International Conference on Man-
agement of Data (SIGMOD), 766–777.
Zeng, Zhiping, Tung, Anthony K. H. Tung, Wang, Jiany-
ong, Feng, Jianhua, and Zhou, Lizhu. 2009. Com-
paring Stars: On Approximating Graph Edit Distance.
In Proc. of International Conference on Very Large
Databases (PVLDB), 2(1), 25–36.
APPENDIX
The Label Aggregate Kernel (LAK) was designed to
measure the similarity between two graphs g
1
and g
2
to enable the application of a Support Vector Machine
(SVM). The kernel is defined as
k(g
1
,g
2
) =
h
t=0
v
i
V (g
1
)
v
j
V (g
2
)
δ(`
`
`
(t)
L
(v
i
),`
`
`
(t)
L
(v
j
)),
where δ is the Kronecker delta. In contrast, in this pa-
per, LAK has been used to measure the approximate
graph edit distance corresponding to the dissimilarity
between two graphs g
1
and g
2
.
ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods
26