XML SCHEMA STRUCTURAL EQUIVALENCE
Angela C. Duta, Ken Barker and Reda Alhajj
Department of Computer Science, University of Calgary, 2500 University Dr. NW, Calgary, Alberta, Canada
Keywords:
Schema equivalence, XML structures.
Abstract:
The Xequiv algorithm determines when two XML schemas are equivalent based on their structural organi-
zation. It calculates the percentages of schema inclusion in another schema by considering the cardinality
of each leaf node and its interconnection to other leaf nodes that are part of a sequence or choice structure.
Xequiv is based on the Reduction Algorithm (Duta et al., 2006) that focuses on the leaf nodes and eliminates
intermediate levels in the XML tree.
1 INTRODUCTION
Much work has been done in the XML schema equiv-
alence area ((Do et al., 2003), (Do and Rahm, 2002),
(Lee et al., 2002), (Madhavan et al., 2001), (Nierman
and Jagadish, 2002)) that is applied optimally in only
some situations. We propose an approach that finds
equivalent XML schemas from the same domain (the
same entities and attributes) that have different tree
organizations. The difficulty of comparing and find-
ing matchable schemas arises for two reasons: (1)
there are three data storage units in XML: elements,
attributes, and text content, and (2) the hierarchical
features of the XML structure. XML schema equiva-
lence must be evaluated from three perspectives: (1)
hierarchical structure (structural equivalence), (2) el-
ements and attributes data types (syntactic equiva-
lence), and (3) elements and attributes names (seman-
tic equivalence).
This paper focuses on determining the structural
equivalence of XML schema by using reduced XML
trees generated by the Reduction Algorithm (RA)
(Duta et al., 2006) . In the reduced XML trees the
three data storage units (element, attribute, and text
content) are transformed into a single storage unit: the
element node (also called the node). RA eliminates
intermediate organizational nodes from each XML
schema so that a comparison between them is effi-
cient. The reduced XML schema contains only infor-
mation about leaf nodes: data types, labels, number
of occurrences, and interconnections between them.
Our argument for using reduced XML trees is that
leaf nodes are the important nodes as they store the
data in XML files. Higher level nodes represent a
subjective hierarchical organization that allows an in-
telligible reading of the information stored in leaves.
From this perspective our approach is contrary to the
assumption “elements at higher levels ... are more rel-
evant than subelement deeply nested” (Bertino et al.,
2004) used by some methods (Bertino et al., 2004).
The purpose of this paper is to define a new
method for optimizing the schema structure equiva-
lence process that applies to schema trees of similar
or different organization. A classification of XML
trees from the structural perspective is (1) similar tree
structures that use different data storage units and
(2) different tree structures that use different order,
grouping and/or nesting of subelements within a par-
ent element. All approaches published to date focus
on similar tree structures and do not address schema
equivalence for different tree structures. The nov-
elty of our method is to determine structural match-
ing based on the equivalent leaves content rather than
contexts and vicinities. A leaf content is defined by
(1) data type and (2) number of minimum and max-
imum occurrences. Our approach finds equivalent
XML schemas in all situations detailed above as long
as the minimum information is provided to find a
52
C. Duta A., Barker K. and Alhajj R. (2007).
XML SCHEMA STRUCTURAL EQUIVALENCE.
In Proceedings of the Ninth International Conference on Enterprise Information Systems - DISI, pages 52-59
DOI: 10.5220/0002357700520059
Copyright
c
SciTePress
match (labels and data types).
Paper Organization Following this two DTD ex-
amples are detailed that have different tree structures
but refer to the same entities: employees, projects,
and tasks. Section 2 briefly discusses several devel-
oped methods for schema equivalence. Our approach
is presented starting with Section 3 that first summa-
rizes the RA and then details Xequiv. An example
for Xequiv is depicted in Section 6. This paper draws
some conclusions in Section 7.
Motivating Examples Figures 1 and 2 illustrate
two simple examples of DTDs that store data about
employees, projects, and tasks for a company. The
element data type definitions have not been included.
No mechanism has yet appeared in the literature to
clearly compare these XML schemas and decide if
they are equivalent. This paper presents the Xequiv
algorithm that structurally compares differently orga-
nized XML trees from the same domain.
2 RELATED WORK
Following Salminen and Tompa’s suggestion (Salmi-
nen and Tompa, 2001) that the canonical forms for
XML recommended by W3C (Boyer, 2001) must be
further researched to solve the XML schema equiva-
lence problem, much work has been done in this area.
The generic schema matching algorithm Cupid (Mad-
havan et al., 2001) focuses on leaf nodes using auto-
matic linguistic matching (elements’ name) and struc-
tural matching (schema structure, path matching, con-
straints, and element data types). The similarity be-
tween two DTDs is evaluated (Lee et al., 2002) from
three perspectives: (1) semantic similarity (similarity
between node labels, constraints and path context (as-
cendants)), (2) immediate descendant similarity, and
(3) leaf context similarity. Constraints such as +, *, ?,
or none are given weights of similarity. This work is
<!ELEMENT company1 (employee+, project*, task+)>
<!ELEMENT employee (eid | sin, name, (pid* | task name)+) >
<!ATTLIST employee address CDATA #REQUIRED>
<!ELEMENT project (pid, description, budget | manager |
location )>
<!ELEMENT task(task name, date)>
keys: project.pid, task.task name
references: employee.pid to project.pid,
employee.task name to task.task name
Figure 1: Repeated employee, project, and task elements.
similar to ours in that it addresses some DTD trans-
formation rules also adopted by us.
A collection of documents with DTD’s from the
same domain is divided into sets of similar DTDs
based on the minimum edit distances (Nierman and
Jagadish, 2002). The edit distance is calculated using
dynamic programming as the minimum cost to trans-
form a tree A into B. This method works for docu-
ments with DTDs having the same tree structure but
it cannot be applied to trees that have a significant dif-
ferent structure even though they refer to the same do-
main. COMA (Do and Rahm, 2002) combines several
simple and hybrid matching algorithms. The simple
algorithms refer to one aspect in DTD: labels, data
types, or user input. Our approach extends the struc-
tural matchers Children and Leaves by combining and
generalizing them to any type of node (repeated, op-
tional, alternative options, key, reference, etc.)
3 THE XEQUIV ALGORITHM
3.1 The Reduction Algorithm (RA)
RA (Figure 3) addresses multiple data storage units
and hierarchical organization in XML. An XML el-
ement stores data in a text unit, attributes, and/or
subelements. Each data unit is represented by a node.
Thus, we easily distinguish between an empty ele-
ment and a text-only element (element types used
accordingly with the W3C standard (Consortium,
2004)) because the first has an element node and sev-
eral attribute subnodes as data units, while the second
uses an element node, a text subnode and several at-
tribute subnodes (for more details refer to (Duta et al.,
2006) ). RA is based on seven rules that convert the
node types of the source structure (element, attribute,
text) into a single node type (element) and eliminates
intermediate tree levels.
<!ELEMENT company2 (employee+)>
<!ELEMENT employee (sin, name, address*, dateOfBirth?,
projects?)>
<!ATTLIST employee eid CDATA #REQUIRED>
<!ELEMENT projects (project+)>
<!ELEMENT project (description?, manager | location, task+)>
<!ATTLIST project pid CDATA #REQUIRED>
<!ELEMENT task (#PCDATA)>
<!ATTLIST task date CDATA #REQUIRED>
keys: employee.eid, project.pid
references: project.manager to employee.eid
Figure 2: Nested structure of employee, project and task
elements.
XML SCHEMA STRUCTURAL EQUIVALENCE
53
1. Part 1: Transform text and attribute nodes
2. Part 2: Create the XRTS tree by eliminating
intermediate nodes.
3. Part 3: Create XRTN tree by transferring
the sequence multiplicator (+, *) or optional
indicator ? to each node in the sequence.
Figure 3: The Reduction Algorithm.
The first two parts of RA eliminate nodes but pre-
serve the initial constraints of the XML schema cre-
ating an XML reduced tree using sequences (XRTS).
Part 3 transfers the outer expression operators (?, +,
*) to inner elements creating an XML reduced tree to
leaf nodes (XRTN). Part 3 generates some informa-
tion loss regarding element occurrences restricted to
occurrences of other elements but allows a fast first
evaluation of schemas similarity.
3.2 The XML Schema Equivalence
Xequiv Algorithm
The purpose of Xequiv is to find Xml schemas that
are similar in terms of leaf content (see Figure 4). To
compare leaf contents the source schemas must be re-
duced using RA so that the intermediate nodes are
eliminated. Xequiv focuses only on the nodes that
store data and compares leaf nodes which are of only
one type: element nodes. A leaf content shows how
much data is stored in the XML data file based on the
node definition.
leaf content = <data type, leaf cardinality
(minOccurences .. maxOccurences)>
For example, the structure tasks(task
name?)
with task
name of type string has the leaf con-
tent task
name
content
=< string,0..1 >. The node
task
name is a perfect match with the node job that
has the leaf content job
content
=< string,0..1 > and
1. Reduce schemas using RA and create XRTS1,
and XRTS2 in Part 2, and XRTN1 and XRTN2 in
Part 3 from the source schemas XSD1 and XSD2,
respectively.
2. Determine the equivalence values
Sim
XRTN1XRTN2
and
Sim
XRTN2XRTN1
. If they are greater than
a predefined threshold, then they are somewhat
equivalent and we proceed to Step 3 to determine
the structural equivalence.
3. Compare XRTS1 and XRTS2 by finding a match for
each structure (sequence,
all
or
choice
). A
node part of a structure or a series of nested
structures in XRTS1 is equivalent to a node in
XRTS2 which is part of similar series of nested
structures.
Figure 4: The Xequiv Algorithm.
only α% similar (see Table 1) with the job name node
where job
name
content
=< string,1..1 >.
We recommend using XML Schema as it allows
a variety of data types and cardinality values. In this
paper we use DTD only to present schemas
1
in a com-
pact way but our algorithm is based on XML Schema
as it has a larger variety of data types and features (e.g.
primary and foreign keys for attributes and elements).
Two nodes are selected for comparison based on
their data types and labels. Equivalent nodes must
have similar data types and matchable labels, which
refer to the same concept either using the same words,
abbreviations, or synonyms. WordNet (Laboratory,
2005) retrieves efficiently the set of synonyms for any
label. We consider that synonyms and abbreviations
are 100% equivalent as we determine structural and
not ontology equivalence (contrary to XClust (Lee
et al., 2002) that determines levels of equivalence
for abbreviations). We use the information provided
by labels to select candidate nodes. This is accom-
plished by the function Θ that determines if two nodes
have a similar label and their data types are from
compatible classes. Consider node N
1
defined by
(type T
1
,label L
1
) from XRT1
2
and node N
2
(T
2
,L
2
)
from XRT2.
Θ(N
1
,N
2
) =
1 , if T
1
T
2
and L
1
L
2
0 , otherwise
(1)
4 NODES SIMILARITY
4.1 Similarity Metric for Simple Nodes
To determine if two schemas are structurally equiva-
lent, Xequiv first evaluates their leaf nodes similarity.
This provides a fast evaluation that separates schemas
into different domains. Using the reduced schemas
obtained at Part 3 of RA, Xequiv identifies for each
node a matching node and determines the measure
of inclusion between them. Consider the structures
str1 : (a) and str2 : (a+). Structure str1 requires node
a to appear exactly one time, while str2 requires node
a to occur several times but at least once in the cor-
responding XML file. We consider that str1 is in-
cluded in structure str2 because a a+, and, thus,
R +. Also, R ? because ? admits two statuses:
present one time (like R - required) or non-present.
The operator ? as allows in addition the node to
occur multiple times. Similarly, inclusion hierarchies
1
As a result we take a few liberties with DTD nomencla-
ture in our examples and we will indicate where they occur.
2
XRT is a general term that refers to the XML reduced
tree, either XRTS or XRTN
ICEIS 2007 - International Conference on Enterprise Information Systems
54
are determined between all operators: R ? and
R + .
The inclusion of a structure str1 into a structure
str2 is based on the inclusion of each node from str1
into a single node in str2. The inclusion ε
xy
of a
node x from str1 into another node y from str2 if
Θ(x,y) = 1 is based on the inclusion of their expres-
sion operators For nodes with the same operator the
inclusion measure ε = 1. If the operator of the x node
is included in the operator of the node y then ε
xy
= 1.
Otherwise, the node x is included in y with a lower
percentage (see Table 1). Values of α, β, γ, δ, ε, ρ, and
σ represent the inclusion percentage, and 0 <= α, β,
γ, δ, ε, ρ, and σ < 1. It is very important how these
values are set as they are directly correlated with the
minimum threshold set for schema equivalence. Note
that Table 1 is asymmetrical as the node a? a and,
thus, ε
a?a
= 1 but ε
a∗→a?
< 1. If a node x from str1
does not have a correspondent in str2 then the inclu-
sion factor is 0.
We define the similarity function for an XML
reduced schema XRTN1 with n1 nodes to another
schema XRTN2 with n2 elements based on the nodes
inclusion. We assume that for each node x from
XRTN1 there exists at most one node y in XRTN2
such that Θ(x,y) = 1.
Sim
XRTN1XRTN2
=
n1
i=1
ε
x
i
y
j
n1
100%,1 j n2
(2)
The similarity Sim is an asymmetrical function.
Sim
XRTN1XRTN2
expresses how much of the struc-
ture of XRTN1 is included in XRTN2. Note that
Sim
XRTN1XRTN2
is different from Sim
XRTN2XRTN1
if (1) there are nodes in one structure that do not have
a match in the other structure, and (2) different oper-
ators are used for nodes with Θ = 1.
4.2 Similarity Metric for Choice Nodes
The Sim metric considers each node and its match.
The values provided by Table 1 work for simple nodes
but not for nodes formed by several alternatives (the
choice nodes or structures). The choice structure is
formed by several mutually exclusive nodes. The
Table 1: Operators inclusion percentages ε
xy
(φ = non-
existent node).
ε
xy
R ? + * φ
R 1 1 1 1 0
? α 1 β 1 0
+ γ δ 1 1 0
* ε ρ σ 1 0
metric SimChoice must evaluate the similarity of two
choice nodes by considering the number of alterna-
tives and also the similarity between a choice node
and a simple node. The equivalence metric must be
no more than 1 (like nodes inclusion) and differenti-
ate between different number of alternatives. We con-
sider each situation below.
4.2.1 Similarity between Two Choice Structures
Consider the node x formed by several alternative
nodes x = (x
1
|..|x
m
) in XRTN1. To evaluate how
similar is node x from XRTN1 to a choice node
y in XRTN2 y = (y
1
|..|y
n
) we assume that alterna-
tives are ordered in both nodes such that Θ(x
1
,y
1
) =
Θ(x
2
,y
2
) = .. = 1.
SimChoice
xy
=
m
i=1
ε
x
i
y
i
m
(3)
If k such that Θ(x
k
,y
k
) = 0, then ε
x
k
y
k
=
0. If n < m, then there are alternatives in x
with no correspondent in y and SimChoice
xy
< 1.
SimChoice
xy
= 1 if each alternative from x has a
correspondent in y with the same or a more general
expression operator.
4.2.2 Similarity between a Choice and Multiple
Simple Nodes
Consider the choice node (x|y) in XRTN1 and the se-
quence (x, y) in XRTN2. (x|y) represents a single node
so it must be similar to one node only from XRTN2.
But as both alternatives x and y from XRTN1 have
a correspondent in XRTN2 the one that maximizes
the similarity function based on cardinality matching
must be chosen.
SimChoice
(x|y)(x,y)
= Max(SimChoice
(x|y)x
,
SimChoice
(x|y)y
) (4)
Thus, SimChoice
(x|y)(x,y)
= Max(
ε
x
2
,
ε
y
2
).
Conversely, SimChoice
(x,y)(x|y)
must be evalu-
ated using the similarity metric for each simple node
SimChoice
x(x|y)
and SimChoice
y(x|y)
.
SimChoice
(x,y)(x|y)
= Max(SimChoice
x(x|y)
,
SimChoice
y(x|y)
) (5)
Thus, SimChoice
(x,y)(x|y)
= Max(ε
x
,ε
y
).
4.2.3 Similarity between One Choice Structure
and Multiple Choice Structures
Consider the alternative structure (x|y|z|t) in XRTN1
and two alternative structures (x|y) and (z|t) in
XRTN2. In XRTN1 there is a single node with four
alternatives and it has two corresponding nodes in
XML SCHEMA STRUCTURAL EQUIVALENCE
55
XRTN2 each with two alternatives. A single node
from XRTN2 must correspond to a node from XRTN1
so the one that maximizes the similarity function must
be chosen.
SimChoice
(x|y|z|t)((x|y),(z|t))
=
Max(SimChoice
(x|y|z|t)(x|y)
,SimChoice
(x|y|z|t)(z|t)
) (6)
Thus, SimChoice
(x|y|z|t)((x|y),(z|t))
=
Max(
ε
x
+ε
y
4
,
ε
z
+ε
t
4
)
In summary, a choice node x(x
1
,...x
m
) from
XRTN1 is equivalent to at most one node in XRTN2.
If the alternatives from x exist in XRTN2 either
as simple nodes y
1
,..y
k
or as alternatives that are
grouped in several choice nodes y
k+1
,..y
n
the simi-
larity measure chooses the node y that is the “most”
equivalent to x.
SimChoice
x(y
1
,..y
n
)
= Max(SimChoice
xy
1
,..,
SimChoice
xy
n
) =
Max
n
i=1
(ε
y
i
)
m
(7)
Thus, the similarity value between a schema XRTN1
with n1 nodes and a schema XRTN2 formed by n2
nodes is:
Sim
XRTN1XRTN2
=
n1
i=1
Sim
x
i
y
j
n1
100%,1 j n2
(8)
The equivalence value Sim for a node x for which
there exists at least a node y in XRTN2 such that
Θ(x,y) = 1 is defined as follows:
Sim
xy
=
ε
xy
, if x and y
are simple nodes
SimChoice
xy
ε
xy,
, if x or y
is a choice node
(9)
The value ε
xy
for choice nodes determines the equiv-
alence between the operators applied to the choice
structures making the difference, for example, be-
tween (x
1
|x
2
) and (y
1
|y
2
)+. Note that ε
xy
for XRTN
schemas is always 1 as there is no outer operator for
choice structures. The choice operator is combined
with alternative nodes’ operators in Part 3 of RA as
described in Section 3.
5 STRUCTURAL SIMILARITY
5.1 Structural Similarity Metric
The similarity Sim is calculated based on the reduced
structures obtain in Part 3 of the RA. However it is
important how nodes are grouped in sequences in the
reduced schema. A more exact way to determine
structural equivalence must consider the cardinality of
each node and its correlation to other nodes as part of
a sequence. Thus, based on reduced schemas obtained
in Part 2 of RA we compute similarity SimStr of each
structure from XRTS1 with a structure from XRTS2
that contains the corresponding nodes. SimStr de-
termines how similar a structure str1 (sequence or
choice) is to a structure str2 considering the nodes’
cardinality, structure cardinality, and number of nodes
in the structure str1.
SimStr
str1str2
=
m1
i=1
SimStr
str1istr2j
m1
ε
str1str2
100%,
if str1 or str2 are sequences
Sim
str1str2
, otherwise
(10)
The value m1 represents the number of inner struc-
tures (sequences, choices, or simple nodes) str1i in
structure str1 such that str1i
str1j = Ø for any 1
i, j m1, i 6= j. If a node from str1 has a correspon-
dent in str2, then: first the equivalence of nodes is
evaluated based on their cardinality, and second it is
multiplied by the equivalence of structures cardinal-
ity. The value ε
str1str2
determines the equivalence
between structures cardinality. A structure, for exam-
ple str1, can be represented by a single node. In this
case, SimStr evaluates the similarity of this node with
a node from str2. The required operator R is implied
whenever there is no other operator for a node or a
structure. If both structures str1 and str2 are simple
nodes the similarity value for them is depicted in Ta-
ble 1. If one node is a choice structure the formula of
SimChoice is used:
SimStr
str1str2
= Sim
str1str2
=
SimChoice
str1str2
ε
str1str2
. Note that in this case
ε
str1str2
can be different than 1 if different operators
are associated with the choice structures.
SimStr values are interpreted as follows. If
SimStr
str1str2
= 100% and SimStr
str2str1
= 80%,
then it means that structure str1 is included in str2.
Str2 has either (1) additional nodes, sequences, or
choices; (2) additional alternatives in its choice nodes;
or (3) more general operators for nodes
3
.
For example, consider the structures defined in
Figure 5. In example (a) structures str1 and str2 are
sequences, with str1 containing two nodes: a+ and b,
and str2 having three nodes a, b, and c. The similarity
value for them is calculated as follows:
SimStr
(a)str1str2
= (SimStr
(a)str11str21
+
SimStr
(a)str12str22
)/2 ε
++
100%
SimStr
(a)str1str2
= (ε
a+a
+ ε
bb
)/2 ε
++
100%
3
The most general operator is *; the operator + is more
general than R but not than ? and *; the optional operator ?
is more general than R
ICEIS 2007 - International Conference on Enterprise Information Systems
56
Using the example values from Table 2 for nodes
inclusion, node a+ from str1 is 50% equivalent with
node a from str2, and nodes b are 100% equivalent.
Thus, SimStr
(a)str1str2
= 75%. This means that 75%
of the structure str1 is included in the structure str2.
To determine the inclusion of structure str2 in str1 we
calculate SimStr
(a)str2str1
.
SimStr
(a)str2str1
= (SimStr
(a)str21str11
+
SimStr
(a)str22str12
+ SimStr
(a)str23str13
)/3 ε
++
Structure str13 does not exists, so
SimStr
(a)str23str13
= 0. Thus, SimStr
(a)str2str1
=
1+1+0
3
1 100% = 66.67%. This means that 66.67%
of structure str2 is found in structure str1.
In example (b) from Figure 5, the structure str1
contains two substructures: a sequence str11 made of
two nodes and str12 made of one node. Similarly, the
structure str2 has a sequence and a node. The simi-
larity value is calculated for one sequence at a time.
SimStr
(b)str1str2
= (SimStr
(b)str11str21
+
SimStr
(b)str12str22
)/2 ε
++
100%
Thus, SimStr
(b)str1str2
= ((ε
a+a
+ ε
bb
)/2
ε
++
+ε
cc
)/2ε
++
100% = 62.50%, states that
62.50% of str1 is included in str2. SimStr
(b)str2str1
is computed similarly and is equal to 100%. Both
structures str1 and str2 have the same number of
nodes and for each node in one structure there is an
equivalent node in the other. The difference in the
similarity values is given by the expression operators
making str1 a more general structure than str2.
In example (c) from Figure 5, both structures are
formed by three nodes but grouped differently in se-
quences. In str2 the nodes a and b are grouped in
a repeatable sequence. In str1 the nodes a+ and
b are not separated by c but it can be considered
that there is a required sequence that groups them in
str1 : ((a+,b), c). This gives the advantage of com-
paring the two sequences containing the nodes a and
b and give a better similarity value between str1 and
str2. Conversely, if str1 is compared to the structure
(a,b,c) the nodes a+ and b must not be grouped sepa-
rately as both structures have only a simple sequence.
SimStr
(c)str1str2
= ((ε
a+a
+ ε
bb
)/2 ε
R+
+
ε
cc
)/2 ε
R→∗
100% = 83.33%
SimStr
(c)str2str1
= ((ε
aa+
+ ε
bb
)/2 ε
+R
+
ε
cc
)/2 ε
∗→R
100% = 30%
Table 2: Example of nodes equivalence (φ = non-existent
node).
ε
xy
R ? + * φ
R 1 1 1 1 0
? 0.5 1 0.4 1 0
+ 0.5 0.2 1 1 0
* 0.4 0.5 0.9 1 0
Str1 is a flat structure compared to str2 which con-
tains other nested structures. Str1 is found in str2 in
terms of 83.33%, while str2 is found in str1 in terms
of 30%. They both contain the same nodes but are
grouped differently. The difference in percentages is
generated by (1) the nested sequence (a,b)+ com-
pared with a+,b, and (2) the * operator.
5.2 Xequiv Applied to Nested and
Non-Nested Structures
Consider the examples from Figure 6. The connec-
tion between employees who work on projects is pre-
served either using references (examples (a) and (c)),
either through a nested structure Figure 6(d). The
example from Figure 6(b) provides some connection
between employees and projects but without check-
ing the foreign key integrity. Are they all equivalent?
There is no mechanism in the literature to clearly
compare them and determine their equivalence. This
section is dedicated to solving this problem.
If examples (a) and (d) from Figure 6 are com-
pared, the equivalence algorithms find the node em-
ployee.pid is an extra node in example (a), thus re-
ducing the equivalence measure of the two struc-
tures. Since a corresponding node to employee.pid
is not necessary in the former structure to link an em-
ployee to a project as this is done by the nested fea-
ture we have two options to remedy this drawback.
The first option is to eliminate nodes which represent
references such as employee.pid. Unfortunately, this
wrongly determines structures (a) and (b) that have
the same nodes to be 100% equivalent even though
(b) is missing an important reference. Another op-
tion is to add fake references in nested structures in
the preparation part of the RA. For example, in the
structure (d) we could add either pid in the employee
node or eid in the project node, thereby generating
two alternative structures that have different equiva-
lence measures to structure (a). As we do not know if
structure (d) is compared to (a) or (c), we must add a
reference node that will determine the same similar-
ity value between (a) and (d) as between (c) and (d).
Thus, we define an additional reduction rule that takes
care of references.
(a) str1 : (a+,b)+ str11 : a+, str12 : b
str2 : (a, b,c)+ str21 : a, str22 : b,str23 : c
(b) str1 : ((a+, b)+,c+)+ str11 : (a+,b)+, str12 : c+
str2 : ((a, b)+,c)+ str21 : (a, b)+,str22 : c
(c) str1 : (a+,b, c) str11 : (a, b),str12 : c
str2 : ((a, b)+,c) str21 : (a, b)+,str22 : c
Figure 5: Determining sequence equivalence for multiple
sequences.
XML SCHEMA STRUCTURAL EQUIVALENCE
57
Rule 8 The link between a structure S2 with the
primary key KEY2 nested inside another structure S1
with the primary key KEY1 is preserved by adding
a choice reference structure formed by primary keys
KEY1||KEY2 inside the nested structure.
Rule 8 is contained in the preparation part of the
RA and is applied at the end of Part 1 when sequences
are still nested. References are included in the in-
ner structure to borrow its operator, thereby to pre-
serve the cardinality of the nested structure. If the
outer structure S1 contains only S2 and no additional
elements but is part of a structure S0 with primary
key KEY0, then the choice structure KEY0||KEY2 is
added inside S2.
Referring to example (d) from Figure 6, the refer-
ence node is formed by (eidREF||pidREF) as eid and
pid are primary keys.
company (employee (eid, project (pid, de-
scription), name)
R8
company (employee (eid,
(eidREF||pidREF, pid, description), name)
We represent the choice structure for references
using a double line || as this is evaluated differently
from a regular alternative construction. Only one ele-
ment of the alternative structure for references is go-
ing to be found (if any) in the other schema. Thus,
contrary to SimChoice previously defined, that must
determine how many alternative options from one
schema are found in the other schema, SimChoiceRef
must evaluate if there is any corresponding reference
(see Equation 11). Thus, only one ε
x
i
y
j
is greater
than zero from the components of the maximum func-
tion, where x
i
and y
j
are reference alternatives from
XRTS1 and XRTS2, respectively.
SimChoiceRe f = Max(ε
x
i
y
j
) ε
REF
100% (11)
A correct equivalence evaluation must also consider
(a) <!ELEMENT company (employee, project)>
<!ELEMENT employee (eid, pid, name)>
<!ELEMENT project (pid, description)>
eid and project.pid primary keys,
employee.pid is keyref to project.pid
(b) <!ELEMENT company (employee, project)>
<!ELEMENT employee (eid, pid, name)>
<!ELEMENT project (pid, description)>
(c) <!ELEMENT company (employee, project)>
<!ELEMENT employee (eid, name)>
<!ELEMENT project (pid, eid, description)>
employee.eid and pid are primary keys,
project.eid is keyref to employee.eid
(d) <!ELEMENT company (employee)>
<!ELEMENT employee (eid, project, name)>
<!ELEMENT project (pid, description)>
eid and project.pid primary keys
Figure 6: Simple possible equivalent schemas.
the existence of primary keys. If XRTS3 is defined as
Employee(eid), with eid primary key and XRTS4 as
Employee(eid), they are not 100% equivalent. Thus,
we revise the similarity formula (Equation 10) to mul-
tiply the node equivalence to the key equivalence for
primary keys. For example if eid is a primary key its
equivalence is the product ε
eid
ε
KEY
, where ε
KEY
= 1
if both nodes are primary keys, and < 1 if only one of
them is a primary key.
The examples from Figure 6, are reduced to the
structures detailed in Figure 7 with the KEY suffix
for primary keys and REF for references. By defining
ε
KEY
= 0.7 and ε
REF
= 0.6 and using the operators
equivalence defined in Table 2, schema (a) is simi-
lar to the the rest of the schemas in the proportions
presented in Figure 8, where x represents a generic
schema and ε
eid
is the short form for ε
(a)eid(x)eid
.
6 EXAMPLE
RA applied to examples from Figures 1 and 2 gener-
ates the following output.
XRTS1: company1((eid | sin, name, address,
(pidREF* | task
nameREF)+) +, (pidKEY, descrip-
tion, budget | manager | location)*, (task
nameKEY,
date)+)
XRTN1: company1(eid+ | sin+, name+, address+,
pidREF* | task
nameREF+, pidKEY*, description*,
budget* | manager* | location*, task
nameKEY+,
date+)
XRTS2: company2(eidKEY, sin, name, address*, da-
teOfBirth?, (pidKEY, description?, manger| location,
eidREF || pidREF, (task, date)+)*)+
XRTN2: company2(eidKEY+, sin+, name+, ad-
dress*, dateOfBirth*, pidKEY*, description*, man-
ager* | location*, eidREF* || pidREF*, task*, date*)
We start by comparing XTRN trees to determine
if they are from the same domain. Consider the opera-
tors equivalence detailed in Table 2. We determine the
node’s similarity from the two schemas using Equa-
tion 2.
Sim
XRTN1XRTN2
= (
Max(ε
eid++
,ε
sin++
)
2
+
ε
name++
+ ε
address+→∗
+ (ε
pid∗→∗
+
ε
task
name+task
)/2 + ε
pid∗→∗
+ ε
description∗→∗
+
(ε
budget∗→φ
+ ε
manager∗→∗
+ ε
location∗→∗
)/3 +
ε
task
name+task
+ ε
date+→∗
)/9 100% = 96.20%
(a)company(eidKEY, pidREF, name, pidKEY, description )
(b)company(eid, pid, name, pid, description)
(c)company(eidKEY, name, pidKEY, eidREF, description)
(d)company (eidKEY, eidREF||pidREF, pid, description, name)
Figure 7: Reduced schemas.
ICEIS 2007 - International Conference on Enterprise Information Systems
58
Sim
XRTN2XRTN1
= (Max(ε
eid++
,ε
sin++
) +
ε
name++
+ε
address∗→+
+ε
dateOfBirth∗→φ
+ε
pid∗→∗
+
ε
description∗→∗
+ (ε
manager∗→∗
+ ε
location∗→∗
)/2 +
Max(ε
eid∗→φ
,ε
pid∗→∗
) + ε
task∗→task
name+
+
ε
date∗→+
))/12 100% = 68.83%
Both values are high enough to suggest that there
are common nodes between XRTN1 and XRTN2.
The next step evaluates the similarity between the
structures of XRTS1 and XRTS2. To optimize the
computation of the structural similarities we use the
references determined in Part 3 and include them ac-
cordingly into sequences.
SimStr
XRTS1XRTS2
= ((
Max(ε
eidRR
,ε
sinRR
)
2
+
ε
nameRR
+ ε
addressR→∗
+ (ε
pid∗→∗
ε
REF
+ 0)/2
ε
++
)/4 ε
++
+(ε
pidRR
ε
KEY
+ε
decriptionR?
+
(ε
budgetRφ
+ ε
managerRR
+ ε
locationRR
)/3)/3
ε
∗→∗
ε
R+
+ (ε
task
nameRtaskR
ε
task nameKEY
+
ε
dateRR
)/2ε
++
ε
R→∗
ε
R+
)/3100% = 66%
SimStr
XRTS2XRTS1
= ((Max(ε
eidRR
ε
KEY
,ε
sinRR
) + ε
nameRR
+ ε
address∗→R
+
ε
dateOfBirth?φ
) ε
++
+ ((ε
pidRR
ε
KEY
+
ε
description?R
+ (ε
managerRR
+ ε
locationRR
)/2 +
ε
pid∗→∗
ε
pidREF
) ε
∗→∗
ε
+R
+ (ε
taskRR
+
ε
dateRR
)/2 ε
++
ε
∗→R
ε
+R
)/5)/6 100% =
46.50%
The nodes similarity values Sim show that both
schemas are from the same domain and refer to the
same set of entities (employee, projects, and task) as
they have many correspondent nodes. However, the
structural similarity values show that they are orga-
nized significantly different. XSD1 is less general
than XSD2 as more of its structure is included in
XSD2 (SimStr
XRTS1XRTS2
> SimStr
XRTS2XRTS1
).
7 CONCLUSION AND FUTURE
WORK
Our approach finds equivalent XML schema struc-
tures by determining if their XML trees are equiv-
alent. Xequiv first determines if schemas are from
the same domain and if there is any similarity be-
tween their nodes regarding labels, data types and
SimStr
(a)(x)
= (ε
eid
ε
eidKEY
+ ε
pid
ε
pidREF
+ ε
name
+
ε
pid
ε
pidKEY
+ ε
description
)/5 100%
SimStr
(a)(b)
= (0.7+ 0.6+ 1+ 0.7+ 1)/5 100% = 80%
SimStr
(a)(c)
= (1+ 0+ 1+ 1+ 1)/5 100% = 80%
SimStr
(a)(d)
= (1+ 1+ 1+ 1+ 1)/5 100% = 100%
Figure 8: Structural similarity of reduced schemas.
operators. Secondly, our algorithm focuses on struc-
tural organization and considers the number of nodes
in structures, operators applied to sequences, nested
or linked structures. The elimination of the non-leaf
nodes using the Reduction Algorithm (Duta et al.,
2006) makes the nodes path unimportant. This has
the advantage of allowing schemas to be equivalent
because they refer to the same entity attributes but not
necessarily because they share a part of the XML tree.
Further research needs to be conducted to asses the
efficiency of Xequiv compared to other existent algo-
rithms in the area.
REFERENCES
Bertino, E., Guerrini, G., and Mesiti, M. (2004). A match-
ing algorithm for measuring the structural similarity
between an XML document and a dtd and its applica-
tions. Journal of Information Systems, 29(1):23–46.
Boyer, J. (2001). Canonical xml version 1.0, w3c
recommendation, white paper. Available at
http://www.w3.org/TR/xml-c14n (Last searched
on November 18, 2006).
Consortium, W. W. W. (2004). XML Schema part 0, 1, and
2.
Do, H. H., Melnik, S., and Rahm, E. (2003). Comparison
of schema matching evaluations. pages 221–237.
Do, H. H. and Rahm, E. (2002). COMA - a system for
flexible combination of schema matching approaches.
In Proceedings of the 28th VLDB Conference, pages
610–621.
Duta, A., Barker, K., and Alhajj, R. (2006). Xml schema
reduction algorithm. In Proceedings of the Tenth East-
European Conference on Advances in Databases and
Information Systems ADBIS’06.
Laboratory, C. S. (2005). Wordnet. Available at
http://www.cogsci.princeton.edu (Last searched on
November 18, 2006).
Lee, M. L., Yang, L. H., Hsu, W., and Yang, X. (2002).
Xclust: clustering xml schemas for effective integra-
tion. In CIKM ’02: Proceedings of the eleventh in-
ternational conference on Information and knowledge
management, pages 292–299, New York, NY, USA.
ACM Press.
Madhavan, J., Bernstein, P. A., and Rahm, E. (2001).
Generic schema matching with cupid. In VLDB ’01:
Proceedings of the 27th International Conference on
Very Large Data Bases, pages 49–58, San Francisco,
CA, USA. Morgan Kaufmann Publishers Inc.
Nierman, A. and Jagadish, H. V. (2002). Evaluating struc-
tural similarity in XML documents. In Proceed-
ings of the Fifth International Workshop on the Web
and Databases (WebDB 2002), Madison, Wisconsin,
USA.
Salminen, A. and Tompa, F. W. (2001). Requirements for
xml document database systems. In DocEng ’01:
Proceedings of the 2001 ACM Symposium on Docu-
ment engineering, pages 85–94, New York, NY, USA.
ACM Press.
XML SCHEMA STRUCTURAL EQUIVALENCE
59