XML SCHEMA STRUCTURAL EQUIVALENCE

Angela C. Duta, Ken Barker and Reda Alhajj

Department of Computer Science, University of Calgary, 2500 University Dr. NW, Calgary, Alberta, Canada

Keywords:

Schema equivalence, XML structures.

Abstract:

The Xequiv algorithm determines when two XML schemas are equivalent based on their structural organi-

zation. It calculates the percentages of schema inclusion in another schema by considering the cardinality

of each leaf node and its interconnection to other leaf nodes that are part of a sequence or choice structure.

Xequiv is based on the Reduction Algorithm (Duta et al., 2006) that focuses on the leaf nodes and eliminates

intermediate levels in the XML tree.

1 INTRODUCTION

Much work has been done in the XML schema equiv-

alence area ((Do et al., 2003), (Do and Rahm, 2002),

(Lee et al., 2002), (Madhavan et al., 2001), (Nierman

and Jagadish, 2002)) that is applied optimally in only

some situations. We propose an approach that ﬁnds

equivalent XML schemas from the same domain (the

same entities and attributes) that have different tree

organizations. The difﬁculty of comparing and ﬁnd-

ing matchable schemas arises for two reasons: (1)

there are three data storage units in XML: elements,

attributes, and text content, and (2) the hierarchical

features of the XML structure. XML schema equiva-

lence must be evaluated from three perspectives: (1)

hierarchical structure (structural equivalence), (2) el-

ements and attributes data types (syntactic equiva-

lence), and (3) elements and attributes names (seman-

tic equivalence).

This paper focuses on determining the structural

equivalence of XML schema by using reduced XML

trees generated by the Reduction Algorithm (RA)

(Duta et al., 2006) . In the reduced XML trees the

three data storage units (element, attribute, and text

content) are transformed into a single storage unit: the

element node (also called the node). RA eliminates

intermediate organizational nodes from each XML

schema so that a comparison between them is efﬁ-

cient. The reduced XML schema contains only infor-

mation about leaf nodes: data types, labels, number

of occurrences, and interconnections between them.

Our argument for using reduced XML trees is that

leaf nodes are the important nodes as they store the

data in XML ﬁles. Higher level nodes represent a

subjective hierarchical organization that allows an in-

telligible reading of the information stored in leaves.

From this perspective our approach is contrary to the

assumption “elements at higher levels ... are more rel-

evant than subelement deeply nested” (Bertino et al.,

2004) used by some methods (Bertino et al., 2004).

The purpose of this paper is to deﬁne a new

method for optimizing the schema structure equiva-

lence process that applies to schema trees of similar

or different organization. A classiﬁcation of XML

trees from the structural perspective is (1) similar tree

structures that use different data storage units and

(2) different tree structures that use different order,

grouping and/or nesting of subelements within a par-

ent element. All approaches published to date focus

on similar tree structures and do not address schema

equivalence for different tree structures. The nov-

elty of our method is to determine structural match-

ing based on the equivalent leaves content rather than

contexts and vicinities. A leaf content is deﬁned by

(1) data type and (2) number of minimum and max-

imum occurrences. Our approach ﬁnds equivalent

XML schemas in all situations detailed above as long

as the minimum information is provided to ﬁnd a

C. Duta A., Barker K. and Alhajj R. (2007).

XML SCHEMA STRUCTURAL EQUIVALENCE.

In Proceedings of the Ninth International Conference on Enterprise Information Systems - DISI, pages 52-59

DOI: 10.5220/0002357700520059

 SciTePress

match (labels and data types).

Paper Organization Following this two DTD ex-

amples are detailed that have different tree structures

but refer to the same entities: employees, projects,

and tasks. Section 2 brieﬂy discusses several devel-

oped methods for schema equivalence. Our approach

is presented starting with Section 3 that ﬁrst summa-

rizes the RA and then details Xequiv. An example

for Xequiv is depicted in Section 6. This paper draws

some conclusions in Section 7.

Motivating Examples Figures 1 and 2 illustrate

two simple examples of DTDs that store data about

employees, projects, and tasks for a company. The

element data type deﬁnitions have not been included.

No mechanism has yet appeared in the literature to

clearly compare these XML schemas and decide if

they are equivalent. This paper presents the Xequiv

algorithm that structurally compares differently orga-

nized XML trees from the same domain.

2 RELATED WORK

Following Salminen and Tompa’s suggestion (Salmi-

nen and Tompa, 2001) that the canonical forms for

XML recommended by W3C (Boyer, 2001) must be

further researched to solve the XML schema equiva-

lence problem, much work has been done in this area.

The generic schema matching algorithm Cupid (Mad-

havan et al., 2001) focuses on leaf nodes using auto-

matic linguistic matching (elements’ name) and struc-

tural matching (schema structure, path matching, con-

straints, and element data types). The similarity be-

tween two DTDs is evaluated (Lee et al., 2002) from

three perspectives: (1) semantic similarity (similarity

between node labels, constraints and path context (as-

cendants)), (2) immediate descendant similarity, and

(3) leaf context similarity. Constraints such as +, *, ?,

or none are given weights of similarity. This work is

<!ELEMENT company1 (employee+, project*, task+)>

<!ELEMENT employee (eid | sin, name, (pid* | task name)+) >

<!ATTLIST employee address CDATA #REQUIRED>

<!ELEMENT project (pid, description, budget | manager |

location )>

<!ELEMENT task(task name, date)>

keys: project.pid, task.task name

references: employee.pid to project.pid,

employee.task name to task.task name

Figure 1: Repeated employee, project, and task elements.

similar to ours in that it addresses some DTD trans-

formation rules also adopted by us.

A collection of documents with DTD’s from the

same domain is divided into sets of similar DTDs

based on the minimum edit distances (Nierman and

Jagadish, 2002). The edit distance is calculated using

dynamic programming as the minimum cost to trans-

form a tree A into B. This method works for docu-

ments with DTDs having the same tree structure but

it cannot be applied to trees that have a signiﬁcant dif-

ferent structure even though they refer to the same do-

main. COMA (Do and Rahm, 2002) combines several

simple and hybrid matching algorithms. The simple

algorithms refer to one aspect in DTD: labels, data

types, or user input. Our approach extends the struc-

tural matchers Children and Leaves by combining and

generalizing them to any type of node (repeated, op-

tional, alternative options, key, reference, etc.)

3 THE XEQUIV ALGORITHM

3.1 The Reduction Algorithm (RA)

RA (Figure 3) addresses multiple data storage units

and hierarchical organization in XML. An XML el-

ement stores data in a text unit, attributes, and/or

subelements. Each data unit is represented by a node.

Thus, we easily distinguish between an empty ele-

ment and a text-only element (element types used

accordingly with the W3C standard (Consortium,

2004)) because the ﬁrst has an element node and sev-

eral attribute subnodes as data units, while the second

uses an element node, a text subnode and several at-

tribute subnodes (for more details refer to (Duta et al.,

2006) ). RA is based on seven rules that convert the

node types of the source structure (element, attribute,

text) into a single node type (element) and eliminates

intermediate tree levels.

<!ELEMENT company2 (employee+)>

<!ELEMENT employee (sin, name, address*, dateOfBirth?,

projects?)>

<!ATTLIST employee eid CDATA #REQUIRED>

<!ELEMENT projects (project+)>

<!ELEMENT project (description?, manager | location, task+)>

<!ATTLIST project pid CDATA #REQUIRED>

<!ELEMENT task (#PCDATA)>

<!ATTLIST task date CDATA #REQUIRED>

keys: employee.eid, project.pid

references: project.manager to employee.eid

Figure 2: Nested structure of employee, project and task

elements.

XML SCHEMA STRUCTURAL EQUIVALENCE

1. Part 1: Transform text and attribute nodes

2. Part 2: Create the XRTS tree by eliminating

intermediate nodes.

3. Part 3: Create XRTN tree by transferring

the sequence multiplicator (+, *) or optional

indicator ? to each node in the sequence.

Figure 3: The Reduction Algorithm.

The ﬁrst two parts of RA eliminate nodes but pre-

serve the initial constraints of the XML schema cre-

ating an XML reduced tree using sequences (XRTS).

Part 3 transfers the outer expression operators (?, +,

*) to inner elements creating an XML reduced tree to

leaf nodes (XRTN). Part 3 generates some informa-

tion loss regarding element occurrences restricted to

occurrences of other elements but allows a fast ﬁrst

evaluation of schemas similarity.

3.2 The XML Schema Equivalence

Xequiv Algorithm

The purpose of Xequiv is to ﬁnd Xml schemas that

are similar in terms of leaf content (see Figure 4). To

compare leaf contents the source schemas must be re-

duced using RA so that the intermediate nodes are

eliminated. Xequiv focuses only on the nodes that

store data and compares leaf nodes which are of only

one type: element nodes. A leaf content shows how

much data is stored in the XML data ﬁle based on the

node deﬁnition.

leaf content = <data type, leaf cardinality

(minOccurences .. maxOccurences)>

For example, the structure tasks(task

name?)

with task

name of type string has the leaf con-

tent task

name

content

=< string,0..1 >. The node

task

name is a perfect match with the node job that

has the leaf content job

content

=< string,0..1 > and

1. Reduce schemas using RA and create XRTS1,

and XRTS2 in Part 2, and XRTN1 and XRTN2 in

Part 3 from the source schemas XSD1 and XSD2,

respectively.

2. Determine the equivalence values

Sim

XRTN1→XRTN2

and

Sim

XRTN2→XRTN1

. If they are greater than

a predefined threshold, then they are somewhat

equivalent and we proceed to Step 3 to determine

the structural equivalence.

3. Compare XRTS1 and XRTS2 by finding a match for

each structure (sequence,

all

choice

). A

node part of a structure or a series of nested

structures in XRTS1 is equivalent to a node in

XRTS2 which is part of similar series of nested

structures.

Figure 4: The Xequiv Algorithm.

only α% similar (see Table 1) with the job name node

where job

name

content

=< string,1..1 >.

We recommend using XML Schema as it allows

a variety of data types and cardinality values. In this

paper we use DTD only to present schemas

in a com-

pact way but our algorithm is based on XML Schema

as it has a larger variety of data types and features (e.g.

primary and foreign keys for attributes and elements).

Two nodes are selected for comparison based on

their data types and labels. Equivalent nodes must

have similar data types and matchable labels, which

refer to the same concept either using the same words,

abbreviations, or synonyms. WordNet (Laboratory,

2005) retrieves efﬁciently the set of synonyms for any

label. We consider that synonyms and abbreviations

are 100% equivalent as we determine structural and

not ontology equivalence (contrary to XClust (Lee

et al., 2002) that determines levels of equivalence

for abbreviations). We use the information provided

by labels to select candidate nodes. This is accom-

plished by the function Θ that determines if two nodes

have a similar label and their data types are from

compatible classes. Consider node N

deﬁned by

(type T

,label L

) from XRT1

and node N

)

from XRT2.

Θ(N

) =



1 , if T

≡ T

and L

≡ L

0 , otherwise

(1)

4 NODES SIMILARITY

4.1 Similarity Metric for Simple Nodes

To determine if two schemas are structurally equiva-

lent, Xequiv ﬁrst evaluates their leaf nodes similarity.

This provides a fast evaluation that separates schemas

into different domains. Using the reduced schemas

obtained at Part 3 of RA, Xequiv identiﬁes for each

node a matching node and determines the measure

of inclusion between them. Consider the structures

str1 : (a) and str2 : (a+). Structure str1 requires node

a to appear exactly one time, while str2 requires node

a to occur several times but at least once in the cor-

responding XML ﬁle. We consider that str1 is in-

cluded in structure str2 because a ⊂ a+, and, thus,

R ⊂ +. Also, R ⊂? because ? admits two statuses:

present one time (like R - required) or non-present.

The operator ? ⊂ ∗ as ∗ allows in addition the node to

occur multiple times. Similarly, inclusion hierarchies

As a result we take a few liberties with DTD nomencla-

ture in our examples and we will indicate where they occur.

XRT is a general term that refers to the XML reduced

tree, either XRTS or XRTN

ICEIS 2007 - International Conference on Enterprise Information Systems

are determined between all operators: R ⊂? ⊂ ∗ and

R ⊂ + ⊂ ∗.

The inclusion of a structure str1 into a structure

str2 is based on the inclusion of each node from str1

into a single node in str2. The inclusion ε

x→y

of a

node x from str1 into another node y from str2 if

Θ(x,y) = 1 is based on the inclusion of their expres-

sion operators For nodes with the same operator the

inclusion measure ε = 1. If the operator of the x node

is included in the operator of the node y then ε

x→y

= 1.

Otherwise, the node x is included in y with a lower

percentage (see Table 1). Values of α, β, γ, δ, ε, ρ, and

σ represent the inclusion percentage, and 0 <= α, β,

γ, δ, ε, ρ, and σ < 1. It is very important how these

values are set as they are directly correlated with the

minimum threshold set for schema equivalence. Note

that Table 1 is asymmetrical as the node a? ⊂ a∗ and,

thus, ε

a?→a∗

= 1 but ε

a∗→a?

< 1. If a node x from str1

does not have a correspondent in str2 then the inclu-

sion factor is 0.

We deﬁne the similarity function for an XML

reduced schema XRTN1 with n1 nodes to another

schema XRTN2 with n2 elements based on the nodes

inclusion. We assume that for each node x from

XRTN1 there exists at most one node y in XRTN2

such that Θ(x,y) = 1.

Sim

XRTN1→XRTN2

∑

i=1

→y

∗ 100%,1 ≤ j ≤ n2

(2)

The similarity Sim is an asymmetrical function.

Sim

XRTN1→XRTN2

expresses how much of the struc-

ture of XRTN1 is included in XRTN2. Note that

Sim

XRTN1→XRTN2

is different from Sim

XRTN2→XRTN1

if (1) there are nodes in one structure that do not have

a match in the other structure, and (2) different oper-

ators are used for nodes with Θ = 1.

4.2 Similarity Metric for Choice Nodes

The Sim metric considers each node and its match.

The values provided by Table 1 work for simple nodes

but not for nodes formed by several alternatives (the

choice nodes or structures). The choice structure is

formed by several mutually exclusive nodes. The

Table 1: Operators inclusion percentages ε

x→y

(φ = non-

existent node).

x→y

R ? + * φ

R 1 1 1 1 0

? α 1 β 1 0

+ γ δ 1 1 0

* ε ρ σ 1 0

metric SimChoice must evaluate the similarity of two

choice nodes by considering the number of alterna-

tives and also the similarity between a choice node

and a simple node. The equivalence metric must be

no more than 1 (like nodes inclusion) and differenti-

ate between different number of alternatives. We con-

sider each situation below.

4.2.1 Similarity between Two Choice Structures

Consider the node x formed by several alternative

nodes x = (x

|..|x

) in XRTN1. To evaluate how

similar is node x from XRTN1 to a choice node

y in XRTN2 y = (y

|..|y

) we assume that alterna-

tives are ordered in both nodes such that Θ(x

) =

Θ(x

) = .. = 1.

SimChoice

x→y

∑

i=1

→y

(3)

If ∃k such that Θ(x

) = 0, then ε

→y

0. If n < m, then there are alternatives in x

with no correspondent in y and SimChoice

x→y

< 1.

SimChoice

x→y

= 1 if each alternative from x has a

correspondent in y with the same or a more general

expression operator.

4.2.2 Similarity between a Choice and Multiple

Simple Nodes

Consider the choice node (x|y) in XRTN1 and the se-

quence (x, y) in XRTN2. (x|y) represents a single node

so it must be similar to one node only from XRTN2.

But as both alternatives x and y from XRTN1 have

a correspondent in XRTN2 the one that maximizes

the similarity function based on cardinality matching

must be chosen.

SimChoice

(x|y)→(x,y)

= Max(SimChoice

(x|y)→x

SimChoice

(x|y)→y

) (4)

Thus, SimChoice

(x|y)→(x,y)

= Max(

Conversely, SimChoice

(x,y)→(x|y)

must be evalu-

ated using the similarity metric for each simple node

SimChoice

x→(x|y)

and SimChoice

y→(x|y)

SimChoice

(x,y)→(x|y)

= Max(SimChoice

x→(x|y)

SimChoice

y→(x|y)

) (5)

Thus, SimChoice

(x,y)→(x|y)

= Max(ε

,ε

4.2.3 Similarity between One Choice Structure

and Multiple Choice Structures

Consider the alternative structure (x|y|z|t) in XRTN1

and two alternative structures (x|y) and (z|t) in

XRTN2. In XRTN1 there is a single node with four

alternatives and it has two corresponding nodes in

XML SCHEMA STRUCTURAL EQUIVALENCE

XRTN2 each with two alternatives. A single node

from XRTN2 must correspond to a node from XRTN1

so the one that maximizes the similarity function must

be chosen.

SimChoice

(x|y|z|t)→((x|y),(z|t))

Max(SimChoice

(x|y|z|t)→(x|y)

,SimChoice

(x|y|z|t)→(z|t)

) (6)

Thus, SimChoice

(x|y|z|t)→((x|y),(z|t))

Max(

+ε

)

In summary, a choice node x(x

,...x

) from

XRTN1 is equivalent to at most one node in XRTN2.

If the alternatives from x exist in XRTN2 either

as simple nodes y

,..y

or as alternatives that are

grouped in several choice nodes y

k+1

,..y

the simi-

larity measure chooses the node y that is the “most”

equivalent to x.

SimChoice

x→(y

,..y

)

= Max(SimChoice

x→y

,..,

SimChoice

x→y

) =

Max

i=1

(ε

)

(7)

Thus, the similarity value between a schema XRTN1

with n1 nodes and a schema XRTN2 formed by n2

nodes is:

Sim

XRTN1→XRTN2

∑

i=1

Sim

→y

∗100%,1 ≤ j ≤ n2

(8)

The equivalence value Sim for a node x for which

there exists at least a node y in XRTN2 such that

Θ(x,y) = 1 is deﬁned as follows:

Sim

x→y











x→y

, if x and y

are simple nodes

SimChoice

x→y

∗ ε

x→y,

, if x or y

is a choice node

(9)

The value ε

x→y

for choice nodes determines the equiv-

alence between the operators applied to the choice

structures making the difference, for example, be-

tween (x

) and (y

)+. Note that ε

x→y

for XRTN

schemas is always 1 as there is no outer operator for

choice structures. The choice operator is combined

with alternative nodes’ operators in Part 3 of RA as

described in Section 3.

5 STRUCTURAL SIMILARITY

5.1 Structural Similarity Metric

The similarity Sim is calculated based on the reduced

structures obtain in Part 3 of the RA. However it is

important how nodes are grouped in sequences in the

reduced schema. A more exact way to determine

structural equivalence must consider the cardinality of

each node and its correlation to other nodes as part of

a sequence. Thus, based on reduced schemas obtained

in Part 2 of RA we compute similarity SimStr of each

structure from XRTS1 with a structure from XRTS2

that contains the corresponding nodes. SimStr de-

termines how similar a structure str1 (sequence or

choice) is to a structure str2 considering the nodes’

cardinality, structure cardinality, and number of nodes

in the structure str1.

SimStr

str1→str2







∑

i=1

SimStr

str1i→str2j

∗ ε

str1→str2

∗ 100%,

if str1 or str2 are sequences

Sim

str1→str2

, otherwise

(10)

The value m1 represents the number of inner struc-

tures (sequences, choices, or simple nodes) str1i in

structure str1 such that str1i

str1j = Ø for any 1 ≤

i, j ≤ m1, i 6= j. If a node from str1 has a correspon-

dent in str2, then: ﬁrst the equivalence of nodes is

evaluated based on their cardinality, and second it is

multiplied by the equivalence of structures cardinal-

ity. The value ε

str1→str2

determines the equivalence

between structures cardinality. A structure, for exam-

ple str1, can be represented by a single node. In this

case, SimStr evaluates the similarity of this node with

a node from str2. The required operator R is implied

whenever there is no other operator for a node or a

structure. If both structures str1 and str2 are simple

nodes the similarity value for them is depicted in Ta-

ble 1. If one node is a choice structure the formula of

SimChoice is used:

SimStr

str1→str2

= Sim

str1→str2

SimChoice

str1→str2

∗ ε

str1→str2

. Note that in this case

str1→str2

can be different than 1 if different operators

are associated with the choice structures.

SimStr values are interpreted as follows. If

SimStr

str1→str2

= 100% and SimStr

str2→str1

= 80%,

then it means that structure str1 is included in str2.

Str2 has either (1) additional nodes, sequences, or

choices; (2) additional alternatives in its choice nodes;

or (3) more general operators for nodes

For example, consider the structures deﬁned in

Figure 5. In example (a) structures str1 and str2 are

sequences, with str1 containing two nodes: a+ and b,

and str2 having three nodes a, b, and c. The similarity

value for them is calculated as follows:

SimStr

(a)str1→str2

= (SimStr

(a)str11→str21

SimStr

(a)str12→str22

)/2∗ ε

+→+

∗ 100%

SimStr

(a)str1→str2

= (ε

a+→a

+ ε

b→b

)/2 ∗ ε

+→+

∗

100%

The most general operator is *; the operator + is more

general than R but not than ? and *; the optional operator ?

is more general than R

ICEIS 2007 - International Conference on Enterprise Information Systems

Using the example values from Table 2 for nodes

inclusion, node a+ from str1 is 50% equivalent with

node a from str2, and nodes b are 100% equivalent.

Thus, SimStr

(a)str1→str2

= 75%. This means that 75%

of the structure str1 is included in the structure str2.

To determine the inclusion of structure str2 in str1 we

calculate SimStr

(a)str2→str1

SimStr

(a)str2→str1

= (SimStr

(a)str21→str11

SimStr

(a)str22→str12

+ SimStr

(a)str23→str13

)/3∗ ε

+→+

Structure str13 does not exists, so

SimStr

(a)str23→str13

= 0. Thus, SimStr

(a)str2→str1

1+1+0

∗ 1∗ 100% = 66.67%. This means that 66.67%

of structure str2 is found in structure str1.

In example (b) from Figure 5, the structure str1

contains two substructures: a sequence str11 made of

two nodes and str12 made of one node. Similarly, the

structure str2 has a sequence and a node. The simi-

larity value is calculated for one sequence at a time.

SimStr

(b)str1→str2

= (SimStr

(b)str11→str21

SimStr

(b)str12→str22

)/2∗ ε

+→+

∗ 100%

Thus, SimStr

(b)str1→str2

= ((ε

a+→a

+ ε

b→b

)/2 ∗

+→+

+ε

c→c

)/2∗ε

+→+

∗100% = 62.50%, states that

62.50% of str1 is included in str2. SimStr

(b)str2→str1

is computed similarly and is equal to 100%. Both

structures str1 and str2 have the same number of

nodes and for each node in one structure there is an

equivalent node in the other. The difference in the

similarity values is given by the expression operators

making str1 a more general structure than str2.

In example (c) from Figure 5, both structures are

formed by three nodes but grouped differently in se-

quences. In str2 the nodes a and b are grouped in

a repeatable sequence. In str1 the nodes a+ and

b are not separated by c but it can be considered

that there is a required sequence that groups them in

str1 : ((a+,b), c). This gives the advantage of com-

paring the two sequences containing the nodes a and

b and give a better similarity value between str1 and

str2. Conversely, if str1 is compared to the structure

(a,b,c) the nodes a+ and b must not be grouped sepa-

rately as both structures have only a simple sequence.

SimStr

(c)str1→str2

= ((ε

a+→a

+ ε

b→b

)/2∗ ε

R→+

c→c

)/2∗ ε

R→∗

∗ 100% = 83.33%

SimStr

(c)str2→str1

= ((ε

a→a+

+ ε

b→b

)/2∗ ε

+→R

c→c

)/2∗ ε

∗→R

∗ 100% = 30%

Table 2: Example of nodes equivalence (φ = non-existent

node).

x→y

R ? + * φ

R 1 1 1 1 0

? 0.5 1 0.4 1 0

+ 0.5 0.2 1 1 0

* 0.4 0.5 0.9 1 0

Str1 is a ﬂat structure compared to str2 which con-

tains other nested structures. Str1 is found in str2 in

terms of 83.33%, while str2 is found in str1 in terms

of 30%. They both contain the same nodes but are

grouped differently. The difference in percentages is

generated by (1) the nested sequence (a,b)+ com-

pared with a+,b, and (2) the * operator.

5.2 Xequiv Applied to Nested and

Non-Nested Structures

Consider the examples from Figure 6. The connec-

tion between employees who work on projects is pre-

served either using references (examples (a) and (c)),

either through a nested structure Figure 6(d). The

example from Figure 6(b) provides some connection

between employees and projects but without check-

ing the foreign key integrity. Are they all equivalent?

There is no mechanism in the literature to clearly

compare them and determine their equivalence. This

section is dedicated to solving this problem.

If examples (a) and (d) from Figure 6 are com-

pared, the equivalence algorithms ﬁnd the node em-

ployee.pid is an extra node in example (a), thus re-

ducing the equivalence measure of the two struc-

tures. Since a corresponding node to employee.pid

is not necessary in the former structure to link an em-

ployee to a project as this is done by the nested fea-

ture we have two options to remedy this drawback.

The ﬁrst option is to eliminate nodes which represent

references such as employee.pid. Unfortunately, this

wrongly determines structures (a) and (b) that have

the same nodes to be 100% equivalent even though

(b) is missing an important reference. Another op-

tion is to add fake references in nested structures in

the preparation part of the RA. For example, in the

structure (d) we could add either pid in the employee

node or eid in the project node, thereby generating

two alternative structures that have different equiva-

lence measures to structure (a). As we do not know if

structure (d) is compared to (a) or (c), we must add a

reference node that will determine the same similar-

ity value between (a) and (d) as between (c) and (d).

Thus, we deﬁne an additional reduction rule that takes

care of references.

(a) str1 : (a+,b)+ str11 : a+, str12 : b

str2 : (a, b,c)+ str21 : a, str22 : b,str23 : c

(b) str1 : ((a+, b)+,c+)+ str11 : (a+,b)+, str12 : c+

str2 : ((a, b)+,c)+ str21 : (a, b)+,str22 : c

str2 : ((a, b)+,c)∗ str21 : (a, b)+,str22 : c

Figure 5: Determining sequence equivalence for multiple

sequences.

XML SCHEMA STRUCTURAL EQUIVALENCE

Rule 8 The link between a structure S2 with the

primary key KEY2 nested inside another structure S1

with the primary key KEY1 is preserved by adding

a choice reference structure formed by primary keys

KEY1||KEY2 inside the nested structure.

Rule 8 is contained in the preparation part of the

RA and is applied at the end of Part 1 when sequences

are still nested. References are included in the in-

ner structure to borrow its operator, thereby to pre-

serve the cardinality of the nested structure. If the

outer structure S1 contains only S2 and no additional

elements but is part of a structure S0 with primary

key KEY0, then the choice structure KEY0||KEY2 is

added inside S2.

Referring to example (d) from Figure 6, the refer-

ence node is formed by (eidREF||pidREF) as eid and

pid are primary keys.

company (employee (eid, project (pid, de-

scription), name)

⇒ company (employee (eid,

(eidREF||pidREF, pid, description), name)

We represent the choice structure for references

using a double line || as this is evaluated differently

from a regular alternative construction. Only one ele-

ment of the alternative structure for references is go-

ing to be found (if any) in the other schema. Thus,

contrary to SimChoice previously deﬁned, that must

determine how many alternative options from one

schema are found in the other schema, SimChoiceRef

must evaluate if there is any corresponding reference

(see Equation 11). Thus, only one ε

→y

is greater

than zero from the components of the maximum func-

tion, where x

and y

are reference alternatives from

XRTS1 and XRTS2, respectively.

SimChoiceRe f = Max(ε

→y

) ∗ ε

REF

∗ 100% (11)

A correct equivalence evaluation must also consider

(a) <!ELEMENT company (employee, project)>

<!ELEMENT employee (eid, pid, name)>

<!ELEMENT project (pid, description)>

eid and project.pid primary keys,

employee.pid is keyref to project.pid

(b) <!ELEMENT company (employee, project)>

<!ELEMENT employee (eid, pid, name)>

<!ELEMENT project (pid, description)>

<!ELEMENT employee (eid, name)>

<!ELEMENT project (pid, eid, description)>

employee.eid and pid are primary keys,

project.eid is keyref to employee.eid

(d) <!ELEMENT company (employee)>

<!ELEMENT employee (eid, project, name)>

<!ELEMENT project (pid, description)>

eid and project.pid primary keys

Figure 6: Simple possible equivalent schemas.

the existence of primary keys. If XRTS3 is deﬁned as

Employee(eid), with eid primary key and XRTS4 as

Employee(eid), they are not 100% equivalent. Thus,

we revise the similarity formula (Equation 10) to mul-

tiply the node equivalence to the key equivalence for

primary keys. For example if eid is a primary key its

equivalence is the product ε

eid

∗ε

KEY

, where ε

KEY

= 1

if both nodes are primary keys, and < 1 if only one of

them is a primary key.

The examples from Figure 6, are reduced to the

structures detailed in Figure 7 with the KEY sufﬁx

for primary keys and REF for references. By deﬁning

KEY

= 0.7 and ε

REF

= 0.6 and using the operators

equivalence deﬁned in Table 2, schema (a) is simi-

lar to the the rest of the schemas in the proportions

presented in Figure 8, where x represents a generic

schema and ε

eid

is the short form for ε

(a)eid→(x)eid

6 EXAMPLE

RA applied to examples from Figures 1 and 2 gener-

ates the following output.

XRTS1: company1((eid | sin, name, address,

(pidREF* | task

nameREF)+) +, (pidKEY, descrip-

tion, budget | manager | location)*, (task

nameKEY,

date)+)

XRTN1: company1(eid+ | sin+, name+, address+,

pidREF* | task

nameREF+, pidKEY*, description*,

budget* | manager* | location*, task

nameKEY+,

date+)

XRTS2: company2(eidKEY, sin, name, address*, da-

teOfBirth?, (pidKEY, description?, manger| location,

eidREF || pidREF, (task, date)+)*)+

XRTN2: company2(eidKEY+, sin+, name+, ad-

dress*, dateOfBirth*, pidKEY*, description*, man-

ager* | location*, eidREF* || pidREF*, task*, date*)

We start by comparing XTRN trees to determine

if they are from the same domain. Consider the opera-

tors equivalence detailed in Table 2. We determine the

node’s similarity from the two schemas using Equa-

tion 2.

Sim

XRTN1→XRTN2

= (

Max(ε

eid+→+

,ε

sin+→+

)

name+→+

+ ε

address+→∗

+ (ε

pid∗→∗

task

name+→task∗

)/2 + ε

pid∗→∗

+ ε

description∗→∗

(ε

budget∗→φ

+ ε

manager∗→∗

+ ε

location∗→∗

)/3 +

task

name+→task∗

+ ε

date+→∗

)/9∗ 100% = 96.20%

(a)company(eidKEY, pidREF, name, pidKEY, description )

(b)company(eid, pid, name, pid, description)

(c)company(eidKEY, name, pidKEY, eidREF, description)

(d)company (eidKEY, eidREF||pidREF, pid, description, name)

Figure 7: Reduced schemas.

ICEIS 2007 - International Conference on Enterprise Information Systems

Sim

XRTN2→XRTN1

= (Max(ε

eid+→+

,ε

sin+→+

) +

name+→+

+ε

address∗→+

+ε

dateOfBirth∗→φ

+ε

pid∗→∗

description∗→∗

+ (ε

manager∗→∗

+ ε

location∗→∗

)/2 +

Max(ε

eid∗→φ

,ε

pid∗→∗

) + ε

task∗→task

name+

date∗→+

))/12∗ 100% = 68.83%

Both values are high enough to suggest that there

are common nodes between XRTN1 and XRTN2.

The next step evaluates the similarity between the

structures of XRTS1 and XRTS2. To optimize the

computation of the structural similarities we use the

references determined in Part 3 and include them ac-

cordingly into sequences.

SimStr

XRTS1→XRTS2

= ((

Max(ε

eidR→R

,ε

sinR→R

)

nameR→R

+ ε

addressR→∗

+ (ε

pid∗→∗

∗ ε

REF

+ 0)/2 ∗

+→+

)/4∗ ε

+→+

+(ε

pidR→R

∗ε

KEY

+ε

decriptionR→?

(ε

budgetR→φ

+ ε

managerR→R

+ ε

locationR→R

)/3)/3 ∗

∗→∗

∗ ε

R→+

+ (ε

task

nameR→taskR

∗ ε

task nameKEY

dateR→R

)/2∗ε

+→+

∗ ε

R→∗

∗ ε

R→+

)/3∗100% = 66%

SimStr

XRTS2→XRTS1

= ((Max(ε

eidR→R

∗

KEY

,ε

sinR→R

) + ε

nameR→R

+ ε

address∗→R

dateOfBirth?→φ

) ∗ ε

+→+

+ ((ε

pidR→R

∗ ε

KEY

description?→R

+ (ε

managerR→R

+ ε

locationR→R

)/2 +

pid∗→∗

∗ ε

pidREF

) ∗ ε

∗→∗

∗ ε

+→R

+ (ε

taskR→R

dateR→R

)/2 ∗ ε

+→+

∗ ε

∗→R

∗ ε

+→R

)/5)/6 ∗ 100% =

46.50%

The nodes similarity values Sim show that both

schemas are from the same domain and refer to the

same set of entities (employee, projects, and task) as

they have many correspondent nodes. However, the

structural similarity values show that they are orga-

nized signiﬁcantly different. XSD1 is less general

than XSD2 as more of its structure is included in

XSD2 (SimStr

XRTS1→XRTS2

> SimStr

XRTS2→XRTS1

7 CONCLUSION AND FUTURE

WORK

Our approach ﬁnds equivalent XML schema struc-

tures by determining if their XML trees are equiv-

alent. Xequiv ﬁrst determines if schemas are from

the same domain and if there is any similarity be-

tween their nodes regarding labels, data types and

SimStr

(a)→(x)

= (ε

eid

∗ ε

eidKEY

+ ε

pid

∗ ε

pidREF

+ ε

name

pid

∗ ε

pidKEY

+ ε

description

)/5∗ 100%

SimStr

(a)→(b)

= (0.7+ 0.6+ 1+ 0.7+ 1)/5∗ 100% = 80%

SimStr

(a)→(c)

= (1+ 0+ 1+ 1+ 1)/5∗ 100% = 80%

SimStr

(a)→(d)

= (1+ 1+ 1+ 1+ 1)/5∗ 100% = 100%

Figure 8: Structural similarity of reduced schemas.

operators. Secondly, our algorithm focuses on struc-

tural organization and considers the number of nodes

in structures, operators applied to sequences, nested

or linked structures. The elimination of the non-leaf

nodes using the Reduction Algorithm (Duta et al.,

2006) makes the nodes path unimportant. This has

the advantage of allowing schemas to be equivalent

because they refer to the same entity attributes but not

necessarily because they share a part of the XML tree.

Further research needs to be conducted to asses the

efﬁciency of Xequiv compared to other existent algo-

rithms in the area.

REFERENCES

Bertino, E., Guerrini, G., and Mesiti, M. (2004). A match-

ing algorithm for measuring the structural similarity

between an XML document and a dtd and its applica-

tions. Journal of Information Systems, 29(1):23–46.

Boyer, J. (2001). Canonical xml version 1.0, w3c

recommendation, white paper. Available at

http://www.w3.org/TR/xml-c14n (Last searched

on November 18, 2006).

Consortium, W. W. W. (2004). XML Schema part 0, 1, and

Do, H. H., Melnik, S., and Rahm, E. (2003). Comparison

of schema matching evaluations. pages 221–237.

Do, H. H. and Rahm, E. (2002). COMA - a system for

ﬂexible combination of schema matching approaches.

In Proceedings of the 28th VLDB Conference, pages

610–621.

Duta, A., Barker, K., and Alhajj, R. (2006). Xml schema

reduction algorithm. In Proceedings of the Tenth East-

European Conference on Advances in Databases and

Information Systems ADBIS’06.

Laboratory, C. S. (2005). Wordnet. Available at

http://www.cogsci.princeton.edu (Last searched on

November 18, 2006).

Lee, M. L., Yang, L. H., Hsu, W., and Yang, X. (2002).

Xclust: clustering xml schemas for effective integra-

tion. In CIKM ’02: Proceedings of the eleventh in-

ternational conference on Information and knowledge

management, pages 292–299, New York, NY, USA.

ACM Press.

Madhavan, J., Bernstein, P. A., and Rahm, E. (2001).

Generic schema matching with cupid. In VLDB ’01:

Proceedings of the 27th International Conference on

Very Large Data Bases, pages 49–58, San Francisco,

CA, USA. Morgan Kaufmann Publishers Inc.

Nierman, A. and Jagadish, H. V. (2002). Evaluating struc-

tural similarity in XML documents. In Proceed-

ings of the Fifth International Workshop on the Web

and Databases (WebDB 2002), Madison, Wisconsin,

USA.

Salminen, A. and Tompa, F. W. (2001). Requirements for

xml document database systems. In DocEng ’01:

Proceedings of the 2001 ACM Symposium on Docu-

ment engineering, pages 85–94, New York, NY, USA.

ACM Press.

XML SCHEMA STRUCTURAL EQUIVALENCE