
can produce a set of mediated schemas. The above 
procedures correspond to step 2 in Figure 2. 
5  QUERY FORWARDING 
At the schema clustering time, we produce a 
mediated schema forest of various data sources. At 
query phase, users can pose queries using the 
terminology of any of the mediated schemas 
although most of time the higher level schemas are 
used. The query is reformulated and delivered 
downward to the source schemas according to the 
mapping between the mediated schema and its lower 
schemas. A whole mapping called schema mapping 
in our system is designed to consist of two parts. 
First part is the one-to-many schema name mapping. 
The mediated schema name is mapped downward to 
the schema names of the lower schemas. We give an 
ID to each schema which can uniquely identify the 
schema. The name mappings keep schema names 
along with their IDs. The IDs guarantee the schema 
mappings are correct mappings.  Consider the next 
example of 3 source schemas. 
Example 1: 
(Paper: Paper_No., Title, Presentation, 
Author_info, Nation, Contact, Company) 
(Proceedings: Year, Conference, Title, Author1, 
Company1, Author2, Company2) 
(Publication: Ref_No., Type, Title, Author_list, 
Reference, Volume, Issue_section, Pages, Year, 
Month, Day, Conference_notes) 
In the above example, schema name and schema 
attributes are separated by colon, and attributes are 
separated by comma. After processing by our system, 
we could obtain the following mediated schema. 
Example 2: 
(Academic_publication: Title, No., Author, 
Conference, Year) 
The name mapping between the above mediated 
schema and source schemas is as follow. 
Example 3: 
((4, Academic_publication)  → ((1, Paper), (2, 
Proceedings), (3, Publication)))  
This mapping maps schema (4, 
Academic_publication) to three lower schemas (1, 
Paper), (2, Proceedings), and (3, Publication). The 
first element in (4, Academic_publication) is the 
schema ID and the other one is the schema name.  
One-to-many attribute mappings are the other 
part of the schema mapping. The attribute mapping 
format is the same as the name mapping, but the 
second element within the parenthesis is an attribute 
such like “Title” in (4, Title) of Example 4. There 
would be several attribute mappings in one schema 
mapping since a mediated schema usually owns 
several mediated attributes. Consider the next 
mappings. 
Example 4: 
((4, Title) → ((
1, Title), (2, Title), (3, Title))) 
((4, No.) → ((1, Paper_No.), (3, Ref_No.))) 
((4, Author) → ((1, Author_info), (2, Author1), (2, 
Author2), (3, Author_list))) 
((4, Conference)  →  ((2, Conference), (3, 
Conference_notes)))  
((4, Year) → ((2, Year), (3, Year))) 
Example 4 is the attribute mappings between 
Example 1 and Example 2. 
When users pose a query on the mediated schema 
in Example 2, the system will reformulate the query 
to the source schema according to the mappings in 
Example 3 and Example 4. Several queries would be 
generated matching the source schemas in Example 
1. Finally the databases will return the required 
answers to users. To illustrate the forwarding process, 
consider the next queries. The query is posed on the 
mediated schema in Example 2. 
SELECT Title, Author, Year 
FROM Academic_publication 
WHERE Year 
>
 2006 AND Author = ‘Strehl’ 
According to the mappings in Example 3 and 4, 
the next two queries will be posed on the 
corresponding data sources automatically. 
(1)  SELECT Year, Title, Author1, Author2  
FROM Proceedings 
WHERE Year 
>
 2006 AND (Author1= ‘Strehl’ 
OR Author2 =‘Strehl’) 
(2)  SELECT Title, Author_list, Year 
FROM Publication 
WHERE Year 
>
 2006 AND Author_list = ‘Strehl’ 
6  EXPERIMENTS 
Our algorithms were implemented in Java. We run 
the experiments on a Windows 7 machine, with 
2.60GH Intel(R) i5 processor and 8GB memory. The 
goal of our experiments is to demonstrate that our 
schema clustering algorithm is effective in clustering 
the data sources of multiple domains, queries on the 
mediated schemas could achieve answers with good 
accuracy and the cost of writing query clauses for 
users is reduced without losing query accuracy.   
For the purpose of our query evaluation, we used 
MySQL to store the data. Two string similarity 
measurements are utilized to compute the schema 
similarity since two strings may be semantically 
Multi-domainSchemaClusteringandHierarchicalMediatedSchemaGeneration
115