AN ARCHITECTURE FOR COLLABORATIVE DATA MINING
Francisco Correia, Rui Camacho
LIAAD & DEI, Faculdade de Engenharia da Universidade do Porto, Porto, Portugal
Jo
˜
ao Correia Lopes
INESC-Porto & Faculdade de Engenharia da Universidade do Porto, Porto, Portugal
Keywords:
Collaborative Data Mining, Web Services.
Abstract:
Collaborative Data Mining (CDM) develops techniques to solve complex problems of data analysis requiring
sets of experts in different domains that may be geographically separate. An important issue in CDM is the
sharing of experience among the different experts. In this paper we report on a framework that enables users
with different expertise to perform data analysis activities and profit, in a collaborative fashion, from expertise
and results of other researchers. The collaborative process is supported by web services that seek for relevant
knowledge available among the collaborative web sites.
We have successfully designed and deployed a prototype for collaborative Data Mining in domains of Molec-
ular Biology and Chemoinformatics.
1 INTRODUCTION
Multi-Relational Data Mining (MRDM) (Dzeroski,
2001) is a very active research field that strives to
construct complex models for data. One flavour of
MRDM is Inductive Logic Programming (ILP) (Mug-
gleton and De Raedt, 1994). ILP systems can con-
struct complex models, represented in a First Order
Logic language, using relevant background knowl-
edge provided by domain experts.
The success of MRDM/ILP applications depends
often on the collaboration between domain (e.g
Molecular Biology) and ILP experts. The former
provides the background knowledge whereas the
later knows how to encode the provided background
knowledge and how to use the algorithms to construct
the models. Also a key point in Data Mining (and
ILP) applications in similar domains is that there may
profit from sharing information. Sharing information
may speed up new DM (ILP) tasks in similar domains.
As in the CRISP-DM methodology (CRISP-DM,
2007), where pre-processing the data takes a signifi-
cant percentage of the whole DM process, deploying
the background knowledge is a significant part of an
ILP-based Relational Data Mining process. Sharing
components (predicates) of the background knowl-
edge may provide a considerable speedup in the de-
ployment of an ILP application an reduce the depen-
dency on the ILP expert.
There have been several successful CDM experi-
ences as reported in (Lavrac et al., 2004; Blockeel and
Moyle, 2002; Moyle et al., 2003).
In this paper we report on a Service Oriented Ar-
chitecture (SOA), implemented as SOAP Web Ser-
vices (Papazoglou and Georgakopoulos, 2003), that
provides a framework for Collaborative Data Mining.
The Web services provide a mechanism that makes
completely transparent for users the sharing of infor-
mation (data sets, background knowledge predicates
and papers) among the participant sites.
The rest of the paper has the following structure.
In Section 2 we describe the data analysis method we
use and explain its advantages and also the advan-
tages of using a collaborative approach. In Section 3
we describe the web application we have developed.
The use of Web services in the Service Oriented Ar-
chitecture is described in Section 4. A case study in
domains of Molecular Biology and Chemoinformat-
ics is described in Section 5. We finally present our
conclusion and point out future work in Section 6.
467
Correia F., Camacho R. and Lopes J..
AN ARCHITECTURE FOR COLLABORATIVE DATA MINING.
DOI: 10.5220/0003097504670470
In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2010), pages 467-470
ISBN: 978-989-8425-28-7
Copyright
c
2010 SCITEPRESS (Science and Technology Publications, Lda.)
2 ILP-BASED MRDM
In order to understand our application and the use-
fulness of the SOA architecture proposed we explain
briefly how an ILP system is used in a MRDM task.
An ILP system requires two main ingredients: a set
of examples (the data to be analysed) and; a set of
predicates called the background knowledge that en-
code knowledge that the domain expert thinks is rel-
evant for the analysis process. The result of the data
analysis process is a model encoded as a set of rules
(clauses). Each of these rules (clauses) use the pred-
icates in the background knowledge as their basic
blocks for the conditional part of the rule (literals in
the body of a clause).
To solve a problem with ILP we require a domain
expert or team of domain experts to define the prob-
lem and provide the data (examples) and we need an
ILP expert to “frame” the domain problem into an
ILP problem and run the data analysis experiments.
We also need a tight collaboration between domain
and ILP experts in the development of the background
knowledge. Development of background knowledge
is a very time consuming stage of the whole MRDM
process. We have speed up the whole process of back-
ground knowledge development in three ways: i) re-
using existing predicates available at other web sites;
avoiding experiments that lead to bad results and;
ii) starting off with an existing (in another web site)
background knowledge and try to improve it.
3 A WEB-BASED APPLICATION
FOR MULTI-RELATIONAL
DATA MINING
The application reported in this paper is deployed as
a set of collaborative sites supporting the proposed
SOA architecture shown in Figure 1 (a).
Each site is devoted to the active collaborative
work of a group of researchers, that may be geograph-
ically distributed. At each site the different kinds of
used/produced items of information are classified as
private or public. Public information are accessible to
groups in other sites via Web services (passive collab-
orative work). We now describe in detail each of the
sites architecture and the functionalities provided for
its users.
At each web site we have adopted the architecture
shown in Figure 1 (b). It uses a standard n-tire model:
user interface, business logic, data access. The imple-
mentation details will be provided in Section 4.
The Data Base underlying each web site stores
(a) (b)
Figure 1: (a) SOA Architecture for CDM. (b) Implementa-
tion Architecture for each Site.
three types of information: data sets upon which data
analysis experiments are performed; libraries of pred-
icates for the background knowledge; papers related
to the stored data sets; and lots of relevant informa-
tion concerning the traces of ongoing and past exper-
iments.
The data analysis task is performed by running an
ILP system (Aleph (Srinivasan, 2003)) on the work-
ing data set. The User Interface (UI) allows any user
to run several data analysis tasks. At any time the user
may access the status of his tasks using the UI to the
tasks server that manages a set of machines available
in a campus LAN.
Each site includes usage scenarios for different
users: the administrator, the ILP expert and the Do-
main expert.
The implemented administrative tasks, available
after authentication by an administrator, include the
following: to inspect the details about user actions on
web site by checking the log file of actions done; to
manage user accounts and access control; to setup the
general configuration of the application (e.g. manag-
ing the available ILP algorithms; and to manage the
list of available web sites (supporting the architecture)
that are used transparently to look for information by
the Business Logic of the site.
The roll of the ILP expert is mainly to encode the
required predicates that are part of the background
knowledge. He also is able to manage the hierarchy of
categories of such predicate libraries. The ILP expert
may perform data analysis tasks with direct control of
the ILP system. That is, he may directly control the
values of the ILP system parameters.
A domain expert can upload new data sets and pa-
pers. He can also manage the meta data: hierarchi-
cal categories of papers, of data sets, and of existing
KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval
468
Figure 2: Interface to compose the background knowledge
for a data set. Library of predicates may also be searched
for externally using web services.
predicates. The domain expert may also perform data
analysis tasks on personal data.
Access to the Web services (Papazoglou and
Georgakopoulos, 2003) is processed at the Web appli-
cation Business Logic level. According to each user
information need request, the business logic layer
sends the request to the set of participant sites to get
and collect the requested information. Web services
are not used to change any information stored in the
web sites, they are only used to retrieve information.
4 IMPLEMENTATION DETAILS
In Section 3 the functional and non-functional re-
quirements of the proposed SOA architecture were
presented. This section will give implementation de-
tails regarding the User Interface (UI), Business Logic
(BL) and Database.
User Interface
The interface in Figure 2, shows an example of the
provided User Interface. In this example a list of data
sets is provided and there are some external data sets
(located in the ILPWS1 machine). The options are
also different for a local and an external data set. The
simplicity and usability are the main points in this ap-
plication.
The user may choose to access external data sets.
If that is the case then the Business Layer will trans-
parently use the registered Web Services to retrieve
related information. The user will state his informa-
tion needs in a special purpose UI and the Business
Logic layer will retrieve local information from the
database and calls the appropriate Web services to re-
trieve information from the other participant sites. All
that information, coming from different sources, will
then be presented to the user in a consistent way.
Business Logic
When the request for information arrives from the
XHTML UI in the Browser, the BL layer will query
its own database and uses the Web services imple-
mented to retrieve information from other participant
sites.
The Web services are used to share/access infor-
mation (data sets, papers, background knowledge)
publicly available at the different web sites that im-
plement the collaborative framework. Communi-
cation with the Web service and the answer re-
sult is implemented through eXtensible Markup Lan-
guage (XML) (Moller and Schwartzbach, 2006) and
uses the standard Simple Object Access Protocol
(SOAP) (Mitra and Lafon, 2007). All possible op-
erations are described in a Web Services Description
(WSDL) (Booth and Liu, 2007), where all services
provided by the application are listed.
For data sets, papers and background knowledge
each site provides services to list all (e.g. data sets),
get information about one resource given the ID,
download the resource to the local file system and get
and import the resource (e.g. data sets) into the site
database.
For data sets the implementation provides the calls
to: return the list of data sets that exist in the site;
return the stored information about a data set; return
the content of a data set; and returns all information
related with the dataset in a format that can be used to
be inserted in the local site database
1
.
Because some information may be very large (e.g.
the data sets) we give the option to the user to see if its
really what he wants before retrieving the files. The
requests to a specific information is done only when
it’s needed.
The advantages of using a Web service instead of
direct access to the database of the remote Business
Logic layer, are that Web services implementation
works over different platforms. This way each site
may be using different Operating Systems and differ-
ent implementation languages may be used.
5 THE ILP-WS FRAMEWORK
We have implemented and deployed a prototype of
the framework described in the previous sections. The
application domains of this case study are Structure-
Activity Relationship (SAR) problems in chemoin-
formatics and problems in genomic and proteomic in
Molecular Biology. Our prototype has two web sites
1
For papers and background knowledge is similar.
AN ARCHITECTURE FOR COLLABORATIVE DATA MINING
469
where biologists, biochemists and ILP experts may
solve the referred kind of problems. The ILP system
available for the experiments is Aleph.
Each site running the application allows domain
and ILP experts to implement active Collaborative
Data Mining tasks. Domain experts provide problems
and data (examples) and the ILP experts develop the
background knowledge predicates for those problems.
Each site has libraries of available predicates organ-
ised in a hierarchical fashion and according to a hi-
erarchical structure defined by domain experts. Each
stored predicate has an English description of its func-
tion and the detailed implementation is hidden from
the domain expert.
We have provided an interface (see Figure 2)
where the domain expert may assemble the back-
ground knowledge by searching and choosing pred-
icates from this hierarchically organised library of
predicates. At this stage Web services are used to
search other web sites where the application is de-
ployed, looking for predicates of the required cate-
gory. This procedure may save time in the develop-
ment of the background knowledge. An ILP expert is
required only when the domain expert decides to use
some knowledge that is not encoded as a predicate lo-
cally neither available using the Web services.
Before starting the data analysis experiments the
user may use the UI to inspect existing results of
other experiments on the data set, if publicly avail-
able. This will give him an idea of what background
knowledge have been tried and what were the corre-
spondent results and therefore avoid repeating useless
experiments or avoid choosing predicates that seem to
be of no use for the analysis of the data.
The expert may undergo a sequence of experi-
ments where models are constructed and shown to
the expert. Each step of the experimental process is
recorded so the expert may inspect previously con-
structed models and in the end he may decide which
models to store as final results of the analysis process.
He may also decide which information to make pub-
lic.
6 CONCLUSIONS
In this paper we have described a framework for Col-
laborative Data Mining. At each site the framework
enables the solving of domain problems with the help
of ILP experts that develop the background knowl-
edge and use ILP systems. Web services look at other
sites for publicly available information that are rele-
vant for the solving of problems.
The use of Web services extended the traditional
approach to Collaborative Data Mining possibilities
implementing a passive Collaborative Data Mining
that searches web sites for relevant information.
ACKNOWLEDGEMENTS
This study was funded by FCT project “ILP-Web-
services” (PTDC/EIA/70841/2006).
REFERENCES
Blockeel, H. and Moyle, S. (2002). Collaborative data
mining needs centralised model evaluation. In Pro-
ceedings of the ICML-2002 Workshop on Data Mining
Lessons Learned, pages 21–28.
Booth, D. and Liu, C. K. (2007). Web services description
language (WSDL) version 2.0 part 0: Primer. Tech-
nical Report Second Edition, W3C Recommendation.
http://www.w3.org/TR/wsdl20-primer.
CRISP-DM (2007). Cross industry standard process for
data mining. http://www.crisp-dm.org/.
Dzeroski, S. (2001). Relational Data Mining. Springer-
Verlag New York, Inc., Secaucus, NJ, USA.
Lavrac, N., Motoda, H., Fawcett, T., Holte, R., Langley,
P., and Adriaans, P. (2004). Introduction: Lessons
learned from data mining applications and collabora-
tive problem solving. Machine Learning, 57(1-2):13–
41.
Mitra, N. and Lafon, Y. (2007). SOAP version 1.2 part 0:
Primer. Technical Report Second Edition, W3C Rec-
ommendation. http://www.w3.org/TR/soap12-part0/.
Moller, A. and Schwartzbach, M. I. (2006). An Introduction
to XML and Web Technologies. Addison Wesley.
Moyle, S., McKenzie, J., and Jorge, A. M. (2003). Col-
laboration in a data mining virtual organization. In
Data Mining and Decision Support: Integration and
Collaboration, The International Series in Engineer-
ing and Computer Science, chapter 5, pages 49–62.
Springer.
Muggleton, S. and De Raedt, L. (1994). Inductive logic
programming: Theory and methods. Journal of Logic
Programming, 19/20:629–679.
Papazoglou, M. P. and Georgakopoulos, D. (2003). Service-
oriented computing. Communications of the ACM,
46(10):2528.
Srinivasan, A. (2003). The Aleph Manual. Available
from http://web.comlab.ox.ac.uk/oucl/research/areas/
machlearn/Aleph.
KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval
470