ARCHITECTURE-CENTRIC DATA MINING MIDDLEWARE

SUPPORTING MULTIPLE DATA SOURCES AND MINING

TECHNIQUES

Sai Peck Lee and Lai Ee Hen

Department of Software Engineering, Faculty of Computer Science & Information Technology, Universiti Malaya

50603 Kuala Lumpur, Malaysia

Keywords: Knowledge Discovery, Data Mining, Middleware, Data Mining Middleware.

Abstract: In today’s market place, information stored in a consumer database is the most valuable asset of an

organization. It houses important hidden information that can be extracted to solve real-world problems in

engineering, science, and business. The possibility to extract hidden information to solve real-world

problems has led to increasing application of knowledge discovery in databases, and hence the emergence

of a variety of data mining tools in the market. These tools offer different strengths and capabilities, helping

decision makers to improve business decisions. In this paper, we provide a high-level overview of a

proposed data mining middleware whose architecture provides great flexibility for a wide spectrum of data

mining techniques to support decision makers in generating useful knowledge to help in decision making.

We describe features that we consider important to be supported by the middleware such as providing a

wide spectrum of data mining algorithms and reports through plugins. We also briefly explain both the high-

level architecture of the middleware and technologies that will be used to develop it.

1 INTRODUCTION

In today’s information age, the increasing

volume of data due to the capability of technologies

in the generation and collection of data (Cheng,

2000) has led to the needs of turning these data into

useful information for decision making. Knowledge

Discovery in Databases (KDD) comes into the

image where low-level data are turned into high-

level knowledge for decision support. According to

Fayyad et al, “Knowledge discovery in databases is

the nontrivial process of identifying valid, novel,

potentially useful, and ultimately understandable

patterns in data” (Fayyad, Piatetsky-Shapiro, Smyth,

1996). Understanding the implicit information in the

data is important for strategic decision support.

However, the data is often scattered throughout the

corporation and integration of data before analysis is

necessary (Michael & Gruenwald, 1999). Hence, the

ability for data mining tools to access different data

sources is essential. In addition, one of the

challenges in data mining is the ability to mine

diverse knowledge in databases (Jiawei &

Micheline, 2006). Different users may have different

interests of knowledge. As such, a well designed

data mining tool should provide a wide spectrum of

data mining techniques to solve different business

problems. As such, this research is to propose an

architecture for a data mining middleware to support

a variety of data mining functionalities, where the

architecture provides great flexibility for a wide

spectrum of data mining techniques from multiple

data sources. The intention of the proposed

architecture is to design a platform- and language-

independent middleware that allows organizations to

mine data through a wide range of data sources such

as relational databases, multidimensional databases,

flat files, hierarchical databases, object-oriented

databases, XML files and others, to solve real-world

business problems.

2 RELATED WORK

There are various data mining tools available in the

market such as IBM Intelligent Miner, SPSS

Clementine, SAS Institute Enterprise Miner, Oracle

Data Miner, and Microsoft Business Intelligence

224

Peck Lee S. and Ee Hen L. (2007).

ARCHITECTURE-CENTRIC DATA MINING MIDDLEWARE SUPPORTING MULTIPLE DATA SOURCES AND MINING TECHNIQUES.

In Proceedings of the Second International Conference on Software and Data Technologies - Volume ISDM/WsEHST/DC, pages 224-227

DOI: 10.5220/0001326102240227

 SciTePress

Development Studio. Majority of the studies

conducted on those tools tend to primarily focus on

the functions of the tools rather than the

performance of the tools. Our study mainly focuses

on the performance of the tools based on five

attributes: memory shortages, excess paging with a

disk bottleneck, paging file fragmentation, memory

leaks, and cache manager efficiency.

Based on our study, data mining tools such as

Oracle Data Miner and Microsoft Business

Intelligence Development Studio cache a certain

percentage of both unmined and mined data in the

application tier. Such a strategy off-loads computing

cycles from the backend systems (for example,

Microsoft Business Intelligence Development

Studio, and Oracle Data Miner). However, both the

unmined and mined data are not fully persisted or

cached in the backend systems. As such, we might

have cases whereby two users might be mining the

same data set and this causes redundancy in terms of

work performed.

Our study also reveals that data mining at the

memory level will lead to better performance. For

example, IBM Intelligent Miner consumes only 15%

of Physical Disk\Disk Time and 100MB of

Memory\Available Bytes. This explains that there is

a trade-off between memory and disk. If we spend

more time at the memory level, then we should

spend less time on disk activity (also referred to as

Disk I/O). Disk I/O is often a major bottleneck to

data mining performance.

To improve the performance of data mining, our

study reveals that major data mining activities

should be performed in-memory at the server level.

Therefore the proposed memory repository of the

middleware will be adopted from SQL Server

Analysis Services. In the transition of 32-bit

computing to 64-bit computing, we believe the

proposed middleware will be able to leverage at the

memory level. In the near future, we believe major

data mining tools like Microsoft Business

Intelligence Development Studio and Oracle Data

Miner, which are almost vendor dependent, will

leverage at the memory level.

At the time of our study, tools like Microsoft

Business Intelligence Development Studio, Oracle

Data Miner, and SAS Institute Enterprise Miner only

support a predefined set of data sources. For

example, Microsoft Business Intelligence

Development Studio only supports ODBC, OLEDB

and other types of predefined data sources. Oracle

Data Miner, on the other hand, only supports JDBC

compliant driver such as OCI-based drivers.

Implementing new data sources into such tools are

difficult and often require understanding of the

specified data source API specification. For

example, in the case of Oracle Data Miner,

implementers need to understand the JDBC API

specification.

A data mining tool might face the constraint of

platform dependent (Sanjiv, 2006). Tools such as

Microsoft Business Intelligence Development Studio

and SPSS Clementine are not platform independent.

Microsoft Business Intelligence Development Studio

depends on .NET Framework which currently only

supports the Windows platform. In order to support

other platforms such as Linux, tedious

customizations are needed. SPSS Clementine, on the

other hand, releases different binaries on different

platforms. Oracle Data Miner uses the same binaries

on different platforms, and as such, is platform

independent.

3 PROPOSED DATA MINING

MIDDLEWARE

This paper discusses our proposed architecture for a

data mining middleware to be developed which

employs the strengths and eliminates the weaknesses

of other data mining tools available in the market.

We will refer this middleware as Java-Based Data

Mining Middleware (JDMM). This proposed

architecture is a server centric middleware that

provides the flexibility in which data mining

techniques are unlimited. New data mining

techniques are allowed to be plugged into the

middleware. In addition, JDMM will be a platform-,

data source-, and data mining technique-independent

middleware which is accessible from front-, back-

and web-office environments. JDMM is designed to

minimize the level of disk activity (Disk I/O) over

time during data mining by introducing the concept

of memory-optimized repository and other

technology. Disk I/O is an important performance

metric during data mining as disks are often a major

bottleneck attribute to data mining performance.

Performance of applications with any I/O will be

limited, further CPU performance improvements

will be wasted (Peter & David, 1993). This is

particularly true for a database driven data mining.

Hence, JDMM architecture needs to be designed to

address the issue of I/O throughput of disks to

enable a highly scalable and an almost instantly

responsive server-centric data mining middleware.

ARCHITECTURE-CENTRIC DATA MINING MIDDLEWARE SUPPORTING MULTIPLE DATA SOURCES AND

MINING TECHNIQUES

225

3.1 Overview of Proposed Middleware

Figure 1 portrays the proposed high-level system

architecture for JDMM. The middleware is proposed

with three possible roles of users: Administrator,

Implementor, and Business Analyst. Administrators

will administer and ensure the uptime of JDMM.

Implementors are technical users who are able to

plug new adapters through the JDMM Web

Configurator. Lastly, Business Analysts are non-

technical users who are responsible on business

decision-making by accessing Web JDMM to solve

real-world business problems. The Enterprise Java

Bean (EJB) server acts as a retrieval engine and

consists of different adapters to interconnect

different data sources with JDMM.

Administrators

and Implementors

Java-Based Data Mining Middleware

Tomcat Servlet

Container

JDMM

Engine

Internet

Business

Analysts

Web JDMM

EJB

Server

JDMM

Web

Configurator

Oracle, SQL Server,

MySQL, XML, TXT

Figure 1: JDMM Architecture Information Flow.

After data retrieval, JDMM organizes the data to

create a data mining model using a specific data

mining technique. Each technique is governed by

adapters which are pluggable rule adapters. At this

stage, the result can be stored into a data mining

repository or directly to a persistent data store at a

specific point of time interval. The primary objective

of the repository is to cache results so that computed

results are not computed again. The result is a XML

file that will then be delivered to the client in any

proprietary format incorporated in JDMM.

We believe that the architecture is able to

address real-world business scenarios in business

areas such as human resource, business management

and project management, IT operations, financial,

marketing and so forth. For example, in a typical

bug tracking application, JDMM can be employed to

analyse project related metrics such as issues per

state, priority, severity, category and resolution.

These metrics are useful to measure the success of

future projects. If the number of projects to be

measured is large, the memory repository of JDMM

is able to reduce the time required during data

mining. On the other hand, if the project related data

are stored in different sources, the JDMM adapters

can be configured accordingly. These data sources

can be managed collectively within JDMM.

3.2 JDMM Detailed Architecture

The internal architecture of the proposed

middleware is divided into two threads namely

Inbound Threads for managing incoming

uninterpreted operational data (raw data) and

Outbound Threads for managing all outgoing

interpreted data (mined data) shown in Figure 2.

Both the Send Adapter and Receive Adapter are

part of a framework known as the Adapter

Framework. Through the JDMM Web Configurator,

implementors of JDMM are able to plugin different

adapter components to connect to different data

sources. Each adapter is configurable and each

configurable parameter is stored in a XML file.

Java-Based Data Miner (JDM) will be a pure

Java API for developing data mining applications.

The idea is to have a common API for data mining

that can be used by clients without users being aware

of or affected by the actual vendor implementations

for data mining.

JDM Extension will be an extension to JDM that

includes additional data mining models, data scoring

and data transformations. JDM Extension will be

designed to be a highly-generalized, object-oriented,

data mining conceptual model using Data Mining

Group’s Predictive Model Markup Language

(PMML) data mining standard. PMML is an XML

markup language to describe statistical and data

mining models ("Predictive Model Markup

Language", 2005).

4 PROPOSED LOGICAL

COMPONENTS OF JDMM

JDMM Web Configurator and Web JDMM are

situated in the application system layer which is

built on top of the business specific component

systems, Adapter Framework and JDM. Adapter

Framework is an extensible framework that allows

different multiple adapters connecting to different

data sources to be added. This framework will be the

component system to enable JDMM users to

establish connections to a wide variety of data

sources. JDM will be the engine to perform any data

mining process.

ICSOFT 2007 - International Conference on Software and Data Technologies

226

Java-Based Data Mining Middleware

Store and

Forward Thread

Inbound Thread

Outgoing Interpreted

Data

JDM

J2EE

JDM Extension

Transformations,

Models,

Descriptors,

DAOs and others.

Data Mining Repository

TXT XML

Java Objects

Receive Adapters

Receive

Adapter 1

Receive

Pools

Receive

Adapter n

Outbound Thread

Send

Pools

Send Adapters

Send

Adapter 1

Send

Adapter n

XML

Java Objects

Incoming

Uninterpreted Data

XLS

XML

MDB

Figure 2: Detailed software architecture of JDMM.

JDMM Web Configurator and Web JDMM

communicate with each other during the data mining

process. If a decision maker wishes to mine a

different data source, a different adapter

corresponding to the data source will need to be

configured and deployed into the system through the

JDMM Web Configurator. From Web JDMM, the

decision-maker actor will be able to perform the data

mining process from the adapter that is deployed.

5 FUTURE CONSIDERATION

An implementation of the JDMM architecture is in

progress. At the current phase of our work, we

believe conceptually that JDMM is able to provide

an analytical platform that is configurable and

extendable throughout the business enterprise. A

known limitation of JDMM is that it relies heavily

on memory. Careful implementation of JDMM is

required to ensure that unused objects are to be

efficiently and effectively garbage collected.

Otherwise, memory will be a potential bottleneck. In

the near future, JDMM will be leveraged to enable

massive data sets to be analyzed. As data sets

increase in size, traditional data mining tools

become less efficient. Our approach to analytical

scalability can be addressed through grid computing

and 64-bit computing. “Grid” computing has

emerged as an important new field, distinguished

from the conventional distributed computing by its

focus on large-scale resource sharing, innovative

applications, and in some cases, high-performance

orientation (Foster, Kesselman, & Tuecke, 2001).

We foresee that grid computing will empower data

mining with instant responsiveness and very high

throughput in terms of analyzing mission-critical

data sets in real-time enterprises and industries. With

grid computing, mined data can be accessed,

captured, or updated many times faster, giving

business analysts a fast response to business-critical

decisions. With 64-bit computing, the memory

limitation of the proposed architecture issue will be

solved.

REFERENCES

Cheng Soon Ong, 2000. Knowledge Discovery In

Databases: An Information Retrieval Perspective,

Malaysian Journal of Computer Science. Vol. 13 No.

2. pp. 54-63

Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P., 1996.

From Data Mining to Knowledge Discovery: An

Overview. Advances in Knowledge Discovery and

Data Mining. MIT Press. 37-54.

Ian Foster, Carl Kesselman, and Steven Tuecke, 2001. The

Anatomy of the Grid: Enabling Scalable Virtual

Organizations. 1-25.

Jiawei Han and Micheline Kamber, 2006. Data Mining:

Concepts and Techniques. Second Edition. Elsevier.

p5 – 45

Michael Goebel, and Le Gruenwald, 1999. A Survey Of

Data Mining And Knowledge Discovery Software

Tools, Sigkdd Explorations. ACM SIGKDD. Volume

1, Issue 1 – 20 - 33

Peter M. Chen and David A, 1993. Storage

Performance—Metrics and Benchmarks. Patterson.

Volume 81. 1-33.

Predictive Model Markup Language (PMML). 2005

Technology Reports. Cover Pages.

http://xml.coverpages.org/pmml.html. Retrieved

August 8, 2005

Sanjiv Purba, 2006. Handbook of Data Management. Viva

Books Private Limited.

ARCHITECTURE-CENTRIC DATA MINING MIDDLEWARE SUPPORTING MULTIPLE DATA SOURCES AND

MINING TECHNIQUES

227