BENEFICIAL SEQUENTIAL COMBINATION OF

DATA MINING ALGORITHMS

Mathias Goller

Department of Business Informatics - Data & Knowledge Engineering, Johannes-Kepler-University Linz/Austria

Markus Humer

Michael Schrefl

Department of Business Informatics - Data & Knowledge Engineering, Johannes-Kepler-University Linz/Austria

Keywords:

Sequences of data mining algorithms, pre-computing intermediate results, clustering, decision tree construc-

tion, naive bayes.

Abstract:

Depending on the goal of an instance of the Knowledge Discovery in Databases (KDD) process, there are

instances that require more than a single data mining algorithm to determine a solution. Sequences of data

mining algorithms offer room for improvement that are yet unexploited.

If it is known that an algorithm is the ﬁrst of a sequence of algorithms and there will be future runs of other

algorithms, the ﬁrst algorithm can determine intermediate results that the succeeding algorithms need. The

anteceding algorithm can also determine helpful statistics for succeeding algorithms. As the anteceding algo-

rithm has to scan the data anyway, computing intermediate results happens as a by-product of computing the

anteceding algorithm’s result.

On the one hand, a succeeding algorithm can save time because several steps of that algorithm have already

been pre-computed. On the other hand, additional information about the analysed data can improve the quality

of results such as the accuracy of classiﬁcation, as demonstrated in experiments with synthetical and real data.

1 INTRODUCTION

Sophisticated algorithms analyse large data sets for

patterns in the data mining phase of the KDD process.

Depending on the goal the analyst is striving for

when performing an analysis, he or she must apply a

combination of data mining algorithms to achieve that

goal.

We observed in a project with NCR Teradata Aus-

tria that the same data is analysed several times—yet,

each time with different purpose. Typically, several

pre-analyses antecede an analysis. In most of these

analyses data mining algorithms are involved.

Clustering is often used to segment a data set

of heterogenous objects into subsets of objects that

are more homogenous than the unsegmented data

set. Succeeding algorithms analyse these homoge-

nous data sets for interesting patterns, for instance to

ﬁnd some kind of predictive model such as a churn

model that can be used to predict the probability a

customer will cancel one’s contract in a given time

frame. In other words, clustering and classiﬁcation

algorithms are used in combination to achieve a com-

mon goal—in this case to determine a churn model

per segment.

A company can also use the combination of clus-

tering and classiﬁcation algorithms to determine seg-

ments that it can easily identify a customer’s segment.

Otherwise, there could be too many customers which

are erroneously assigned to a segment.

When there is a sequence of data mining algo-

rithms analysing data, the conventional way is to com-

pute the results of each algorithm individually. We

will call that kind of sequential computation of a

combination of data mining algorithms na

¨

ıve com-

bined data mining because this proceeding lacks to

exploit potential improvements the combination of al-

gorithms offers.

Figure 1 illustrates (a) the conventional way of

combining data mining algorithms and (b) our im-

proved way of combining algorithms, which is

sketched in the remainder of this section.

In a sequence of algorithms, each algorithm within

that sequence has the role of antecessor or succes-

sor of another algorithm. Although algorithms in the

middle of a sequence with more than two algorithms

can have both roles, there is only one antecessor and

one successor in each pair of algorithms.

The antecessor retrieves tuples and performs some

operations on them such as determining groups of

135

Goller M., Humer M. and Schreﬂ M. (2006).

BENEFICIAL SEQUENTIAL COMBINATION OF DATA MINING ALGORITHMS.

In Proceedings of the Eighth International Conference on Enterprise Information Systems - AIDSS, pages 135-143

DOI: 10.5220/0002495501350143

Copyright

c

SciTePress

(a) na

¨

ıve combined data mining

Anteceding

Algorithm A

Succeeding

Algorithm S

S processes result of A

(b) antecessor knows successor

Anteceding

Algorithm A

Succeeding

Algorithm S

S processes result of A

antecessor pre-computes

intermediate results and auxiliary data

Figure 1: Conventional and improved way of sequences in

algorithm combinations.

similar tuples or marking some tuples as relevant ac-

cording to a given criterion. The successor needs the

antecessor’s result or the modiﬁcations the antecessor

has applied to the data because otherwise both algo-

rithms could have been run in parallel.

In this work we examine only sequences of data

mining algorithms that are connected due to their re-

sults, i.e. the result of one algorithm is needed for an-

other algorithm. Thus, parallelising unconnected data

mining algorithms is no issue of this work.

If it is known that there will be a run of the suc-

cessor, the antecessor can do more than processing its

own task: During a scan of the data the antecessor

can compute intermediate results the successor needs

and auxiliary data the successor can proﬁt of. As ac-

cessing tuples on a hard disk takes much more time

than CPU operations do, additional computations are

almost for free once a tuple is loaded into main mem-

ory.

Proﬁting of intermediate results and auxiliary data

means that the succeeding algorithm is either faster or

it returns results of higher quality than in the case of

not-using intermediate results and auxiliary data.

We call the approach of the antecessor preprocess-

ing items for the successor antecessor knows succes-

sor because the task of the successor must be known

when starting the antecessor.

We demonstrate in a series of experiments that the

approach antecessor knows successor is beneﬁcial for

succeeding data mining algorithms. As some results

have been declared as corporate secret by our indus-

trial partner only half of our tests shown in this paper

are made with real data. Where free real data was un-

available we re-did tests with synthetical data having

similar characteristics.

The remainder of this paper is organised as follows:

Section 2 discusses related work in literature. Previ-

ous work concentrates on what we refer to as na

¨

ıve

combined data mining . Section 3 describes how an

anteceding algorithm can pre-compute intermediate

results for later usage by succeeding algorithms. The

experimental results section, section 4, discusses the

results of our experiments based on costs and beneﬁts

of our approach. It also concludes this paper.

2 RELATED WORK

This section surveys previous work on using combi-

nations of data mining algorithms. Previous work can

be classiﬁed in (a) work on solving a complex prob-

lem of knowledge discovery by combining a set of

algorithms of different type and (b) work on solving a

problem by using a combination of algorithms of the

same type.

Our discussion of previous work below shows that

especially combining algorithms of different type of-

fers much room for improvement.

2.1 Using Different Types of Data

Mining Algorithms to Solve

Complex Problems

The combination of clustering and classiﬁcation is the

by far most used combination of different types of

data mining algorithms we found in literature. Hence,

this paper focusses on the combination of clustering

and classiﬁcation. However, we also present how to

apply our concept on other combinations in short be-

cause the concept is not limited to clustering and clas-

siﬁcation.

Several approaches like (Kim et al., 2004) and

(Genther and Glesner, 1994) use clustering to im-

prove the accuracy of a succeeding classiﬁcation al-

gorithm when there are too many different attributes

for classiﬁcation. Having less attributes is desirable

because there are less combinations of attributes pos-

sible. As a consequence of less combinations, the re-

sulting trees are less vulnerable to over-ﬁtting.

Approaches of text classiﬁcation such as (Dhillon

et al., 2002) pre-cluster words in groups of words be-

fore predicting the category of a document. Again, the

accuracy of classiﬁcation rises due to the decreasing

number of different words.

However, all above-mentioned approaches com-

bine clustering and classiﬁcation without taking po-

tential improvements in the algorithms. The algo-

rithms are merely executed sequentially. Hence, we

call such approaches na

¨

ıve combined data mining.

Combinations of association rule analysis (ARA)

and other data mining techniques include the com-

binations clustering succeeded by ARA, classiﬁcation

succeeded by ARA, and ARA succeeded by clustering.

ICEIS 2006 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

136

There exist several approaches that reduce the num-

ber of found rules by clustering such as (Lent et al.,

1997) and (Han et al., 1997). Unfortunately, the

herein presented auxiliary data are unable to improve

approaches of this kind. Thus, we omit this combina-

tion in further sections.

In web usage mining, an anteceding clustering of

web sessions partitions user in groups having a dif-

ferent proﬁle of accessing a web site. A succeeding

association rule analysis examines patterns of naviga-

tion segment by segment, as shown in (Lai and Yang,

2000) or (Cooley, 2000). However, most papers focus

on either ARA or clustering, as shown in (Facca and

Lanzi, 2005).

Again in web usage mining, a classiﬁcation algo-

rithm can determine ﬁlter rules to distinguish accesses

of human users and software agents which is needed

as only user accesses represent user behaviour.

Both, a clustering algorithm and the scoring func-

tion of a classiﬁcation algorithm, scan all data the suc-

ceeding ARA algorithm needs. Thus, they could efﬁ-

ciently pre-compute intermediate results for an asso-

ciation rule analysis.

However, all above-mentioned approaches are ap-

proaches of na

¨

ıve combined data mining. Hence, they

are candidates for being improved by our approach as

shown in section 3.

There are data mining algorithms such as the ap-

proaches of (Kruengkrai and Jaruskulchai, 2002) and

(Liu et al., 2000) that call other data mining algo-

rithms during their run. Hereby, there is a strong

coupling between both algorithms as the calling algo-

rithm cannot exist on its own. The calling algorithm

would become a different algorithm when removing

the call of the other algorithm. Thus, we do not call

this coupling of algorithms a sequence of algorithms

because one cannot separate them. In contrast to that

our approach uses loosely coupled algorithms where

each algorithm can exist on its own.

2.2 Combining Data Mining

Algorithms of the Same Type to

Improve Results

Combining data mining algorithms of the same type

to solve a problem is a good choice when there is a

trade-off between speed and quality. Fast but inac-

curate algorithms ﬁnd initial solutions for succeeding

algorithms that deliver better results but need more

time. k-means (MacQueen, 1967) and EM (Demp-

ster et al., 1977) is such a combination where k-means

generates an initial solution for EM.

Some approaches that cluster streams use hierar-

chical clustering to compress the data in a dendro-

gram which a partitioning clustering algorithm uses

as a replacement of tuples. Using this combination,

only one scan of data is needed—which is a necessary

condition to apply an algorithm on a data stream. Par-

titioning clustering algorithms typically need multiple

scans of the data they analyse. (O’Callaghan et al.,

2002), (Guha et al., 2003), (Zhang et al., 2003), and

(Chiu et al., 2001) are approaches that use a dendro-

gram to compute a partitioning clustering.

An algorithm clustering streams must scan the data

at least once. Thus, algorithms clustering data streams

need a scan in any case.

As the above-mentioned approaches of clustering

streams need only the minimum number of scans,

there is not much room for improvement. (Bradley

et al., 1998) have shown that quality of clustering re-

mains high when using aggregated representations of

data such as a dendrogram instead of the data itself.

Thus, the approaches clustering streams mentioned

above are not limited to streams.

As combining clustering algorithms of the same

type is established in research community, the herein-

presented approach focusses on combinations of data

mining algorithms of different type.

3 ANTECESSOR KNOWS

SUCCESSOR

This section describes the concept antecessor knows

successor in detail. It is organised as follows: Sub-

section 3.1 introduces different types of intermediate

results and auxiliary data. Subsection 3.2 shows their

efﬁcient computation during the antecessor’s run. The

subsequent subsections describe the beneﬁcial usage

of intermediate results and auxiliary data in runs of

succeeding algorithms that are association rule algo-

rithms in subsection 3.3, decision tree algorithms in

subsection 3.4, and na

¨

ıve bayes classiﬁcation algo-

rithms in subsection 3.5.

3.1 Intermediate Results and

Auxiliary Data

The antecessor can compute intermediate results or

auxiliary data for the successor.

We call an item an intermediate result for the suc-

cessor if the successor would have to compute that

item during its run if not already computed by the an-

tecessor.

Further, we call a date an auxiliary date for the suc-

cessor if that item is unnecessary for the successor to

compute its result but either improves the quality of

the result or simpliﬁes the successor’s computation.

Therefore, the type of the succeeding algorithm de-

termines whether an item is an intermediate result or

BENEFICIAL SEQUENTIAL COMBINATION OF DATA MINING ALGORITHMS

137

an auxiliary date. An item can be intermediate re-

sult in the one application and an auxiliary date in the

other one. For instance, statistics of the distribution of

tuples are intermediate results for na

¨

ıve Bayes classi-

ﬁcation but are auxiliary data for decision tree classi-

ﬁcation.

Both, intermediate results and auxiliary data, are

either tuples fulﬁlling some special condition or some

statistics of the data. For instance, tuples that are

modus, medoid, or outliers of the data set are tuples

fulﬁlling a special condition—modus and medoid

represent typical items of the data set, while outliers

represent atypical items. Furthermore, all kind of sta-

tistics such as mean and deviation parameter are po-

tential characteristics of the data.

We call a tuple fulﬁlling a special condition an aux-

iliary tuple while we denote an additionally computed

statistic of the data an auxiliary statistic.

3.2 Requirements for Efﬁcient

Computation of Intermediate

Results and Auxiliary Data

This section shows what is required to efﬁciently

compute intermediate results and auxiliary data for

the successor. It concentrates on the cost of com-

puting and storing intermediate results and auxiliary

data. Their beneﬁcial usage depends on the type of

the succeeding algorithm. Hence, the description of

using intermediate results and auxiliary data is part of

the following subsections 3.3, 3.4, and 3.5.

For computing the antecessor’s result, intermedi-

ate results and auxiliary data in parallel, it must be

possible to compute intermediate results and auxiliary

data without additional scans of the data. Otherwise,

pre-computing intermediate results and auxiliary data

would decrease the antecessor’s performance without

guaranteeing that the successor’s performance gain

exceeds the successor’s performance loss. Conse-

quentially, our approach is limited to auxiliary data

and intermediate results that can easily be determined.

Statistics such as mean and deviation are easy to

compute as count, linear sum, and sum of squares of

tuples is all that is needed to determine mean and de-

viation.

Hence, count, linear sum and sum of squares are

easily-computed items that enable the computation of

several auxiliary statistics.

If the antecessor is a clustering algorithm, auxil-

iary statistics can be statistics of a cluster or statistics

of all tuples of the data set. As the number of clus-

ters of a partitioning clustering algorithm is typically

much smaller than the number of tuples, storing auxil-

iary statistics of all clusters is inexpensive. Thus, we

suggest to store all of them. Our experiments show

that cluster-speciﬁc statistics enable ﬁnding good split

points when the successor is a decision tree algorithm.

Other statistics like the frequencies of attribute val-

ues and the frequencies of the k most frequent pairs

of attribute values are also easy to compute as shown

in subsections 3.3 and 3.5, respectively.

The cost of determining whether a tuple is an aux-

iliary tuple or not depends on the special condition

which that tuple must fulﬁll. The mode of a data set

is easy to determine. However, determining if a tuple

is an outlier or a typical representative of the data set

is difﬁcult because one requires the tuple’s vicinity to

test the conditions “is outlier” and “is typical repre-

sentative”. Tuples in sparse areas are candidates for

outliers. In analogy to that, tuples in dense areas are

candidates for typical representatives.

However, if the anteceding algorithm is an itera-

tively optimising clustering algorithm such as EM or

k-means, the task of determining whether a tuple is

a typical representative or an outlier simpliﬁes to the

task of determining whether the tuple is within the

vicinity of the cluster’s centre, or far away of its clus-

ter’s centre.

Further, intermediate results must be independent

of any of the successor’s parameters because the spe-

ciﬁc value of a parameter of the successor might be

unknown when starting the antecessor. For instance,

it could be known there will be a succeeding classi-

ﬁcation but the analyst has not yet speciﬁed which

attribute will be the class attribute.

Yet, if a parameter of the successor has a domain

with a ﬁnite number of distinct values, the succes-

sor can pre-compute intermediate results for all po-

tential values of this parameter. For instance, assume

that the class attribute is unknown. Then, the class at-

tribute must be one of the data set’s attributes. To be

more speciﬁc, it must be one of the ordinal or categor-

ical attributes that are non-unique, as shown in section

4.2.2. Section 4.2.2 also shows that small buffers are

sufﬁcient to store auxiliary statistics for all potential

class attributes.

3.3 Intermediate Results for

Association Rule Analyses

If the succeeding algorithm is an algorithm which

mines association rules such as apriori or FP-growth,

the frequency of all item sets with one element is

a parameter-independent intermediate result: Apriori

(Agrawal and Srikant, 1994) can then determine the

frequent item sets of length 1 without scanning the

data. FP-growth (Han et al., 2000) can also then de-

termine the item to start the FP-tree with—which is

the most-frequent item—without scanning the data.

Therefore having pre-computed the frequencies of

1-itemsets as intermediate results during the anteces-

ICEIS 2006 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

138

sor’s run saves exactly one scan. If the additional time

is less than the time of a scan, we receive an improve-

ment in runtime.

As the saving is constantly a scan and the results are

identical with or without pre-computed frequencies,

we omit presenting tests in the experimental results

section.

If it is known which attribute represents the item of

the succeeding association rule analysis, the anteces-

sor can limit the number of potential 1-itemsets. Oth-

erwise, the antecessor must compute the frequency of

each attribute value for all ordinal and categorical at-

tributes.

Even for large numbers of attributes and different

attribute values per attribute, the space that is nec-

essary to store the frequency of 1-itemsets is small

enough to easily ﬁt in the main memory of an up-

to-date computer. We exclude attributes with unique

values because they would signiﬁcantly increase the

number of 1-itemsets while using them in an associa-

tion rule analysis makes no sense.

3.4 Auxiliary Data for Decision Tree

Classiﬁcation

Classiﬁcation algorithms generating decision trees

take a set of tuples where the class is known—the

training set—to derive a tree where each path in the

tree reﬂects a series of decisions and an assignment to

a class. A decision tree algorithm iteratively identiﬁes

a split criterion using a single attribute and splits the

training set in disjoint subsets according to the split

criterion. For doing this, the algorithm tries to split

the training set according to each attribute and ﬁnally

uses that attribute for splitting that partitions the train-

ing set best to some kind of metric such as entropy

loss.

Composing the training set and the way the training

set is split inﬂuence the quality of the decision tree.

One can choose the training set randomly or one

can choose tuples that fulﬁll a special condition as el-

ements of the training set. Our experiments in sub-

section 4.2.1 show that accuracy of classiﬁcation in-

creases when choosing tuples fulﬁlling a special con-

dition.

Splitting ordinal or continuous attributes is typi-

cally a binary split at a split point: Initially, the al-

gorithm determines a split point. Then, tuples with an

attribute value of the split attribute that is less than the

split point become part of the one subset, tuples with

greater values become part of the other subset.

Determining the split point is non-trivial if the split

attribute is continuous because the number of poten-

tial split points is unlimited.

However, auxiliary statistics describing the distrib-

ution of attributes simplify ﬁnding good split points.

Assume that the anteceding algorithm is a partition-

ing clustering algorithm. Thus, the set of auxiliary

statistics also includes a set of statistics of each clus-

ter.

With the auxiliary statistics of each cluster one can

determine a set of density functions for each continu-

ous attribute the decision tree algorithm wants to test

for splitting.

The experiments in section 4.2.1 show that the

points of intersection of all pairs of density functions

are good candidates of split points.

3.5 Intermediate Results and

Auxiliary Data for Naive Bayes

Classiﬁcation

It is very easy to deduce a naive Bayes-classiﬁer us-

ing only pre-computed frequencies. A naive Bayes-

classiﬁer classiﬁes a tuple t =(t

1

,...,t

d

) as the class

c ∈ C of a set of classes C that has the highest pos-

terior probability P (c|t). The naive Bayes-classiﬁer

needs the prior probabilities P (t

i

|c) and the proba-

bility P (c) of class c—which the successor can pre-

compute—to determine the posterior probabilities ac-

cording to the formula

P (c|t)=P (c)

d

i=1

P (t

i

|c). (1)

Determining the probabilities on the right hand side

of formula 1 requires the number of tuples, the total

frequency of each attribute value of the class attribute,

and the total frequencies of pairs of attribute values

where one element of a pair is an attribute value of the

class attribute, as the probabilities P (c) and P (t

i

|c)

are approximated by the total frequencies F (c) and

F (c ∩ t

i

) as P (c)=

F (c)

n

and P (t

i

|c)=

F (c∩t

i

)

F (c)

,

respectively.

Frequency F (c) is also the frequency of the 1-

itemset {c}, which is an intermediate result stored for

an association rule analysis, as shown in subsection

3.3. As count n is the sum of all frequencies of the

class attribute, the frequency of pairs of attribute val-

ues is the only remaining item to determine.

Storing all potential combinations of attribute val-

ues is very expensive when there is a reasonable num-

ber of attributes but storing the top frequent combina-

tions is tolerable. As the Bayes classiﬁer assigns a

tuple to the class that maximises posterior probabil-

ity, a class with infrequent combinations is rarely the

most likely class because a low frequency in formula

1 inﬂuences the product more than several frequen-

cies that represent the maximum probability of 1.

As a potential solution, one can store the top fre-

quent pairs of attribute values in a buffer with ﬁxed

BENEFICIAL SEQUENTIAL COMBINATION OF DATA MINING ALGORITHMS

139

size and take the risk of having a small fraction of

unclassiﬁed tuples.

Counting the frequency of attribute value pairs is

only appropriate when the attributes to classify are or-

dinal or categorical because continuous attributes po-

tentially have too many distinct attribute values.

If a continuous attribute shall be used for classiﬁ-

cation, the joint probability density function replaces

the probabilities of pairs of attribute values in formula

1.

The parameters necessary to determine the joint

probability density function such as the covariance

matrix are auxiliary statistics for the succeeding

Bayes classiﬁcation.

Hence, pre-computed frequencies and a set of pre-

computed parameters of probability density function

are all that is needed to derive a na

¨

ıve Bayes classiﬁer.

Subsection 4.2.2 shows that the resulting classiﬁers

have high quality.

4 EXPERIMENTS

We exemplify our approach with a series of exper-

iments organised in two scenarios: (a) Combining

clustering with decision tree construction and (b)

combining clustering with na

¨

ıve Bayes classiﬁcation.

k-means clustering is the antecessor in both sce-

narios. It pre-computes statistics and identiﬁes spe-

ciﬁc tuples that could potentially be used as interme-

diate results and auxiliary data as mentioned in sub-

section 3.2. All intermediate results and auxiliary data

needed by the successors in scenarios (a) and (b) are

a subset of these items. We examine the costs for pre-

computing all items.

In scenario (a) we adapt the classiﬁcation algorithm

Rainforest (Gehrke et al., 2000) to use auxiliary data

consisting of auxiliary tuples and statistics. We inves-

tigate the improvement on classiﬁcation accuracy by

using these auxiliary data.

In scenario (b) we construct na

¨

ıve Bayes classiﬁers

using only pre-computed intermediate results consist-

ing of frequent pairs of attribute values. We compare

the classiﬁcation accuracy using these pre-computed

intermediate results with the conventional way to

train a na

¨

ıve Bayes classiﬁer.

4.1 Cost of Anticipatory

Computation of Intermediate

Results and Auxiliary Data for

the Succeeding Algorithm

The cost of computing intermediate results and aux-

iliary data are two-fold. On the one hand, storing in-

termediate results and auxiliary data requires space.

Table 1: Computational cost of k-means with and without

pre-computing auxiliary data and intermediate results.

time in seconds overhead

tuples

iterations without with in percent

100’000 4 655 710 8

200’000

3 676 780 15

300’000

7 2’123 2’144 1

400’000

3 1’266 1’500 18

500’000

4 1’654 2’064 25

600’000

6 2’636 3’296 25

700’000

6 2’968 3’740 26

800’000

5 3’137 3’720 19

900’000

5 3’506 4’284 22

1’000’000

2 2’297 2’928 27

2’000’000

6 8’492 10’352 22

3’000’000

2 6’209 6’999 13

4’000’000

5 13’745 16’824 22

5’000’000

5 18’925 22’165 17

This space should ﬁt into the main memory to avoid

additional write and read accesses to the hard disk—

which would slow down the performance of computa-

tion. On the other hand, additional computations need

time regardless there are additional disk accesses or

not.

Table 1 shows the cost of runtime of k-means with-

out and with pre-computing intermediate results and

auxiliary data. The data set used in this table is the

largest data set of our test series. It is the syntheti-

cal data set we used in the test series as described in

section 4.2.1.

In the tests of table 1 we saved intermediate re-

sults and auxiliary data to disk such that we receive

the maximum time of overhead.

Yet, the average percentage of additional time is 20

percent when saving intermediate results and auxil-

iary data. In a third of all tests this additional time is

approximately the same amount of time as needed for

an additional scan.

However, in a majority of cases the additionally

needed time is a minor temporal overhead.

The space that is necessary to store intermediate

results and auxiliary data signiﬁcantly depends on the

type of data mining algorithm that succeeds the cur-

rent algorithm. Hence, we postpone the costs of space

when we discuss the beneﬁt for different types of data

mining algorithms.

4.2 Beneﬁt of Pre-Computed

Intermediate Results

4.2.1 Beneﬁt for Decision Tree Classiﬁcation

For demonstrating the beneﬁt of auxiliary data for

decision tree classiﬁcation we modiﬁed the Rainfor-

est classiﬁcation algorithm to use auxiliary data. The

ICEIS 2006 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

140

0123456

random set A

near set A

far set A

random set B

near set B

far set B

height of tree

without auxiliary statistics with auxiliary statistics

Figure 2: Height of decision tree.

data set used for testing is a synthetical data set hav-

ing two dozens attributes of mixed type. The class

attribute has ﬁve distinct values. Ten percent of the

data is noisy to make the classiﬁcation task harder.

Additionally, classes overlap.

In a anteceding step we applied k-means to ﬁnd the

best-ﬁtting partitioning of tuples. We use these par-

titions to select tuples that are typical representatives

and outliers of the data set: We consider tuples that

are near a cluster’s centre as a typical representative.

Analogically, we consider a tuple that is far away of

a cluster’s centre as an outlier. Yet, both are auxiliary

tuples according to our deﬁnition in section 3.1. Due

to their distance to their cluster’s centre we call them

near and far.

We tested using both kinds of auxiliary tuples as

training set instead of selecting the training set ran-

domly.

In addition to auxiliary tuples we stored auxiliary

statistics such as mean and deviation of tuples of a

cluster in each dimension.

We used these statistics for determining split points

in a succeeding run of Rainforest. After each run we

compared the results of these tests with the results of

the same tests without using auxiliary statistics.

Compactness of tree and accuracy are the measures

we examined. Compacter trees tend to be more resis-

tant to over-ﬁtting, e.g. (Freitas, 2002, p 49). Hence,

we prefer smaller trees. We measure the height of a

decision tree to indicate its compactness.

Using statistics of distribution for splitting returns

a tree that is lower or at maximum as high as the tree

of the decision tree algorithm that uses no auxiliary

statistics, as indicated in ﬁgure 2.

The inﬂuence of using auxiliary statistics on accu-

racy is ambiguous, as shown in ﬁgure 3. Some tests

show equal or slightly better accuracy, others show

worse accuracy than using no auxiliary statistics.

However, using auxiliary tuples as training set sig-

niﬁcantly inﬂuences accuracy. Figure 3 shows that

choosing tuples of the near set is superior to choosing

Table 3: Classiﬁcation accuracy of Bayes classiﬁer with

pre-computed frequencies.

test

S10% S20% S50% aux400 aux1000

accuracy (%) 80.5 81.8 83.5 83.5 83.8

classiﬁed (%)

96.6 97.1 98.4 96.9 97.3

total accuracy (%)

77.8 79.4 82.2 80.9 81.2

buffer size 400 1000

pairs of class attribute churn

107 248

tuples randomly.

Considering each test series individually we ob-

serve that the number of tuples only slightly inﬂu-

ences accuracy. Except for the test series foA and

faA, accuracy is approximately constant within a test

series.

Thus, selecting the training set from auxiliary tu-

ples is more beneﬁcial than increasing the number of

tuples in the training set. We suppose that the chance

that there are noisy data in the training set is smaller

when we select less tuples or select tuples out of the

near set.

Summarising, if one is interested in compact and

accurate decision trees then selecting training data out

of the near data set in combination with using statis-

tics about distribution for splitting is a good option.

4.2.2 Beneﬁt for Na

¨

ıve Bayes Classiﬁcation

For demonstrating the beneﬁt of pre-computing fre-

quencies of frequent combinations for na

¨

ıve Bayes

classiﬁcation, we compared na

¨

ıve Bayes classiﬁers

using a buffer of pre-computed frequencies with na

¨

ıve

using the traditional way of determining a na

¨

ıve

Bayes classiﬁer.

We trained na

¨

ıve Bayes classiﬁers on a real data

set provided by a mobile phone company to us. We

used data with demographical and usage data of mo-

bil phone customers to predict whether a customer is

about to churn or not. Most continuous and ordinal

attributes such as age and sex have few distinct val-

ues. Yet, other attributes such as city have several

hundreds of them. We used all non-unique categori-

cal and ordinal attributes.

For checking the classiﬁers’ accuracy, we reserved

20 % of available tuples or 9

999 tuples as test data.

Further, we used the remaining tuples to draw sam-

ples of different size and to store the frequencies of

frequent combinations in a buffer.

For storing the top frequent pairs of attribute val-

ues, we reserved a buffer of different sizes. If the

buffer is full when trying to insert a new pair an item

with low frequency is removed from the buffer.

To ensure that a newly inserted element of the list

is not removed at the next insertion, we guarantee a

minimum lifetime t

L

of each element in the list. Thus,

BENEFICIAL SEQUENTIAL COMBINATION OF DATA MINING ALGORITHMS

141

0.0

0.2

0.4

0.6

0.8

1.0

rsoA rsaA noA naA foA faA rsoB rsaB noB naB foB faB

origin of tuples

accuracy

100 200 300 500 1000 2000 3000 5000

tuples in training set

without auxiliary statistics with auxiliary statistics

data set A data set B data set A data set B

random sample rsoA rsoB rsaA rsaB

near tuples noA noB naA naB

legend

far tuples foA foB faA faB

Figure 3: Accuracy of decision tree classiﬁer.

Table 2: Results of Na

¨

ıve Bayes Classiﬁcation in detail.

sample 10 %

sample 20 % sample 50 % auxiliary buffer 400 auxiliary buffer 1000

actual class

false true

false 7683 1394

estimate

true 487 97

actual class

false true

false 7885 1449

estimate

true 321 52

actual class

false true

false 8195 1494

estimate

true 130 25

actual class

false true

false 8054 1461

estimate

true 140 35

actual class

false true

false 8124 1471

estimate

true 108 31

we remove the least frequent item that has survived at

least t

L

insertions.

Estimating the class of a tuple needs the frequency

of all attribute values of that tuple in combination with

all values of the class attribute. If a frequency is not

present, classiﬁcation of that tuple is impossible.

A frequency can be unavailable because either (a)

it is not part of the training set or (b) it is not part of

the frequent pairs of the buffer. While option (b) can

be solved by increasing the buffer size, the problem

of option (a) is an immanent problem of Bayes classi-

ﬁcation.

Therefore, we used variable buffer sizes and sam-

pling rates in our test series. We tested with a buffer

with space for 400 pairs and 1000 pairs of attribute

values. As the buffer with capacity of 1000 pairs be-

came not full, we left out tests with larger buffers.

Table 3 contains the results of tests with sampling

S10%, S20%, and S50% and results of tests with

buffers as auxiliary data aux400 and aux1000 in sum-

marised form. Table 2 lists these results in detail.

Table 3 shows that small buffers are sufﬁcient for

generating Bayes classiﬁers with high accuracy.

Although the tests show that classiﬁcation accuracy

is very good when frequencies of combinations are

kept in the buffer, there are few percent of tuples that

cannot be classiﬁed. Thus, we split accuracy in table

3 in accuracy of tuples that could be classiﬁed and

accuracy of all tuples. The tests show that the buffer

size inﬂuences the number of classiﬁed tuples. They

also show that small buffers have a high classiﬁcation

ratio.

Thus, small buffers are sufﬁcient to generate na

¨

ıve

Bayes classiﬁers having high total accuracy using ex-

clusively intermediate results.

4.3 Summary of Costs and Beneﬁts

Our experiments have shown that the costs for com-

puting intermediate results and auxiliary data are

low—even, if one saves a broad range of different

types of intermediate results such as the top frequent

attribute value pairs and auxiliary data such as typi-

cal members of clusters to disk. In contrast to that,

the beneﬁt of intermediate results and auxiliary data

is high.

In scenario (a) we examined the effect of using typ-

ical cluster members as auxiliary data on the accu-

racy of a decision tree classiﬁcation that succeeds a

k-means clustering. We observed that tuples which

are near the clusters’ centres are good candidates of

the training set.

In scenario (b) we examined the accuracy of na

¨

ıve

Bayes classiﬁers that were generated only by using

frequencies of attribute value pairs calculated as in-

termediate results for the na

¨

ıve Bayes classiﬁer by

an anteceding k-means clustering. These classiﬁers

had about the same accuracy as conventional classi-

ﬁers using high sampling rates but could be quickly

computed from the intermediate results.

Furthermore, the general principle of our approach

is widely applicable. For instance, with regards to

ARA we have shown that pre-computing frequencies

of 1-itemsets saves exactly one scan of an association

ICEIS 2006 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

142

rule analysis without changing the result.

REFERENCES

Agrawal, R. and Srikant, R. (1994). Fast algorithms for

mining association rules. In Bocca, J. B., Jarke, M.,

and Zaniolo, C., editors, Proc. 20th Int. Conf. Very

Large Data Bases, VLDB, pages 487–499. Morgan

Kaufmann.

Bradley, P. S., Fayyad, U. M., and Reina, C. (1998). Scaling

clustering algorithms to large databases. In Knowl-

edge Discovery and Data Mining, pages 9–15.

Chiu, T., Fang, D., Chen, J., Wang, Y., and Jeris, C. (2001).

A robust and scalable clustering algorithm for mixed

type attributes in large database environment. In Pro-

ceedings of the seventh ACM SIGKDD international

conference on Knowledge discovery and data mining,

pages 263–268. ACM Press.

Cooley, R. (2000). Web Usage Mining: Discovery and Ap-

plication of Interesting Patterns from Web Data. PhD

thesis, University of Minnesota.

Dempster, A. P., Laird, N., and Rubin, D. (1977). Maximum

likelihood via the EM algorithm. Journal of the Royal

Statistical Society, (39):1–38.

Dhillon, S. I., Kumar, R., and Mallela, S. (2002). En-

hancded word clustering for hierarchical text classiﬁ-

cation. In Proceedings of the Eighth ACM SIGKDD

International Conference on Knowledge Discovery

and Data Mining, pages 191–200.

Facca, F. M. and Lanzi, P. L. (2005). Mining interesting

knowledge from weblogs: a survey. Data and Knowl-

edge Engineering, 53(3):225–241.

Freitas, A. A. (2002). Data Mining and Knowledge Dis-

covery with Evolutionary Algorithms. Spinger-Verlag,

Berlin.

Gehrke, J., Ramakrishnan, R., and Ganti, V. (2000). Rain-

forest - a framework for fast decision tree construc-

tion of large datasets. Data Mining and Knowledge

Discovery, 4(2/3):127–162.

Genther, H. and Glesner, M. (1994). Automatic generation

of a fuzzy classiﬁcation system using fuzzy cluster-

ing methods. In SAC ’94: Proceedings of the 1994

ACM symposium on Applied computing, pages 180–

183, New York, NY, USA. ACM Press.

Guha, S., Meyerson, A., Mishra, N., Motwani, R., and

O’Callaghan, L. (2003). Clustering data streams: The-

ory and practice. IEEE Transactions on Knowledge

and Data Engineering, 15.

Han, E.-H., Karypis, G., Kumar, V., and Mobasher, B.

(1997). Clustering based on association rule hyper-

graphs. In Research Issues on Data Mining and

Knowledge Discovery.

Han, J., Pei, J., and Yin, Y. (2000). Mining frequent patterns

without candidate generation. In Chen, W., Naughton,

J., and Bernstein, P. A., editors, 2000 ACM SIGMOD

Intl. Conference on Management of Data, pages 1–12.

ACM Press.

Kim, K. M., Park, J. J., and Song, M. H. (2004). Binary

decision tree using genetic algorithm for recognizing

defect patterns of cold mill strip. In Proc. of the

Canadian Conference on AI 2004, pages 461 – 466.

Springer.

Kruengkrai, C. and Jaruskulchai, C. (2002). A parallel

learning algorithm for text classiﬁcation. In KDD

’02: Proceedings of the eighth ACM SIGKDD inter-

national conference on Knowledge discovery and data

mining, pages 201–206, New York, NY, USA. ACM

Press.

Lai, H. and Yang, T.-C. (2000). A group-based inference

approach to customized marketing on the web inte-

grating clustering and association rules techniques. In

HICSS ’00: Proceedings of the 33rd Hawaii Interna-

tional Conference on System Sciences-Volume 6, page

6054, Washington, DC, USA. IEEE Computer Soci-

ety.

Lent, B., Swami, A. N., and Widom, J. (1997). Clustering

association rules. In ICDE, pages 220–231.

Liu, B., Xia, Y., and Yu, P. S. (2000). Clustering through

decision tree construction. In CIKM ’00: Proceedings

of the ninth international conference on Information

and knowledge management, pages 20–29, New York,

NY, USA. ACM Press.

MacQueen, J. (1967). Some methods for classiﬁcation and

multivariate observations. In Proceedings of the 5th

Berkeley Symp. Math. Statist, Prob.

, pages 1:281–297.

O’Callaghan, L., Mishra, N., Meyerson, A., Guha, S., and

Motwani, R. (2002). High-performance clustering of

streams and large data sets. In Proc. of the 2002 Intl.

Conf. on Data Engineering (ICDE 2002), February

2002.

Zhang, D., Gunopulos, D., Tsotras, V. J., and Seeger, B.

(2003). Temporal and spatio-temporal aggregations

over data streams using multiple time granularities.

Inf. Syst., 28(1-2):61–84.

BENEFICIAL SEQUENTIAL COMBINATION OF DATA MINING ALGORITHMS

143