NEW FAST ALGORITHM FOR INCREMENTAL MINING OF

ASSOCIATION RULES

Yasser El Sonbati, Rasha Kashef

Arab Academy for science & Technology, College of Comuputing and Information Technology

P.0.Box. 1029. Abu Kir , AlexandriaEgypt

Key

words: Data mining, Association rules, Incremental mining.

Abstract: Mining association rules is a well-studied problem, and several algorithms were presented for finding large

itemsets. In this paper we present a new algorithm for incremental discovery of large itemsets in an

increasing set of transactions. The proposed algorithm is based on partitioning the database and keeping a

summary of local large itemsets for each partition based on the concept of negative border technique. A

global summary for the whole database is also created to facilitate the fast updating of overall large

itemsets. When adding a new set of transactions to the database, the algorithm uses these summaries instead

of scanning the whole database, thus reducing the number of database scans. The results of applying the

new algorithm showed that the new technique is quite efficient, and in many respects superior to other

incremental algorithms like Fast Update Algorithm (FUP) and Update Large Itemsets (ULI).

1 INTRODUCTION

Data mining is the process of discovering potentially

valuable patterns, associations, trends, sequences

and dependencies in data [

Agrawal et al.,

1993

][Agrawal et al., 1994][Agrawal and Yu 1998].

Mining association rules is one of the vital data

mining problems. An association rule is a relation

between items in a set of transactions. This rule must

have a statistical significance (support) with respect

to the whole database and its structure must have a

semantic prospective (confidence), as will be stated

in more details later in section 2. Apriori algorithm

[Agrawal et al., 1994] is the first successful

algorithm for mining association rules. It introduces

a method to generate candidate itemsets C

in pass k

using only large itemsets L

k-1

in the previous pass.

Direct Hashing and Pruning (DHP) algorithm [Park

et al., 1997] is the next algorithm for efficient

mining of association rules. It employs a hash

technique to reduce the size of the candidate

itemsets and the database The Continuous

Association Rule Mining Algorithm (CARMA)

[Hidber 1998] allows the user to change the support

threshold and continuously displays the resulting

association rules with support and confidence

bounds.

After some Update activities, new transactions are

added to the database. When new transactions are

added to the database, insignificant rules will be

discarded. Similarly new valid ones that satisfy the

statistical and the semantic constraints will be

included. The Adaptive algorithm [Sarda and

Srinivas 1998] is not only incremental but also

adaptive in nature. By inferring the nature of the

incremental database, it can avoid unnecessary

database scans. The Fast Update algorithm (FUP) is

an incremental algorithm which makes use of past

mining results to speed up the mining process

[Cheung et al., 1996]. Update Large Itemsets

algorithm (ULI) [Thomas et al., 1997] uses negative

borders to decide when to scan the whole database.

Recently Fast Online Dynamic Association Rule

Mining (FODARM) algorithm [Woon et al., 2002] is

introduced for incremental mining in electronic

commerce. It uses a novel tree structure known as a

Support-Ordered Trie Itemset (SOTrieIT) structure

to hold pre-processed transactional data. Another

algorithm for Online Generation of Profile

Association Rules is introduced in [Aggarwal et al.,

275

El Sonbati Y. and Kashef R. (2004).

NEW FAST ALGORITHM FOR INCREMENTAL MINING OF ASSOCIATION RULES.

In Proceedings of the Sixth International Conference on Enterprise Information Systems, pages 275-281

DOI: 10.5220/0002624102750281

 SciTePress

2002]. A New approach to Online Generation of

Association Rules [Aggarwal and Yu 2001]

introduces the concept of storing the preprocessed

data in such a way that online processing may be

done by applying a graph theoretic search algorithm

whose complexity is proportional to the size of the

output .

The algorithm presented in this paper NBP: Negative

Border with Partitioning is based on partitioning the

database, keeping a summary for each partition. This

summary includes the locally large itemsets, their

negative border and any other previously counted

itemset in the partition. Another global summary

including the large and negative border itemsets is

also created for the whole database. When adding a

new set of transactions to the database, the NBP

applies the Update Large Itemsets (ULI)-like

algorithm [Thomas et al., 1997] that uses these

summaries instead of scanning the whole database,

thus reducing the number of database scans to less

than one scan.

The rest of the paper is organized as follows. The

next section gives a description of the association

rules mining (ARM) problem while section 3

presents the negative border with partitioning

algorithm. Section 4 describes performance analysis

of the proposed algorithm in comparison with some

related algorithms. Finally conclusions are discussed

in Section 5.

2 PROBLEM DESCRIPTION

The problem of association rules mining is described

in the following two subsections.

2.1 Mining of association rules

The problem of mining association rules is described

as follow: let the universal itemset, I = {i

, i

,.., i

}

be a set of literals called items , D be a database of

transactions, where each transaction T contains a set

of items such that T

⊆

I. An itemset is a set of items

and k-itemset is an itemset that contains exactly k

items. For a given itemset X

⊆

I and a given

transaction T, T contains X if and only if X

⊆

The support count σ

of an itemset X is defined as

the number of transactions in D containing X. An

item set is large, with respect to a support threshold

of s%, if σ

≥ |D| × s, where |D| is the number of

transactions in the database D. An association Rule

is an implication of the form “X

⇒

Y” where X, Y

⊆

and X

∩

Y =

∅

. The association rule X

⇒

Y holds in

the database with confidence c% if no less than c%

of the transactions in D that contain X also contain Y.

The rule X

⇒

Y has support s% in D if σ

∪

= |D|

×s%. For a given pair of confidence and support

thresholds, the problem of mining association rules

is to find out all the association rules that have

confidence and support greater than the

corresponding thresholds. This problem can be

reduced to the problem of finding all large itemsets

for the same support threshold [

Agrawal et al., 1993].

2.2 Update of association rules

After some Update activities, new transactions are

added to the original database D. When new

transactions are added to the database, an old large

itemset could potentially become small in the

updated database. Similarly, an old small itemset

could potentially become large in the new database.

Let ∆

be the set of newly added transactions

(increment database), D

’

be the updated database

where D

’

= ( D ∪ ∆

), σ

’

be the new support count

of an itemset X in the updated database D

’

, L

D’

the set of large itemset in D

’

, C

is the set of

candidate k-itemsets in D and δ

be the support

count of an itemset X in the increment database ∆

3 THE PROPOSED ALGORITHM

In this section, we develop an efficient algorithm for

updating the association rules when new transactions

are added to the database. The proposed algorithm

uses negative borders [Thomas et al., 1997]. The

intuition behind the concept of negative border is

that, for a given set of large itemsets L, the negative

border NBd(L) contains the closest itemsets that

could be large too.

The list of symbols that used in our algorithm is

shown in Table 1.

ICEIS 2004 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

276

Table 1: Symbols of the proposed algorithm NBP.

Symbol Definition

q Partition size

| Cardinality of partition p

Large itemset of partition p

NBd(L

) Negative border itemset of

partition p

NBd(L

) Negative border itemset of

original database D

NBd(L

D’

) Negative border itemset of

updated database D

’

n Total Number of partitions

3.1 Algorithm Description

The new algorithm can be described in two main

steps.

Preprocessing step

In this step we divide the original transactional

database into number of partitions with size q. For

simplicity we assume q as a multiple of | ∆

| (for

generality q can be of any size). Total number of

partitions is assumed to be n. In the preprocessing

step we evaluate for each partition p

; i =1, 2... n the

large itemset L

with its corresponding negative

border itemset NBd(L

). Each itemset in L

NBd(L

) is stored with its corresponding support

count in the partition p

. Also we compute the large

itemset L

and negative border itemset NBd(L

) for

the whole database D.

The size of large and negative border itemsets for all

partitions may be big enough not to be fitted into the

computer memory due to memory limitations, so we

suggest adding another support threshold ls “local

support threshold” as a fraction of the global support

threshold s which limits the size of negative border

itemsets for all partitions. Itemsets with support

count less than ls×s×|pi| are discarded from the

negative border itemset of the partition p

The large

itemsets for all partitions is kept unchanged as they

contain significant information of most frequent

itemsets.

The Pseudo code of preprocessing step is described

in both Figures 1and 2.

Function preprocessing (D, n,q)

Divide the original database into n partitions, each

partition with size q.

for i = 1 to n do

L = large-itemset for partition p

NBd(L

)=Negtiveborder_gen (L

)

= large-itemset for the whole database

NBd(L

) = Negtiveborder_gen (L

)

Figure 1: A high-level description of preprocessing step

Function Negtiveborder_gen (L)

Split L into L

, L

,.., L

where r is the size of the

largest itemset in L

For all k=1, 2, ….,r do

Compute C

k+1

using apriori-gen(L

)

//Apriori[Agrawal et al., 1994]

L∪NBd(L) = U

i=2,….,r+1

∪ I

, where I

is the set

of 1-itemset.

Figure 2: The of Negativeborder_gen function.

Updating Step

In this step we have the incremental database ∆

, set

of n partitions with their corresponding large and

negative border itemsets. With the assumption that

partition size q is multiple of the incremental

database size, ∆

is added either to the last partition

or to a new partition. The next step is to update the

large itemsets L

and negative border itemsets

NBd(L

) of the partition p

using NBP(p

, ∆

, p

)

function. If ∆

is added to a new partition evaluate

and NBd(L

) using Apriori as a level wise

algorithm (Apriori [Agrawal et al., 1994] generates

only candidate itemsets, we get the negative border

itemsets by applying the function

Negtiveborder_gen(L) to the resulting large itemset

from Apriori algorithm .

After updating the partitions, the next step is to

update L

and NBd(L

) of the whole database to

obtain the updated large itemset L

D’

and updated

negative border itemset NBd(L

D’

). First we compute

the large itemset L

∆+

and negative border itemset

NBd(L

∆+

) of ∆

, simultaneously we count the

support for all itemsets t ∈L

∪NBd(L

) in ∆

. If an

itemset t

∈

or NBd(L

) has minimum support in

’

, then t is added to L

D’

otherwise it is added to

NBd(L

D’

). For each itemset x∈ L

∆+

∪ NBd(L

∆+

) , x ∉

and x∉ NBd(L

) add x into NBd(L

D’

).The change

in L

could potentially change NBd(L

) also.

NEW FAST ALGORITHM FOR INCREMENTAL MINING OF ASSOCIATION RULES

277

Therefore some itemsets may be missed in both L

D’

and NBd(L

D’

). We define two sets Large_to_Large

(set of itemsets that moved from L

to L

D’

) and

Negative_to_Large (set of itemsets that moved from

NBd(L

) to L

D’

). Join Negative_to_Large with

Large_to_Large to get new set Self_Join_Set, using

the function join (L

k-1

) which join a set of large

itemsets with length (k-1) with itself to get C

a set

of candidate itemsets with length k. the join (L

k-1

)

function is described in Figure 3.

function join (L

k-1

)

= Φ

For each X , Y ⊂ L

k-1

if X.item

= Y.item

,..,X.item

k-2

= Y.item

k-2

X.item

k-1

< Y.item

k-1

then

Z = X.item

, X.item

,.., X.item

k-1,

Y.item

k-1

Insert Z into C

//pruning step

For all itemsets c

∈

For all (k-1) subsets s of c do

if (s

∉ L

k-1

) then

delete c from C

Return C

Figure 3: High Level Description of the join function

For each itemset t ∈ Self_Join_Set, check all

partitions p

, i = 1, 2, …, n. If t is found at the large

itemset L

or negative border itemset NBd(L

then

update the support count of t. If t is not found in

either L

or NBd(L

) then scan partition p

to get the

support count of t. Scanning a partition is done once

for all itemsets need to be scanned in this partition.

This means, we only need maximum of one scan for

the whole database (all partitions) at worst case. In

general, the proposed algorithm needs a fraction of a

scan to update the large and negative border itemsets

for the updated database. We use the hash tree

structure (Apriori [Agrawal et al., 1994]) to get the

support count of a set of itemsets within this

partition. If the support count of t ≥ the support

threshold of D

’

, then add t to L

D’

; otherwise add t to

NBd(L

D’

The description of the NBP (D, ∆

, Partitions)

function is described Figure 4.

Function NBP ( D, ∆

, Partitions)

D’

= Ф , NBd(L

D’

)= Ф, Large_to_Large = Ф and

Negative_to_Large = Ф // initialization

Compute L

∆+

, NBd(L

∆+

)

If | p

| < q then

Add ∆

to p

and update the partition p

else n++, Add ∆

to the new partition ,Compute

, NBd(L

)

For each itemset x ∈ L

if (σ

+δ

≥ s * (|D| +| ∆

|) then

//s :minimum support threshold

add x to both L

D’

and

Large_to_Large sets

else add x to NBd(L

D’

)

For each itemset x

∈

NBd(L

)

if (σ

+δ

≥ s * (|D| +| ∆

|) then

add x to both L

D’

and Negative_to_Large sets

else add x to NBd(L

D’

)

For each itemset x ∈ L

∆+

∪ NBd(L

∆+

) , x∉ L

and

∉

NBd(L

) do

add x to NBd(L

D’

)

if L

≠ L

D’

then

ULNBd(L

D’

,NBd(L

D’

), Large_to_Large,

Negative_to_Large, Partitions)

Figure 4: Negative Border with Partitioning algorithm

using function NBP ( )

The pseudo code of the function ULNBd () is given

in Figure5.

ULNBd (L

D’

,NBd(L

D’

),Large_to_Large,

Negative_to_Large, Partitions)

// generate all possible candidates “Self_Join_Set”

//for the set of large items in the updated database D

’

Self_Join_Set

= Ф

// initialize Self_Join_Set of length1 to be empty

For k = 1, 2, ….,ℓ do

//ℓ:size of the largest itemset in Negative_to_Large

= set of itemsets with length k from

Large_to_Large

= set of itemsets with length k from

Negative_to_Large

Self_Join_Set

k+1

= join(LL

∪ NL

∪

Self_Join_Set

)

For i=1, 2, …, n do // n: number of partitions

_itemsets= Ф // p

_itemsets: set of items to

be scanned within a partition p

For each itemset t∈ Self_Join_Set do

’

=0 // initialize support count of itemset t

For i=1, 2, …, n do

// search all partitions for the support count of

all elements found in Self_Join_Set

if t ∈ L

then

’

= σ

’

+ support count of t in L

else if t ∈ NBd(L

) then

’

= σ

’

+ support count of t in NBd(L

)

else add t to p

_itemsets

For i=1, 2, …, n do

if p

_itemsets ≠ Ф then

Scan p

to get support count of each itemset

x ∈ p

_itemsets

ICEIS 2004 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

278

//scanning using hash tree structure // (Apriori [Agrawal et al.,

1994])

For each itemset t

∈ Self_Join_Set do

//Update support count of t after scanning all

//partitions

if σ

’

≥ minsup*(|D| +| ∆

|) then

add t to L

D’

else add t to NBd(L

D’

)

Figure 5: Update Large and Negative Border of D

’

using

ULNBd () function

The number of scans over the whole database

needed for NBP algorithm is varying from 0 to 1.

The zero scan is obtained when the information

needed after adding the increment database is found

in either the global summary of the whole database

or the local summary in each partition. The one scan

is occurred at the worst case when the algorithm

needs to scan all partitions (whole database) to get

the count of some itemsets. In general, the algorithm

needs a fraction of a scan to reach the final results

4 PERFORMANCE ANALYSIS

In this section, the proposed algorithm is tested

using several test data to show its efficiency in

handling the problem of incremental mining of

association rules.

4.1 Generation of synthetic data

In this experiment, we used synthetic data as the

input database to the algorithms. The data are

generated using the same technique as introduced in

[Agrawal et al., 1994], modified in [Park et al.,

1997] and used in many algorithms like [Thomas et

al., 1997] and [Cheung et al., 1996]. Table 2 gives a

list of the parameters used in the data generation

method.

Table 2: Parameters for data generation

|| D

Number of transactions in original

database

Number of transactions in the

updated database

∆

Number of added transactions

|T| Mean size of transactions

| I | Mean size of potentially large

itemsets

| £ | Number of potentially large itemsets

Number of items

We use the notation Tx.Iy.Di+d , and modified from

the one used in [Agrawal et al., 1994] , to denote an

experiment using databases with the following sizes

|D| = i thousands, ∆

=k thousands, |T|= x, and |I|=y .

In the Experiments we set N=1000 and |£|=2000.The

increment database is generated as follow: we

generate 100 thousand transactions, of which (100-

d) thousands is used for the initial computation and d

thousands is used as the increment, where d is the

fractional size (in percentage) of the increment.

4.2 Experimental Results

In each experiment, we run the proposed algorithm

NBP on the previous test data. We compare the

execution time of the incremental algorithm NBP

with respect to running Apriori on the whole data

set. The proposed algorithm is tested using the

settings T10.I4.D100+d. The support threshold is

varying between 0.5% and 3.0%. For simplicity we

assumed that the partition size q is a multiple of the

size of the increment database | ∆

| .We run the

algorithm for q = 1, 2, 5, 10 multiples of | ∆

| and d

= 1% as a fraction from the whole database size.

Figure 6 shows the speed up of the incremental

algorithm over Apriori with support threshold is

varying between 0.5% and 3.0%. It can be shown

that when applying the NBP algorithm on the test

data it achieves an average speed up ranging from 6

to 67 in comparison with Apriori

algorithm.

3 2 .5 2 1.5 1 0 .75 0 .5

Support Threshold (%)

1 ∆+

2 ∆+

5 ∆+

10 ∆+

Figure 6: Performance Ratio of NBP at ∆

= 1%

Figures 7 and 8, show the experimental results when

applying the new algorithm NBP on the same test

data but with d is 2% and 5% respectively. The

support threshold is varying between 0.5%and 3.0%

in both experiments. It can be concluded from

Figure 7 that the proposed algorithm has an average

speed up ranging from 4 to 37 in comparison with

Apriori algorithm. From Figure 8 the NBP algorithm

achieves an average speed up ranging from 2 to 14.

NEW FAST ALGORITHM FOR INCREMENTAL MINING OF ASSOCIATION RULES

279

3 2.5 2 1.5 1 0.75 0.5

Support Threshold(%)

Speed Up

1 ∆+

2 ∆+

5 ∆+

10 ∆+

Figure 7: Performance Ratio of NBP at ∆

3 2.5 2 1.5 1 0.75 0.5

Support Threshold (%)

Speed Up

1 ∆+

2 ∆+

5 ∆+

10 ∆+

Figure 8: Performance Ratio of NBP at ∆

= 5%

From Figures 6, 7 and 8, it is noticed that the

proposed algorithm shows better performance for

high support than low support. At high support

thresholds, the possibility to get new large itemsets

from the original negative border is low so the

searching time within the partitions’ large itemsets

and negative border itemsets is small. At low

support thresholds, there is a high probability of

getting more new large itemsets immigrating from

the set of negative border to the set of large itemsets.

This increases the possibility to scan most partitions

causing the increase of execution time. Also, the

speed up of the proposed algorithm is higher for

smaller increment sizes since the new algorithm

needs to process less data. It can be shown that the

NBP algorithm achieves better performance when

the partition size is five times of the increment

database ∆

and the size of increment database is 1%

of the whole database.

4.3 Comparisons with FUP

FUP may require O (k) scans over the whole

database where k is the size of maximal large

itemsets, while the new NBP algorithm needs a

fraction of a scan to update the results. In this

experiment, we run the proposed algorithm NBP on

the previous test data. We compare the execution

time of the incremental algorithm NBP with respect

to running FUP on the same data set. For support

threshold varying between 1.0% and 3.0%, and | ∆

= 1 % Figure 9 shows that the proposed NBP

algorithm has an average speed up ranging from 6 to

67 while FUP algorithm achieves a speed up from 2

to 7 against Apriori algorithm.

3 2.5 2 1.5 1

Support Threshold (%)

Speed Up

NBP

FUP

Figure 9: Speed up of NBP against FUP

4.4 Comparisons with ULI

It is costly to run ULI at high support thresholds

where the number of large itemsets is less and at low

support threshold the probability of the negative

border expanding is higher so ULI may have to scan

the whole database. We run the proposed algorithm

NBP on the previous test data and compare the

execution time of the incremental algorithm NBP

with respect to running ULI on the same data set.It is

concluded from Figure 10 that for support threshold

varying between 0.5%and 3.0%, and | ∆

| = 1 % The

NBP algorithm has an average speed up ranging

from 6 to 67 while ULI algorithm achieves a speed

up from 5 to 20 against Apriori algorithm.

ICEIS 2004 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

280

2.5

0.5

Support Threshold (%)

Speed Up

NBP

ULI

Figure 10: Speed up of NBP against ULI

5 CONCLUSIONS

In this paper a new algorithm NBP: Negative

Border with Partitioning is presented for

incremental mining of association rules. The

proposed algorithm is based on partitioning the

database, keeping a summary for each partition.

Another global summary including the large and

negative border itemsets is also created for the

whole database. When adding a new set of

transactions to the database, the NBP applies a ULI-

like algorithm that uses these summaries instead of

scanning the whole database, thus reducing the

number of database scans to less than one scan.

From algorithm discussion and experimental results,

the following points can be concluded:

1. The new algorithm NBP, can efficiently

handle the problem of incremental mining

of association rules. NBP shows better

performance than the algorithms of FUP

and ULI.

2. The number of scans over the whole

database needed for NBP algorithm is

varying from 0 to 1.

3. NBP achieves high speed up from 6 to 67

for support threshold varying from 0.5 to

3.0 against the Apriori algorithm.

REFERENCES

Agrawal, R. ,Imielinski, T. and Swami, A., 1993. Mining

Association Rules between Sets of Items in Large

Databases. Proc. ACM SIGMOD. Int Conf, 1993.

Agrawal, R. and Srikant, R..Fast Algorithms for Mining

Association Rules .Proc.(VLDB).Int Conf, 1994.

Cheung, D.W. Lee, S.D. and Kao, B. A General

Incremental Technique for Maintaining Discovered

Association Rules. Proc. Database systems for

Advanced Applications, Int Conf, 1998.

Park, J.S. Chen, M.S. and Yu, P.S.. Using a Hash Based

Method with Transaction Trimming for Mining

Association Rules. IEEE Trans on Knowledge and

Data Engineering, 1997.

Agrawal, C.C. and Yu, P.S. Mining Large Itemsets for

Association Rules, Bulletin of the IEEE Computer

Society Technical Committee on Data Engineering

1998.

Sarasere, A. Omiecinsky, E. and Navathe, S. An Efficient

Algorithm for Mining Association Rules in Large

Databases. Very Large Databases (VLDB). Int Conf.

1995.

Hidber,C.Online Association Rule Mining.Proc.ACM

SIGMOD Int Conf. Management of Data, 1998.

Han, J. ,Pei, J. and Yin, Y. Mining frequent patterns

without candidate generation. Proc. ACM SIGMOD.

Int Conf. on management of Data,2000.

Woon , Ng, Y. W. and Das, A. , Fast Online Association

Rule Mining , IEEE transactions on Knowledge and

Data Engineering ,2002.

Sarda, N.L. and Srinivas, N. V.An Adaptive Algorithm

for Incremental Mining of Association Rules. Proc

.Database and Experts systems. Int Conf , 1998.

Thomas, S., Bodagala, S. Alsabti, K. and Ranka, S.. An

Efficient Algorithm for the Incremental Updation of

Association Rules in Large Databases. Proc.

Knowledge Discovery and Data Mining (KDD 97). Int

conf, 1997.

Aggarwal, C.C., Sun, Z. and Yu, P.S., Fast Algorithms for

Online Generation of Profile Association Rules, IEEE

transactions on knowledge and Data Engineering,

September 2002.

Cheung, D.W. Han, J. Ng, V.T. and Wong, C.Y.

Maintenance of Discovered Association Rules in

Large Databases: An Incremental Updating

Technique. Proc. Data Engineering. Int Conf, 1996.

Aggarwal, C. and Yu, P. A new Approach for Online

Generation of Association Rules, IEEE transactions

on Knowledge and Data Engineering, 2001

NEW FAST ALGORITHM FOR INCREMENTAL MINING OF ASSOCIATION RULES

281