A DISTRIBUTED ALGORITHM FOR MINING FUZZY
ASSOCIATION RULES
George Stephanides
University of Macedonia, Department of Applied Informatics
156 Egnatia Street, 540 06 Thessaloniki GREECE
Mihai Gabroveanu, Mirel Cosulschi, Nicolae Constantinescu
University of Craiova, Computer Science Department
13 A.I. Cuza Street, 200585 Craiova ROMANIA
Keywords:
data mining, fuzzy association rules, distributed mining.
Abstract:
Data mining, also known as knowledge discovery in databases, is the process of discovery potentially useful,
hidden knowledge or relations among data from large databases. An important topic in data mining research
is concerned with the discovery of association rules. The majority of databases are distributed nowadays.
In this paper is presented an algorithm for mining fuzzy association rules from these distributed databases.
This algorithm is inspired from DMA (Distributed Mining of Association rules) algorithm for mining boolean
association rules.
1 INTRODUCTION
Data mining, also known as knowledge discovery in
databases, is the process of discovery potentially use-
ful, hidden knowledge or relations among data from
large databases. An important task in data mining
process is the discovery of association rules. An
association rule describes an interesting relationship
among different attributes.
The task of discovering association rules was first
introduced in (Agrawal R., 1993). Many of pro-
posed algorithms for mining association rules are se-
quential algorithms. The most popular are: Apri-
ori (Rakesh Agrawal, 1994), DHP, DIC. The basic
problem of finding fuzzy association rules was intro-
duced in (Chan Man Kuok, 1998).
Mining association rules based on fuzzy sets can
handle quantitative and categorical data, providing the
necessary support to use uncertain data types with ex-
isting algorithms. Today the majority of databases are
distributed. The records of transactions correspond-
ing to each customer operation registered in a stores
chain distributed in many locations form an example
of such databases. The main problem here is to dis-
cover the association rules from this distributed data.
In this paper we introduce an algorithm for mining
fuzzy association rules from these distributed data-
bases. This algorithm is an adaptation of DMA al-
gorithm used here for mining fuzzy association rules.
2 PROBLEM DEFINITION
2.1 Sequential problem definition
The formal problem definition as in (Chan Man Kuok,
1998) is the following:
Let DB = {t
1
, . . . , t
n
} a transactional database.
We consider that this database is characterized by a
set of categorical or quantitative attributes (items).
Let I = {i
1
, . . . , i
m
} the set of these attributes. We
note with dom(i
k
) the domain of values for the at-
tribute i
k
. For each attribute i
k
, (k = 1, . . . , m)
we will consider n(k) associated fuzzy sets. Let
F
i
k
= {f
1
i
k
, . . . , f
n(k)
i
k
} be the set of fuzzy sets. For
an attribute i
k
and a fuzzy set f
j
i
k
, the membership
function is µ
f
j
i
k
.
Definition 2.1. We call fuzzy itemset the tuple
hX, F
X
i, where X I, and F
X
is a set of fuzzy
sets associated with items from X. A fuzzy itemset
hX, F
X
i is called k-fuzzy itemset if the number of at-
tributes from X is k.
Definition 2.2. A fuzzy association rule is an im-
plication with following form X A Y B,
where X, Y I, X Y = , X = {x
1
, . . . , x
p
},
Y = {y
1
, . . . , y
q
}. A = {a
1
, . . . , a
p
} and B =
{b
1
, . . . , b
q
} are fuzzy sets related to attributes from
X, respectively Y . More exactly, a
i
F
x
i
, (i =
1, . . . , p), and b
i
F
y
i
, (i = 1, . . . , q).
206
Stephanides G., Gabroveanu M., Cosulschi M. and Constantinescu N. (2005).
A DISTRIBUTED ALGORITHM FOR MINING FUZZY ASSOCIATION RULES.
In Proceedings of the First International Conference on Web Information Systems and Technologies, pages 206-209
DOI: 10.5220/0001228802060209
Copyright
c
SciTePress
We denote this rule with hX, Ai hY, Bi.
The intuitively signification of this fuzzy associa-
tion rule hX, Ai hY, Bi is: ”if a transaction (tuple)
satisfies the property X A then it will satisfy the
property Y B with a high probability also”.
Definition 2.3. The fuzzy support value of itemset
hX, F
X
i in DB is:
F S
hX,F
X
i
=
P
t
i
∈DB
Q
x
j
X
α
a
j
(t
i
[x
j
])
|DB |
where
α
a
j
(t
i
[x
j
]) =
µ
a
j
(t
i
[x
j
]), if µ
a
j
(t
i
[x
j
]) ω
0, otherwhise
and ω is a user specified minimum threshold for the
membership function. Thus, the values of member-
ship functions lesser than this minimum threshold are
ignored.
Definition 2.4. A fuzzy itemset hX, F
X
i is called
a large (frequent) fuzzy itemset if its fuzzy support
value is greater than or equal to the minimum support
threshold (minsup), namely F S
hX,F
X
i
minsup.
An association rule is considered as interesting if it
has enough support and high confidence value. This
association rule can be encountered under the name
strong rule.
Problem 1 (Sequential Mining Fuzzy Association
Rules). Given the database DB characterized by a
set of attributes I, the fuzzy sets associated with at-
tributes from I, ω the minimum support threshold for
membership function, the minimum support threshold
(minsup) and the minimum confidence threshold
(minconf), extract all interesting fuzzy association
rules.
Definition 2.5. Let hX, Ai hY, Bi be a fuzzy
association rule. The fuzzy support value of the
rule is defined as fuzzy support value of the itemset
h{X, Y }, {A, B}i:
F S
hX,Ai⇒hY,Bi
= F S
h{X,Y },{A,B}i
Definition 2.6. A fuzzy association rule is called a
frequent rule if its fuzzy support value is greater than
or equal to the minimum support threshold (minsup),
namely F S
hX,Ai⇒hY,Bi
minsup.
Based on discovered large fuzzy itemsets we can
generate all possible frequent rules, but in order to be
interesting they must have a high confidence value.
Definition 2.7. Let hX, Ai hY, Bi a fuzzy associ-
ation rule. The fuzzy confidence value of the rule is
defined as:
F C
hX,Ai⇒hY,Bi
=
F S
hZ,Ci
F S
hX,Ai
where Z = {X, Y } and C = {A, B}
The confidence of the rule is defined as the fraction
between the value of fuzzy support of the fuzzy item-
set hZ, Ci and the value of fuzzy support of the fuzzy
itemset hX, Ai.
Lemma 1. If a fuzzy itemset hX, F
X
i is a large fuzzy
itemset in DB, Y X, F
Y
F
X
, then also fuzzy
itemsets hY, F
Y
i are large in DB.
From the above lemma we can draw the conclusion
that any fuzzy subitemset of a large fuzzy itemset is
also large.
The problem of sequential mining of fuzzy associ-
ation rules can be decomposed in two subproblems:
1. find all large fuzzy itemsets.
2. generate the fuzzy association rules from the large
fuzzy itemsets founded.
The majority of algorithms for mining
fuzzy association rules (see (Gyenesei, 2000),
(Hong T.P., 2000)) are based on the algorithm
Apriori (Rakesh Agrawal, 1994).
2.2 Distributed problem definition
Let DB = {DB
1
, DB
2
, . . . , DB
n
} be a distributed
database over n sites S
1
, S
2
, . . . , S
n
. We denote with
D the number of transactions from DB, and with D
i
the number of transactions from DB
i
, for all i =
1, . . . , n.
Definition 2.8. For a given fuzzy itemset hX, F
X
i
we call global fuzzy support value the fuzzy support
value of hX, F
X
i in DB defined as:
F S
hX,F
X
i
=
P
t
i
∈DB
Q
x
j
X
α
a
j
(t
i
[x
j
])
|DB |
and global fuzzy support count in DB is defined as:
CF S
hX,F
X
i
=
X
t
i
∈DB
Y
x
j
X
α
a
j
(t
i
[x
j
])
Definition 2.9. For a given fuzzy itemset hX, F
X
i and
a database DB
i
we call local fuzzy support value in
DB
i
the fuzzy support value of hX, F
X
i in DB
i
de-
fined as:
F S
i
hX,F
X
i
=
P
t
i
∈DB
i
Q
x
j
X
α
a
j
(t
i
[x
j
])
|DB
i
|
and local fuzzy support count in DB
i
is defined as:
CF S
i
hX,F
X
i
=
X
t
i
∈DB
i
Y
x
j
X
α
a
j
(t
i
[x
j
])
Let minsup be the minimum support threshold.
Definition 2.10. A fuzzy itemset hX, F
X
i is called
global large fuzzy itemset if F S
hX,F
X
i
minsup.
A DISTRIBUTED ALGORITHM FOR MINING FUZZY ASSOCIATION RULES
207
Definition 2.11. A fuzzy itemset hX, F
X
i is called
local large fuzzy itemset at site S
i
if F S
i
hX,F
X
i
minsup.
Definition 2.12. If a fuzzy itemset hX, F
X
i is both
globally large and locally large at a site S
i
, it is called
gl-large fuzzy itemset at site S
i
.
In the following, we will denote with L the set of
all globally large fuzzy itemsets in DB, and with L
(k)
the set of all globally large k-fuzzy itemsets in DB.
Problem 2 (Distributed Mining Fuzzy Association
Rules). Given the set of items I, the distributed data-
base DB = {DB
1
, DB
2
, . . . , DB
n
}, the fuzzy sets
associated with attributes from I, the minimum sup-
port threshold (minsup) and the minimum confidence
threshold (minconf ), extract all global fuzzy associ-
ation rules.
3 THE DISTRIBUTED
ALGORITHM
In (Cheung D.W., 1996), the authors proposed a DMA
algorithm for mining boolean association rules from
distributed databases.
3.1 Generate set of candidate fuzzy
itemsets
The candidate fuzzy itemsets reduction is made on the
basis of the properties of the global large fuzzy item-
sets and local large fuzzy itemsets subsequently pre-
sented:
Lemma 2. If a fuzzy itemset hX, F
X
i is locally large
at a site S
i
, then all its subsets are also locally large
at site S
i
,
Lemma 3. If a fuzzy itemset hX, F
X
i is globally
large, then there exist a site S
i
, (1 i n), such
that hX, F
X
i is locally large at site S
i
.
Lemma 4. If a fuzzy itemset hX, F
X
i is gl-large fuzzy
itemset at a site S
i
, (1 i n), then all its sub-fuzzy
itemsets, hY, F
Y
i, Y X, are also gl-large fuzzy
itemsets at site S
i
.
We use GL
i
to denote the set of all gl-large fuzzy
itemsets at site S
i
, and GL
i
(k)
to denote all k-gl-large
fuzzy itemsets at site S
i
.
Lemma 5. If hX, F
X
i L
(k)
, (i.e. is a globally
large fuzzy k-itemset), then there exists a site S
i
,
(1 i n) such that hX, F
X
i and all its (k-1) sub-
fuzzy itemsets are gl-large fuzzy itemsets at site S
i
.
Like in the DMA algorithm, which is an adapta-
tion of the Apriori algorithm, at k-th iteration, the
set of candidate sets is obtained by applying the
Fuzzy
Apriori Gen function on L
(k1)
. We denote
this set by CA
(k)
. More exactly,
CA
(k)
= Fuzzy Apriori Gen(L
(k1)
).
For each site S
i
, (1 i n), we denote with
CG
i
(k)
the set of candidate fuzzy itemsets generated
applying Fuzzy Apriori Gen on GL
i
(k1)
, i.e.,
CG
i
(k)
= Fuzzy
Apriori Gen(GL
i
(k1)
).
Because GL
i
(k1)
L
(k1)
, then CG
i
(k)
is a
subset of CA
(k)
. Following, we denote CG
(k)
=
S
n
i=1
CG
i
(k)
.
Theorem 1. For every k > 1, the set of all globally
large k-fuzzy itemsets L
(k)
is a subset of CG
(k)
=
S
n
i=1
CG
i
(k)
.
Applying the Theorem 1 the result is that we can
use the set CG
(k)
, which is a superset of L
(k)
, as a
candidate set instead of CA
(k)
, and could be much
smaller that CA
(k)
.
Thus the candidate set for L
(k)
will be generated at
k-th iteration in the following manner: first the set of
candidate sets CG
i
(k)
can be generated locally at each
site S
i
. After this step, sites exchange fuzzy support
count and compute the set of gl-large fuzzy itemsets
GL
i
(k)
. Based on GL
i
(k)
, the candidate fuzzy itemsets
at S
i
for (k + 1)-st iteration can then be generated.
3.2 Local pruning of candidate sets
The Lemma 3 can be used to perform a local prun-
ing of the set of candidate fuzzy item sets. At a site
S
i
, after the set of candidate fuzzy itemsets CG
(k)
is
generated, in order to find if a candidate fuzzy itemset
hX, F
X
i CG
i
(k)
is gl-large fuzzy itemset, the fuzzy
support count must be requested from all other sites.
We can prune this request for fuzzy support count for
some candidates using a local pruning technique. The
basic idea is that at site S
i
, if a candidate fuzzy item-
set hX, F
X
i CG
i
(k)
is not locally large at site S
i
,
there is no need for S
i
to compute global support to
find out if it is globally large. This is possible because
in this case, either hX, F
X
i is not globally large, or
it will be locally large at some other site, and hence
only the sites where hX, F
X
i is locally large need to
be responsible to find its global support count. We use
LL
i
(k)
to denote those fuzzy candidate items in CG
i
(k)
which are locally large at site S
i
.
3.3 The algorithm outline
In Algorithm 1 is presented in detail the FUZZY-
DMA algorithm for distributed mining of association
WEBIST 2005 - INTERNET COMPUTING
208
Algorithm 1 FUZZY-DMA
INPUT:
D B
1
, . . . , DB
n
- the database partition at each site.
minsup - the minimum support threshold.
F - the set of fuzzy sets associated with attributes from
I.
OUTPUT:
L - the set of all globally large fuzzy itemsets in DB.
METHOD: For all k 1, iterates the following algorithm
distributively at each site S
i
. At the end of each step a syn-
chronization is required to develop global count. The al-
gorithm terminates when either L
(k)
returned is empty or
candidate CG
(k)
= .
1: if k = 1 then
2: T
i
(1)
= Get
Local F uzzy Count(DB
i
, , 1)
3: else
4: CG
(k)
=
n
i=1
CG
i
(k)
=
=
n
i=1
F uzzy
Apriori Gen(GL
i
(k1)
)
5: T
i
(k)
= Get
Local F uzzy Count(DB
i
, CG
(k)
, i)
6: for all hX, Ai CG
i
(k)
do
7: if CF S
i
hX,Ai
minsup × D
i
then
8: insert hX, Ai into LL
i
(k)
{Broadcast support count request to compute global
fuzzy support count}
9: for j = 1, . . . , n; j 6= i do
10: Broadcast Count Request(LL
i
(k)
, S
j
)
{Receive support count request}
11: for j = 1, . . . , n; j 6= i do
12: receive LL
j
(k)
extract CF S
i
hX,Ai
from T
i
(k)
and send
to S
j
{Compute global fuzzy support count}
13: for all hX, Ai LL
i
(k)
do
14: receive CF S
j
hX,Ai
from sites S
j
, where j 6= i
15: CF S
hX,Ai
=
n
p=1
CF S
p
hX,Ai
16: if CF S
hX,Ai
minsup × D then
17: insert hX, Ai into G
i
(k)
18: broadcast G
i
(k)
{Compute global L
(k)
}
19: receive G
j
(k)
from all other sites S
j
, (i 6= j)
20: L
(k)
=
n
i=1
G
k
(k)
21: return L
(k)
rules. At every iteration (k-th iteration), each site S
i
computes the set of gl-large fuzzy itemsets GL
i
(k)
on
the site, and from these computes the set of all glob-
ally large fuzzy itemsets L
(k)
.
Initially, each site S
i
generates the complete global
candidates fuzzy itemsets CG
(k)
using the globally
(k1)-fuzzy itemsets, L
(k1)
, generated at the end of
step k 1, and locally large candidate fuzzy itemsets
based on gl-large fuzzy itemsets found at site S
i
at
(k 1) step applying function Fuzzy
Apriori Gen on
GL
i
(k1)
(candidate sets generation).
For each hX, Ai CG
(k)
, scan the database DB
i
to compute the local fuzzy support count CF S
hX,Ai
and store it into the hash tree T
i
(k)
using function
Get
Local Fuzzy Count, and generate set of locally
large fuzzy itemsets LL
i
(k)
. After this, S
i
broadcasts
the candidate fuzzy itemsets from LL
i
(k)
to other sites
to collect fuzzy support counts. The fuzzy support
counts are needed to compute global support counts
and generate set of all gl-large k-fuzzy itemsets at site
S
i
.
Finally computed gl-large fuzzy itemsets are broad-
casted to all other sites, and these can compute L
(k)
.
The algorithm is stopped when either L
(k)
returned
is empty or candidate set CG
(k)
is empty.
4 CONCLUSION
In this article, it is proposed an algorithm for min-
ing fuzzy association rules from distributed databases
more efficiently than a sequential algorithm. In the fu-
ture, we will study the means of automatically finding
of fuzzy sets associated with database attributes. The
other direction of improvement is related to the study
of new relationships between local and global large
itemsets in order to reduce the number of messages
exchanged among sites.
REFERENCES
Agrawal R., Imiclinski T., S. A. (1993). Mining associa-
tion rules between sets of items in large databases. In
Proceedings of the 1993 ACM SIGMOD Conference
Washington DC, USA.
Chan Man Kuok, Ada Fu, M. H. W. (1998). Mining
fuzzy association rules in databases. SIGMOD Rec.,
27(1):41–46.
Cheung D.W., Jiawei Han, N. V. F. A. Y. F. (1996). A fast
distributed algorithm for mining association rules. In
In 4th International Conference on Parallel and Dis-
tributed Information Systems (PDIS ’96), pages 31–
43. IEEE Computer Society Technical Committee on
Data Engineering, and ACM SIGMOD.
Gyenesei, A. (2000). Mining weighted association rules for
fuzzy quantitative items. In Principles of Data Mining
and Knowledge Discovery, pages 416–423.
Hong T.P., Kuo C.S., C. S. W. S. (2000). Mining fuzzy
rules from quantitative data based on the apriotitid al-
gorithm. In Proceedings of the 2000 ACM symposium
on Applied computing, pages 534–536.
Rakesh Agrawal, R. S. (1994). Fast algorithms for mining
association rules. In Bocca, J. B., Jarke, M., and Zan-
iolo, C., editors, Proc. 20th Int. Conf. Very Large Data
Bases, VLDB, pages 487–499. Morgan Kaufmann.
A DISTRIBUTED ALGORITHM FOR MINING FUZZY ASSOCIATION RULES
209