
 
Example 2: Consider the taxonomy tree in Figure 1. 
Let S
1
 = {<Orange, Beef>, <Apple, Chicken, Beef>}, 
LCG(S
1
) = <Fruit, Beef>. LCG(S
1
) cannot be <Fruit, 
Meat> since <Fruit,  Beef> is a more specialized 
common transaction. For S
2
  = {<Orange,  Milk>, 
<Apple, Cheese, Butter>}, LCG(S
2
)=<Fruit, Dairy>. 
Dairy represents Milk in the first transaction and 
represents one of Cheese and Butter in the second 
transaction. Thus one of Cheese or Butter is 
considered as a suppressed item. For S
3 
= {<Orange, 
Apple>, <Orange, Banana, Milk>, <Banana, Apple, 
Beef>},  LCG(S
3
)=<Fruit,  Fruit>, which represents 
that all three transactions contain at least two items 
under Fruit. Milk and Beef are suppressed items. For 
S
4 
= {<Orange,  Beef>, <Apple,  Milk>},  LCG(S
4
) = 
<Fruit,  Food>, where Food represents Beef in the 
first transaction and Milk in the second transaction. 
Here LCG contains both a parent and a child item.  
Various metrics have been proposed in the 
literature to measure the quality of generalized data 
including  Classification Metric (CM),  Generalized 
Loss Metric (LM) (Iyengar, 2002), and Discernibility 
Metric (DM) (Bayardo et al., 2005). We use LM to 
measure item generalization distortion. The similar 
notion of NCP has also been employed for set-
valued data (Terrovitis et al., 2008) and (He et al., 
2009). Let M be the total number of leaf nodes in the 
taxonomy tree T, and let Mp be the number of leaf 
nodes in the subtree rooted at a node p. The Loss 
Metric for an item p, denoted by LM(p), is defined 
as (Mp-1) / (M-1). For the root item p, LM(p) is 1. In 
words, LM captures the degree of generalization of 
an item by the percentage of the leaf items in the 
domain that are indistinguishable from it after the 
generalization. For example, considering taxonomy 
in Figure 1, LM(Fruit)=2/7.  
Suppose that we generalize every transaction in a 
subset of transactions S to a common generalized 
transaction t, and we want to measure the distortion 
of this generalization. Recall that every item in t 
represents one distinct item in each transaction in S 
(Definition 1). Therefore, each item in t generalizes 
exactly |S
| items, one from each transaction in S, 
where |S| is the number of transactions in S. The 
remaining items in a transaction (that are not 
generalized by any item in t) are suppressed items. 
Therefore, the distortion of this generalization is the 
sum of the distortion for generalized items, |S|Σ
it
 
LM(i), and the distortion for suppressed items. For 
each suppressed item, we charge the same distortion 
as if it is generalized to the root item, i.e., 1.  
Definition 3 (GGD).  Suppose that we generalize 
every transaction in a set of transactions S to a 
common generalized transaction t. The Group 
Generalization Distortion of the generalization is 
defined as GGD(S, t) = |S|Σ
it
 LM(i) + N
s
, where N
s
 
is the number of occurrences of suppressed items. 
To minimize the distortion, we shall generalize S 
to the least common generalization LCG(S), which 
has the distortion GGD(S, LCG(S)). 
Example 3: Consider the taxonomy in Figure 1 and 
S
1
={<Orange, Beef>, <Apple,  Chicken, Beef>}. We 
have  LCG(S
1
) = <Fruit,  Beef>.  LM(Fruit)=2/7, 
LM(Beef)=0, and |S
1
|=2. Since Chicken is the only 
suppressed item, N
s
=1. Thus GGD(S
1
,  LCG(S
1
)) = 
2(2/7+0) + 1 = 11/7. 
2.3  Problem Definition 
We adopt the transactional k-anonymity in (He et al., 
2009) as our privacy notion. A transaction database 
D is k-anonymous if for every transaction in D, there 
are at least k-1 other identical transactions in D. 
Therefore, for a k-anonymous D, if one transaction 
is linked to an individual, so are at least k-1 other 
transactions, so the adversary has at most 1/k 
probability to link a specific transaction to the 
individual. For example, the last column in Table 1 
is a 2-anonymous transaction database. 
Definition 5 (Transaction Anonymization). Given a 
transaction database D, a taxonomy of items, and a 
privacy parameter k, we want to find the clustering 
C={S
1
,…,S
n
} of D such that S
1
,…,S
n
 are pair-wise 
disjoint subsets of D with each S
i
 containing at least 
k transactions from D, and Σ
 i=1..|C|
 GGD(S
i
, LCG(S
i
)) 
is minimized.  
Let  C={S
1
,…,S
n
} be a solution to the above 
anonymization problem. A k-anonymized database 
of  D can be obtained by generalizing every 
transaction in S
i
 to LCG(S
i
), i=1,…,n. 
3 CLUSTERING APPROACH 
In this section we present our algorithm Clump for 
solving the problem defined in Definition 5. In 
general, the problem of finding optimal k-
anonymization is NP-hard for k3 (Meyerson et al., 
2004). Thus, we focus on an efficient heuristic 
solution to this problem and evaluate its 
effectiveness empirically. In this section, we assume 
that the functions LCG(S) and GGD(S, LCG(S)) are 
given. We will discuss the detail of computing these 
functions in Section 4. 
The central idea of our algorithm is to group 
transactions in order to reduce GGD(S
i
, LCG(S
i
)), 
subject to the constraint that S
i
 contains at least k 
SECRYPT 2010 - International Conference on Security and Cryptography
112