Classiﬁcation of Datasets with Frequent Itemsets is Wild

Natalia Vanetik

Department of Software Engineering, Sami Shamoon College of Engineering, Beer Sheva, Israel

Keywords:

Frequent Itemsets, Dataset Classiﬁcation.

Abstract:

The problem of dataset classiﬁcation with frequent itemsets is deﬁned as the problem of determining whether

or not two different datasets have the same frequent itemsets without computing these itemsets explicitly. The

reasoning behind this approach is high computational cost of computing frequent itemsets. Finding well-

deﬁned and understandable normal forms for this classiﬁcation task would be a breakthrough in dataset clas-

siﬁcation ﬁeld. The paper proves that classiﬁcation of datasets with frequent itemsets is a hopeless task since

canonical forms do not exist for this problem.

1 INTRODUCTION

Suppose that we are given a dataset that consists

of transactions (tuples) each containing one or more

items. Frequent itemsets are subsets that appear in

a large fraction of dataset tuples, where the exact

fraction value is deﬁned by the user and is called

support. Frequent pattern mining was proposed by

Agrawal (Agrawal,Srikant 1994) for shopping basket

analysis; both frequent itemsets and association rules

were introduced in this paper. Many additional algo-

rithms have been suggested other the years, such as

FPGrowth (Han, Pei, Yin 2000), Eclat (Zaki 2000),

Genmax (Gouda,Zaki 2005) and many others. This

problem has numerous applications in both theoreti-

cal and practical knowledge discovery, but its com-

putational complexity is another matter. It has been

shown that generating and counting frequent itemsets

is #P-complete (see (Yang 2004)).

We focus here on using frequent itemsets in

datasets as the means for dataset characterization.

Frequent itemsets are important dataset property and

they have been used as a classifying feature in virus

signature detection (see (Ye et al. 2007)), text catego-

rization (see (Za

ıane, Antonie 2002)) and biological

data mining (see (Zaki et al. 2010)). A through ex-

perimental study of the issue can be found in (Flou-

vat, De Marchi, Petit 2010). The classifying feature

problem is also computationally difﬁcult as it requires

computing frequent itemsets for each dataset. In

(Palmerini, Orlando, Perego 2004), the authors pro-

posed a statistical property of transactional datasets

to characterize dataset density. Paper (Parthasarathy,

Ogihara 2000) proposes a similarity measure for ho-

mogeneous datasets that is based on frequent patterns

appearing in these datasets; this measure is then used

to enable dataset clustering. The diff operator, pro-

posed in (Subramonian 1998), is another correlation

indicator between datasets that captures user beliefs

in terms of events and conditional probabilities.

This paper addresses the issue of using frequent

itemsets as a classifying feature of datasets. The an-

swer is given in category-theoretic form; we prove

that the task of ﬁnding well-structured normal forms

for frequent itemsets in this case is a hopeless one.

2 PROBLEM STATEMENT

Let D be a dataset composed of transactions

, ..., t

}, where m is the dataset size. Each trans-

action contains items from a ﬁnite set V . The size

of V , denoted by n, is called the cardinality of the

dataset. The number of items may vary from transac-

tion to transaction, but the items in each transaction

form a set, i.e. they cannot appear more than once.

Additionally, we have a support value 1 ≤ S ≤ m. A

set I of items (itemset) is called frequent if it appears

in at least S transactions as a subset.

The three main approaches to frequent set genera-

tion are Apriori (Agrawal,Srikant 1994), FPGrowth

(Han, Pei, Yin 2000) and Eclat (Zaki 2000). This

task is computationally expensive as the number of

frequent itemsets in D can be exponential in |V |. We

are asking the following question:

• Is it possible to classify datasets according to their

frequent itemsets without explicit computation?

386

Vanetik N..

Classiﬁcation of Datasets with Frequent Itemsets is Wild.

DOI: 10.5220/0004167903860389

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2012), pages 386-389

ISBN: 978-989-8565-29-7

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

Dataset 1

Dataset 2

Dataset 3

Figure 1: Dataset equivalence as frequent itemset equality.

3 CLASSIFICATION OF

DATASETS

3.1 Canonical Forms

Suppose now that datasets D

and D

are deﬁned on

set V on items, and we are given support constraints

and S

. We wish to check whether or not datasets

and D

are equivalent up to their frequent item-

sets. In other words, we ask if frequent itemsets F

and F

of D

and D

are identical. Figure 1 shows an

example of such classiﬁcation.

There are three main ways of answering this ques-

tion, some computationally harder than others.

1. (approach 1) Compute both frequent itemset sets

and ﬁnd their symmetric difference.

This type of answer is usually the hardest one to

compute.

2. (approach 2) Find an algorithm comparing fre-

quent itemset sets without computing them and

provide a ”yes” or ”no” answer.

This type of answer can sometimes be easier to

achieve.

3. (approach 3) Find canonical representations for

both datasets, i.e. ﬁnd mathematical objects

whose equality implies dataset equivalence and

inequality implies lack of equivalence.

This type of answer tells us everything there is

to know about dataset equivalence. Once canon-

ical representations are found, their comparison

should be easy and straightforward.

In case of dataset equivalence approach 1 is very

hard to implement, since enumeration of frequent

itemsets in a dataset is a #P-complete problem by re-

duction from the problem of determining the number

of maximal bipartite cliques in a bipartite graph. This

fact is proved in the next claim. Approach 2 to dataset

equivalence can provide the answer faster than fre-

quent itemset set computation, since the difference

…

Figure 2: Jordan normal form of a square complex matrix.

between frequent sets may be determined at early

stages of frequent itemset enumeration, thus saving

a great deal of time and effort. But do we have a hope

of ﬁnding a classiﬁcation for sets of frequent item-

sets in various datasets using approach 3? The most

prominent advantage of this method is that such a rep-

resentation needs to be computed only once for each

dataset, thus eliminating the need is pairwise dataset

comparison. However, the answer to this question is

no and the explanation follows.

The problem of classiﬁcation of objects in a small

category is deﬁned as the problem of ﬁnding canon-

ical forms for objects in that category. The issue of

deciding on equivalence of two objects up to a certain

set of transformation in this case reduces to the prob-

lem of comparing their canonical forms up to equality.

In our case the category of datasets consists of ﬁnite

datasets identical up to row permutations.

For some problems, such as matrix similarity,

well-structured canonical forms exist.

Example 1. Let A, B be n × n matrices over an al-

gebraically closed ﬁeld (ﬁnite or inﬁnite). The ma-

trices are similar if there exists a matrix S such that

A = S

−1

BS. Three approaches to matrix equiva-

lence discovery in case of matrix similarity can be re-

formulated as follows.

(1) Test all possible matrices S until the one that en-

sures A = S

−1

BS is found. This approach is infeasible

over inﬁnite ﬁelds and #P-hard over ﬁnite ﬁelds.

(2) Find an algorithm that attempts to construct the

matrix S from A, B. Such algorithms exist, and they

are quire efﬁcient (see e.g. (Giesbrecht 1995)).

(3) Find canonical representations for both matrices.

Canonical representations of square matrices over

ﬁelds for similarity exist and are called Jordan nor-

mal forms of matrices. There representations are also

square matrices over the same ﬁeld that have special

structure. Figure 2 shows Jordan normal form for ma-

trices over C.

Existence of canonical forms is not limited to ma-

trix problems. For instance, the fundamental theorem

of ﬁnite abelian groups describes canonical forms that

do not have matrix form.

ClassificationofDatasetswithFrequentItemsetsisWild

387

3.2 Matrix Problems

Let us, following (Belitskii, Sergeichuk 2003), deﬁne

a matrix problem as a pair A = {A

, A

} where A

a set of a-tuples (A

, ..., A

) of m × n matrices over

an algebraically closed ﬁeld and A

are operations

on tuples from A

. Given two matrix problems A =

, A

} and B = {B

, B

}, A is said to be contained

in B if there exists a b-tuple T (x) = T (x

, ..., x

) of

matrices, whose entries are non-commutative polyno-

mials in x

, ..., x

, such that

1. T (A) = T (A

, ..., A

) ∈ B

A = (A

, ..., A

) ∈ A

;

2. for every A, A

∈ A

, A reduces to A

by transfor-

mations A

if and only if T (A) reduces to T (A

)

by transformations B

Example 2. For the problem of matrix similarity, A

contains square matrices over C and A

contains an

operation of multiplication of a matrix by an invert-

ible matrix S and its inverse on left and right.

The problem of simultaneous similarity for a pair of

n ×n matrices over a ﬁeld (i.e. pairs (A, B) and (C, D)

is deﬁned as follows. Pairs (A, B) and (C, D) are si-

multaneously similar if and only if there exists S such

that C = S

−1

AS and D = S

−1

BS). In this case, canoni-

cal forms do not exist (see (Drozd 1980) and (Gabriel

1972) for detailed explanation).

Problems containing the matrix problem of si-

multaneous similarity for pairs of matrices are called

wild as opposed to tame problems for which canon-

ical forms exist. In other words, instances of some

problems cannot be packed into convenient and self-

explainable classes.

Example 3. The problem of classifying ﬁnite groups

(even if they are 2-nilpotent) is wild since it contains

the problem of classifying pairs of matrices up to si-

multaneous similarity (see (Sergeichuk 1975)).

3.3 Dataset Classiﬁcation

One surprising corollary from the work (Sergeichuk

1975) is the following.

Corollary 4 ((Vanetik, Lipyanski 2010)). The prob-

lem of classifying graphs up to isomorphism is wild,

i.e. well-deﬁned canonical forms for graphs up to iso-

morphism do not exist.

Corollary 4 implies that while the hope of ﬁnd-

ing an efﬁcient algorithm that distinguishes graphs up

to isomorphism still exists, such an algorithm can-

not take advantage on some special canonical form of

graphs, since these canonical forms do not exist. Un-

fortunately, this is also the case for frequent itemset

sets, as the following shows.

Bipartite clique K

2,3

Figure 3: Maximal bipartite clique in a bipartite graph.

Claim 5. Frequent itemset equivalence is as hard as

the graph isomorphism problem.

Proof. Equivalence of datasets up to frequent item-

sets is at least as hard as comparing the number of

maximal bipartite cliques in a bipartite graph (see

(Zaki 2000) for proof), even if we are talking about

closed or maximal itemsets only. A maximal bipar-

tite clique in bipartite graph G is a maximal com-

plete bipartite subgraph K

i, j

of G (see Figure 3 for

an illustration). Finding the size of maximal bipartite

clique in a bipartite graph is an NP-complete problem

(see (Kuznetsov 1989)) and therefore there exists a

polynomial-time reduction from the subgraph isomor-

phism problem to the maximal bipartite clique prob-

lem. Reduction from the graph isomorphism problem

to subgraph isomorphism is straightforward.

Theorem 6. The problem of classifying datasets up

to frequent itemsets is wild.

Proof. Wildness of the classifying problem for

graphs follows from the famous result by V. Serge-

ichuck (see (Sergeichuk 1975)) and has been proved

in Corollary 4. Since the problem of classifying

datasets up to their frequent itemsets contains (in the

matrix sense) the problem of classifying graphs up to

isomorphism, by Claim 5 the classifying problem for

datasets up to frequent itemsets is wild.

Corollary 7. Dataset classiﬁcation problem is wild

for maximal or closed frequent itemsets as well.

Note that wildness of a problem does not neces-

sarily imply nonexistence of an efﬁcient algorithm

telling whether or not two objects are equivalent up

to a certain set of transformations. It does, however,

imply that no well-deﬁned representation (represen-

tation with normal forms) for equivalence classes of

these objects exists.

4 CONCLUSIONS

This paper addresses the problem of classifying

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

388

datasets with (maximal, closed, all) frequent itemsets.

We show that high computational cost is not the only

problem of this approach and that there is in fact a

deeper reason to why the approach fails. We use

category-theoretic results to prove that well-described

normal forms for the dataset classiﬁcation problem do

not exist.

REFERENCES

Agrawal, R., Srikant, R. (1994). Fast algorithms for min-

ing association rules in large databases. Proceedings

of the 20th International Conference on Very Large

Data Bases, VLDB, pages 487–499, Santiago, Chile.

Belitskii, G., Sergeichuk, V. (2003). Complexity of matrix

problems, Linear Algebra Appl. 361, pp. 203-222.

Calders, T. (2004). Computational complexity of itemset

frequency satisﬁability. In: Proc. 23rd ACM PODS

04, pp. 143–154, ACM Press.

Drozd, J. (1980). Tame and wild matrix problems. Lecture

Notes in Mathematics, Volume 832, 242–258.

Flouvat, F., De Marchi, F., Petit, JM. (2010), A new classi-

ﬁcation of datasets for frequent itemsets. J. Intell. Inf.

Syst. 34, pp. 1–19.

Friedland, S. (1983). Simultaneous similarity of matrices,

Adv. Math. 50 pp. 189–265.

Gabriel, P. (1972). Unzerlegbare Darstellungen I,

Manuscripta Math. 6 pp. 71–103.

Giesbrecht, M. (1995). Nearly Optimal Algorithms For

Canonical Matrix Forms, SIAM Journal on Comput-

ing, v.24 n.5, pp.948–969.

Gouda, K., Zaki, J. M. (2005). GenMax: An Efﬁcient Al-

gorithm for Mining Maximal Frequent Itemsets. Data

Mining and Knowledge Discovery: An International

Journal, 11(3) pp.223–242.

Han, J., Pei, J., Yin, Y. (2000). Mining frequent pat-

terns without candidate generation. In: Proceeding of

the 2000 ACM-SIGMOD international conference on

management of data (SIGMOD00), Dallas, TX, pp 1–

12.

Kuznetsov, S. O. (1989). Interpretation on Graphs and Com-

plexity Characteristics of a Search for Speciﬁc Pat-

terns, Nauchn. Tekh. Inf., Ser. 2 (Automatic Docu-

ment Math Linguist), vol. 23, no. 1, pp. 23–37.

Palmerini, P., Orlando, S., Perego, R. (2004). Statistical

properties of transactional databases. In H. Haddad,

A. Omicini, R. L. Wainwright, L. M. Liebrock (Eds.),

SAC (pp. 515–519). New York: ACM.

Parthasarathy S., Ogihara, M. (2000). Clustering Dis-

tributed Homogeneous Datasets. in Proceedings

PKDD 2000, pp. 566–574

Sergeichuk, V. (1977), The classiﬁcation of metabelian

p-groups (Russian), Matrix problems, Akad. Nauk

Ukrain. SSR Inst. Mat., Kiev, pp. 150–161.

Subramonian, R. (1998). Deﬁning diff as a Data Mining

Primitive. in Proceedings of KDD1998, pp. 334–338

Lipyanski, R., Vanetik N.(2010). The classiﬁcation problem

for graphs and lattices is wild, in Proceeding of the

International Conference on Modern Algebra and its

Applications, pp. 107-111, Batumi, September 20th–

26th.

Yang, G. (2004). The complexity of mining maximal fre-

quent itemsets and maximal frequent patterns. Proc.

Int. Conf. Knowl. Discov. Data Mining, pp. 344–353.

Ye, Y., Wang, D., Li, T., Ye, D. (2007). IMDS: intelligent

malware detection system, Proceedings of the 13th

ACM SIGKDD international conference on Knowl-

edge discovery and data mining, pp. 1043–1047.

ıane, O. R., Antonie, M. L. (2002). Classifying Text Doc-

uments by Associating Terms With Text Categories,

Australasian Database Conference.

Zaki, M. J. (2000) Scalable algorithms for association min-

ing, IEEE TransKnowl Data Eng 12:372–390.

Zaki, M. J., Carothers, C. D., Szymanski B. K. (2010).

VOGUE: A Variable Order Hidden Markov Model

with Duration based on Frequent Sequence Mining.

ACM Transactions on Knowledge Discovery in Data,

4(1).

ClassificationofDatasetswithFrequentItemsetsisWild

389