LOGIC OF DISCOVERY, DATA MINING AND SEMANTIC WEB

Position Paper

Jan Rauch

Faculty of Informatics and Statistics, University of Economics

am. W. Churchilla 4, 130 67 Prague, Czech Republic

Keywords:

Logic of discovery, Association rules, Logic of association rules, Data mining, Semantic web.

Abstract:

Logic of discovery was developed in 1970’s as an answer to questions ”Can computers formulate and justify

scientiﬁc hypotheses?” and ”Can they comprehend empirical data and process it rationally, using the apparatus

of modern mathematical logic and statistics to try to produce a rational image of the observed empirical

world?”. Logic of discovery is based on two semantic systems. Observational semantic system corresponds

to observational data and statements on observational data. Theoretical semantic system concerns suitable

state dependent structures. Both systems are related via inductive inference rules corresponding to statistical

approaches. An attempt to modify logic of discovery to data mining was made and a framework making

possible to deal with domain knowledge in data mining was developed. Possibility of enhancement of this

framework for presenting results of data mining through Semantic web is suggested and discussed.

1 INTRODUCTION

Logic of discovery is developed in book (H

ajek and

Havr

anek, 1978) as an answer to questions Q

, Q

) – Can computers formulate and justify scien-

tiﬁc hypotheses? (Q

) – Can they comprehend empir-

ical data and process it rationally, using the appara-

tus of modern mathematical logic and statistics to try

to produce a rational image of the observed empiri-

cal world? Answers to these questions are based on a

scheme of inductive inference:

theoretical assumptions, observational statement

theoretical statement

Logic of discovery deals with two semantic systems -

observational semantic system and theoretical seman-

tic system. Observational semantic system has a lan-

guage for speaking about observational data. Theoret-

ical semantic system concerns state dependent struc-

tures, both systems are connected by inductive infer-

ence rules based on statistical approaches.

An attempt to modify logic of discovery for needs

of data mining resulted into a suggestion of system

4ft-Discoverer (Rauch, 2010) which is intended to be

an experimental framework making possible to deal

with domain knowledge when mining in particular

data set. The goal of this paper is to discuss a possi-

bility of enhancement of this framework to serve as a

basis for disseminating results of data mining through

Semantic web.

System 4ft-Discoverer is based on logic of associ-

ation rules (Rauch, 2005). The association rule is un-

derstood here as a general relation of two Boolean at-

tributes. Main features of logic of discovery are sum-

marized in section 2. The logic of association rules is

introduced in section 3. Important features of the 4ft-

Discoverer are described in section 4. Possibilities to

enhance 4ft-Discoverer to a framework for dissemi-

nating results of data mining through Semantic web

are discussed in section 5.

2 LOGIC OF DISCOVERY

The schema of inductive inference introduced in sec-

tion 1 inspired additional ﬁve questions (H

ajek and

Havr

anek, 1978):

L0: In what languages does one formulate observa-

tional and theoretical statements?

L1: What are rational inductive inference rules bridg-

ing the gap between observational and theoretical sen-

tences? (What does it mean that a theoretical state-

ment is justiﬁed?)

L2: Are there rational methods for deciding whether

a theoretical statement is justiﬁed (on the basis of

given theoretical assumptions and observational state-

ments)?

342

Rauch J..

LOGIC OF DISCOVERY, DATA MINING AND SEMANTIC WEB - Position Paper.

DOI: 10.5220/0003117203420351

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2010), pages 342-351

ISBN: 978-989-8425-28-7

 2010 SCITEPRESS (Science and Technology Publications, Lda.)

L3: What are the conditions for a theoretical statement

or a set of theoretical statements to be of interest with

respect to the task of scientiﬁc cognition?

L4: Are there methods for suggesting such a set of

statements which is as interesting (important) as pos-

sible?

Answering questions (L0) – (L2) leads to logic of

induction, answers to questions (L3) and (L4) lead to

logic of suggestion. Answers to questions (L0) – (L4)

constitute a logic of discovery developed in (H

ajek

and Havr

anek, 1978). The rational inductive infer-

ence rules bridging the gap between observational

and theoretical sentences are based on statistical ap-

proaches, i.e. estimates of various parameters or sta-

tistical hypothesis tests are used.

Semantic system is deﬁned to formalize languages

for observational and theoretical statements: Seman-

tic system S = hSent,M,V,Vali is determined by a

non-empty set Sent of sentences, a non-empty set M

of models, a non-empty set V of abstract values and

an evaluating function Val : (Sent × M) → V . If it is

ϕ ∈ Sent and M ∈ M then Val(ϕ,M ) is the value

of ϕ in M . Semantic system S = hSent,M ,V,Vali is

observational if Sent, M , V are recursive sets and Val

is a partial recursive function.

Two semantic systems – obbservational seman-

tic system S

= hSent

,Val

i correspond-

ing to analyzed data and theoretical semantic sys-

tem S

= hSent

,Val

i corresponding to the

whole set of objects we are interested in are devel-

oped. The analyzed data can concern only a part of

this whole set. Rationality of inductive inference rules

is based on statistical approaches. It leads to observa-

tional semantic systems with formulas corresponding

to statistical hypothesis tests. An example of observa-

tional system is related to logical calculus of associa-

tion rules, see section 3.

3 LOGIC OF ASSOCIATION

RULES

The most in (H

ajek and Havr

anek, 1978) studied ob-

servational semantic systems are based on observa-

tional predicate calculi which are introduced in sec-

tion 3.1. Logical calculi of association rules can be

understood as modiﬁcations of observational predi-

cate calculi, they are informally deﬁned in section 3.2.

Very important are deduction rules in logical calculi

of association rules, some practically important de-

duction rules are mentioned in section 3.3.

3.1 Observational Predicate Calculi

Observational predicate calculus is a result of modi-

ﬁcations of classical predicate calculus – only ﬁnite

models are allowed and generalized quantiﬁers are

added. Finite models correspond to data resulting

from observation and generalized quantiﬁers make

it possible to express various assertions on analyzed

data including assertions corresponding to statistical

hypothesis tests.

Set Sent

of all closed formulas of observational

predicate calculus P can be used to build observa-

tional semantic system S

= hSent

,Val

where M

is the set of all models (i.e. ﬁnite data struc-

tures) of P , V

= {0,1} and Val

is a function assign-

ing a value from {0, 1} to each couple hM ,Φi where

M ∈ M

and Φ ∈ Sent

. If Val

(M ,Φ) = 1 then Φ

is true in M , otherwise Φ is false in M .

If we use predicate calculus P with only unary

predicates P

,...P

, then each model M ∈ M

of S

is a {0,1} – data matrix with n columns. Expres-

sion ∀(x)P

(x) and ∃(x)(P

(x) ∨ P

(y)) are examples

of formulas with classical quantiﬁers ∀ and ∃.

Expressions ⇒

p,α,B

(x)(P

(x),P

(x)) and

⇔

p,B

(x)(P

(x) ∧ P

(x),P

(y) ∨ P

(x)) are examples

of formulas with generalized quantiﬁers ⇒

p,α,B

and

⇔

p,B

which are introduced in table 1. These

expressions concern couples of derived predicates

(x);P

(x)i and hP

(x) ∧ P

(x);P

(y) ∨P

(x)i, they

can be understood as generalization of association

rules.

3.2 Logical Calculi of Association Rules

The boom of association rules in the 1990’s (Agrawal

et al., 1993) was the start of a new effort in the study

of association rules as formulas of observational cal-

culi. The syntax of used formulas of predicate ob-

servational calculi has been signiﬁcantly simpliﬁed,

only calculi with monadic predicates are further stud-

ied. Free and bound variables are omitted and basic

Boolean attributes are used instead of predicates. Re-

sulting calculi can be understood as logical calculi of

association rules (Rauch, 2005; Rauch, 2008; Rauch,

2009).

We are going to informally outline deﬁnition of

semantic system AR

= hSent

,{0,1},Val

of type T concerning association rules. Elements of

Sent

are association rules ϕ ≈ ψ where ϕ and ψ are

Boolean attributes derived from columns of analyzed

data matrix M of type T and ≈ is a 4ft-quantiﬁer.

Such association rules are closed formulas of lan-

guage L

of association rules which is outlined in

LOGIC OF DISCOVERY, DATA MINING AND SEMANTIC WEB - Position Paper

343

section 3.2.1. M

is a set of all data matrices of type

T , see section 3.2.2. Val

is an evaluating function

assigning value Val

(ϕ ≈ ψ,M ) ∈ {0,1} to each

couple M ∈ M

and ϕ ≈ ψ ∈ Sent

. It is introduced

in section 3.2.3.

3.2.1 Language L

Association rule is expression ϕ ≈ ψ where ϕ and ψ

are Boolean attributes derived from columns of an an-

alyzed data matrix and ≈ is a 4ft-quantiﬁer. Boolean

attribute ϕ is called antecedent and Boolean attribute

ψ is called succedent.

Basic Boolean attributes are created ﬁrst. The

basic Boolean attribute is an expression A(α) where

α ⊂ {a

,...a

} and {a

,...a

} is the set of all cate-

gories of the attribute A. The basic Boolean attribute

A(α) is true in row o of M if it is A(o) ∈ α where

A(o) is the value of the attribute A in row o. Exam-

ples of basic Boolean attributes are in ﬁgure 1. These

Boolean attributes are derived from columns of data

matrix M with columns corresponding to attributes

,...,A

M A

. . . A

(1) A

(2,6)

1 . . . 6 1 1

3 . . . 1 0 0

Figure 1: Data matrix M and basic Boolean attributes.

Boolean attributes ϕ and ψ are derived from basic

Boolean attributes using connectives ∨, ∧ and ¬ in

the usual way. Expression

(1) ∧ A

(4,5) ≈ A

(2,6)

is an example of an association rule.

We consider data matrices with values – natural

numbers only. The natural numbers represent cat-

egories i.e. possible values of observed attributes

,. . .,A

. Columns of data matrix correspond to at-

tributes and rows correspond to observed objects, e.g.

to patients. An example of such a data matrix is in

ﬁgure 1.

There is only ﬁnite number of categories i.e. pos-

sible values for each attribute. Let us assume that the

number of possible values of a column is t and that

the possible values in this column are natural num-

bers 1,. . .,t. All possible values in the data matrix are

then described by the numbers of possible values for

each column. The whole information on number of

columns and possible values in the data matrix is then

given by type of data matrix: A type of data matrix

is a K-tuple T = ht

,. . .,t

i where t

≥ 2 are natural

numbers for i = 1,. ..,K.

Symbols of language L

of association rules

of type T = ht

,. . .,t

i are attributes A

,. . .,A

4ft-quantiﬁers ≈

,. . .,≈

, propositional connectives

∧,∨, ¬ and parentheses. The basic Boolean attributes

A(α) are deﬁned in the above given way. Each basic

Boolean attribute is a Boolean attribute, if ϕ and ψ

are Boolean attributes, then ¬ϕ, ϕ ∧ ψ and ϕ ∨ ψ are

Boolean attributes.

Set Sent

of semantic system S

of associa-

tion rules of type T is the set of all association rules

i.e. closed formulas of language L

. Formal deﬁni-

tion of language of association rules is e.g. in (Rauch,

2005).

3.2.2 Data Matrices M

A more formal deﬁnition of a data matrix with the

number of columns and the numbers of possible val-

ues in particular columns given by the type T =

,. . .,t

i is used: Let T = ht

,. . .,t

i be the type

of data matrix. Then a data matrix of type T is a

K + 1-tuple M = hM, f

,. . ., f

i, where M is a non-

empty ﬁnite set and f

is the unary function from M

to {1,... ,t

} for i = 1,.. . ,K. Set M is a set of rows of

data matrix M . Set M is called a domain of data ma-

trix M , we write M = Dom(M ). An example of data

matrix M = hM, f

,. . ., f

i is ﬁgure 2. We assume

that M = {o

,. . .,o

object f

. . . f

) . . . f

)

) . . . f

)

Figure 2: Data matrix M = hM, f

,.. ., f

3.2.3 Evaluation Function Val

Association rule ϕ ≈ ψ can be true or false in given

data matrix M ∈ M

. Rule ϕ ≈ ψ is veriﬁed on the

basis of four-fold table 4 f t(ϕ,ψ, M ) of ϕ and ψ in

M , see ﬁgure 3.

M ψ ¬ψ

ϕ a b

¬ϕ a b

Figure 3: 4ft-table 4ft(ϕ,ψ,M ).

Here a is the number of objects (i.e. rows of M )

satisfying both ϕ and ψ, b is the number of objects

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

344

satisfying ϕ and not satisfying ψ, etc. 4 f t(ϕ,ψ, M ) is

also written as ha,b, c,di and called 4ft-table.

Evaluation function Val

assigns a value 0 or 1

to each couple hϕ ≈ ψ, M i where ϕ ≈ ψ is the asso-

ciation rule and M ∈ M

. If Val

(ϕ ≈ ψ, M ) = 1

then we say that rule ϕ ≈ ψ is true in M and if

Val

(ϕ ≈ ψ,M ) = 0 then we say that rule ϕ ≈ ψ

is false in M . Val

(ϕ ≈ ψ,M ) is deﬁned using 4ft-

table 4ft(ϕ,ψ,M ) of ϕ and ψ in M and associated

function F

≈

of ≈.

Associated function F

≈

of the 4ft quantiﬁer ≈

is a {0, 1} - valued function deﬁned for all quadru-

ples ha,b,c,di of natural numbers. Value of asso-

ciation rule ϕ ≈ ψ in data matrix M ∈ M

is de-

ﬁned such that Val

(ϕ ≈ ψ,M ) = F

≈

(a,b, c, d)

where ha,b,c, di = 4 f t(ϕ,ψ,M ). Examples of 4ft-

quantiﬁers ≈ and associated functions F

≈

(a,b, c, d)

are in table 1.

Table 1: Examples of 4ft-quantiﬁers.

≈ F

≈

(a,b, c,d) = 1 iff

⇒

p,B

a+b

≥ p ∧ a ≥ B

⇒

p,α,B

∑

i=a





(1 − p)

r−i

≤ α ∧ a ≥ B

≡

p,B

a+d

a+b+c+d

≥ p ∧ a ≥ B

≈

α,B

∑

min(r,k)

i=a

(

)(

n−k

r−i

)

(

)

≤ α ∧ a ≥ B

∼

α,B

(ad−bc)

rkls

n ≥ χ

∧ a ≥ B

∼

q,B

a+b

≥ (1 + q)

a+c

a+b+c+d

∧ a ≥ B

The 4ft-quantiﬁers ⇒

p,B

of founded implication,

⇒

p,α,B

of lower critical implication, Fisher’s quanti-

ﬁer ≈

α,B

and χ

–quantiﬁer ∼

α,B

are deﬁned in (H

ajek

and Havr

anek, 1978), the quantiﬁer ≡

p,B

of founded

equivalence is deﬁned in (H

ajek et al., 1983) and the

4ft-quantiﬁer of above average dependence ∼

q,B

deﬁned in (Rauch, 2005).

3.3 Deduction Rules in Logical Calculus

of Association Rules

Language L

, set of data matrices M

and evalua-

tion function Val

constitute logical calculus of as-

sociation rules (Rauch, 2005). There are various the-

oretically interesting and practically useful results re-

lated to logical calculus of association rules. Most of

them are related to classes of 4ft-quantiﬁers (Rauch,

2008).

An example of a class of 4ft-quantiﬁers is the class

of implicational 4ft-quantiﬁers. 4ft-quantiﬁer ≈ is

implicational if F

≈

(a,b, c, d) = 1 ∧ a

≥ a ∧b

≤ b im-

plies F

≈

,c, d) = 1. Both 4ft-quantiﬁers ⇒

p,B

and ⇒

p,α,B

(see table 1) are implicational.

Important results concerning soundness of deduc-

tion rules of the form

ϕ≈ψ

≈ψ

were achieved (Rauch,

2008). Here both ϕ ≈ ψ and ϕ

≈ ψ

are association

rules. We outline these results for the class of interest-

ing implicational quantiﬁers: If ⇒

∗

is an interesting

implicational quantiﬁer then there are formulas ω

, ω

of propositional calculus created from ϕ, ψ,

, ψ

so that the deduction rule

ϕ⇒

∗

⇒

∗

is sound if and

only if at least one of the following conditions (1), (2)

are satisﬁed: (1) – both ω

and ω

are tautologies,

(2) – ω

is a tautology.

All practically important implicational 4ft-

quantiﬁers are interesting implicational quantiﬁers.

Similar theorems are proved for additional classes of

4ft-quantiﬁers (Rauch, 2005; Rauch, 2008).

4 4FT-DISCOVERER

4ft-Discoverer 4 f tD

is system 4 f tD

, U

, 4ft-Miner, 4ft-Filter, 4ft-Synt i

where S

and U

are two semantic system

intended to be able to express results of observation,

properties of particular data matrices and various

items of domain knowledge. Here T is type of data

matrix, it is T = ht

,. . .,t

i, see section 3.2.1. We

say that 4ft-Discoverer 4 f tD

is of type T .

and U

are brieﬂy described in section 4.1.

They are related each other by function Cons

as-

signing to each item of domain knowledge a set of its

atomic consequences, see section 4.2.

4ft-Miner is a GUHA procedure i.e. data mining

procedure which mines for association rules - couples

of Boolean attributes created from columns of data

matrices M ∈ M

(Rauch and

Sim

unek, 2005). It has

very ﬁne tools to deﬁne a set of association rules to

be generated and veriﬁed. It is introduced in section

4.3. Procedures 4ft-Filter and 4ft-Synt are intended

to interpret results of 4ft-Miner using domain knowl-

edge expressed by semantic system the U

. Both

procedures are introduced in section 4.4.

4.1 Semantic Systems S

and U

Semantic system S

of type T = ht

,. . .,t

i is

deﬁned as S

= hM

,Sent

,Val

,Sent

,Val

where:

• M

is the set of all data matrices M of type T see

section 3.2.2.

LOGIC OF DISCOVERY, DATA MINING AND SEMANTIC WEB - Position Paper

345

• Sent

is the set of all association rules ϕ ≈ ψ

of type T i.e. the set of all closed formulas of

language L

, see section 3.2.1.

• Val

is the evaluation function deﬁned in sec-

tion 3.2.3.

• Sent

is a set of (closed) formulas of language

which is a language intended to express fea-

tures of particular data matrices. Informal exam-

ples of such formulas are: data matrix M

∈ M

concerns only pathological patients and data ma-

trix M

∈ M

concerns patients from a given town

(we assume that data matrices from M

concern

patients). This language is not deﬁned in details

in (Rauch, 2010), more details are in section 5.

• Val

is an evaluation function for features of data

matrices, Val

: (Sent

× M

) → {0,1}. If θ ∈

Sent

and M ∈ M

then Val

(θ,M ) is the value

of feature θ for M . If Val

(θ,M ) = 1 then M

has feature θ, otherwise M has not feature θ.

Please note that we use here the notion semantic

system in a broader sense than deﬁned in (H

ajek and

Havr

anek, 1978), the same is true for system U

introduced below. We call system S

observational

to express that S

concerns results of observation.

Semantic system U

of type T = ht

,. . .,t

i is

deﬁned as U

= hU,Sent

,Cons

i where

• U =

{Dom(M ) | M ∈ M

} is a union of do-

mains of all data matrices M ∈ M

, see also sec-

tion 3.2.2.

• Sent

is a set of (closed) formulas of language

which is a language intended to express var-

ious items of knowledge related to set U or items

of general knowledge. Thus each I ∈ Sent

an item of knowledge. An example of item of

knowledge related to set U is information on spe-

ciﬁc vaccination applied to all patients in a given

region. We assume that each data matrix M ∈ M

concerns only patients from this region. An ex-

ample of an item of general knowledge is a com-

monly accepted fact that if weight increases then

blood pressure increases too. Examples of formu-

las from Sent

are given below.

• Cons

is a function assigning to each I ∈ Sent

a set of association rules which can be under-

stood as consequences of item I . This function

is intended to connect observational semantic sys-

tem S

and theoretical semantic system U

adding semantics to items of domain knowledge.

More information is in section 4.2.

System U

is called theoretical because of it talks

about the whole set of objects we are interested in.

Language L

is intended to express items of

knowledge related to set U or items of general knowl-

edge. Some examples of general knowledge follow.

Here A is one of attributes A

, . . . , A

of language

, the same is true for B. In addition, ω, ω

, ω

are Boolean attributes of L

and ω does not contain

attribute A.

• A ↑↑ B means that if A increases then B increases

• A ↑↓ B means that if A increases then B decreases

• A →

ω means that if A increases then relative

frequency of ω increases

• A →

−

ω means that if A increases then relative

frequency of ω decreases

• ω

→

means that if ω

is satisﬁed then rela-

tive frequency of ω

increases

• ω

→

−

means that if ω

is satisﬁed then rela-

tive frequency of ω

decreases.

We can imagine that there is an additional parameter

making possible to express that a formula is opinion

of expert X or an assertion from a paper Y.

4.2 Cons

– Atomic Consequences

Function Cons

is used instead of the statistical ap-

proaches used in (H

ajek and Havr

anek, 1978) to con-

nect observational semantic system S

and theoreti-

cal semantic system S

. It is assumed that function

Cons

is deﬁned with help of domain expert. It adds

semantics to items of domain knowledge expressed

by formulas from Sent

We show how function Cons

creates a set

Cons

(A ↑↑ B,M ) of association rules – formulas

of language L

which can be considered as a set

of all atomic consequences of item A ↑↑ B of knowl-

edge in data matrix M . Function Cons

can be

seen as a family of functions Cons

≈

where ≈ is a

4ft-quantiﬁer of language L

. Function Cons

≈

cre-

ates a set Cons

≈

(A ↑↑ B,M ) of association rules –

formulas of language L

such that this set can be

considered as a set of all atomic consequences of

A ↑↑ B of the form ρ ≈ σ in data matrix M . Then

Cons

(A ↑↑ B,M ) is deﬁned as a union

[

{Cons

≈

(A ↑↑ B,M ) | ≈ belongs to L

} .

We outline how function Cons

⇒

p,B

works for 4ft-

quantiﬁer ⇒

p,B

of founded implication (see table 1)

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

346

and item A ↑↑ B of domain knowledge. The func-

tions Cons

≈

for additional 4ft-quantiﬁers and formu-

las of L

are deﬁned using similar principles, see also

(Rauch, 2009).

We assume that attribute A has categories 1, . ..,u

and attribute B has categories 1,. . .,v. Our task is to

deﬁne a set of rules ρ ⇒

p,B

σ which can be naturally

considered as a set of all consequences of item A ↑↑ B

and which are as simple as possible. We assume the

simplest rules in form A(α) ⇒

p,B

B(β) where

α ⊂ {1,. ..,u} and β ⊂ {1,. . .,v}.

The rule A(low) ⇒

p,B

B(low) saying if A is low

then B is low can be understood as a natural conse-

quence of A ↑↑ B. The only problem is to deﬁne coef-

ﬁcients α and β which can be understood as low. We

choose natural A

low

, 1 < A

low

< u and natural B

low

1 < B

low

< v and then we consider α as low if and

only if α ⊂ {1,. . .,A

low

} and β as low if and only if

β ⊂ {1,. ..,B

low

}, see also section 4.3.

Also the rule A(high) ⇒

p,B

B(high) saying that

if A is high then B is high can be understood as

a natural consequence of A ↑↑ B. We choose nat-

ural A

high

, 1 < A

low

< A

high

< u and natural B

high

1 < B

low

< B

high

< v and then we consider α as high

if and only if α ⊂ {A

high

,. . .,v} and β as high if and

only if β ⊂ {B

high

,. . .,v}.

It remains to deﬁne values of parameters p and B

of ⇒

p,B

. We can choose each p ≥ 0.9 and B ≥

where n is the number of rows of data matrix M .

However, boundaries of p and B as well as values

low

, A

high

, B

low

, B

high

should be determined by a do-

main expert.

Set of all rules A(low) ⇒

p,B

B(low) and

A(high) ⇒

p,B

B(high) satisfying the above given con-

ditions can be considered as Cons

⇒

p,B

(A ↑↑ B,M ) –

a set of atomic consequences of A ↑↑ B of the form

ρ ⇒

p,B

σ in M .

Set Cons

⇒

p,B

(A ↑↑ B,M ) can be deﬁned in

a ﬁner way by rules A(medium) ⇒

p,B

B(medium)

with a suitable deﬁnition of ”medium”. Rules

A(low, medium) ⇒

p,B

B(medium), etc. can also be

added.

There is a natural requirement on consistency of

set Cons

(A ↑↑ B, M ) of atomic consequences, de-

tailed discussion is however without the scope of this

paper.

4.3 GUHA Procedure 4ft-Miner

4ft-Miner procedure mines for association rules of the

form ϕ ≈ ψ where ϕ ∈ Φ, ψ ∈ Ψ, and ϕ and ψ have

no common attributes. Input parameters deﬁne ana-

lyzed data matrix M , 4ft-quantiﬁer ≈, set of relevant

antecedents Φ and set of relevant succedents Ψ.

Each antecedent is a conjunction τ

∧ · ·· ∧ τ

of partial antecedents τ

,. . .,τ

. Each partial an-

tecedent is either conjunction λ

∧ ·· · ∧ λ

or dis-

junction λ

∨ ··· ∨ λ

of literals λ

,. . .,λ

. Each lit-

eral is a basic Boolean attribute A(α) or its negation

¬A(α). Deﬁnition of set of relevant antecedents Φ

consists of deﬁnitions of relevant partial antecedents

,. . .,Φ

, τ

∧ ·· · ∧ τ

is a relevant antecedent if

∈ Φ

,. . .,τ

∈ Φ

Deﬁnition of a relevant partial antecedent is given

by list A

,. . .,A

of attributes, by a minimal and maxi-

mal number of literals in particular partial antecedents

and by a type of partial antecedent i.e. conjunctions

or disjunctions. In addition, for each attribute A

set of relevant basic Boolean attributes which are au-

tomatically generated is deﬁned. There are various

detailed possibilities how to deﬁne all relevant ba-

sic Boolean attributes A

(α) (Rauch and

Sim

unek,

2005). We outline only one of them. We use at-

tribute A with categories 1, 2, 3, 4, 5. Option in-

tervals of length 2-3 gives basic Boolean attributes

A(1,2), A(2, 3), A(3, 4), A(4, 5), A(1,2,3), A(2,3,4),

A(3,4, 5). This way we can get basic Boolean at-

tributes A(low), A(high), B(low), B(high), see section

4.2.

Set Ψ of relevant succedents is deﬁned analo-

gously. The output of 4ft-Miner is set Ω of association

rules ϕ ≈ ψ which are true in M and both ϕ ∈ Φ and

ψ ∈ Ψ. The 4ft-Miner procedure does not use apri-

ori, its implementation is based on representation of

analyzed data by suitable strings of bits (Rauch and

Sim

unek, 2005).

Let us note that the 4ft-Miner procedure mines

also for conditional association rules ϕ ≈ ψ/χ where

ϕ, ψ and χ are Boolean attributes. The association

rule ϕ ≈ ψ/χ is true in data matrix M if and only

if the rule ϕ ≈ ψ is true in data matrix M /χ where

M /χ is a data matrix consisting from all rows of M

satisfying χ.

The input of 4ft-Miner can contain also a deﬁni-

tion of set Ξ of relevant conditions in addition to deﬁ-

nitions of set of relevant antecedents Φ and set of rel-

evant succedents Ψ. The set Ξ is deﬁned analogously

to sets Φ and Ψ.

The output of 4ft-Miner is then set Ω of condi-

tional association rules ϕ ≈ ψ/χ true in M which are

true in M and both ϕ ∈ Φ, ψ ∈ Ψ and χ ∈ Ξ.

4.4 Procedures 4ft-Filter and 4ft-Synt

The 4ft-Filter procedure ﬁlters out consequences of

given item of domain knowledge from the output of

4ft-Miner. Item of domain knowledge is expressed by

LOGIC OF DISCOVERY, DATA MINING AND SEMANTIC WEB - Position Paper

347

a formula from Sent

. The 4ft-Synt recognizes groups

of patterns which can be considered as a consequence

of a (yet unknown) item of knowledge.

Function Is4 f tConsequence(I , ϕ ≈ ψ, M ) de-

ﬁned for all formulas I ∈ Sent

, association rules

ϕ ≈ ψ ∈ Sent

and data matrices M ∈ M

can be

used to realize 4ft-Filter procedure. It is deﬁned

such that Is4 f tConsequence(I ,ϕ ≈ ψ,M ) = 1 if rule

ϕ ≈ ψ can be considered as a consequence of I , oth-

erwise Is4 f tConsequence(I ,ϕ ≈ ψ,M ) = 0.

Value Is4 f tConsequence(I ,ϕ ≈ ψ,M ) is com-

puted using function Cons

, see section 4.2 and

using deduction rules

ϕ≈ψ

≈ψ

, see section 3.3. There

are criteria of correctness of rules

ϕ≈ψ

≈ψ

for each 4ft-

quantiﬁer ≈ of 4ft-Miner procedure (Rauch, 2005;

Rauch, 2008; Rauch and

Sim

unek, 2005). Function

Cons

is deﬁned for all I ∈ Sent

, and M ∈ M

such that Cons

(I , M ) = Λ and Λ is a set of all

association rules ρ ≈ σ which can be considered as

atomic consequences of I in M .

Value Is4 f tConsequence(I ,ϕ ≈ ψ,M ) is com-

puted in two steps. In the ﬁrst step we compute set

Λ = Cons

(I , M ). In the second step we test cor-

rectness of

ρ≈σ

ϕ≈ψ

for each ρ ≈ σ ∈ Λ. If there is such a

correct rule, then ϕ ≈ ψ is considered as consequence

of I in M and Is4 f tConsequence(I , ϕ ≈ ψ,M ) = 1.

Otherwise Is4 f tConsequence(I ,ϕ ≈ ψ,M ) = 0.

Function Is4 f tConsequence(I ,ϕ ≈ ψ, M ) can

also be used to realize the procedure 4ft-Synt which

recognizes groups of rules ϕ ≈ ψ which can be con-

sidered as a consequence of a (yet unknown) items

of knowledge. We assume that each, even yet un-

known, item of knowledge is represented by a formula

of Sent

. The procedure 4ft-Synt can be then realized

such that we choose formula ω ∈ Sent

and using

function Is4 f tConsequence(ω, ϕ ≈ ψ,M ) we pick up

all consequences of ω from output of 4ft-Miner pro-

cedure. However, we have somehow to limit set of

tested formulas ω ∈ Sent

. A more detailed study of

this problem is out of the scope of this paper.

5 4FT-DISCOVERER AND

SEMANTIC WEB

One of 10 challenging problems in data mining re-

search (see http://www.cs.uvm.edu/∼icdm/) is char-

acterized as mining complex knowledge from complex

data. It is emphasized that all the current data mining

systems can do is hand the results back to the user.

However, it is necessary to relate results to real world

decisions they affect. A way how to do it is to arrange

results of data mining into an analytical report struc-

tured both according to the analyzed problem and to

the user’s needs. Core of such a report is a set of asser-

tions on analyzed data together with some explanation

comments. Such analytical report can be considered

as a formal structure. An idea of indexing such reports

by logical formulas corresponding to patterns result-

ing from data mining is outlined in (Rauch, 1997). It

means that such analytical reports are natural candi-

dates for Semantic Web.

Project SEWEBAR concerning these ideas is de-

scribed in (Rauch and

Sim

unek, 2007). It is assumed

there are various institutions (e.g. hospitals) storing

data in their databases. There are automatically or

semi-automatically produced local analytical reports

giving answers to various local analytical questions.

It is further assumed that these reports are presented

on Internet. It is natural to try to get answers to var-

ious global analytical questions using these local an-

alytical reports. It is again assumed that answers to

global analytical questions will be presented on In-

ternet in a form of analytical reports. We call such

reports global analytical reports. Various aspects of

the SEWEBAR project are discussed in (Rauch, 2007;

Rauch and

Sim

unek, 2009; Kliegr et al., 2009) includ-

ing formulation of analytical questions using various

items of domain knowledge. Some experiments are

presented at http://sewebar.vse.cz/.

The SEWEBAR project is based on dealing with

analytical reports which are considered as formal

structures. No uniﬁed formal framework is given

to the project till now. The goal of this section is

to discuss possibilities of enhancement of the 4ft-

Discoverer to serve as a formal framework for the

SEWEBAR project. We are going to identify main

related problems and to sketch possible ways of their

solution.

Overview of currently known main problems re-

lated to enhancement of 4ft-Discoverer is given in

section 5.1. Possibilities of solution of particular

problems are discussed in sections 5.2 – 5.4.

5.1 Enhancing 4ft-Discoverer

The core of 4ft-Discoverer is formal framework for

dealing with domain knowledge and association rules

(i.e. interesting couples of Boolean attributes related

in a given way in a given data matrix). We have for-

mulas expressing items of domain knowledge, proce-

dures 4ft-Miner, 4ft-Filter, and 4ft-Synt, and function

Is4 f tConsequence, see section 4.4.

By these tools we are able to achieve interesting

results in solving various local analytical questions

related to a given data matrix. Our task is to arrange

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

348

these results into a local analytical report such that

it will be possible to deal with the report as with a

formal object. Some remarks to this problem are in

section 5.2.

The current version of 4ft-Discoverer is tailored

to analysis of one particular data matrix using domain

knowledge expressed by formulas from Sent

, see

section 4.1. Only very few attention is given to knowl-

edge related to particular data matrices. This knowl-

edge is assumed to be formalized by Sent

i.e. the

set of (closed) formulas of language L

which is a

language intended to express characteristics of partic-

ular data matrices, see section 4.1. This requires more

attention even when mining in one particular data ma-

trix. Some remarks to knowledge related to particular

data matrices are in section 5.3.

Our goal is to get answers to various global an-

alytical questions using local analytical reports pre-

sented on Internet. The answers to global analytical

questions will be presented as global analytical re-

ports. It is further assumed that such global analyti-

cal reports will be used as input for answering addi-

tional global analytical reports. This approach brings

lot of various problems. Initial comments to them are

in section 5.4.

Very important is application of classical Seman-

tic web technologies in the SEWEBAR project. In

this paper we are not interested in this topic. Let us

however emphasize that there are various activities in

this directions, see e.g. (Kliegr et al., 2009). The cur-

rent state is presented at http://sewebar.vse.cz/.

Let us also note that 4ft-Discoverer is tailored to

association rules mined by the 4ft-Miner GUHA pro-

cedure. There are six additional GUHA procedures

mining for various types of patterns (H

ajek et al.,

2010). Similar formal framework can be developed

for these procedures.

5.2 Local Analytical Reports

An example of local analytical question is the ques-

tion: Are there any association rules which can be

considered as exceptions from the generally accepted

fact A ↑↑ B in given data matrix M ? We assume

that the exception concerns a subset of rows deﬁned

by attributes C

,. . .,C

– columns of M . Informally

speaking, this task can be solved in following steps:

1. We identify exceptions with conditional associa-

tion rules τ ≈ σ/χ satisfying

• τ ≈ σ ∈ Cons

(A ↑↓ B) i.e. τ ≈ σ is an atomic

consequence of A ↑↓ B which is a contradiction

to A ↑↑ B.

• χ is a Boolean attribute derived from attributes

,. . .,C

2. We take into account that it is possible that

Cons

(A ↑↓ B) and Cons

(A ↑↑ B) have

common rules. For example it can happen

A(medium) ⇒

p,B

B(medium) ∈ Cons

(A ↑↓ B),

A(medium) ⇒

p,B

B(medium) ∈ Cons

(A ↑↑ B),

see section 4.2.

3. We use 4ft-Miner with input parameters such that

• set Φ of relevant antecedents is the set of all τ

where τ ≈ σ ∈ Cons

(A ↑↓ B). It can be done

due to the possibility to use option intervals for

set of all relevant basic Boolean attributes de-

rived from attribute A, see section 4.3.

• set Ψ of relevant succedents is the set of all σ

where τ ≈ σ ∈ Cons

(A ↑↓ B).

• set Ξ of relevant conditions is deﬁned as a set

of Boolean attributes derived from attributes

,. . .,C

in a suitable way

• we use quantiﬁer ⇒

p,B

with p = 0.9 and B ≥

where n is the number of rows of data matrix

M , see section 4.2.

4. Function Is4 f tConsequence(A ↑↑ B, ϕ ≈ ψ,M )

(see section 4.4) is used to ﬁlter out from Ω all

rules τ ≈ σ/χ satisfying τ ≈ σ ∈ Cons

(A ↑↑ B).

5. The remaining conditional association rules cor-

respond to searched exceptions.

The above informally described steps can be for-

malized and automatized. In addition they can be de-

scribed such that it will be possible to understood this

description as a local analytical report answering the

given analytical question. This approach differs from

that introduced in (Suzuki, 2004).

Such local analytical reports are formal structures

and they can be indexed for automatized search. For-

mulas like τ ≈ σ/χ and A ↑↑ B can be also used for

indexing and searching to deal with semantics. Lot

of similar local analytical questions can be formal-

ized and answered by local analytical reports in the

above outlined way. Some of them are sketched in

(Rauch and

Sim

unek, 2009). Detailed elaboration of

this topic is a subject of current research.

5.3 Knowledge on Data Matrices

Properties of analyzed data are crucial for analysis

and interpretation of results. It is ideal when the data

satisﬁes all requirements for correct application of

statistical approaches. However in the case of data

mining it is only rare situation. Our goal is to use

properties of analyzed data both to formulation and

LOGIC OF DISCOVERY, DATA MINING AND SEMANTIC WEB - Position Paper

349

solution of suitable analytical questions in a similar

way the knowledge expressed by formulas of Sent

is used.

Language L

is intended to express characteris-

tics of particular data matrices and it is assumed to

use set Sent

of closed formulas of this language to

deal with knowledge on particular data matrices in the

same way as formulas of Sent

are used, see section

4.1. It means we have to:

• get formulas of Sent

expressing important prop-

erties of data matrices in a similar way the formu-

las A ↑↑ B, A ↑↓ B, . . . , of Sent

express important

items of domain knowledge, see section 4.1.

• deﬁne function ConsM

adding semantics to

formulas from Sent

, similarly to the way func-

tion Cons

gives semantics to formulas from

Sent

; ConsM

(I , M ) is a set of association

rules – formulas of language L

which can be

considered as a set of all atomic consequences of

item I of knowledge on data matrix M .

• deﬁne function

Is4 f tConsequenceM(I ,ϕ ≈ ψ,M )

for all I ∈ Sent

, association rules

ϕ ≈ ψ ∈ Sent

and data matrices M ∈ M

such that Is4 f tConsequenceM(I ,ϕ ≈ ψ, M ) = 1

if rule ϕ ≈ ψ can be considered as a con-

sequence of I in data matrix M and

Is4 f tConsequenceM(I ,ϕ ≈ ψ,M ) = 0 oth-

erwise; Is4 f tConsequenceM(I ,ϕ ≈ ψ, M ) is

analogous to Is4 f tConsequence(I ,ϕ ≈ ψ,M ),

see section 4.4.

We give a very simple example of a formula

from Sent

. It is formula Fr

≥0.9

(1)) saying

that at least 90 per cent of rows of data ma-

trix satisfy basic Boolean attribute A

(1). It is

Val

(Fr

≥0.9

(1)),M ) = 1 if at least 90 per cent

of rows of M satisfy basic Boolean attribute A

(1),

otherwise it is Val

(Fr

≥0.9

(1)),M ) = 0.

Function ConsM

can be seen as a family of

functions ConsM

≈

where ≈ is a 4ft-quantiﬁer of

language L

, it is analogous to Cons

≈

. Then

ConsM

(Fr

≥0.9

(1)),M ) is deﬁned as a union

[

{ConsM

≈

(Fr

≥0.9

(1)),M ) |≈ belongs to L

} .

We outline function ConsM

⇒

p,B

for 4ft-quantiﬁer

⇒

p,B

of founded implication (see table 1). We can

deﬁne ConsM

⇒

0.9,B

(Fr

≥0.9

(1)),M ) as a set of all

rules ϕ ⇒

p,B

(1) where 0.85 ≤ p ≤ 0.95 and B ≥

where n is the number of rows of data matrix M .

However, boundaries of p and B should be determined

by a domain expert.

Is4 f tConsequenceM(Fr

≥0.9

(1)),ϕ ≈ ψ, M )

is computed in two steps, see also sec-

tion 4.4. In the ﬁrst step we compute set

Λ = ConsM

(Fr

≥0.9

(1)),M ) of rules ρ ≈ σ

which can be considered as atomic consequences

of Fr

≥0.9

(1)) in M . In the second step we

test correctness of deduction rule

ρ≈σ

ϕ≈ψ

for each

ρ ≈ σ ∈ Λ. If there is such a correct rule, then

ϕ ≈ ψ is considered as consequence of Fr

≥0.9

(1))

in M and Is4 f tConsequenceM(I , ϕ ≈ ψ,M ) = 1.

Otherwise Is4 f tConsequenceM(I ,ϕ ≈ ψ,M ) = 0.

Detailed elaboration of the outlined approach is a

subject of current research.

5.4 Global Analytical Reports

The goal is to get answers to various global analyti-

cal questions using local analytical reports presented

on Internet. The answers to global analytical ques-

tions will be presented as global analytical reports. It

is further assumed that such global analytical reports

will be used as input for answering additional global

analytical reports. It means that the global analytical

reports must be again treated as formal objects.

The global analytical questions are formulated on

the basis of available local analytical reports. Thus

the research of global analytical questions must start

with preparing variety of local analytical questions

and corresponding analytical reports. An example of

local analytical question together with a sketch of its

solution by means of 4ft-Discoverer are in section 5.2.

Additional examples of local analytical questions are

in (Rauch and

Sim

unek, 2009).

Each local analytical question leads to several

global analytical question. We denote as LAQ

the lo-

cal analytical question introduced in section 5.2: Are

there any association rules which can be considered

as exceptions from the generally accepted fact A ↑↑ B

in given data matrix M ? We assume that the excep-

tion concerns a subset of objects deﬁned by attributes

,. . .,C

concerning data matrix M . Then we can

formulate e.g. the following global analytical ques-

tions GAQ

and GAQ

GAQ

: Which data matrices are similar to the given

data matrix M what concerns solutions of LAQ

GAQ

: Which data matrices differ from the given

data matrix M what concerns solutions of LAQ

Lot of additional global analytical questions can

be formulated. The core problem related to solution

of such global analytical questions is comparison of

results concerning two data matrices M

and M

There are two possibilities:

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

350

1. Both M

and M

belong to one 4ft-Discoverer

4 f tD

2. M

belongs to 4ft-Discoverer 4 ftD

and M

be-

longs to 4ft-Discoverer 4 f tD

where T

6= T

There is a research effort to solve problem of compar-

ison of M

and M

for both possibilities. However

its description is out of the scope of this paper.

6 CONCLUSIONS

Logic of discovery was introduced in (H

ajek and

Havr

anek, 1978) and modiﬁed in (Rauch, 2010). The

modiﬁcation resulted into a system 4ft-Discoverer

4 f tD

which is a framework for mining associa-

tion rules and application of domain knowledge in

the mining process. We have brieﬂy introduced the

4ft-Discoverer 4 f tD

and then we have shown that it

can be enhanced for needs of the SEWEBAR project

which aims to disseminating results of data mining in

the form of analytical reports answering reasonable

analytical questions.

We have identiﬁed several research problems re-

lated to this enhancement and outlined possibilities

of their solution. Further work concerns solution of

these problems.

ACKNOWLEDGEMENTS

This paper was prepared with the support of Institu-

tional funds for support of a long-term development

of science and research at the Faculty of Informat-

ics and Statistics of The University of Economics,

Prague.

REFERENCES

Agrawal, R., Imielinski, T., and Swami, A. (1993). Min-

ing association rules between sets of items in large

databases. In 19 ACM SIGMOD Conf. on the Man-

agement of Data, Washington,DC.

ajek, P. and Havr

anek, T. (1978). Mechanizing Hypothesis

Formation (Mathematical Foundations for a General

Theory. Springer–Verlag, Berlin Heidellberg New

York, 1st edition.

ajek, P., Havr

anek, T., and Chytil, M. (1983). The GUHA

Method (in Czech). Academia, Praha, 1st edition.

ajek, P., Hole

na, M., and Rauch, J. (2010). The guha

method and its meaning for data mining. Journal of

Computer and System Science, 76(1):34–48.

Kliegr, T., Ralbovsk

y, M., Sv

atek, V., Simunek, M.,

Jirkovsk

y, V., Nemrava, J., and Zem

anek, J. (2009).

Semantic analytical reports: A framework for post-

processing data mining results. In Rauch, J., Ras,

Z. W., Berka, P., and Elomaa, T., editors, ISMIS,

volume 5722 of Lecture Notes in Computer Science,

pages 88–98. Springer.

Rauch, J. (1997). Logical calculi for knowledge discovery

in databases. In Komorowski, J. and Zytkow, J., edi-

tors, Proceedings of the 1st European Symposium on

Principles of Data Mining and Knowledge Discovery,

volume 1263 of LNAI, pages 47–57, Berlin. Springer.

Rauch, J. (2005). Logic of association rules. Applied Intel-

ligence, 22(1):9–28.

Rauch, J. (2007). Project SEWEBAR considerations on se-

mantic web and data mining. In IICAI, pages 1763–

1782.

Rauch, J. (2008). Classes of association rules: An overview.

In Lin, T. Y., Xie, Y., Wasilewska, A., and Liau, C.-

J., editors, Data Mining: Foundations and Practice,

volume 118 of Studies in Computational Intelligence,

pages 315–337. Springer.

Rauch, J. (2009). Considerations on logical calculi for deal-

ing with knowledge in data mining. In W., R. Z. and

Dardzinska, A., editors, Advances in Data Manage-

ment, volume 118 of Studies in Computational Intel-

ligence, pages 177–199. Springer.

Rauch, J. (2010). Considerations on logic of discovery and

data mining. In Suggested for publication at CLA

2010.

Rauch, J. and

Sim

unek, M. (2005). An alternative approach

to mining association rules. In Lin, T. Y., Ohsuga, S.,

Liau, C.-J., Hu, X., and Tsumoto, S., editors, Founda-

tions of Data Mining and knowledge Discovery, vol-

ume 6 of Studies in Computational Intelligence, pages

211–231. Springer.

Rauch, J. and

Sim

unek, M. (2007). Semantic web presen-

tation of analytical reports from data mining - prelim-

inary considerations. In Web Intelligence, pages 3–7.

IEEE Computer Society.

Rauch, J. and

Sim

unek, M. (2009). Dealing with back-

ground knowledge in the sewebar project. In Berendt,

B., Mladeni

c, D., de Gemmis, M., Semeraro, G.,

Spiliopoulou, M., Stumme, G., Sv

atek, V., and

Zelezn

y, F., editors, Knowledge Discovery Enhanced

with Semantic and Social Information, volume 220 of

Studies in Computational Intelligence, pages 89–106.

Springer.

Suzuki, E. (2004). Discovering interesting exception rules

with rule pair. In In J. Fuernkranz (Ed.), Proceedings

of the ECML/PKDD Workshop on Advances in Induc-

tive Rule Learning, pages 163–178.

LOGIC OF DISCOVERY, DATA MINING AND SEMANTIC WEB - Position Paper

351