LOGIC OF DISCOVERY, DATA MINING AND SEMANTIC WEB
Position Paper
Jan Rauch
Faculty of Informatics and Statistics, University of Economics
n
´
am. W. Churchilla 4, 130 67 Prague, Czech Republic
Keywords:
Logic of discovery, Association rules, Logic of association rules, Data mining, Semantic web.
Abstract:
Logic of discovery was developed in 1970’s as an answer to questions ”Can computers formulate and justify
scientific hypotheses?” and ”Can they comprehend empirical data and process it rationally, using the apparatus
of modern mathematical logic and statistics to try to produce a rational image of the observed empirical
world?”. Logic of discovery is based on two semantic systems. Observational semantic system corresponds
to observational data and statements on observational data. Theoretical semantic system concerns suitable
state dependent structures. Both systems are related via inductive inference rules corresponding to statistical
approaches. An attempt to modify logic of discovery to data mining was made and a framework making
possible to deal with domain knowledge in data mining was developed. Possibility of enhancement of this
framework for presenting results of data mining through Semantic web is suggested and discussed.
1 INTRODUCTION
Logic of discovery is developed in book (H
´
ajek and
Havr
´
anek, 1978) as an answer to questions Q
1
, Q
2
:
(Q
1
) Can computers formulate and justify scien-
tific hypotheses? (Q
2
) – Can they comprehend empir-
ical data and process it rationally, using the appara-
tus of modern mathematical logic and statistics to try
to produce a rational image of the observed empiri-
cal world? Answers to these questions are based on a
scheme of inductive inference:
theoretical assumptions, observational statement
theoretical statement
.
Logic of discovery deals with two semantic systems -
observational semantic system and theoretical seman-
tic system. Observational semantic system has a lan-
guage for speaking about observational data. Theoret-
ical semantic system concerns state dependent struc-
tures, both systems are connected by inductive infer-
ence rules based on statistical approaches.
An attempt to modify logic of discovery for needs
of data mining resulted into a suggestion of system
4ft-Discoverer (Rauch, 2010) which is intended to be
an experimental framework making possible to deal
with domain knowledge when mining in particular
data set. The goal of this paper is to discuss a possi-
bility of enhancement of this framework to serve as a
basis for disseminating results of data mining through
Semantic web.
System 4ft-Discoverer is based on logic of associ-
ation rules (Rauch, 2005). The association rule is un-
derstood here as a general relation of two Boolean at-
tributes. Main features of logic of discovery are sum-
marized in section 2. The logic of association rules is
introduced in section 3. Important features of the 4ft-
Discoverer are described in section 4. Possibilities to
enhance 4ft-Discoverer to a framework for dissemi-
nating results of data mining through Semantic web
are discussed in section 5.
2 LOGIC OF DISCOVERY
The schema of inductive inference introduced in sec-
tion 1 inspired additional ve questions (H
´
ajek and
Havr
´
anek, 1978):
L0: In what languages does one formulate observa-
tional and theoretical statements?
L1: What are rational inductive inference rules bridg-
ing the gap between observational and theoretical sen-
tences? (What does it mean that a theoretical state-
ment is justified?)
L2: Are there rational methods for deciding whether
a theoretical statement is justified (on the basis of
given theoretical assumptions and observational state-
ments)?
342
Rauch J..
LOGIC OF DISCOVERY, DATA MINING AND SEMANTIC WEB - Position Paper.
DOI: 10.5220/0003117203420351
In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2010), pages 342-351
ISBN: 978-989-8425-28-7
Copyright
c
2010 SCITEPRESS (Science and Technology Publications, Lda.)
L3: What are the conditions for a theoretical statement
or a set of theoretical statements to be of interest with
respect to the task of scientific cognition?
L4: Are there methods for suggesting such a set of
statements which is as interesting (important) as pos-
sible?
Answering questions (L0) – (L2) leads to logic of
induction, answers to questions (L3) and (L4) lead to
logic of suggestion. Answers to questions (L0) – (L4)
constitute a logic of discovery developed in (H
´
ajek
and Havr
´
anek, 1978). The rational inductive infer-
ence rules bridging the gap between observational
and theoretical sentences are based on statistical ap-
proaches, i.e. estimates of various parameters or sta-
tistical hypothesis tests are used.
Semantic system is defined to formalize languages
for observational and theoretical statements: Seman-
tic system S = hSent,M,V,Vali is determined by a
non-empty set Sent of sentences, a non-empty set M
of models, a non-empty set V of abstract values and
an evaluating function Val : (Sent × M) V . If it is
ϕ Sent and M M then Val(ϕ,M ) is the value
of ϕ in M . Semantic system S = hSent,M ,V,Vali is
observational if Sent, M , V are recursive sets and Val
is a partial recursive function.
Two semantic systems obbservational seman-
tic system S
O
= hSent
O
,M
O
,V
O
,Val
O
i correspond-
ing to analyzed data and theoretical semantic sys-
tem S
T
= hSent
T
,U
T
,V
T
,Val
T
i corresponding to the
whole set of objects we are interested in are devel-
oped. The analyzed data can concern only a part of
this whole set. Rationality of inductive inference rules
is based on statistical approaches. It leads to observa-
tional semantic systems with formulas corresponding
to statistical hypothesis tests. An example of observa-
tional system is related to logical calculus of associa-
tion rules, see section 3.
3 LOGIC OF ASSOCIATION
RULES
The most in (H
´
ajek and Havr
´
anek, 1978) studied ob-
servational semantic systems are based on observa-
tional predicate calculi which are introduced in sec-
tion 3.1. Logical calculi of association rules can be
understood as modifications of observational predi-
cate calculi, they are informally defined in section 3.2.
Very important are deduction rules in logical calculi
of association rules, some practically important de-
duction rules are mentioned in section 3.3.
3.1 Observational Predicate Calculi
Observational predicate calculus is a result of modi-
fications of classical predicate calculus only finite
models are allowed and generalized quantifiers are
added. Finite models correspond to data resulting
from observation and generalized quantifiers make
it possible to express various assertions on analyzed
data including assertions corresponding to statistical
hypothesis tests.
Set Sent
P
of all closed formulas of observational
predicate calculus P can be used to build observa-
tional semantic system S
P
= hSent
P
,M
P
,V
P
,Val
P
i
where M
P
is the set of all models (i.e. finite data struc-
tures) of P , V
P
= {0,1} and Val
P
is a function assign-
ing a value from {0, 1} to each couple hM ,Φi where
M M
P
and Φ Sent
P
. If Val
P
(M ,Φ) = 1 then Φ
is true in M , otherwise Φ is false in M .
If we use predicate calculus P with only unary
predicates P
1
,...P
n
, then each model M M
P
of S
P
is a {0,1} data matrix with n columns. Expres-
sion (x)P
1
(x) and (x)(P
1
(x) P
2
(y)) are examples
of formulas with classical quantifiers and .
Expressions
!
p,α,B
(x)(P
1
(x),P
2
(x)) and
p,B
(x)(P
1
(x) P
3
(x),P
2
(y) P
4
(x)) are examples
of formulas with generalized quantifiers
!
p,α,B
and
p,B
which are introduced in table 1. These
expressions concern couples of derived predicates
hP
1
(x);P
2
(x)i and hP
1
(x) P
3
(x);P
2
(y) P
4
(x)i, they
can be understood as generalization of association
rules.
3.2 Logical Calculi of Association Rules
The boom of association rules in the 1990’s (Agrawal
et al., 1993) was the start of a new effort in the study
of association rules as formulas of observational cal-
culi. The syntax of used formulas of predicate ob-
servational calculi has been significantly simplified,
only calculi with monadic predicates are further stud-
ied. Free and bound variables are omitted and basic
Boolean attributes are used instead of predicates. Re-
sulting calculi can be understood as logical calculi of
association rules (Rauch, 2005; Rauch, 2008; Rauch,
2009).
We are going to informally outline definition of
semantic system AR
T
= hSent
T
AR
,M
T
,{0,1},Val
T
AR
i
of type T concerning association rules. Elements of
Sent
T
AR
are association rules ϕ ψ where ϕ and ψ are
Boolean attributes derived from columns of analyzed
data matrix M of type T and is a 4ft-quantifier.
Such association rules are closed formulas of lan-
guage L
T
AR
of association rules which is outlined in
LOGIC OF DISCOVERY, DATA MINING AND SEMANTIC WEB - Position Paper
343
section 3.2.1. M
T
is a set of all data matrices of type
T , see section 3.2.2. Val
T
AR
is an evaluating function
assigning value Val
T
AR
(ϕ ψ,M ) {0,1} to each
couple M M
T
and ϕ ψ Sent
T
AR
. It is introduced
in section 3.2.3.
3.2.1 Language L
T
AR
Association rule is expression ϕ ψ where ϕ and ψ
are Boolean attributes derived from columns of an an-
alyzed data matrix and is a 4ft-quantifier. Boolean
attribute ϕ is called antecedent and Boolean attribute
ψ is called succedent.
Basic Boolean attributes are created first. The
basic Boolean attribute is an expression A(α) where
α {a
1
,...a
t
} and {a
1
,...a
t
} is the set of all cate-
gories of the attribute A. The basic Boolean attribute
A(α) is true in row o of M if it is A(o) α where
A(o) is the value of the attribute A in row o. Exam-
ples of basic Boolean attributes are in figure 1. These
Boolean attributes are derived from columns of data
matrix M with columns corresponding to attributes
A
1
,...,A
K
.
M A
1
. . . A
K
A
1
(1) A
K
(2,6)
o
1
1 . . . 6 1 1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
o
n
3 . . . 1 0 0
Figure 1: Data matrix M and basic Boolean attributes.
Boolean attributes ϕ and ψ are derived from basic
Boolean attributes using connectives , and ¬ in
the usual way. Expression
A
1
(1) A
2
(4,5) A
K
(2,6)
is an example of an association rule.
We consider data matrices with values natural
numbers only. The natural numbers represent cat-
egories i.e. possible values of observed attributes
A
1
,. . .,A
K
. Columns of data matrix correspond to at-
tributes and rows correspond to observed objects, e.g.
to patients. An example of such a data matrix is in
figure 1.
There is only finite number of categories i.e. pos-
sible values for each attribute. Let us assume that the
number of possible values of a column is t and that
the possible values in this column are natural num-
bers 1,. . .,t. All possible values in the data matrix are
then described by the numbers of possible values for
each column. The whole information on number of
columns and possible values in the data matrix is then
given by type of data matrix: A type of data matrix
is a K-tuple T = ht
1
,. . .,t
K
i where t
i
2 are natural
numbers for i = 1,. ..,K.
Symbols of language L
T
AR
of association rules
of type T = ht
1
,. . .,t
K
i are attributes A
1
,. . .,A
K
,
4ft-quantifiers
1
,. . .,
Q
, propositional connectives
,, ¬ and parentheses. The basic Boolean attributes
A(α) are defined in the above given way. Each basic
Boolean attribute is a Boolean attribute, if ϕ and ψ
are Boolean attributes, then ¬ϕ, ϕ ψ and ϕ ψ are
Boolean attributes.
Set Sent
T
AR
of semantic system S
T
AR
of associa-
tion rules of type T is the set of all association rules
i.e. closed formulas of language L
T
AR
. Formal defini-
tion of language of association rules is e.g. in (Rauch,
2005).
3.2.2 Data Matrices M
T
A more formal definition of a data matrix with the
number of columns and the numbers of possible val-
ues in particular columns given by the type T =
ht
1
,. . .,t
K
i is used: Let T = ht
1
,. . .,t
K
i be the type
of data matrix. Then a data matrix of type T is a
K + 1-tuple M = hM, f
1
,. . ., f
K
i, where M is a non-
empty finite set and f
i
is the unary function from M
to {1,... ,t
i
} for i = 1,.. . ,K. Set M is a set of rows of
data matrix M . Set M is called a domain of data ma-
trix M , we write M = Dom(M ). An example of data
matrix M = hM, f
1
,. . ., f
K
i is figure 2. We assume
that M = {o
1
,. . .,o
n
}.
object f
1
. . . f
K
o
1
f
1
(o
1
) . . . f
K
(o
1
)
.
.
.
.
.
.
.
.
.
.
.
.
o
n
f
1
(o
n
) . . . f
K
(o
n
)
Figure 2: Data matrix M = hM, f
1
,.. ., f
K
i.
3.2.3 Evaluation Function Val
T
AR
Association rule ϕ ψ can be true or false in given
data matrix M M
T
. Rule ϕ ψ is verified on the
basis of four-fold table 4 f t(ϕ,ψ, M ) of ϕ and ψ in
M , see figure 3.
M ψ ¬ψ
ϕ a b
¬ϕ a b
Figure 3: 4ft-table 4ft(ϕ,ψ,M ).
Here a is the number of objects (i.e. rows of M )
satisfying both ϕ and ψ, b is the number of objects
KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval
344
satisfying ϕ and not satisfying ψ, etc. 4 f t(ϕ,ψ, M ) is
also written as ha,b, c,di and called 4ft-table.
Evaluation function Val
T
AR
assigns a value 0 or 1
to each couple hϕ ψ, M i where ϕ ψ is the asso-
ciation rule and M M
T
. If Val
T
AR
(ϕ ψ, M ) = 1
then we say that rule ϕ ψ is true in M and if
Val
T
AR
(ϕ ψ,M ) = 0 then we say that rule ϕ ψ
is false in M . Val
T
AR
(ϕ ψ,M ) is defined using 4ft-
table 4ft(ϕ,ψ,M ) of ϕ and ψ in M and associated
function F
of .
Associated function F
of the 4ft quantifier
is a {0, 1} - valued function defined for all quadru-
ples ha,b,c,di of natural numbers. Value of asso-
ciation rule ϕ ψ in data matrix M M
T
is de-
fined such that Val
T
AR
(ϕ ψ,M ) = F
(a,b, c, d)
where ha,b,c, di = 4 f t(ϕ,ψ,M ). Examples of 4ft-
quantifiers and associated functions F
(a,b, c, d)
are in table 1.
Table 1: Examples of 4ft-quantifiers.
F
(a,b, c,d) = 1 iff
p,B
a
a+b
p a B
!
p,α,B
r
i=a
r
i
i
(1 p)
ri
α a B
p,B
a+d
a+b+c+d
p a B
α,B
min(r,k)
i=a
(
k
i
)(
nk
ri
)
(
r
n
)
α a B
2
α,B
(adbc)
2
rkls
n χ
2
α
a B
+
q,B
a
a+b
(1 + q)
a+c
a+b+c+d
a B
The 4ft-quantifiers
p,B
of founded implication,
!
p,α,B
of lower critical implication, Fisher’s quanti-
fier
α,B
and χ
2
–quantifier
2
α,B
are defined in (H
´
ajek
and Havr
´
anek, 1978), the quantifier
p,B
of founded
equivalence is defined in (H
´
ajek et al., 1983) and the
4ft-quantifier of above average dependence
+
q,B
is
defined in (Rauch, 2005).
3.3 Deduction Rules in Logical Calculus
of Association Rules
Language L
T
AR
, set of data matrices M
T
and evalua-
tion function Val
T
AR
constitute logical calculus of as-
sociation rules (Rauch, 2005). There are various the-
oretically interesting and practically useful results re-
lated to logical calculus of association rules. Most of
them are related to classes of 4ft-quantifiers (Rauch,
2008).
An example of a class of 4ft-quantifiers is the class
of implicational 4ft-quantifiers. 4ft-quantifier is
implicational if F
(a,b, c, d) = 1 a
0
a b
0
b im-
plies F
(a
0
,b
0
,c, d) = 1. Both 4ft-quantifiers
p,B
and
!
p,α,B
(see table 1) are implicational.
Important results concerning soundness of deduc-
tion rules of the form
ϕψ
ϕ
0
ψ
0
were achieved (Rauch,
2008). Here both ϕ ψ and ϕ
0
ψ
0
are association
rules. We outline these results for the class of interest-
ing implicational quantifiers: If
is an interesting
implicational quantifier then there are formulas ω
1A
,
ω
1B
, ω
2
of propositional calculus created from ϕ, ψ,
ϕ
0
, ψ
0
so that the deduction rule
ϕ
ψ
ϕ
0
ψ
0
is sound if and
only if at least one of the following conditions (1), (2)
are satisfied: (1) both ω
1A
and ω
1B
are tautologies,
(2) – ω
2
is a tautology.
All practically important implicational 4ft-
quantifiers are interesting implicational quantifiers.
Similar theorems are proved for additional classes of
4ft-quantifiers (Rauch, 2005; Rauch, 2008).
4 4FT-DISCOVERER
4ft-Discoverer 4 f tD
T
is system 4 f tD
T
=
hS
T
AR
, U
T
AR
, 4ft-Miner, 4ft-Filter, 4ft-Synt i
where S
T
AR
and U
T
AR
are two semantic system
intended to be able to express results of observation,
properties of particular data matrices and various
items of domain knowledge. Here T is type of data
matrix, it is T = ht
1
,. . .,t
K
i, see section 3.2.1. We
say that 4ft-Discoverer 4 f tD
T
is of type T .
S
T
AR
and U
T
AR
are briefly described in section 4.1.
They are related each other by function Cons
T
AR
as-
signing to each item of domain knowledge a set of its
atomic consequences, see section 4.2.
4ft-Miner is a GUHA procedure i.e. data mining
procedure which mines for association rules - couples
of Boolean attributes created from columns of data
matrices M M
T
(Rauch and
ˇ
Sim
˚
unek, 2005). It has
very fine tools to define a set of association rules to
be generated and verified. It is introduced in section
4.3. Procedures 4ft-Filter and 4ft-Synt are intended
to interpret results of 4ft-Miner using domain knowl-
edge expressed by semantic system the U
T
AR
. Both
procedures are introduced in section 4.4.
4.1 Semantic Systems S
T
AR
and U
T
AR
Semantic system S
T
AR
of type T = ht
1
,. . .,t
K
i is
defined as S
T
AR
= hM
T
,Sent
T
AR
,Val
T
AR
,Sent
T
M
,Val
T
M
i,
where:
M
T
is the set of all data matrices M of type T see
section 3.2.2.
LOGIC OF DISCOVERY, DATA MINING AND SEMANTIC WEB - Position Paper
345
Sent
T
AR
is the set of all association rules ϕ ψ
of type T i.e. the set of all closed formulas of
language L
T
AR
, see section 3.2.1.
Val
T
AR
is the evaluation function defined in sec-
tion 3.2.3.
Sent
T
M
is a set of (closed) formulas of language
L
T
M
which is a language intended to express fea-
tures of particular data matrices. Informal exam-
ples of such formulas are: data matrix M
1
M
T
concerns only pathological patients and data ma-
trix M
2
M
T
concerns patients from a given town
(we assume that data matrices from M
T
concern
patients). This language is not defined in details
in (Rauch, 2010), more details are in section 5.
Val
T
M
is an evaluation function for features of data
matrices, Val
M
T
: (Sent
T
M
× M
T
) {0,1}. If θ
Sent
T
M
and M M
T
then Val
M
T
(θ,M ) is the value
of feature θ for M . If Val
M
T
(θ,M ) = 1 then M
has feature θ, otherwise M has not feature θ.
Please note that we use here the notion semantic
system in a broader sense than defined in (H
´
ajek and
Havr
´
anek, 1978), the same is true for system U
T
AR
introduced below. We call system S
T
AR
observational
to express that S
T
AR
concerns results of observation.
Semantic system U
T
AR
of type T = ht
1
,. . .,t
K
i is
defined as U
T
AR
= hU,Sent
T
U
,Cons
T
AR
i where
U =
S
{Dom(M ) | M M
T
} is a union of do-
mains of all data matrices M M
T
, see also sec-
tion 3.2.2.
Sent
T
U
is a set of (closed) formulas of language
L
T
U
which is a language intended to express var-
ious items of knowledge related to set U or items
of general knowledge. Thus each I Sent
T
U
is
an item of knowledge. An example of item of
knowledge related to set U is information on spe-
cific vaccination applied to all patients in a given
region. We assume that each data matrix M M
T
concerns only patients from this region. An ex-
ample of an item of general knowledge is a com-
monly accepted fact that if weight increases then
blood pressure increases too. Examples of formu-
las from Sent
T
U
are given below.
Cons
T
AR
is a function assigning to each I Sent
T
U
a set of association rules which can be under-
stood as consequences of item I . This function
is intended to connect observational semantic sys-
tem S
T
AR
and theoretical semantic system U
T
AR
by
adding semantics to items of domain knowledge.
More information is in section 4.2.
System U
T
AR
is called theoretical because of it talks
about the whole set of objects we are interested in.
Language L
T
U
is intended to express items of
knowledge related to set U or items of general knowl-
edge. Some examples of general knowledge follow.
Here A is one of attributes A
1
, . . . , A
K
of language
L
T
AR
, the same is true for B. In addition, ω, ω
1
, ω
2
are Boolean attributes of L
T
AR
and ω does not contain
attribute A.
A ↑↑ B means that if A increases then B increases
A ↑↓ B means that if A increases then B decreases
A
+
ω means that if A increases then relative
frequency of ω increases
A
ω means that if A increases then relative
frequency of ω decreases
ω
1
+
ω
2
means that if ω
1
is satisfied then rela-
tive frequency of ω
2
increases
ω
1
ω
2
means that if ω
1
is satisfied then rela-
tive frequency of ω
2
decreases.
We can imagine that there is an additional parameter
making possible to express that a formula is opinion
of expert X or an assertion from a paper Y.
4.2 Cons
T
AR
– Atomic Consequences
Function Cons
T
AR
is used instead of the statistical ap-
proaches used in (H
´
ajek and Havr
´
anek, 1978) to con-
nect observational semantic system S
O
and theoreti-
cal semantic system S
T
. It is assumed that function
Cons
T
AR
is defined with help of domain expert. It adds
semantics to items of domain knowledge expressed
by formulas from Sent
T
U
.
We show how function Cons
T
AR
creates a set
Cons
T
AR
(A ↑↑ B,M ) of association rules formulas
of language L
T
AR
which can be considered as a set
of all atomic consequences of item A ↑↑ B of knowl-
edge in data matrix M . Function Cons
T
AR
can be
seen as a family of functions Cons
T
where is a
4ft-quantifier of language L
T
AR
. Function Cons
T
cre-
ates a set Cons
T
(A ↑↑ B,M ) of association rules
formulas of language L
T
AR
such that this set can be
considered as a set of all atomic consequences of
A ↑↑ B of the form ρ σ in data matrix M . Then
Cons
T
AR
(A ↑↑ B,M ) is defined as a union
[
{Cons
T
(A ↑↑ B,M ) | belongs to L
T
AR
} .
We outline how function Cons
T
p,B
works for 4ft-
quantifier
p,B
of founded implication (see table 1)
KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval
346
and item A ↑↑ B of domain knowledge. The func-
tions Cons
T
for additional 4ft-quantifiers and formu-
las of L
T
M
are defined using similar principles, see also
(Rauch, 2009).
We assume that attribute A has categories 1, . ..,u
and attribute B has categories 1,. . .,v. Our task is to
define a set of rules ρ
p,B
σ which can be naturally
considered as a set of all consequences of item A ↑↑ B
and which are as simple as possible. We assume the
simplest rules in form A(α)
p,B
B(β) where
α {1,. ..,u} and β {1,. . .,v}.
The rule A(low)
p,B
B(low) saying if A is low
then B is low can be understood as a natural conse-
quence of A ↑↑ B. The only problem is to define coef-
ficients α and β which can be understood as low. We
choose natural A
low
, 1 < A
low
< u and natural B
low
,
1 < B
low
< v and then we consider α as low if and
only if α {1,. . .,A
low
} and β as low if and only if
β {1,. ..,B
low
}, see also section 4.3.
Also the rule A(high)
p,B
B(high) saying that
if A is high then B is high can be understood as
a natural consequence of A ↑↑ B. We choose nat-
ural A
high
, 1 < A
low
< A
high
< u and natural B
high
,
1 < B
low
< B
high
< v and then we consider α as high
if and only if α {A
high
,. . .,v} and β as high if and
only if β {B
high
,. . .,v}.
It remains to define values of parameters p and B
of
p,B
. We can choose each p 0.9 and B
n
20
where n is the number of rows of data matrix M .
However, boundaries of p and B as well as values
A
low
, A
high
, B
low
, B
high
should be determined by a do-
main expert.
Set of all rules A(low)
p,B
B(low) and
A(high)
p,B
B(high) satisfying the above given con-
ditions can be considered as Cons
T
p,B
(A ↑↑ B,M )
a set of atomic consequences of A ↑↑ B of the form
ρ
p,B
σ in M .
Set Cons
T
p,B
(A ↑↑ B,M ) can be defined in
a finer way by rules A(medium)
p,B
B(medium)
with a suitable definition of ”medium”. Rules
A(low, medium)
p,B
B(medium), etc. can also be
added.
There is a natural requirement on consistency of
set Cons
T
AR
(A ↑↑ B, M ) of atomic consequences, de-
tailed discussion is however without the scope of this
paper.
4.3 GUHA Procedure 4ft-Miner
4ft-Miner procedure mines for association rules of the
form ϕ ψ where ϕ Φ, ψ Ψ, and ϕ and ψ have
no common attributes. Input parameters define ana-
lyzed data matrix M , 4ft-quantifier , set of relevant
antecedents Φ and set of relevant succedents Ψ.
Each antecedent is a conjunction τ
1
· ·· τ
m
of partial antecedents τ
1
,. . .,τ
m
. Each partial an-
tecedent is either conjunction λ
1
·· · λ
q
or dis-
junction λ
1
··· λ
q
of literals λ
1
,. . .,λ
q
. Each lit-
eral is a basic Boolean attribute A(α) or its negation
¬A(α). Definition of set of relevant antecedents Φ
consists of definitions of relevant partial antecedents
Φ
1
,. . .,Φ
m
, τ
1
·· · τ
m
is a relevant antecedent if
τ
1
Φ
1
,. . .,τ
m
Φ
m
.
Definition of a relevant partial antecedent is given
by list A
0
1
,. . .,A
0
u
of attributes, by a minimal and maxi-
mal number of literals in particular partial antecedents
and by a type of partial antecedent i.e. conjunctions
or disjunctions. In addition, for each attribute A
0
a
set of relevant basic Boolean attributes which are au-
tomatically generated is defined. There are various
detailed possibilities how to define all relevant ba-
sic Boolean attributes A
0
(α) (Rauch and
ˇ
Sim
˚
unek,
2005). We outline only one of them. We use at-
tribute A with categories 1, 2, 3, 4, 5. Option in-
tervals of length 2-3 gives basic Boolean attributes
A(1,2), A(2, 3), A(3, 4), A(4, 5), A(1,2,3), A(2,3,4),
A(3,4, 5). This way we can get basic Boolean at-
tributes A(low), A(high), B(low), B(high), see section
4.2.
Set Ψ of relevant succedents is defined analo-
gously. The output of 4ft-Miner is set of association
rules ϕ ψ which are true in M and both ϕ Φ and
ψ Ψ. The 4ft-Miner procedure does not use apri-
ori, its implementation is based on representation of
analyzed data by suitable strings of bits (Rauch and
ˇ
Sim
˚
unek, 2005).
Let us note that the 4ft-Miner procedure mines
also for conditional association rules ϕ ψ/χ where
ϕ, ψ and χ are Boolean attributes. The association
rule ϕ ψ/χ is true in data matrix M if and only
if the rule ϕ ψ is true in data matrix M /χ where
M /χ is a data matrix consisting from all rows of M
satisfying χ.
The input of 4ft-Miner can contain also a defini-
tion of set Ξ of relevant conditions in addition to defi-
nitions of set of relevant antecedents Φ and set of rel-
evant succedents Ψ. The set Ξ is defined analogously
to sets Φ and Ψ.
The output of 4ft-Miner is then set of condi-
tional association rules ϕ ψ/χ true in M which are
true in M and both ϕ Φ, ψ Ψ and χ Ξ.
4.4 Procedures 4ft-Filter and 4ft-Synt
The 4ft-Filter procedure filters out consequences of
given item of domain knowledge from the output of
4ft-Miner. Item of domain knowledge is expressed by
LOGIC OF DISCOVERY, DATA MINING AND SEMANTIC WEB - Position Paper
347
a formula from Sent
T
U
. The 4ft-Synt recognizes groups
of patterns which can be considered as a consequence
of a (yet unknown) item of knowledge.
Function Is4 f tConsequence(I , ϕ ψ, M ) de-
fined for all formulas I Sent
T
U
, association rules
ϕ ψ Sent
T
AR
and data matrices M M
T
can be
used to realize 4ft-Filter procedure. It is defined
such that Is4 f tConsequence(I ,ϕ ψ,M ) = 1 if rule
ϕ ψ can be considered as a consequence of I , oth-
erwise Is4 f tConsequence(I ,ϕ ψ,M ) = 0.
Value Is4 f tConsequence(I ,ϕ ψ,M ) is com-
puted using function Cons
T
AR
, see section 4.2 and
using deduction rules
ϕψ
ϕ
0
ψ
0
, see section 3.3. There
are criteria of correctness of rules
ϕψ
ϕ
0
ψ
0
for each 4ft-
quantifier of 4ft-Miner procedure (Rauch, 2005;
Rauch, 2008; Rauch and
ˇ
Sim
˚
unek, 2005). Function
Cons
T
AR
is defined for all I Sent
T
U
, and M M
T
such that Cons
T
AR
(I , M ) = Λ and Λ is a set of all
association rules ρ σ which can be considered as
atomic consequences of I in M .
Value Is4 f tConsequence(I ,ϕ ψ,M ) is com-
puted in two steps. In the first step we compute set
Λ = Cons
T
AR
(I , M ). In the second step we test cor-
rectness of
ρσ
ϕψ
for each ρ σ Λ. If there is such a
correct rule, then ϕ ψ is considered as consequence
of I in M and Is4 f tConsequence(I , ϕ ψ,M ) = 1.
Otherwise Is4 f tConsequence(I ,ϕ ψ,M ) = 0.
Function Is4 f tConsequence(I ,ϕ ψ, M ) can
also be used to realize the procedure 4ft-Synt which
recognizes groups of rules ϕ ψ which can be con-
sidered as a consequence of a (yet unknown) items
of knowledge. We assume that each, even yet un-
known, item of knowledge is represented by a formula
of Sent
T
U
. The procedure 4ft-Synt can be then realized
such that we choose formula ω Sent
T
U
and using
function Is4 f tConsequence(ω, ϕ ψ,M ) we pick up
all consequences of ω from output of 4ft-Miner pro-
cedure. However, we have somehow to limit set of
tested formulas ω Sent
T
U
. A more detailed study of
this problem is out of the scope of this paper.
5 4FT-DISCOVERER AND
SEMANTIC WEB
One of 10 challenging problems in data mining re-
search (see http://www.cs.uvm.edu/icdm/) is char-
acterized as mining complex knowledge from complex
data. It is emphasized that all the current data mining
systems can do is hand the results back to the user.
However, it is necessary to relate results to real world
decisions they affect. A way how to do it is to arrange
results of data mining into an analytical report struc-
tured both according to the analyzed problem and to
the user’s needs. Core of such a report is a set of asser-
tions on analyzed data together with some explanation
comments. Such analytical report can be considered
as a formal structure. An idea of indexing such reports
by logical formulas corresponding to patterns result-
ing from data mining is outlined in (Rauch, 1997). It
means that such analytical reports are natural candi-
dates for Semantic Web.
Project SEWEBAR concerning these ideas is de-
scribed in (Rauch and
ˇ
Sim
˚
unek, 2007). It is assumed
there are various institutions (e.g. hospitals) storing
data in their databases. There are automatically or
semi-automatically produced local analytical reports
giving answers to various local analytical questions.
It is further assumed that these reports are presented
on Internet. It is natural to try to get answers to var-
ious global analytical questions using these local an-
alytical reports. It is again assumed that answers to
global analytical questions will be presented on In-
ternet in a form of analytical reports. We call such
reports global analytical reports. Various aspects of
the SEWEBAR project are discussed in (Rauch, 2007;
Rauch and
ˇ
Sim
˚
unek, 2009; Kliegr et al., 2009) includ-
ing formulation of analytical questions using various
items of domain knowledge. Some experiments are
presented at http://sewebar.vse.cz/.
The SEWEBAR project is based on dealing with
analytical reports which are considered as formal
structures. No unified formal framework is given
to the project till now. The goal of this section is
to discuss possibilities of enhancement of the 4ft-
Discoverer to serve as a formal framework for the
SEWEBAR project. We are going to identify main
related problems and to sketch possible ways of their
solution.
Overview of currently known main problems re-
lated to enhancement of 4ft-Discoverer is given in
section 5.1. Possibilities of solution of particular
problems are discussed in sections 5.2 – 5.4.
5.1 Enhancing 4ft-Discoverer
The core of 4ft-Discoverer is formal framework for
dealing with domain knowledge and association rules
(i.e. interesting couples of Boolean attributes related
in a given way in a given data matrix). We have for-
mulas expressing items of domain knowledge, proce-
dures 4ft-Miner, 4ft-Filter, and 4ft-Synt, and function
Is4 f tConsequence, see section 4.4.
By these tools we are able to achieve interesting
results in solving various local analytical questions
related to a given data matrix. Our task is to arrange
KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval
348
these results into a local analytical report such that
it will be possible to deal with the report as with a
formal object. Some remarks to this problem are in
section 5.2.
The current version of 4ft-Discoverer is tailored
to analysis of one particular data matrix using domain
knowledge expressed by formulas from Sent
T
AR
, see
section 4.1. Only very few attention is given to knowl-
edge related to particular data matrices. This knowl-
edge is assumed to be formalized by Sent
T
M
i.e. the
set of (closed) formulas of language L
T
M
which is a
language intended to express characteristics of partic-
ular data matrices, see section 4.1. This requires more
attention even when mining in one particular data ma-
trix. Some remarks to knowledge related to particular
data matrices are in section 5.3.
Our goal is to get answers to various global an-
alytical questions using local analytical reports pre-
sented on Internet. The answers to global analytical
questions will be presented as global analytical re-
ports. It is further assumed that such global analyti-
cal reports will be used as input for answering addi-
tional global analytical reports. This approach brings
lot of various problems. Initial comments to them are
in section 5.4.
Very important is application of classical Seman-
tic web technologies in the SEWEBAR project. In
this paper we are not interested in this topic. Let us
however emphasize that there are various activities in
this directions, see e.g. (Kliegr et al., 2009). The cur-
rent state is presented at http://sewebar.vse.cz/.
Let us also note that 4ft-Discoverer is tailored to
association rules mined by the 4ft-Miner GUHA pro-
cedure. There are six additional GUHA procedures
mining for various types of patterns (H
´
ajek et al.,
2010). Similar formal framework can be developed
for these procedures.
5.2 Local Analytical Reports
An example of local analytical question is the ques-
tion: Are there any association rules which can be
considered as exceptions from the generally accepted
fact A ↑↑ B in given data matrix M ? We assume
that the exception concerns a subset of rows defined
by attributes C
1
,. . .,C
L
columns of M . Informally
speaking, this task can be solved in following steps:
1. We identify exceptions with conditional associa-
tion rules τ σ/χ satisfying
τ σ Cons
T
AR
(A ↑↓ B) i.e. τ σ is an atomic
consequence of A ↑↓ B which is a contradiction
to A ↑↑ B.
χ is a Boolean attribute derived from attributes
C
1
,. . .,C
L
.
2. We take into account that it is possible that
Cons
T
AR
(A ↑↓ B) and Cons
T
AR
(A ↑↑ B) have
common rules. For example it can happen
A(medium)
p,B
B(medium) Cons
T
AR
(A ↑↓ B),
A(medium)
p,B
B(medium) Cons
T
AR
(A ↑↑ B),
see section 4.2.
3. We use 4ft-Miner with input parameters such that
set Φ of relevant antecedents is the set of all τ
where τ σ Cons
T
AR
(A ↑↓ B). It can be done
due to the possibility to use option intervals for
set of all relevant basic Boolean attributes de-
rived from attribute A, see section 4.3.
set Ψ of relevant succedents is the set of all σ
where τ σ Cons
T
AR
(A ↑↓ B).
set Ξ of relevant conditions is defined as a set
of Boolean attributes derived from attributes
C
1
,. . .,C
L
in a suitable way
we use quantifier
p,B
with p = 0.9 and B
n
20
where n is the number of rows of data matrix
M , see section 4.2.
4. Function Is4 f tConsequence(A ↑↑ B, ϕ ψ,M )
(see section 4.4) is used to filter out from all
rules τ σ/χ satisfying τ σ Cons
T
AR
(A ↑↑ B).
5. The remaining conditional association rules cor-
respond to searched exceptions.
The above informally described steps can be for-
malized and automatized. In addition they can be de-
scribed such that it will be possible to understood this
description as a local analytical report answering the
given analytical question. This approach differs from
that introduced in (Suzuki, 2004).
Such local analytical reports are formal structures
and they can be indexed for automatized search. For-
mulas like τ σ/χ and A ↑↑ B can be also used for
indexing and searching to deal with semantics. Lot
of similar local analytical questions can be formal-
ized and answered by local analytical reports in the
above outlined way. Some of them are sketched in
(Rauch and
ˇ
Sim
˚
unek, 2009). Detailed elaboration of
this topic is a subject of current research.
5.3 Knowledge on Data Matrices
Properties of analyzed data are crucial for analysis
and interpretation of results. It is ideal when the data
satisfies all requirements for correct application of
statistical approaches. However in the case of data
mining it is only rare situation. Our goal is to use
properties of analyzed data both to formulation and
LOGIC OF DISCOVERY, DATA MINING AND SEMANTIC WEB - Position Paper
349
solution of suitable analytical questions in a similar
way the knowledge expressed by formulas of Sent
T
U
is used.
Language L
T
M
is intended to express characteris-
tics of particular data matrices and it is assumed to
use set Sent
T
M
of closed formulas of this language to
deal with knowledge on particular data matrices in the
same way as formulas of Sent
T
U
are used, see section
4.1. It means we have to:
get formulas of Sent
T
M
expressing important prop-
erties of data matrices in a similar way the formu-
las A ↑↑ B, A ↑↓ B, . . . , of Sent
T
U
express important
items of domain knowledge, see section 4.1.
define function ConsM
T
AR
adding semantics to
formulas from Sent
T
M
, similarly to the way func-
tion Cons
T
AR
gives semantics to formulas from
Sent
T
U
; ConsM
T
AR
(I , M ) is a set of association
rules formulas of language L
T
AR
which can be
considered as a set of all atomic consequences of
item I of knowledge on data matrix M .
define function
Is4 f tConsequenceM(I ,ϕ ψ,M )
for all I Sent
T
M
, association rules
ϕ ψ Sent
T
AR
and data matrices M M
T
such that Is4 f tConsequenceM(I ,ϕ ψ, M ) = 1
if rule ϕ ψ can be considered as a con-
sequence of I in data matrix M and
Is4 f tConsequenceM(I ,ϕ ψ,M ) = 0 oth-
erwise; Is4 f tConsequenceM(I ,ϕ ψ, M ) is
analogous to Is4 f tConsequence(I ,ϕ ψ,M ),
see section 4.4.
We give a very simple example of a formula
from Sent
T
M
. It is formula Fr
0.9
(A
1
(1)) saying
that at least 90 per cent of rows of data ma-
trix satisfy basic Boolean attribute A
1
(1). It is
Val
M
T
(Fr
0.9
(A
1
(1)),M ) = 1 if at least 90 per cent
of rows of M satisfy basic Boolean attribute A
1
(1),
otherwise it is Val
M
T
(Fr
0.9
(A
1
(1)),M ) = 0.
Function ConsM
T
AR
can be seen as a family of
functions ConsM
T
where is a 4ft-quantifier of
language L
T
AR
, it is analogous to Cons
T
. Then
ConsM
T
AR
(Fr
0.9
(A
1
(1)),M ) is defined as a union
[
{ConsM
T
(Fr
0.9
(A
1
(1)),M ) |≈ belongs to L
T
AR
} .
We outline function ConsM
T
p,B
for 4ft-quantifier
p,B
of founded implication (see table 1). We can
define ConsM
T
0.9,B
(Fr
0.9
(A
1
(1)),M ) as a set of all
rules ϕ
p,B
A
1
(1) where 0.85 p 0.95 and B
n
20
where n is the number of rows of data matrix M .
However, boundaries of p and B should be determined
by a domain expert.
Is4 f tConsequenceM(Fr
0.9
(A
1
(1)),ϕ ψ, M )
is computed in two steps, see also sec-
tion 4.4. In the first step we compute set
Λ = ConsM
T
AR
(Fr
0.9
(A
1
(1)),M ) of rules ρ σ
which can be considered as atomic consequences
of Fr
0.9
(A
1
(1)) in M . In the second step we
test correctness of deduction rule
ρσ
ϕψ
for each
ρ σ Λ. If there is such a correct rule, then
ϕ ψ is considered as consequence of Fr
0.9
(A
1
(1))
in M and Is4 f tConsequenceM(I , ϕ ψ,M ) = 1.
Otherwise Is4 f tConsequenceM(I ,ϕ ψ,M ) = 0.
Detailed elaboration of the outlined approach is a
subject of current research.
5.4 Global Analytical Reports
The goal is to get answers to various global analyti-
cal questions using local analytical reports presented
on Internet. The answers to global analytical ques-
tions will be presented as global analytical reports. It
is further assumed that such global analytical reports
will be used as input for answering additional global
analytical reports. It means that the global analytical
reports must be again treated as formal objects.
The global analytical questions are formulated on
the basis of available local analytical reports. Thus
the research of global analytical questions must start
with preparing variety of local analytical questions
and corresponding analytical reports. An example of
local analytical question together with a sketch of its
solution by means of 4ft-Discoverer are in section 5.2.
Additional examples of local analytical questions are
in (Rauch and
ˇ
Sim
˚
unek, 2009).
Each local analytical question leads to several
global analytical question. We denote as LAQ
1
the lo-
cal analytical question introduced in section 5.2: Are
there any association rules which can be considered
as exceptions from the generally accepted fact A ↑↑ B
in given data matrix M ? We assume that the excep-
tion concerns a subset of objects defined by attributes
C
1
,. . .,C
L
concerning data matrix M . Then we can
formulate e.g. the following global analytical ques-
tions GAQ
1
and GAQ
2
:
GAQ
1
: Which data matrices are similar to the given
data matrix M what concerns solutions of LAQ
1
?
GAQ
2
: Which data matrices differ from the given
data matrix M what concerns solutions of LAQ
1
?
Lot of additional global analytical questions can
be formulated. The core problem related to solution
of such global analytical questions is comparison of
results concerning two data matrices M
A
and M
B
.
There are two possibilities:
KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval
350
1. Both M
A
and M
B
belong to one 4ft-Discoverer
4 f tD
T
.
2. M
A
belongs to 4ft-Discoverer 4 ftD
T
A
and M
B
be-
longs to 4ft-Discoverer 4 f tD
T
B
where T
A
6= T
B
.
There is a research effort to solve problem of compar-
ison of M
A
and M
B
for both possibilities. However
its description is out of the scope of this paper.
6 CONCLUSIONS
Logic of discovery was introduced in (H
´
ajek and
Havr
´
anek, 1978) and modified in (Rauch, 2010). The
modification resulted into a system 4ft-Discoverer
4 f tD
T
which is a framework for mining associa-
tion rules and application of domain knowledge in
the mining process. We have briefly introduced the
4ft-Discoverer 4 f tD
T
and then we have shown that it
can be enhanced for needs of the SEWEBAR project
which aims to disseminating results of data mining in
the form of analytical reports answering reasonable
analytical questions.
We have identified several research problems re-
lated to this enhancement and outlined possibilities
of their solution. Further work concerns solution of
these problems.
ACKNOWLEDGEMENTS
This paper was prepared with the support of Institu-
tional funds for support of a long-term development
of science and research at the Faculty of Informat-
ics and Statistics of The University of Economics,
Prague.
REFERENCES
Agrawal, R., Imielinski, T., and Swami, A. (1993). Min-
ing association rules between sets of items in large
databases. In 19 ACM SIGMOD Conf. on the Man-
agement of Data, Washington,DC.
H
´
ajek, P. and Havr
´
anek, T. (1978). Mechanizing Hypothesis
Formation (Mathematical Foundations for a General
Theory. Springer–Verlag, Berlin Heidellberg New
York, 1st edition.
H
´
ajek, P., Havr
´
anek, T., and Chytil, M. (1983). The GUHA
Method (in Czech). Academia, Praha, 1st edition.
H
´
ajek, P., Hole
ˇ
na, M., and Rauch, J. (2010). The guha
method and its meaning for data mining. Journal of
Computer and System Science, 76(1):34–48.
Kliegr, T., Ralbovsk
´
y, M., Sv
´
atek, V., Simunek, M.,
Jirkovsk
´
y, V., Nemrava, J., and Zem
´
anek, J. (2009).
Semantic analytical reports: A framework for post-
processing data mining results. In Rauch, J., Ras,
Z. W., Berka, P., and Elomaa, T., editors, ISMIS,
volume 5722 of Lecture Notes in Computer Science,
pages 88–98. Springer.
Rauch, J. (1997). Logical calculi for knowledge discovery
in databases. In Komorowski, J. and Zytkow, J., edi-
tors, Proceedings of the 1st European Symposium on
Principles of Data Mining and Knowledge Discovery,
volume 1263 of LNAI, pages 47–57, Berlin. Springer.
Rauch, J. (2005). Logic of association rules. Applied Intel-
ligence, 22(1):9–28.
Rauch, J. (2007). Project SEWEBAR considerations on se-
mantic web and data mining. In IICAI, pages 1763–
1782.
Rauch, J. (2008). Classes of association rules: An overview.
In Lin, T. Y., Xie, Y., Wasilewska, A., and Liau, C.-
J., editors, Data Mining: Foundations and Practice,
volume 118 of Studies in Computational Intelligence,
pages 315–337. Springer.
Rauch, J. (2009). Considerations on logical calculi for deal-
ing with knowledge in data mining. In W., R. Z. and
Dardzinska, A., editors, Advances in Data Manage-
ment, volume 118 of Studies in Computational Intel-
ligence, pages 177–199. Springer.
Rauch, J. (2010). Considerations on logic of discovery and
data mining. In Suggested for publication at CLA
2010.
Rauch, J. and
ˇ
Sim
˚
unek, M. (2005). An alternative approach
to mining association rules. In Lin, T. Y., Ohsuga, S.,
Liau, C.-J., Hu, X., and Tsumoto, S., editors, Founda-
tions of Data Mining and knowledge Discovery, vol-
ume 6 of Studies in Computational Intelligence, pages
211–231. Springer.
Rauch, J. and
ˇ
Sim
˚
unek, M. (2007). Semantic web presen-
tation of analytical reports from data mining - prelim-
inary considerations. In Web Intelligence, pages 3–7.
IEEE Computer Society.
Rauch, J. and
ˇ
Sim
˚
unek, M. (2009). Dealing with back-
ground knowledge in the sewebar project. In Berendt,
B., Mladeni
ˇ
c, D., de Gemmis, M., Semeraro, G.,
Spiliopoulou, M., Stumme, G., Sv
´
atek, V., and
ˇ
Zelezn
´
y, F., editors, Knowledge Discovery Enhanced
with Semantic and Social Information, volume 220 of
Studies in Computational Intelligence, pages 89–106.
Springer.
Suzuki, E. (2004). Discovering interesting exception rules
with rule pair. In In J. Fuernkranz (Ed.), Proceedings
of the ECML/PKDD Workshop on Advances in Induc-
tive Rule Learning, pages 163–178.
LOGIC OF DISCOVERY, DATA MINING AND SEMANTIC WEB - Position Paper
351