IMPROVED REVISION OF RANKING FUNCTIONS FOR THE
GENERALIZATION OF BELIEF IN THE CONTEXT OF
UNOBSERVED VARIABLES
Klaus H¨aming and Gabriele Peters
University of Hagen, Universit¨atsstr. 1, 58097 Hagen, Germany
Keywords:
Ranking functions, Machine learning, Reinforcement learning, Belief revision, Hybrid learning system.
Abstract:
To enable a reinforcement learning agent to acquire symbolical knowledge, we augment it with a high-level
knowledge representation. This representation consists of ordinal conditional functions (OCF) which allow it
to rank world models. By this means the agent is enabled to complement the self-organizing capabilities of
the low-level reinforcement learning sub-system by reasoning capabilities of a high-level learning component.
We briefly summarize the state-of-the-art method how new information is included into the OCF. To improve
the emergence of plausible behavior, we then introduce a modification of this method. The viability of this
modification is examined rst, for the inclusion of conditional information with negated consequents and
second, for the generalization of belief in the context of unobserved variables. Besides providing a theoretical
justification for this modification, we also show the advantages of our approach in comparison to the state-of-
the-art method of revision in a reinforcement learning application.
1 INTRODUCTION
The creation of a system with autonomous learning
capabilities creates a variety of challenges. Such a
system (or “agent”) has to figure out which actions
are beneficial and which have to be avoided. Start-
ing with three system requirements we developed the
work described in this paper. These requirements are
the following. First, an autonomous learning system
should be able to learn from experience. A widely
adopted approach to incorporate such a property is
given by reinforcement learning (RL) (Sutton and
Barto, 1998). We will use a basic Q-learning scheme
to model this. Since RL is not the primary topic of
this work, we describe the basic idea in a nutshell
only. Second, the system should generate a repre-
sentation of its knowledge that allows further rea-
soning. In this area belief revision (BR) techniques
can be found. We will examine the usefulness of
ordinal conditional functions (OCF) (Kern-Isberner,
2001; Spohn, 2009) in this work. Third, and most im-
portant, we want both mentioned approaches to bene-
fit from each other. A mixture of low-level learning-
by-doing and high-level deduction abilities is called
a two-level learning approach. Psychological find-
ings (Anderson, 1983; Gombert, 2003; Reber, 1989;
Sun et al., 2005) indicate that such two-level learn-
ing principles can explain some of the human learning
abilities. While humans are able to learn top-down or
bottom-up (Sun et al., 2006), we will focus on the
bottom-up part only. A combination of RL and BR
has been proposed before (Leopold et al., 2008), in-
fluenced by (Sun et al., 2001) and (Ye et al., 2003).
While we have already described the general idea
of our approach in (H¨aming and Peters, 2010), we
present here the detailed formalism and give a theoret-
ical justification. This work is also related to the topic
of relational reinforcement learning (RRL) (Dzeroski
et al., 2001). However, RLL does not distinguish be-
tween high-level and low-level knowledge. It repre-
sents the Q-function directly in the form of proposi-
tional clauses.
2 NOTATION AND
TERMINOLOGY
A variable a can represent a value from its domain
D
a
. Such a domain consists of discrete values. One
such realization of a variable is called a literal. We
write literals by denoting the variable as a subscript
of its value (e.g., 3
a
or t
a
). A formula consists of
literals and logical operators such as , , , etc. It is
118
Häming K. and Peters G..
IMPROVED REVISION OF RANKING FUNCTIONS FOR THE GENERALIZATION OF BELIEF IN THE CONTEXT OF UNOBSERVED VARIABLES.
DOI: 10.5220/0003669501180123
In Proceedings of the International Conference on Neural Computation Theory and Applications (NCTA-2011), pages 118-123
ISBN: 978-989-8425-84-3
Copyright
c
2011 SCITEPRESS (Science and Technology Publications, Lda.)
referred to by an uppercase letter, e.g., A := 0
a
1
b
. A
negation of a literal refers to a formula. For example,
if D
a
:= {1, 2, 3}, then
2
a
= (1
a
3
a
). (1)
The set of all variables is V, while the set of variables
that are realized in a formula A is denoted by V
A
.
A model is a conjunction in which exactly one lit-
eral exists for each variable. The set of all models is
referred to as M. If we restrict the set of variables the
models are derived from, we will write the variable
set as a subscript, e.g. M
V
. A model M is said to be
a model of a formula F, if F is true for the literals in
M. We denote this as M |= F. If an agent believes
a formula A, which means A can be inferred from its
knowledge base κ, we will write κ |= A.
A conditional is denoted by A B, where A is the
antecedent and B is the consequent. The set of con-
ditionals we obtain when the antecedent A is replaced
by a set of formulas F, is referred to as {F B} :=
{F B|F F} .
3 REINFORCEMENT LEARNING
AND BELIEF REVISION
Let us assume an environment that is described by a
set of states. State transitions are performed depend-
ing on the current state and the current action carried
out by the agent. The transitions are rewarded. A
goal of RL consists in the identification of beneficial
actions, i.e., those actions that produce high rewards.
So, we have a set of states S, a set of actions A, a
transition function δ : S× A S, and a reward func-
tion r : S ×A R. Knowledge about good and poor
actions is learned by applying a learning technique.
In our approach we apply Q-learning. This technique
has the convenient property of being policy-free. This
means that the result does not depend on the chosen
strategy with which the agent explores the environ-
ment.
The agent’s experience is captured in the
Q(uality)-function that assigns an expected reward to
each state-action-pair. The Q-function is updated af-
ter each state transition in the following way:
Q(S, A) = r+ γmax
A
Q(S
, A
) (2)
with
S
:= δ(S, A
). (3)
One can interpret this formula in the way that the
agent will believe an action A to be a best action, if
it has the highest Q(S, A) value for a given state S.
This is the point where we establish a connection to
the high-level knowledge using BR in the following.
BR is a theory of maintaining a knowledge base in
such a way that the current belief is represented in a
consistent manner (Alchourron et al., 1985; Darwiche
and Pearl, 1996). We model our knowledge base κ
as an ordinal conditional function (OCF). This is a
ranking function that maintains a list of all models.
The models the agent believes in are set to rank 0,
while all ranks greater than 0 represent an increasing
disbelief. We denote the rank an OCF κ assigns to
a model M as κ(M). By convention, contradictions
shall have the rank . The operator “|=” of Section 2
is defined for an OCF as
κ |= A :(M
1
, M
1
|= A : κ(M
1
) = 0)
(M
2
, M
2
|= A : κ(M
2
) > 0) (4)
which requires a believed formula to have a model
with rank 0 and its negation to have a rank greater
than 0.
In this work, the states and actions are described as
formulas. Therefore it is possible to store information
on them in a suitable OCF. The interplay between the
OCF and the Q-function is described in Section 8.
4 STATE-OF-THE-ART REVISION
OF ORDINAL CONDITIONAL
FUNCTIONS
The current belief represented by the OCF consists of
models, i.e., propositional information in the form of
conjunctions. However, during exploration the infor-
mation gathered and the information needed is in the
form of conditionals. To check, whether an OCF be-
lieves a conditional the agent can temporarily believe
its antecedent (known as conditioning) and check if
the conjunction of the antecedent and the consequent
is also believed. At the same time, the conjunction
of the antecedent and the negation of the consequent
must not be believed (that is, κ(SA) > 0). Generally,
we do not have to condition κ to find out whether a
conditional is believed. It is sufficient to compute the
belief ranks r
1
= κ(SA) and r
2
= κ(SA). If r
1
< r
2
,
the conditional will be believed. More difficult than
querying the knowledge base is its update, called re-
vision. The revision operator is ”. Conditionals
in BR are usually denoted by (A|S), where S is the
antecedent and A the consequent. The meaning of
(A|S) is not exactly the same as S A (Kern-Isberner,
2001). The latter means that S implicates A irrespec-
tive of the values other variables. In contrast, (A|S)
expresses that A will be believed if κ is conditioned
IMPROVED REVISION OF RANKING FUNCTIONS FOR THE GENERALIZATION OF BELIEF IN THE CONTEXT
OF UNOBSERVED VARIABLES
119
with S and S alone, therefore a revision (κ(ST)) may
not result in A being believed. In our context of RL, if
S is a complete state description, it will capture all the
available information. Then, an expression such as
ST, T 6= S is necessarily a contradiction and therefore
not believed. In this case, the meaning of S A and
(A|S) is the same. Therefore, on a first attempt, we
use (κ (A|S)) to revise κ with a conditional analo-
gous to (Leopold et al., 2008). Then, we will examine
the consequences of such a decision.
After a revision of κ with the conditional S A,
we want
(κ (A|S))(SA) < (κ (A|S))(SA) (5)
to hold. If this is already the case, nothing has to be
done. Otherwise the following holds:
Theorem 1. If κ(SA) κ(SA), then the OCF κ
de-
rived from κ by rearranging the models using
M M : κ
(M) := (κ (A|S))(M)
=
(
κ(M) κ(S A) : M |= S A
a+ b : M |= SA
(6)
with
a = κ(SA) κ(S A) + 1
b = κ(M) κ(SA)
will result in κ
(SA) < κ
(SA). Consequently, κ
ex-
presses the belief in S A.
Proof. Let us partition the models in κ into three dis-
joint sets:
M
1
= {M|M |= S},
M
2
= {M|M |= SA}, and
M
3
= {M|M |= SA}.
We address the first rule of Equation 6 first. The
purpose of it is to let κ
(S A) = 0. The models
in M
1
M
2
are those that model S A. There-
fore we reduce in rank all models in M
1
M
2
by
κ(S A) which is the rank of the lowest ranked
model in M
1
M
2
. Hence, κ
(S A) = 0. We
now consider term a of the second rule. We want
κ
(SA) < κ
(SA) to hold. That means, after revi-
sion, the lowest rank of the models in M
3
needs to
be at least κ
(SA) + 1. Since the models of SA are
found in M
2
and are therefore shifted by the first rule,
κ
(SA) = κ(SA) κ(S A). Adding 1 is arbitrary
but sufficient to meet the requirements. Term a alone
would shift the ranks of all models of M
3
to the rank
κ
(SA) + 1. To preserve the relative ranking of the
models, we need to add term b to the second rule.
Since κ(SA) is the rank of the lowest ranked model
of M
3
, this very model is still shifted to the rank
κ
(SA) + 1. The other models, however, now keep
their distance.
5 NEGATED CONSEQUENTS
What will happen if κ is revised with S A? Then,
an application of Equation 6 will result in (κ
(A|S))(SA) < (κ (A|S))(SA).
This does not mean that all models of SA have a
rank lower than κ(SA). We show this in the follow-
ing example. Let us define two variables a and b with
their domains D
a
:= {1, 2} and D
b
:= {1, 2, 3}. The
current belief is represented by an OCF, where the
first entry represents the current belief; that means its
model has rank 0. Now, we want the following OCF
κ
neg
to belief 1
a
1
b
:
κ
neg
=
21
11
22
12
23
13
(κ
neg
(1
b
|1
a
))
κ
neg
=
21
22
12
11 23
13
(7)
which beliefs 1
a
2
b
, but not 1
a
3
b
. This behavior
is perfectly sane since (1
a
2
b
) (1
a
3
b
) is a con-
tradiction. But the belief in (1
a
1
b
) is stronger than
the belief in (1
a
3
b
). If we revise κ with 1
a
2
b
,
then the result will be
κ
′′
neg
= (κ
neg
(2
b
|1
a
)) =
21
22
11 23
13 12
. (8)
This expresses a belief in 1
a
1
b
which is certainly
not what we expect an agent to believe if it has just
been exposed to the information 1
a
1
b
and 1
a
2
b
.
Instead, a belief in 1
a
3
b
seems reasonable.
6 GENERALIZATION
We examine a revision by Equation 6 in the context of
generalization by examining what effect the omission
of variables in a formula has. Let us partition the set
of variables V into three non-empty subsets:
V = X Y Z, with
X Y =
/
0, X Z =
/
0, and Y Z =
/
0 (9)
Next, take a model from each of the subsets, such as
X M
X
,Y M
Y
, and Z M
Z
. (10)
The revision κ (Z|X) will lead to a knowledge base
that believes a particular model M
of {M
X
Z}
{M
XY
Z}.
Next, we consider the other models C :=
{M
XY
Z} \ M
. First, there is the obvious re-
striction that C C is not allowed to contradict Z. We
NCTA 2011 - International Conference on Neural Computation Theory and Applications
120
already ruled this out in Equation 9. Let us look at the
following sample OCFs:
κ
gen
=
212
211
221
222
and κ
gen
=
211
222
212
221
(11)
We can easily see that κ
gen
believes 2
a
2
c
since
κ
gen
(2
a
2
c
) = 0, but at the same time κ
gen
((2
a
2
b
) 2
c
) = 1 > 0. A revision with 2
a
2
c
using
Equation 6 would not change κ
gen
at all.
The same issue occurs considering a revision with
a conditional that has a negated consequent, such as
2
a
2
c
. We show this for κ
gen
which believes this
conditional and would not be changed by a revision
with 2
a
2
c
using Equation 6. Nevertheless is does
not believe (2
a
2
b
) 2
c
. We conclude that Equa-
tion 6 does not produce an OCF that is capable of gen-
eralization.
7 AN ALTERNATIVE REVISION
Because of the described drawbacks we suggest an
alternative revision technique. The proposed revision
introduced in this section, (κ (S A)), utilizes a
new operator κ[A] which returns the highest disbelief
among all models of A. After a revision of κ with
the conditional S A, we still want the equivalent of
Equation 5 to hold:
(κ (S A))(SA) < (κ (S A))(SA) (12)
This is investigated in the following
Theorem 2. If κ(SA) κ(SA), then the OCF κ
de-
rived from κ by rearranging the models using
M M : κ
(M) := (κ (S A))(M)
=
(
κ(M) κ(S A) : M |= S A
a’+ b’ : M |= SA
(13)
with
a’ = κ[SA] κ(S A) + 1
b’ = κ(M) κ(SA)
will result in κ
(SA) < κ
(SA). Consequently, κ
ex-
presses the belief in S A.
Proof. Let κ
1
:= (κ (A|S)) and κ
2
:= (κ (S
A)). Since κ[A] κ(A), by application of Theorem 1
can be deduced that κ
2
(SA) = κ
1
(SA) < κ
1
(SA)
κ
2
(SA).
So, concerning the preservation of current belief,
this method works just as good as Equation 6, but
introduces greater changes. In the following discus-
sion of the properties of (κ (S A)) with respect to
negation and generalization we justify these changes.
First, we consider negation.
Theorem 3. Let t D
b
and κ
:= (κ (A t
b
)).
Then
r D
b
\ t : κ
(A r
b
) κ
(A t
b
). (14)
Proof. By applying Equation 13, we obtain
κ
(At
b
) > κ
[At
b
]. This is equivalent to
r D
b
\ t : κ
(Ar
b
) < κ
(At
b
).
Hence, if A is believed, the inequality of Equation 14
will hold strictly. On the other hand, if A is believed,
then κ
(A r
b
) = 0 as well as κ
(A t
b
) = 0.
Theorem 3 induces that the observed inconsis-
tency described in Section 5 does not appear after the
repeated application of Equation 13. Indeed, a revi-
sion of κ
neg
with (1
a
1
b
) now results in
κ
neg
= (κ
neg
(1
a
1
b
)) =
21
22
12
23
13
11
. (15)
Also,
κ
′′
neg
= (κ
neg
(1
a
2
b
)) =
21
22
23
13
11
12
(16)
which illustrates that 1
a
3
b
is now believed as ex-
pected.
We now consider generalization with our alterna-
tive revision technique. Again, Equation 9 and Equa-
tion 10 are given.
Theorem 4. Let X M
X
, Z M
Z
, and κ
:= (κ
(X Z)). Then
Y M
Y
: κ
(X Y Z) κ
(X Y Z). (17)
Proof. The proof is an analog of the proof of The-
orem 3. After applying Equation 13, we obtain
κ
(XZ) > κ
[XZ]. This is equivalent to
Y M
Y
: κ
(X Y Z) < κ
(X Y Z)
IMPROVED REVISION OF RANKING FUNCTIONS FOR THE GENERALIZATION OF BELIEF IN THE CONTEXT
OF UNOBSERVED VARIABLES
121
Hence, if X Y is believed, the inequality of Equa-
tion 17 will hold strictly. On the other hand, if
X Y is believed, then κ
(X Y Z) = 0 as well as
κ
(X Y Z) = 0.
A similar theorem will hold, if the consequent is
negated.
To complete this section, we show that the pre-
vious counter-examples can be resolved using Equa-
tion 13. A revision of κ
gen
with 2
a
2
c
now yields
κ
gen
=
212
211
221
222
(κ
gen
(2
a
2
c
))
κ
gen
=
212
222
211
221
. (18)
This κ
gen
expresses a belief in 2
a
1
b
2
c
and 2
a
2
b
2
c
.
A revision of κ
gen
with 2
a
2
c
provides
κ
gen
=
211
222
212
221
(κ
gen
(2
a
2
c
))
κ
gen
=
211
221
222
212
(19)
which now believes 2
a
1
b
2
c
as well as 2
a
2
b
2
c
.
8 APPLICATION
We examine the effect of the proposed algorithm
in a cliff-walk gridworld (Sutton and Barto, 1998)
(Figure 2). For this application six cases are exam-
ined: plain Q-learning, OCF-augmented Q-learning
with application of Equation 6, OCF-augmented Q-
learning with application of Equation 13, plain Q-
learning with futile information, OCF-augmented Q-
learning with application of Equation 6 and futile in-
formation, and OCF-augmented Q-learning with ap-
plication of Equation 13 and futile information. An
OCF-augmented Q-learner is a Q-learner that has
conditionals extracted from its Q-table. These condi-
tionals revise the learner’s OCF and this OCF acts as
a filter for the choice of actions afterwards. Figure 1
shows this architecture.
We add futile information to model the case where
the agent perceives properties of its environment that
are not helpful with regard to its goal. The OCF-
augmented Q-learners are expected to be able to gen-
eralize and therefore identify the futile information.
The generalization is performed in the same way as
in (Leopold et al., 2008) by counting the pattern fre-
quency. The general idea is to keep track of how often
sub-patterns of antecedents are used in the context of
particular consequents. If they occur frequently enou-
gh, we will revise the OCF with the sub-pattern in-
stead of the complete state description. The state de-
scription is also adopted from (Leopold et al., 2008),
where a qualitative description is used which consists
of the relative position of the agent towards the goal
(north, south, east, west) and a distance (near, middle,
far) amended with information on adjacent obstacles.
Reaching the goal triggers a reward of 100, getting
closer towards the goal is rewarded by 0.5. Stepping
into the cliff is punished by 10, every other step re-
ceives 1. After 100 steps the episode is forced to
end.
The results are depicted in Figure 3. It is evident
that a revision with Equation 13 clearly outperforms
a revision with Equation 6. The latter is worse than
a plain Q-learner and even seems to deteriorate over
time. An explanation for this may lie in the fact that
the OCF gets contaminated by harmful conditionals.
However, this has not been examined in this work.
The computational cost of the described improve-
ment depends on the representation of the OCF. If the
OCF is implemented by creating every possible con-
junction beforehand, then Equation 6 and Equation 13
will lead to roughly the same runtime, because κ[A] is
the rank of A in a reversed κ.
In a different approach we initialized the OCF
without any conjunctions to be able to handle larger
problems. Conjunctions not occurring in the OCF re-
ceived a rank of infinity. Then the revision process
generates conjunctions as needed. Clearly, this breaks
the symmetry between κ(A) and κ[A]. The runtime of
this approach is about 1.5 times larger than the run-
time of the previously described approach.
filter
QTable
OCF
Policy
decode
δ(s, a)
Environment
(a number)
State
Figure 1: OCF-augmented RL system. The OCF acts as a
filter that limits the choices of the policy.
9 CONCLUSIONS
The theoretical considerations presented in this work
alleviate a severe disadvantage of the to-date revision
of ordinal conditional functions with conditional in-
formation. The presented examples clearly indicate
the removal of quite apparent implausibilities. The
aptitude of this approach to create an agent which
shows emergent understanding of its environment
NCTA 2011 - International Conference on Neural Computation Theory and Applications
122
0
0.2
0.4
0.6
0.8
1
Figure 2: Cliff-walk gridworld. The goal of a moving agent
is to reach the green square, starting from the red one. En-
tering the dark squares (representing a cliff) results in a high
negative reward. Superimposed is the learned path after 100
episodes. The path color indicates the expected reward by
displaying the value of min(1,
expected reward
goal reward
) using the dis-
played color key.
-20
-10
0
10
20
30
40
50
60
70
80
90
0 50 100 150 200 250 300
new
old
plain
-10
0
10
20
30
40
50
60
70
80
90
0 50 100 150 200 250 300
newn
oldn
plainn
Figure 3: Results. The diagrams show the rewards over a
series of 300 episodes. On the top the results with futile in-
formation are depticted, on the bottom the results without
futile information. plain/plainn show results of a plain Q-
learner, old/oldn show results of revisions with Equation 6,
and new/newn show results of revisions with Equation 13.
For the OCF-augmented learners the values are averages of
1000 runs. Since the plain Q-learner exhibits large varia-
tions, its values have been averaged over 2000 runs.
needs to be examined in more detail. Especially the
analysis of the symbolic belief representation in dif-
ferent contexts is certainly on our agenda. First exper-
iments indicate the accumulation of a proper symbolic
description of favorable state-action-pairs.
The symbolic representation also allows for sym-
bolic reasoning to incorporate a top-down path of
learning. The combination of theses techniques is
definitely of interest and needs to be addressed in fu-
ture publications.
ACKNOWLEDGEMENTS
This research was funded by the German Research
Association (DFG) under Grant PE 887/3-3.
REFERENCES
Alchourron, C. E., Gardenfors, P., and Makinson, D.
(1985). On the logic of theory change: Partial meet
contraction and revision functions. J. Symbolic Logic,
50(2):510–530.
Anderson, J. R. (1983). The architecture of cognition. Hard-
vard University Press, Cambridge, MA.
Darwiche, A. and Pearl, J. (1996). On the logic of iterated
belief revision. Artificial intelligence, 89:1–29.
Dzeroski, S., Raedt, L. D., and Driessens, K. (2001). Rela-
tional reinforcement learning. In Machine Learning,
volume 43, pages 7–52.
Gombert, J.-E. (2003). Implicit and explicit learning to
read: Implication as for subtypes of dyslexia. Current
psychology letters, 1(10).
H¨aming, K. and Peters, G. (2010). An alternative ap-
proach to the revision of ordinal conditional functions
in the context of multi-valued logic. In Proceedings of
the 20th International Conference on Artificial Neural
Networks, volume LNCS 6353, pages 200–203.
Kern-Isberner, G. (2001). Conditionals in nonmonotonic
reasoning and belief revision: considering condition-
als as agents. Springer-Verlag New York, Inc.
Leopold, T., Kern Isberner, G., and Peters, G. (2008). Com-
bining reinforcement learning and belief revision: A
learning system for active vision. In Proceedings of
the 19th British Machine Vision Conference, pages
473–482.
Reber, A. S. (1989). Implicit learning and tacit knowl-
edge. Journal of Experimental Psycology: General,
3(118):219–235.
Spohn, W. (2009). A survey of ranking theory. In Degrees
of Belief. Springer.
Sun, R., Merrill, E., and Peterson, T. (2001). From implicit
skills to explicit knowledge: A bottom-up model of
skill learning. In Cognitive Science, volume 25(2),
pages 203–244.
Sun, R., Terry, C., and Slusarz, P. (2005). The interaction of
the explicit and the implicit in skill learning: A dual-
process approach. Psychological Review, 112:159–
192.
Sun, R., Zhang, X., Slusarz, P., and Mathews, R. (2006).
The interaction of implicit learning, explicit hypothe-
sis testing, and implicit-to-explicit knowledge extrac-
tion. Neural Networks, 1(20):34–47.
Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learn-
ing: An Introduction. MIT Press, Cambridge.
Ye, C., Yung, N. H. C., and Wang, D. (2003). A fuzzy
controller with supervised learning assisted reinforce-
ment learning algorithm for obstacle avoidance. IEEE
Transactions on Systems, Man, and Cybernetics, Part
B, 33(1):17–27.
IMPROVED REVISION OF RANKING FUNCTIONS FOR THE GENERALIZATION OF BELIEF IN THE CONTEXT
OF UNOBSERVED VARIABLES
123