IMPROVED REVISION OF RANKING FUNCTIONS FOR THE

GENERALIZATION OF BELIEF IN THE CONTEXT OF

UNOBSERVED VARIABLES

Klaus H¨aming and Gabriele Peters

University of Hagen, Universit¨atsstr. 1, 58097 Hagen, Germany

Keywords:

Ranking functions, Machine learning, Reinforcement learning, Belief revision, Hybrid learning system.

Abstract:

To enable a reinforcement learning agent to acquire symbolical knowledge, we augment it with a high-level

knowledge representation. This representation consists of ordinal conditional functions (OCF) which allow it

to rank world models. By this means the agent is enabled to complement the self-organizing capabilities of

the low-level reinforcement learning sub-system by reasoning capabilities of a high-level learning component.

We brieﬂy summarize the state-of-the-art method how new information is included into the OCF. To improve

the emergence of plausible behavior, we then introduce a modiﬁcation of this method. The viability of this

modiﬁcation is examined ﬁrst, for the inclusion of conditional information with negated consequents and

second, for the generalization of belief in the context of unobserved variables. Besides providing a theoretical

justiﬁcation for this modiﬁcation, we also show the advantages of our approach in comparison to the state-of-

the-art method of revision in a reinforcement learning application.

1 INTRODUCTION

The creation of a system with autonomous learning

capabilities creates a variety of challenges. Such a

system (or “agent”) has to ﬁgure out which actions

are beneﬁcial and which have to be avoided. Start-

ing with three system requirements we developed the

work described in this paper. These requirements are

the following. First, an autonomous learning system

should be able to learn from experience. A widely

adopted approach to incorporate such a property is

given by reinforcement learning (RL) (Sutton and

Barto, 1998). We will use a basic Q-learning scheme

to model this. Since RL is not the primary topic of

this work, we describe the basic idea in a nutshell

only. Second, the system should generate a repre-

sentation of its knowledge that allows further rea-

soning. In this area belief revision (BR) techniques

can be found. We will examine the usefulness of

ordinal conditional functions (OCF) (Kern-Isberner,

2001; Spohn, 2009) in this work. Third, and most im-

portant, we want both mentioned approaches to bene-

ﬁt from each other. A mixture of low-level learning-

by-doing and high-level deduction abilities is called

a two-level learning approach. Psychological ﬁnd-

ings (Anderson, 1983; Gombert, 2003; Reber, 1989;

Sun et al., 2005) indicate that such two-level learn-

ing principles can explain some of the human learning

abilities. While humans are able to learn top-down or

bottom-up (Sun et al., 2006), we will focus on the

bottom-up part only. A combination of RL and BR

has been proposed before (Leopold et al., 2008), in-

ﬂuenced by (Sun et al., 2001) and (Ye et al., 2003).

While we have already described the general idea

of our approach in (H¨aming and Peters, 2010), we

present here the detailed formalism and give a theoret-

ical justiﬁcation. This work is also related to the topic

of relational reinforcement learning (RRL) (Dzeroski

et al., 2001). However, RLL does not distinguish be-

tween high-level and low-level knowledge. It repre-

sents the Q-function directly in the form of proposi-

tional clauses.

2 NOTATION AND

TERMINOLOGY

A variable a can represent a value from its domain

. Such a domain consists of discrete values. One

such realization of a variable is called a literal. We

write literals by denoting the variable as a subscript

of its value (e.g., 3

or t

). A formula consists of

literals and logical operators such as ∧, ∨, ⇒, etc. It is

118

Häming K. and Peters G..

IMPROVED REVISION OF RANKING FUNCTIONS FOR THE GENERALIZATION OF BELIEF IN THE CONTEXT OF UNOBSERVED VARIABLES.

DOI: 10.5220/0003669501180123

In Proceedings of the International Conference on Neural Computation Theory and Applications (NCTA-2011), pages 118-123

ISBN: 978-989-8425-84-3

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

referred to by an uppercase letter, e.g., A := 0

∨1

. A

negation of a literal refers to a formula. For example,

if D

:= {1, 2, 3}, then

= (1

∨ 3

). (1)

The set of all variables is V, while the set of variables

that are realized in a formula A is denoted by V

A model is a conjunction in which exactly one lit-

eral exists for each variable. The set of all models is

referred to as M. If we restrict the set of variables the

models are derived from, we will write the variable

set as a subscript, e.g. M

. A model M is said to be

a model of a formula F, if F is true for the literals in

M. We denote this as M |= F. If an agent believes

a formula A, which means A can be inferred from its

knowledge base κ, we will write κ |= A.

A conditional is denoted by A ⇒ B, where A is the

antecedent and B is the consequent. The set of con-

ditionals we obtain when the antecedent A is replaced

by a set of formulas F, is referred to as {F ⇒ B} :=

{F ⇒ B|F ∈ F} .

3 REINFORCEMENT LEARNING

AND BELIEF REVISION

Let us assume an environment that is described by a

set of states. State transitions are performed depend-

ing on the current state and the current action carried

out by the agent. The transitions are rewarded. A

goal of RL consists in the identiﬁcation of beneﬁcial

actions, i.e., those actions that produce high rewards.

So, we have a set of states S, a set of actions A, a

transition function δ : S× A → S, and a reward func-

tion r : S ×A → R. Knowledge about good and poor

actions is learned by applying a learning technique.

In our approach we apply Q-learning. This technique

has the convenient property of being policy-free. This

means that the result does not depend on the chosen

strategy with which the agent explores the environ-

ment.

The agent’s experience is captured in the

Q(uality)-function that assigns an expected reward to

each state-action-pair. The Q-function is updated af-

ter each state transition in the following way:

Q(S, A) = r+ γmax

′

Q(S

′

, A

′

) (2)

with

′

:= δ(S, A

′

). (3)

One can interpret this formula in the way that the

agent will believe an action A to be a best action, if

it has the highest Q(S, A) value for a given state S.

This is the point where we establish a connection to

the high-level knowledge using BR in the following.

BR is a theory of maintaining a knowledge base in

such a way that the current belief is represented in a

consistent manner (Alchourron et al., 1985; Darwiche

and Pearl, 1996). We model our knowledge base κ

as an ordinal conditional function (OCF). This is a

ranking function that maintains a list of all models.

The models the agent believes in are set to rank 0,

while all ranks greater than 0 represent an increasing

disbelief. We denote the rank an OCF κ assigns to

a model M as κ(M). By convention, contradictions

shall have the rank ∞. The operator “|=” of Section 2

is deﬁned for an OCF as

κ |= A :⇔(∃M

, M

|= A : κ(M

) = 0)

∧(∀M

, M

|= A : κ(M

) > 0) (4)

which requires a believed formula to have a model

with rank 0 and its negation to have a rank greater

than 0.

In this work, the states and actions are described as

formulas. Therefore it is possible to store information

on them in a suitable OCF. The interplay between the

OCF and the Q-function is described in Section 8.

4 STATE-OF-THE-ART REVISION

OF ORDINAL CONDITIONAL

FUNCTIONS

The current belief represented by the OCF consists of

models, i.e., propositional information in the form of

conjunctions. However, during exploration the infor-

mation gathered and the information needed is in the

form of conditionals. To check, whether an OCF be-

lieves a conditional the agent can temporarily believe

its antecedent (known as conditioning) and check if

the conjunction of the antecedent and the consequent

is also believed. At the same time, the conjunction

of the antecedent and the negation of the consequent

must not be believed (that is, κ(SA) > 0). Generally,

we do not have to condition κ to ﬁnd out whether a

conditional is believed. It is sufﬁcient to compute the

belief ranks r

= κ(SA) and r

= κ(SA). If r

< r

the conditional will be believed. More difﬁcult than

querying the knowledge base is its update, called re-

vision. The revision operator is “∗”. Conditionals

in BR are usually denoted by (A|S), where S is the

antecedent and A the consequent. The meaning of

(A|S) is not exactly the same as S ⇒ A (Kern-Isberner,

2001). The latter means that S implicates A irrespec-

tive of the values other variables. In contrast, (A|S)

expresses that A will be believed if κ is conditioned

IMPROVED REVISION OF RANKING FUNCTIONS FOR THE GENERALIZATION OF BELIEF IN THE CONTEXT

OF UNOBSERVED VARIABLES

119

with S and S alone, therefore a revision (κ∗(ST)) may

not result in A being believed. In our context of RL, if

S is a complete state description, it will capture all the

available information. Then, an expression such as

ST, T 6= S is necessarily a contradiction and therefore

not believed. In this case, the meaning of S ⇒ A and

(A|S) is the same. Therefore, on a ﬁrst attempt, we

use (κ ∗ (A|S)) to revise κ with a conditional analo-

gous to (Leopold et al., 2008). Then, we will examine

the consequences of such a decision.

After a revision of κ with the conditional S ⇒ A,

we want

(κ∗ (A|S))(SA) < (κ ∗ (A|S))(SA) (5)

to hold. If this is already the case, nothing has to be

done. Otherwise the following holds:

Theorem 1. If κ(SA) ≥ κ(SA), then the OCF κ

′

de-

rived from κ by rearranging the models using

∀M ∈ M : κ

′

(M) := (κ ∗ (A|S))(M)

(

κ(M) − κ(S ⇒ A) : M |= S ⇒ A

a+ b : M |= SA

(6)

with

a = κ(SA) − κ(S ⇒ A) + 1

b = κ(M) − κ(SA)

will result in κ

′

(SA) < κ

′

(SA). Consequently, κ

′

ex-

presses the belief in S ⇒ A.

Proof. Let us partition the models in κ into three dis-

joint sets:

= {M|M |= S},

= {M|M |= SA}, and

= {M|M |= SA}.

We address the ﬁrst rule of Equation 6 ﬁrst. The

purpose of it is to let κ

′

(S ⇒ A) = 0. The models

in M

∪ M

are those that model S ⇒ A. There-

fore we reduce in rank all models in M

∪ M

κ(S ⇒ A) which is the rank of the lowest ranked

model in M

∪ M

. Hence, κ

′

(S ⇒ A) = 0. We

now consider term a of the second rule. We want

′

(SA) < κ

′

(SA) to hold. That means, after revi-

sion, the lowest rank of the models in M

needs to

be at least κ

′

(SA) + 1. Since the models of SA are

found in M

and are therefore shifted by the ﬁrst rule,

′

(SA) = κ(SA) − κ(S ⇒ A). Adding 1 is arbitrary

but sufﬁcient to meet the requirements. Term a alone

would shift the ranks of all models of M

to the rank

′

(SA) + 1. To preserve the relative ranking of the

models, we need to add term b to the second rule.

Since κ(SA) is the rank of the lowest ranked model

of M

, this very model is still shifted to the rank

′

(SA) + 1. The other models, however, now keep

their distance.

5 NEGATED CONSEQUENTS

What will happen if κ is revised with S ⇒ A? Then,

an application of Equation 6 will result in (κ ∗

(A|S))(SA) < (κ ∗ (A|S))(SA).

This does not mean that all models of SA have a

rank lower than κ(SA). We show this in the follow-

ing example. Let us deﬁne two variables a and b with

their domains D

:= {1, 2} and D

:= {1, 2, 3}. The

current belief is represented by an OCF, where the

ﬁrst entry represents the current belief; that means its

model has rank 0. Now, we want the following OCF

neg

to belief 1

⇒ 1

neg



(κ

neg

∗(1

))

−−−−−−−−→ κ

′

neg



11 23



(7)

which beliefs 1

⇒ 2

, but not 1

⇒ 3

. This behavior

is perfectly sane since (1

∧ 2

) ∧ (1

∧ 3

) is a con-

tradiction. But the belief in (1

∧ 1

) is stronger than

the belief in (1

∧ 3

). If we revise κ with 1

⇒ 2

then the result will be

′′

neg

= (κ

′

neg

∗ (2

)) =



11 23

13 12



. (8)

This expresses a belief in 1

⇒ 1

which is certainly

not what we expect an agent to believe if it has just

been exposed to the information 1

⇒ 1

and 1

⇒ 2

Instead, a belief in 1

⇒ 3

seems reasonable.

6 GENERALIZATION

We examine a revision by Equation 6 in the context of

generalization by examining what effect the omission

of variables in a formula has. Let us partition the set

of variables V into three non-empty subsets:

V = X ∪ Y ∪ Z, with

X ∩ Y =

0, X ∩Z =

0, and Y ∩ Z =

0 (9)

Next, take a model from each of the subsets, such as

X ∈ M

,Y ∈ M

, and Z ∈ M

. (10)

The revision κ ∗ (Z|X) will lead to a knowledge base

that believes a particular model M

′

of {M

⇒ Z} ⊂

X∪Y

⇒ Z}.

Next, we consider the other models C :=

X∪Y

⇒ Z} \ M

′

. First, there is the obvious re-

striction that C ∈ C is not allowed to contradict Z. We

NCTA 2011 - International Conference on Neural Computation Theory and Applications

120

already ruled this out in Equation 9. Let us look at the

following sample OCFs:

gen



212

211

221

222



and κ

gen



211

222

212

221



(11)

We can easily see that κ

gen

believes 2

⇒ 2

since

gen

⇒ 2

) = 0, but at the same time κ

gen

((2

∧

) ⇒ 2

) = 1 > 0. A revision with 2

⇒ 2

using

Equation 6 would not change κ

gen

at all.

The same issue occurs considering a revision with

a conditional that has a negated consequent, such as

⇒ 2

. We show this for κ

gen

which believes this

conditional and would not be changed by a revision

with 2

⇒ 2

using Equation 6. Nevertheless is does

not believe (2

∧ 2

) ⇒ 2

. We conclude that Equa-

tion 6 does not produce an OCF that is capable of gen-

eralization.

7 AN ALTERNATIVE REVISION

Because of the described drawbacks we suggest an

alternative revision technique. The proposed revision

introduced in this section, (κ ∗ (S ⇒ A)), utilizes a

new operator κ[A] which returns the highest disbelief

among all models of A. After a revision of κ with

the conditional S ⇒ A, we still want the equivalent of

Equation 5 to hold:

(κ∗ (S ⇒ A))(SA) < (κ ∗ (S ⇒ A))(SA) (12)

This is investigated in the following

Theorem 2. If κ(SA) ≥ κ(SA), then the OCF κ

′

de-

rived from κ by rearranging the models using

∀M ∈ M : κ

′

(M) := (κ ∗ (S ⇒ A))(M)

(

κ(M) − κ(S ⇒ A) : M |= S ⇒ A

a’+ b’ : M |= SA

(13)

with

a’ = κ[SA] − κ(S ⇒ A) + 1

b’ = κ(M) − κ(SA)

will result in κ

′

(SA) < κ

′

(SA). Consequently, κ

′

ex-

presses the belief in S ⇒ A.

Proof. Let κ

:= (κ ∗ (A|S)) and κ

:= (κ ∗ (S ⇒

A)). Since κ[A] ≥ κ(A), by application of Theorem 1

can be deduced that κ

(SA) = κ

(SA) < κ

(SA) ≤

(SA).

So, concerning the preservation of current belief,

this method works just as good as Equation 6, but

introduces greater changes. In the following discus-

sion of the properties of (κ ∗ (S ⇒ A)) with respect to

negation and generalization we justify these changes.

First, we consider negation.

Theorem 3. Let t ∈ D

and κ

′

:= (κ ∗ (A ⇒ t

)).

Then

∀r ∈ D

\ t : κ

′

(A ⇒ r

) ≤ κ

′

(A ⇒ t

). (14)

Proof. By applying Equation 13, we obtain

′

(At

) > κ

′

[At

]. This is equivalent to

∀r ∈ D

\ t : κ

′

(Ar

) < κ

′

(At

Hence, if A is believed, the inequality of Equation 14

will hold strictly. On the other hand, if A is believed,

then κ

′

(A ⇒ r

) = 0 as well as κ

′

(A ⇒ t

) = 0.

Theorem 3 induces that the observed inconsis-

tency described in Section 5 does not appear after the

repeated application of Equation 13. Indeed, a revi-

sion of κ

neg

with (1

⇒ 1

) now results in

′

neg

= (κ

neg

∗ (1

⇒ 1

)) =



. (15)

Also,

′′

neg

= (κ

′

neg

∗ (1

⇒ 2

)) =



(16)

which illustrates that 1

⇒ 3

is now believed as ex-

pected.

We now consider generalization with our alterna-

tive revision technique. Again, Equation 9 and Equa-

tion 10 are given.

Theorem 4. Let X ∈ M

, Z ∈ M

, and κ

′

:= (κ ∗

(X ⇒ Z)). Then

∀Y ∈ M

: κ

′

(X ∧Y ⇒ Z) ≤ κ

′

(X ∧Y ⇒ Z). (17)

Proof. The proof is an analog of the proof of The-

orem 3. After applying Equation 13, we obtain

′

(XZ) > κ

′

[XZ]. This is equivalent to

∀Y ∈ M

: κ

′

(X ∧Y ∧ Z) < κ

′

(X ∧Y ∧ Z)

IMPROVED REVISION OF RANKING FUNCTIONS FOR THE GENERALIZATION OF BELIEF IN THE CONTEXT

OF UNOBSERVED VARIABLES

121

Hence, if X ∧Y is believed, the inequality of Equa-

tion 17 will hold strictly. On the other hand, if

X ∧Y is believed, then κ

′

(X ∧Y ⇒ Z) = 0 as well as

′

(X ∧Y ⇒ Z) = 0.

A similar theorem will hold, if the consequent is

negated.

To complete this section, we show that the pre-

vious counter-examples can be resolved using Equa-

tion 13. A revision of κ

gen

with 2

⇒ 2

now yields

gen



212

211

221

222



(κ

gen

∗(2

⇒2

))

−−−−−−−−−→ κ

′

gen



212

222

211

221



. (18)

This κ

′

gen

expresses a belief in 2

∧ 1

⇒ 2

and 2

∧

⇒ 2

A revision of κ

gen

with 2

⇒ 2

provides

gen



211

222

212

221



(κ

gen

∗(2

⇒2

))

−−−−−−−−−→ κ

′

gen



211

221

222

212



(19)

which now believes 2

∧1

⇒ 2

as well as 2

∧2

⇒

8 APPLICATION

We examine the effect of the proposed algorithm

in a cliff-walk gridworld (Sutton and Barto, 1998)

(Figure 2). For this application six cases are exam-

ined: plain Q-learning, OCF-augmented Q-learning

with application of Equation 6, OCF-augmented Q-

learning with application of Equation 13, plain Q-

learning with futile information, OCF-augmented Q-

learning with application of Equation 6 and futile in-

formation, and OCF-augmented Q-learning with ap-

plication of Equation 13 and futile information. An

OCF-augmented Q-learner is a Q-learner that has

conditionals extracted from its Q-table. These condi-

tionals revise the learner’s OCF and this OCF acts as

a ﬁlter for the choice of actions afterwards. Figure 1

shows this architecture.

We add futile information to model the case where

the agent perceives properties of its environment that

are not helpful with regard to its goal. The OCF-

augmented Q-learners are expected to be able to gen-

eralize and therefore identify the futile information.

The generalization is performed in the same way as

in (Leopold et al., 2008) by counting the pattern fre-

quency. The general idea is to keep track of how often

sub-patterns of antecedents are used in the context of

particular consequents. If they occur frequently enou-

gh, we will revise the OCF with the sub-pattern in-

stead of the complete state description. The state de-

scription is also adopted from (Leopold et al., 2008),

where a qualitative description is used which consists

of the relative position of the agent towards the goal

(north, south, east, west) and a distance (near, middle,

far) amended with information on adjacent obstacles.

Reaching the goal triggers a reward of 100, getting

closer towards the goal is rewarded by 0.5. Stepping

into the cliff is punished by −10, every other step re-

ceives −1. After 100 steps the episode is forced to

end.

The results are depicted in Figure 3. It is evident

that a revision with Equation 13 clearly outperforms

a revision with Equation 6. The latter is worse than

a plain Q-learner and even seems to deteriorate over

time. An explanation for this may lie in the fact that

the OCF gets contaminated by harmful conditionals.

However, this has not been examined in this work.

The computational cost of the described improve-

ment depends on the representation of the OCF. If the

OCF is implemented by creating every possible con-

junction beforehand, then Equation 6 and Equation 13

will lead to roughly the same runtime, because κ[A] is

the rank of A in a reversed κ.

In a different approach we initialized the OCF

without any conjunctions to be able to handle larger

problems. Conjunctions not occurring in the OCF re-

ceived a rank of inﬁnity. Then the revision process

generates conjunctions as needed. Clearly, this breaks

the symmetry between κ(A) and κ[A]. The runtime of

this approach is about 1.5 times larger than the run-

time of the previously described approach.

ﬁlter

QTable

OCF

Policy

decode

δ(s, a)

Environment

(a number)

State

Figure 1: OCF-augmented RL system. The OCF acts as a

ﬁlter that limits the choices of the policy.

9 CONCLUSIONS

The theoretical considerations presented in this work

alleviate a severe disadvantage of the to-date revision

of ordinal conditional functions with conditional in-

formation. The presented examples clearly indicate

the removal of quite apparent implausibilities. The

aptitude of this approach to create an agent which

shows emergent understanding of its environment

NCTA 2011 - International Conference on Neural Computation Theory and Applications

122

0.2

0.4

0.6

0.8

Figure 2: Cliff-walk gridworld. The goal of a moving agent

is to reach the green square, starting from the red one. En-

tering the dark squares (representing a cliff) results in a high

negative reward. Superimposed is the learned path after 100

episodes. The path color indicates the expected reward by

displaying the value of min(1,

expected reward

goal reward

) using the dis-

played color key.

-20

-10

0 50 100 150 200 250 300

new

old

plain

-10

0 50 100 150 200 250 300

newn

oldn

plainn

Figure 3: Results. The diagrams show the rewards over a

series of 300 episodes. On the top the results with futile in-

formation are depticted, on the bottom the results without

futile information. plain/plainn show results of a plain Q-

learner, old/oldn show results of revisions with Equation 6,

and new/newn show results of revisions with Equation 13.

For the OCF-augmented learners the values are averages of

1000 runs. Since the plain Q-learner exhibits large varia-

tions, its values have been averaged over 2000 runs.

needs to be examined in more detail. Especially the

analysis of the symbolic belief representation in dif-

ferent contexts is certainly on our agenda. First exper-

iments indicate the accumulation of a proper symbolic

description of favorable state-action-pairs.

The symbolic representation also allows for sym-

bolic reasoning to incorporate a top-down path of

learning. The combination of theses techniques is

deﬁnitely of interest and needs to be addressed in fu-

ture publications.

ACKNOWLEDGEMENTS

This research was funded by the German Research

Association (DFG) under Grant PE 887/3-3.

REFERENCES

Alchourron, C. E., Gardenfors, P., and Makinson, D.

(1985). On the logic of theory change: Partial meet

contraction and revision functions. J. Symbolic Logic,

50(2):510–530.

Anderson, J. R. (1983). The architecture of cognition. Hard-

vard University Press, Cambridge, MA.

Darwiche, A. and Pearl, J. (1996). On the logic of iterated

belief revision. Artiﬁcial intelligence, 89:1–29.

Dzeroski, S., Raedt, L. D., and Driessens, K. (2001). Rela-

tional reinforcement learning. In Machine Learning,

volume 43, pages 7–52.

Gombert, J.-E. (2003). Implicit and explicit learning to

read: Implication as for subtypes of dyslexia. Current

psychology letters, 1(10).

H¨aming, K. and Peters, G. (2010). An alternative ap-

proach to the revision of ordinal conditional functions

in the context of multi-valued logic. In Proceedings of

the 20th International Conference on Artiﬁcial Neural

Networks, volume LNCS 6353, pages 200–203.

Kern-Isberner, G. (2001). Conditionals in nonmonotonic

reasoning and belief revision: considering condition-

als as agents. Springer-Verlag New York, Inc.

Leopold, T., Kern Isberner, G., and Peters, G. (2008). Com-

bining reinforcement learning and belief revision: A

learning system for active vision. In Proceedings of

the 19th British Machine Vision Conference, pages

473–482.

Reber, A. S. (1989). Implicit learning and tacit knowl-

edge. Journal of Experimental Psycology: General,

3(118):219–235.

Spohn, W. (2009). A survey of ranking theory. In Degrees

of Belief. Springer.

Sun, R., Merrill, E., and Peterson, T. (2001). From implicit

skills to explicit knowledge: A bottom-up model of

skill learning. In Cognitive Science, volume 25(2),

pages 203–244.

Sun, R., Terry, C., and Slusarz, P. (2005). The interaction of

the explicit and the implicit in skill learning: A dual-

process approach. Psychological Review, 112:159–

192.

Sun, R., Zhang, X., Slusarz, P., and Mathews, R. (2006).

The interaction of implicit learning, explicit hypothe-

sis testing, and implicit-to-explicit knowledge extrac-

tion. Neural Networks, 1(20):34–47.

Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learn-

ing: An Introduction. MIT Press, Cambridge.

Ye, C., Yung, N. H. C., and Wang, D. (2003). A fuzzy

controller with supervised learning assisted reinforce-

ment learning algorithm for obstacle avoidance. IEEE

Transactions on Systems, Man, and Cybernetics, Part

B, 33(1):17–27.

IMPROVED REVISION OF RANKING FUNCTIONS FOR THE GENERALIZATION OF BELIEF IN THE CONTEXT

OF UNOBSERVED VARIABLES

123