Hybrid POMDP-BDI
An Agent Architecture with Online Stochastic Planning and
Desires with Changing Intensity Levels
Gavin Rens and Thomas Meyer
Centre for Artificial Intelligence Research, University of KwaZulu-Natal, and CSIR Meraka, Pretoria, South Africa
Keywords:
POMDP, BDI, Online Planning, Desire Intensity, Preference.
Abstract:
Partially observable Markov decision processes (POMDPs) and the belief-desire-intention (BDI) framework
have several complimentary strengths. We propose an agent architecture which combines these two powerful
approaches to capitalize on their strengths. Our architecture introduces the notion of intensity of the desire
for a goal’s achievement. We also define an update rule for goals’ desire levels. When to select a new goal to
focus on is also defined. To verify that the proposed architecture works, experiments were run with an agent
based on the architecture, in a domain where multiple goals must continually be achieved. The results show
that (i) while the agent is pursuing goals, it can concurrently perform rewarding actions not directly related to
its goals, (ii) the trade-off between goals and preferences can be set effectively and (iii) goals and preferences
can be satisfied even while dealing with stochastic actions and perceptions. We believe that the proposed
architecture furthers the theory of high-level autonomous agent reasoning.
1 INTRODUCTION
Imagine a scenario where a planetary rover has four
main tasks and one task it can do when it does not
interfere with performing the main tasks. The main
tasks could be, for instance, collecting gas (for indus-
trial use) from a natural vent at the base of a hill, tak-
ing a temperature measurement at the top of the hill,
performing self-diagnostics and repairs, and reload-
ing its batteries at the solar charging station. The
less important task is to collect soil samples wherever
the rover is. The rover is programmed to know the
relative importance of collecting soil samples. The
rover also has a model of the probabilities with which
its various actuators fail and the probabilistic noise-
profile of its various sensors. The rover must be able
to reason (plan) in real-time to pursue the right task
at the right time while considering its resources and
dealing with unforeseen events, all while considering
the uncertainties about its actions (actuators) and per-
ceptions (sensors).
We propose an architecture for the proper con-
trol of an agent in a complex environment such as
the scenario described above. The architecture com-
bines belief-desire-intention (BDI) theory (Bratman,
1987; Rao and Georgeff, 1995) and partially observ-
able Markov decision processes (POMDPs) (Mona-
han, 1982; Lovejoy, 1991). Traditional BDI architec-
tures (BDIAs) cannot deal with probabilistic uncer-
tainties and they do not generate plans in real-time. A
traditional POMDP cannot manage goals (major and
minor tasks) as well as BDIAs can. Next, we analyse
the POMDPs and BDIAs in a little more detail.
One of the benefits of agents based on BDI theory,
is that they need not generate plans from scratch; their
plans are already (partially) compiled, and they can
act quickly once a goal is focused on. Furthermore,
the BDI framework can deal with multiple goals.
However, their plans are usually not optimal, and it
may be difficult to find a plan which is applicable to
the current situation. On the other hand, POMDPs
can generate optimal policies on the spot to be highly
applicable to the current situation. Moreover, poli-
cies account for stochastic actions and partially ob-
servable environments. Unfortunately, generating op-
timal POMDP policies is usually intractable. One so-
lution to the intractability of POMDP policy genera-
tion is to employ a continuous planning strategy, or
agent-centred search (Koenig, 2001). Aligned with
agent-centred search is the forward-search approach
or online planning approach in POMDPs (Ross et al.,
2008).
The traditional BDIA maintains goals as desires;
there is no reward for performing some action in some
5
Rens G. and Meyer T..
Hybrid POMDP-BDI - An Agent Architecture with Online Stochastic Planning and Desires with Changing Intensity Levels.
DOI: 10.5220/0005185000050014
In Proceedings of the International Conference on Agents and Artificial Intelligence (ICAART-2015), pages 5-14
ISBN: 978-989-758-073-4
Copyright
c
2015 SCITEPRESS (Science and Technology Publications, Lda.)
state. The reward function provided by POMDP the-
ory is useful for modeling certain kinds of behavior
or preferences. For instance, an agent based on a
POMDP may want to avoid moist areas to prevent its
parts becoming rusty. Moreover, a POMDP agent can
generate plans which can optimally avoid moist areas.
But one would not say that avoiding moist areas is
the agent’s goal. A BDI agent cannot simultaneously
avoid moist areas and collect gold; it has to switch
between the two or combine the desire to avoid moist
areas with every other goal.
However, POMDP theory maintains a single re-
ward function; there is no possibility of weighing al-
ternative reward functions and pursuing one at a time
for a fixed period—all objectives must be considered
simultaneously, in one reward function. Reasoning
about objectives in POMDP theory is not as sophisti-
cated as in BDI theory.
We argue that maintenance goals like avoiding
moist areas (or collecting soil samples) should rather
be viewed as a preference and modeled as a POMDP
reward function. And specific tasks to complete (like
collecting gas or keeping its battery charged) should
be modeled as BDI desires.
Given the advantages of POMDP theoretic reason-
ing and the potentially sophisticated means-ends rea-
soning of BDI theory, we propose to combine the best
features of these two theories in a coherent agent ar-
chitecture. We call it the Hybrid POMDP-BDI agent
architecture (or HPB architecture, for short).
In BDI theory, one of the big challenges is to know
when the agent should switch its current goal and
what its new goal should be (Schut et al., 2004). To
address this challenge with an intuitive explanation,
we propose that an agent should maintain intensity
levels of desire for every goal. (This intensity of de-
sire could be interpreted as a kind of emotion.) The
goal most intensely desired should be the current goal
sought (the intention). We also define the notion of
how much an intention is satisfied in the agent’s cur-
rent belief-state.
Typically, BDI agents do not deal with stochastic
uncertainty. Integrating POMDP notions into a BDIA
addresses this. For instance, an HPB agent will main-
tain a (subjective) belief-state representing its proba-
bilistic (uncertain) belief about its current state. Plan-
ning with models of stochastic actions and percep-
tions is thus possible in the proposed architecture.
The tight integration of POMDPs and BDIAs is novel,
especially in combination with desires with changing
intensity levels.
Section 2 briefly reviews the necessary theory.
The proposed agent architecture is presented in Sec-
tion 3 and formally defined. Section 4 shows an im-
plementation of the architecture on an example do-
main and evaluates the performance on various di-
mensions, confirming that the approach may be useful
in some domains. The last section discusses some re-
lated work and points out some future directions for
research in this area.
2 PRELIMINARIES
The basic components of a BDI architecture
(Wooldridge, 1999, 2002) are
a set or knowledge-base B of beliefs;
an option generation function wish, generating the
objectives the agent would ideally like to pursue
(its desires);
a set of desires D (goals to be achieved);
a ‘focus’ function which selects intentions from
the set of desires;
a structure of intentions I of the most desirable
options/desires returned by the focus function;
a library of plans and subplans;
a ‘reconsideration’ function which decides
whether to call the focus function;
an execution procedure, which affects the world
according to the plan associated with the inten-
tion;
a sensing or perception procedure, which gathers
information about the state of the environment;
and
a belief update function, which updates the
agent’s beliefs according to its latest observations
and actions.
Exactly how these components are implemented re-
sult in a particular BDI architecture.
Algorithm 1 (adapted from Wooldridge (2000,
Fig. 2.3)) it a basic BDI agent control loop. π is the
current plan to be executed. getPercept(·) senses the
environment and returns a percept (processed sensor
data) which is an input to update(·). plan(·) returns a
plan from the plan library to achieve the agent’s cur-
rent intentions. wish : B × I D generates a set of
desires, given the agent’s beliefs, current intentions
and possibly its innate motives. It is usually imprac-
tical for an agent to pursue the achievement of all its
desires. It must thus filter out the most valuable de-
sires and desires that are believed possible to achieve.
This is the function of focus : B × D × I I, taking
beliefs, desires and current intentions as parameters.
Together, the processes performed by wish and focus
ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence
6
Algorithm 1: Basic BDI agent control loop.
Input: B
0
: initial beliefs
Input: I
0
: initial intentions
1 B B
0
;
2 I I
0
;
3 π null ;
4 while alive do
5 p getPercept();
6 B update(B, p);
7 D wish(B,I);
8 I focus(B,D,I);
9 π plan(B,I);
10 execute(π);
may be called deliberation, formally encapsulated by
the deliberate procedure.
Algorithm 2 (adapted fromSchut and Wooldridge
(2001b)) has some more sophisticated controls. It
controls when the agent would consider whether to
re-deliberate, with the reconsider function (line 7)
placed just before deliberation would take place.
reconsider(·) is a Boolean function which tells the
agent whether to reconsider its intentions (every time
line 7 is reached).
The agent tests at every iteration through the main
loop whether the currently pursued intention is still
possibly achievable, using impossible(·). In the algo-
rithm, serendipity is also taken advantage of by pe-
riodically testing—using succeeded(·)—whether the
intention has been achieved, without the plan being
Algorithm 2: Control loop for an agent with recon-
sideration.
Input: B
0
: initial beliefs
Input: I
0
: initial intentions
1 B B
0
;
2 I I
0
;
3 π null ;
4 while alive do
5 p getPercept() ;
6 B update(B, p) ;
7 if reconsider(B,I) then
8 D wish(B,I) ;
9 I focus(B,D,I) ;
10 if not sound(π,I,B) then
π plan(B,I)
11 if not empty(π) then
12 α head(π) ;
13 execute(α) ;
14 π tail(π) ;
15 I succeeded(I, B) ;
16 I impossible(I, B) ;
fully executed. This agent is considered ‘reactive’ be-
cause it executes one action per loop iteration; this al-
lows for deliberation between executions. The sound-
ness (or applicability) of the plan to achieve the cur-
rent intention is checked at every iteration of the loop.
There are various mechanisms that an agent might
use to decide when to reconsider its intentions. See,
for instance, Bratman (1987); Pollack and Ringuette
(1990); Kinny and Georgeff (1991, 1992); Schut and
Wooldridge (2000, 2001a); Schut et al. (2004).
In a partially observable Markov decision process
(POMDP), the actions the agent performs have non-
deterministic effects in the sense that the agent can
only predict with a likelihood in which state it will
end up after performing an action. Furthermore, its
perception is noisy. That is, when the agent uses its
sensors to determine in which state it is, it will have a
probability distribution over a set of possible states to
reflect its conviction for being in each state.
Formally (Kaelbling et al., 1998), a POMDP is a
tuple hS, A, T, R, Z,P,b
0
i with
S, a finite set of states of the world (that the agent
can be in),
A a finite set of actions (that the agent can choose
to execute),
a transition function T(s,a,s
), the probability of
being in s
after performing action a in state s,
R(a,s), the immediate reward gained for execut-
ing action a while in state s,
Z, a finite set of observations the agent can per-
ceive in its world,
a perception function P(s
,a,z), the probability of
observing z in state s
resulting from performing
action a in some other state, and
b
0
the initial probability distribution over all states
in S.
A belief-state b is a set of pairs hs, pi where each
state s in b is associated with a probability p. All
probabilities must sum up to one, hence, b forms a
probability distribution over the set S of all states. To
update the agent’s beliefs about the world, a special
function SE(z,a,b) = b
n
is defined as
b
n
(s
) =
P(s
,a,z)
sS
T(s,a, s
)b(s)
Pr(z|a, b)
, (1)
where a is an action performed in ‘current’ belief-
state b, z is the resultant observation and b
n
(s
) de-
notes the probability of the agent being in state s
in
‘new’ belief-state b
n
. Note that Pr(z|a,b) is a nor-
malizing constant.
Let the planning horizon h (also called the look-
ahead depth) be the number of future steps the
HybridPOMDP-BDI-AnAgentArchitecturewithOnlineStochasticPlanningandDesireswithChangingIntensityLevels
7
agent plans ahead each time it selects its next action.
V
(b,h) is the optimal value of future courses of ac-
tions the agent can take with respect to a finite horizon
h starting in belief-state b. This function assumes that
at each step the action that will maximize the state’s
value will be selected.
Because the reward function R(a, s) providesfeed-
back about the utility of a particular state s (due to a
executed in it), an agent who does not know in which
state it is in cannot use this reward function directly.
The agent must consider, for each state s, the proba-
bility b(s) of being in s, according to its current belief-
state b. Hence, a belief reward function ρ(a,b) is
defined, which takes a belief-state as argument. Let
ρ(a,b)
def
=
sS
R(a,s)b(s).
The optimal state-value function is define by
V
(b,h)
def
= max
aA
h
ρ(a,b)+
γ
zZ
Pr(z|a, b)V
(SE(z,a,b), h 1)
i
,
where 0 γ < 1 is a factor to discount the value of
future rewards and Pr(z|a,b) denotes the probability
of reaching belief-state b
n
= SE(z,a,b). While V
de-
notes the optimal value of a belief-state, function Q
denotes the optimal action-value:
Q
(a,b,h)
def
= ρ(a, b)+
γ
zZ
Pr(z|a, b)V
(SE(z,a,b), h 1)
is the value of executing a in the current belief-state,
plus the total expected value of belief-states reached
thereafter.
3 THE HPB ARCHITECTURE
A hybrid POMDP-BDI (HPB) agent maintains (i) a
belief-state which is periodically updated, (ii) a map-
ping from goals to numbers representing the level of
desire to achieve the goals, and (iii) the current in-
tention, the goal with the highest desire level. As the
agent acts, its desire levels are updated and it may
consider choosing a new intention based on new de-
sire levels.
The state of an HPB agent is defined by the tuple
hB,D,Ii, where B is the agent’s current belief-state
(i.e., a probability distribution over the states S, de-
fined below), D is the agent’s current desire function
and I is the agent’s current intention. We’ll have more
to say about D and I a little later.
An HPB agent could be defined by the tuple
hAtrb,G,A,Z, T, P,Utili, where
Atrb is a set of attribute-sort pairs (for short, the
attribute set). For every (atrb : sort) Atrb, atrb is
the name or identifier of an attribute of interest in the
domain of interest, like BattryLevel or WeekDays, and
sort is the set from which atrb can take a value, for in-
stance, real numbersin the range [0,55] or a list of val-
ues like {Mon,Tue,Wed,Thu,Fri}. So {(BattryLevel :
[0,55]),(WeekDays : {Mon, Tue, Wed, Thu, Fri})}
could be an attribute set.
Let N = {atrb | (atrb : sort) Atrb} be the set
of all attribute names. We define a state s induced
from Atrb as one possible way of assigning values to
attributes: s = {(atrb : v) | atrb N ,(atrb : sort)
Atrb,v sort} such that if (atrb : v),(atrb
: v
) s
and atrb = atrb
, then v = v
. The set of all possible
states is denoted S.
G is a set of goals. A goal g G is a
set defined as follows. Given a fixed set N
g
N , g = {(atrb : v) | atrb N
g
,(atrb : sort)
Atrb,v sort} such that if (atrb : v),(atrb
: v
)
g and atrb = atrb
, then v = v
. For instance,
{(BattryLevel : 13),(WeekDay : Tue)} is a goal, and
so are {(BattryLevel : 33)} and {(WeekDay : Wed)}.
It is even possible to have one goal overlap or be a
subset of another goal. For instance, one is allowed to
have {(BattryLevel : 13),(WeekDay : Tue)} G and
simultaneously {(BattryLevel : 13)}, {(BattryLevel :
14),(WeekDay : Tue)} G. In this architecture, it is
assumed that the set of goals is given.
A is a finite set of actions.
Z is a finite set of observations.
T is the transition function of POMDPs.
P is the perception function of POMDPs.
Util consists of two functions Pref and Satf
which allow an agent to determine the utilities of al-
ternative sequences of actions. Util = hPref,Satfi.
Pref is the preference function with a range in
R [0, 1]. It takes an action a and a state s, and re-
turns the preference (any real number) for performing
a in s. That is, Pref(a,s) [0,1]. Numbers closer
to 1 imply greater preference and numbers closer to
0 imply less preference. Except for the range restric-
tion of [0,1], it has the same definition as a POMDP
reward function, but its name indicates that it mod-
els the agent’s preferences and not what is typically
thought of as rewards. An HPB agent gets ‘rewarded’
by achieving its goals. The preference function is es-
pecially important to model action costs; the agent
should prefer ‘inexpensive’ actions. Pref has a lo-
cal flavor. Designing the preference function to have
a value lying in [0,1] may sometimes be challenging,
but we believe it is always possible.
Satf is the satisfaction function with a range in
R [0,1]. It takes a state s and an intention I, and
ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence
8
returns a value representing the degree to which the
state satisfies the intention. That is, Satf(I,s) [0, 1].
It is completely up to the agent designer to decide how
the satisfaction function is defined, as long as num-
bers closer to 1 mean more satisfaction and numbers
closer to 0 mean less satisfaction. Satf has a global
flavor.
The desire function D is a total function from
goals in G into the positive real numbers R
+
. The
real number represents the intensity or level of de-
sire of the goal. For instance, ({(BattryLevel :
13),(WeekDay : Tue)},2.2) could be in D, mean-
ing that the goal of having the battery level at 13
and the week-day Tuesday is desired with a level
of 2.2. ({(BattryLevel : 33)},56) and ({(WeekDay :
Wed)},444) are also examples of desires in D.
I is the agent’s current intention; an element of G;
the goal with the highest desire level. This goal will
be actively pursued by the agent, shifting the impor-
tance of the other goals to the background. The fact
that only one intention is maintained makes the HPB
agent architecture quite different to standard BDIAs.
Figure 1 shows a flow diagram representing the
operational semantics of the HPB architecture.
The satisfaction an agent gets for an intention in
its current belief-state is defined as
Satf
β
(I, B)
def
=
sS
Satf(I,s)B(s),
where Satf(I,s) is defined above and B(s) is the prob-
ability of being in state s. The definition of Pref
β
has
the same form as the reward function ρ over belief-
states in POMDP theory:
Pref
β
(a,B)
def
=
sS
Pref (a,s)B(s),
where Pref(a,s) was discussed above.
We propose the following desire update rule.
D(g) D(g) + 1 Satf
β
(B,g) (2)
Rule 2 is defined so that as Satf
β
(B,g) tends to one
(total satisfaction), the intensity with which the in-
cumbent goal is desired does not increase. On the
other hand, as Satf
β
(B,g) becomes smaller (more dis-
satisfaction), the goal’s intensity is incremented. The
rule transforms D with respect to B and g. A goal’s
intensity should drop the more it is being satisfied.
The update rule thus defines how a goal’s intensity
changes over time with respect to satisfaction.
Note that desire levels never decrease. This does
not reflect reality. It is however convenient to rep-
resent the intensity of desires like this: only relative
differences in desire levels matter in our approach and
we want to avoid unnecessarily complicating the ar-
chitecture.
An HPB agent controls its behaviour according to
the policies it generates. Plan is a procedure which
generates a POMDP policy π of depth h. Essentially,
we want to consider all action sequences of length h
and the belief-states in which the agent would find it-
self if it followed the sequences. Then we want to
choose the sequence (or at least its first action) which
yields the highest preference and which ends in the
belief-state most satisfying with respect to the inten-
tion.
During planning, preferences and intention satis-
faction must be maximized. The main function used
in the Plan procedure is the HPB action-value func-
tion Q
HPB
, giving the value of some action a, con-
ditioned on the current belief-state B, intention I and
look-ahead depth h:
Q
HPB
(a,B,I,h)
def
= αSatf
β
(B,I) + (1 α)Pref
β
(a,B) +
γ
zZ
Pr(z | a, B) max
a
A
Q
(a
,B
,I, h 1),
Q
HPB
(a,B,I,1)
def
= αSatf
β
(B,I) + (1 α)Pref
β
(a,B),
where B
= SE(a,z,B), 0 α 1 is the
goal/preference ‘trade-off factor, γ is the nor-
mal POMDP discount factor and SE is the normal
POMDP state estimation function.
To keep things simple for this introductory paper,
we define Plan to return argmax
aA
Q
HPB
(a,B,I,h),
the trivial policy of a single action. In general, Plan
could return a policy of depth h, that is, a sequence of
h actions, where the choice of exactly which action to
take at each step depends on the observation received
just prior.
Focus is a function which returns one member of
G called the (current) intention I. Presently, we de-
fine it simply as selecting the goal with the highest
desire level. After every execution of an action in the
real-world, Refocus is called to decide whether to call
Focus to select a new intention. Refocus is a meta-
reasoning function analogous to the reconsider func-
tion discussed in Section 2. It is important to keep the
agent focused on one goal long enough to give it a rea-
sonable chance of achieving it. It is the job of Refocus
to recognize when the current intention seems impos-
sible or too expensive to achieve.
Let Satf
levels be the sequence of satisfaction lev-
els of the current intention since it became active and
let MEMORY be a designer-specified number repre-
senting the length of a sub-sequence of Satf
levels
the MEMORY last satisfaction levels. One possible
HybridPOMDP-BDI-AnAgentArchitecturewithOnlineStochasticPlanningandDesireswithChangingIntensityLevels
9
Focus
Plan
Replan
Execute
Sense
Refocus
Belief Update
G
I
pi
B
action
observation
yes
no
no
yes
information flow
control flow
Desire Update
D
SL
Figure 1: Operational semantics of the HPB architecture. SL stands for Satf levels. Note that Satf levels depends on the
current belief-state and intention, but not on desire levels. Planning is also independent of desire levels. The focus function
depends on desire levels, but not on satisfaction. Whether to refocus depends on satisfaction levels, but not on desire levels.
ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence
10
definition of Refocus is
Refocus(c,θ)
def
=
‘no’ if |Satf levels| < MEMORY
‘yes’ if c < θ
‘no’ otherwise,
where c is the average change from one satisfaction
level to the next in the agent’s ‘MEMORY’, and θ
is some threshold chosen by the agent designer. If
the agent is expected to increase its satisfaction by
at least, say, 0.1 on average for the current intention,
then θ should be set to 0.1. With this approach, if the
agent ‘gets stuck’ trying to achieve its current inten-
tion, it will not blindly keep on trying to achieve it,
but will start pursuing another goal (with the highest
desire level). Note that if an intention was not well
satisfied, its desire level still increases at a relatively
high rate. So whenever the agent focuses again, a goal
not well satisfied in the past will be a top contender to
become the intention (again).
4 EVALUATION
We performed some tests on an HPB agent in a six-
by-six grid-world. In this world, the agent’s task is to
visit each of the four corners, while collecting items
on the way. That is, the agent’s goals are the states
representing the four corners, but the collecting of
items is regarded as a preferred behavior, not a goal
to be pursued.
States are quadruples hx,y, d,ti, with x,y
{1,··· ,6} being the coordinates of the agent’s po-
sition in the world, d {North,East,West, South}
the direction it is facing, and t {0,1}, t = 1
if an item is present in the cell with the agent,
else t = 0. The agent can perform five actions
{left, right,forward,see,collect}, meaning, turn left,
turn right, move one cell forward, see whether an item
is present and collect an item. The only observation
possible when executing one of the physical actions
is obsNil, the null observation, and see has possible
observations from the set {0,1} for whether the agent
sees the presence of an item (1) or not (0).
Next, we define the possible outcomes for each
action: When the agent turns left or right, it can get
stuck in the same direction, turn 90
or overshoots by
90
. When the agent moves forward, it moves one cell
in the direction it is facing or it gets stuck and does
not move. The agent can see an item or see nothing
(no item in the cell), and collecting is deterministic (if
there is an item present, it will be collected with cer-
tainty, if the agent executes collect). All actions ex-
cept collect are designed so that the correct outcome
is achieved 95% of the time and incorrect outcomes
are achieved 5% of the time.
So that the agent does not get lost too quickly,
we have included an automatic localization action,
that is, a sensing action returns information about the
agent’s approximate location. The action is automatic
because the agent cannot choose whether to perform
it; the agent localizes itself after every regular/chosen
action is executed. However, just as with regular ac-
tions, the localization sensor is noisy, and it correctly
reports the agent’s location with probability 0.95, else
the sensor reports a location adjacent to the agent with
probability uniformly distributed over 0.05.
Errors in the agent’s actions and perceptions are
thus modeled, not ignored.
In the experiments which follow, the threshold θ
is set to 0.05, MEMORY is set to 5 and h = 4. De-
sire levels are initially set to zero for all goals. Three
experiments were performed. First, visiting corners
but not collecting items, second, collecting items but
not intentionally visiting corners and third, visiting
corners while collecting items. For each experiment,
10 trials were run with the agent starting in random
locations and performing 100 actions per trial. We
let Satf(I, s) = 1 dist/10 where 10 is the maximum
Manhattan distance between two cells in the world
and dist is the Manhattan distance between the cells
represented by I and s, and we let
Pref (a,s) = (1 dist/10+ collUtil+ sensUtil)/100,
where dist is the Manhattan distance between the cell
representing s and the closest cell containing an item,
collUtil is 98 if a is collect and there is actually an
item in the cell represented by s, else 0, and sensUtil
is 1 if the agent tries to see, else 0.
1
The division by
100 is to bring the value of Pref (·) within the limits
of 0 and 1.
First, we see how an HPB agent behaves when
it has no goal state (α = 0), but continually only
‘prefers’ to collect items. That is, we let
Q
HPB
(a,B,I,h)
def
= Pref
β
(a,B)+
γ
zZ
Pr(z | a, B) max
a
A
Q
(a
,B
,I, h 1).
On average, it collects 7.4 of 12 possible items. The
left-most results column of Table 1 shows how often
corners are (unintentionally) visited.
Next, if the HPB agent prefers to collect items
while equally trying to reach corners (α = 0.5), it col-
lects 4.3 of 12 possible items and the corners it visits
is summarized in the second-from-left results column
of Table 1.
1
Pref(·) is designed such that the agent collects a maxi-
mum number of items (ignoring goals). The agent collects
more when it is encouraged to sense where items are, hence
sensUtil is 1 if the agent tries to see.
HybridPOMDP-BDI-AnAgentArchitecturewithOnlineStochasticPlanningandDesireswithChangingIntensityLevels
11
Table 1: The average number of times each corner was vis-
ited (on separate occasions), percentage of times all corners
were visited, and percentage of items (out of 12) collected.
Corner
Times visited
α = 0 α = 0.5 α = 0.75 α = 1
(1,1) 2.2 2.8 2.7 2.9
(1,6) 2.1 2.6 2.7 3.2
(6,1) 2.0 2.7 2.6 3.0
(6,6) 1.7 2.6 2.9 3.0
All 8.0% 10.7% 10.9% 12.1%
Items
coll’ed
62% 36% 29% 0%
Then, we observe the agent’s behavior if we set
α = 0.75. In this case, the agent collects 3.5 items on
average, and its corner-visiting behavior—as given in
the second-from-right column of Table 1—is propor-
tional to the value of α, as expected.
Finally, we ignore the collection of items by set-
ting α = 1. That is, we let
Q
HPB
(a,B,I,h)
def
= Satf
β
(B,I)+
γ
zZ
Pr(z | a, B) max
a
A
Q
(a
,B
,I, h 1).
The right-most results column of Table 1 shows the
average number of times each corner was visited
when collecting items is not a preference. No items
were collected.
These experiments highlight four important fea-
tures of an HPB agent: (1) While the agent is pursuing
goals, it can concurrently perform rewarding actions
not directly related to its goals. (2) Each of several
goals can be pursued individually until satisfactorily
achieved. (3) Goals must periodically be re-achieved.
(4) The trade-off between goals and preferences can
be set effectively. (5) Goals and preferences can be
satisfied even while dealing with stochastic actions
and perceptions.
5 RELATED WORK AND
CONCLUSION
Our work focuses on providing high-level decision-
making capabilities for robots and agents who live
in dynamic stochastic environments, where multiple
goals and goal types must be pursued. We introduced
a hybrid POMDP-BDI agent architecture, which may
display emergentbehavior, driven by the intensities of
their desires. In the past decade, several BDIAs have
been augmented with capabilities to deal with uncer-
tainty. The HPB architecture is novelin that, while the
agent is pursuing goals, it can concurrently perform
rewarding actions not directly related to its goals, and
goals must periodically be re-achieved, depending on
the goals’ desire levels, which change over time and
in proportion to how close the goals are to being sat-
isfied.
Walczak et al. (2007) and Meneguzzi et al. (2007)
have incorporated online plan generation into BDI
systems, however the planners deal only with deter-
ministic actions and observations.
Nair and Tambe (2005) use POMDP theory to
coordinate teams of agents. However, their frame-
work is very different to our architecture. They use
POMDP theory to determine good role assignments
of team members, not for generating policies online.
Lim et al. (2008) provide a rather sophisticated
architecture for controlling the behavior of an emo-
tional agent. Their agents reason with several classes
of emotion and their agents are supposed to portray
emotional behavior, not simply to solve problems, but
to look believable to humans. Their architecture has
a “continuous planner [...] that is capable of partial
order planning and includes emotion-focused coping
[...]” Their work has a different application to ours,
however, we could take inspiration from them to im-
prove the HPB architecture.
Pereira et al. (2008) take a different approach to
use POMDPs to improve BDI agents. By leveraging
the relationship between POMDP and BDI models,
as discussed by Simari and Parsons (2006), they de-
vised an algorithm to extract BDI plans from optimal
POMDP policies. The main difference to our work
is that their policies are pre-generated and BDI-style
rules are extracted for all contingencies. The advan-
tage is that no (time-consuming) online plan/policy
generation is necessary. The disadvantage of their ap-
proach is that all the BDI plans must be stores and ev-
ery time the domain model changes, a new POMDP
must be solved and the policy-to-BDI-plan algorithm
must be run. It is not exactly clear from their paper
(Pereira et al., 2008) how or when intentions are cho-
sen. Although it is interesting to know the relation-
ship between POMDPs and BDI models (Simari and
Parsons, 2006, 2011), we did not use any of these in-
sights in developing our architecture. However, the
fact that the HPB architecture does integrate the two
frameworks, is probably due to the existence of the
relationship.
Rens et al. (2009) also introduced a hybrid
POMDP-BDI architecture, but without a notion of de-
sire levels or satisfaction levels. Although their basic
approaches to combine the POMDP and BDI frame-
works is the same as ours, there are at least two major
differences: Firstly, they define their architecture in
terms of the GOLOG agent language (Boutilier et al.,
ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence
12
2000). Secondly, their approach uses a computation-
ally intensive method for deciding whether to refocus;
performing short policy look-aheads to ascertain the
most valuable goal to pursue.
2
Our approach seems
much more efficient.
Chen et al. (2013) incorporate probabilistic graph-
ical models into the BDI framework for plan selection
in stochastic environments. An agent maintains epis-
temic states (with random variables) to model the un-
certainty about the stochastic environment, and cor-
responding belief sets of the epistemic state are de-
fined. The possible states of the environment, ac-
cording to sensory observations, and their relation-
ships are modeled using probabilistic graphical mod-
els: The uncertainty propagation is carried out by
Bayesian Networks and belief sets derived from the
epistemic states trigger the selection of relevant plans
from a plan library. For cases when more than one
plan is applicable due to uncertainty in an agent’s be-
liefs, they propose a utility-driven approach for plan
selection, where utilities of actions are modeled in in-
fluence diagrams. Our architecture is different in that
it does not have a library of pre-supplied plans; in our
architecture, policies (plans) are generated online.
None of the approaches mentioned maintain de-
sire levels for selecting intentions. The benefit of
maintaining desire levels is that intentions are not se-
lected only according what they offer with respect to
their current expected reward, but also according to
when last they were achieved.
Although Nair and Tambe (2005) and Chen et al.
(2013) call their approaches hybrid, our architecture
can arguably more confidently be called hybrid be-
cause of its more intimate integration of POMDP and
BDI concepts.
We could take some advice from Antos and Pf-
effer (2011). They provide a systematic methodol-
ogy to incorporate emotion into a decision-theoretic
framework, and also provide “a principled, domain-
independent methodology for generating heuristics in
novel situations”.
Policies returned by Plan as defined in this paper
are optimal. A major benefit of a POMDP-based ar-
chitecture is that the literature on POMDP planning
optimization (Murphy, 2000; Roy et al., 2005; Paquet
et al., 2005; Li et al., 2005; Shani et al., 2007; Ross
et al., 2008; Cai et al., 2009; Shani et al., 2013) (for in-
stance) can be drawn upon to improve the speed with
which policies can be generated.
Our architecture cannot yet control how often one
2
Essentially, the goals in G are stacked in descending
order of the value of V
HPB
(B,g,h
), where h
< h and B
is the current belief-state. The goal on top of the stack be-
comes the intention.
goal is sought relative to other goals. It would be ad-
vantageous to be able to do this.
Evaluating the proposed architecture in richer do-
mains would highlight problems in the architecture
and indicate new directions for research and develop-
ment in the area of hybrid POMDP-BDI architectures.
REFERENCES
Antos, D. and Pfeffer, A. (2011). Using emotions to en-
hance decision-making. In Walsh, T., editor, Pro-
ceedings of the 22nd Intl. Joint Conf. on Artif. In-
tell. (IJCAI-11), pages 24–30, Menlo Park, CA. AAAI
Press.
Boutilier, C., Reiter, R., Soutchanski, M., and Thrun, S.
(2000). Decision-theoretic, high-level agent program-
ming in the situation calculus. In Proceedings of the
Seventeenth National Conference on Artificial Intelli-
gence (AAAI-00) and of the Twelfth Conference on In-
novative Applications of Artificial Intelligence (IAAI-
00), pages 355–362. AAAI Press, Menlo Park, CA.
Bratman, M. (1987). Intention, Plans, and Prac-
tical Reason. Harvard University Press, Mas-
sachusetts/England.
Cai, C., Liao, X., and Carin, L. (2009). Learning to explore
and exploit in pomdps. In NIPS, pages 198–206.
Chen, Y., Hong, J., Liu, W., Godo, L., Sierra, C., and
Loughlin, M. (2013). Incorporating PGMs into a BDI
architecture. In Boella, G., Elkind, E., Savarimuthu,
B., Dignum, F., and Purvis, M., editors, PRIMA 2013:
Principles and Practice of Multi-Agent Systems, vol-
ume 8291 of Lecture Notes in Computer Science,
pages 54–69. Springer, Berlin/Heidelberg.
Kaelbling, L., Littman, M., and Cassandra, A. (1998). Plan-
ning and acting in partially observable stochastic do-
mains. Artificial Intelligence, 101(1–2):99–134.
Kinny, D. and Georgeff, M. (1991). Commitment and effec-
tiveness of situated agents. In Proceedings of the 12th
Intl. Joint Conf. on Artificial Intelligence (IJCAI-91),
pages 82–88.
Kinny, D. and Georgeff, M. (1992). Experiments in optimal
sensing for situated agents. In Proceedings of the the
2nd Pacific Rim Intl. Conf. on Artificial Intelligence
(PRICAI-92).
Koenig, S. (2001). Agent-centered search. Artificial Intelli-
gence Magazine, 22:109–131.
Li, X., Cheung, W., and Liu, J. (2005). Towards solv-
ing large-scale POMDP problems via spatio-temporal
belief state clustering. In Proceedings of IJCAI-05
Workshop on Reasoning with Uncertainty in Robotics
(RUR-05).
Lim, M., Dias, J., Aylett, R., and Paiva, A. (2008). Im-
proving adaptiveness in autonomous characters. In
Prendinger, H., Lester, J., and Ishizuka, M., editors,
Intelligent Virtual Agents, volume 5208 of Lecture
Notes in Computer Science, pages 348–355. Springer,
Berlin/Heidelberg.
HybridPOMDP-BDI-AnAgentArchitecturewithOnlineStochasticPlanningandDesireswithChangingIntensityLevels
13
Lovejoy, W. (1991). A survey of algorithmic methods for
partially observed Markov decision processes. Annals
of Operations Research, 28:47–66.
Meneguzzi, F., Zorzo, A., M´ora, M., and M., L. (2007).
Incorporating planning into BDI systems. Scalable
Computing: Practice and Experience, 8(1):15–28.
Monahan, G. (1982). A survey of partially observable
Markov decision processes: Theory, models, and al-
gorithms. Management Science, 28(1):1–16.
Murphy, R. (2000). Introduction to AI Robotics. MIT Press,
Massachusetts/England.
Nair, R. and Tambe, M. (2005). Hybrid bdi-pomdp
framework for multiagent teaming. J. Artif. Intell.
Res.(JAIR), 23:367–420.
Paquet, S., Tobin, L., and Chaib-draa, B. (2005). Real-time
decision making for large POMDPs. In Advances in
Artificial Intelligence: Proceedings of the Eighteenth
Conference of the Canadian Society for Computa-
tional Studies of Intelligence, volume 3501 of Lecture
Notes in Computer Science, pages 450–455. Springer
Verlag.
Pereira, D., Gonc¸alves, L., Dimuro, G., and Costa, A.
(2008). Constructing bdi plans from optimal pomdp
policies, with an application to agentspeak program-
ming. In G. Henning, M. G. and Goneet, S., editors,
XXXIV Conferˆencia Latinoamericano de Inform´atica,
Santa Fe. Anales CLEI 2008, pages 240–249.
Pollack, M. and Ringuette, M. (1990). Introducing the Tile-
world: Experimentally evaluating agent architectures.
In Proceedings of the AAAI-90, pages 183–189. AAAI
Press.
Rao, A. and Georgeff, M. (1995). BDI agents: From theory
to practice. In Proceedings of the ICMAS-95, pages
312–319. AAAI Press.
Rens, G., Ferrein, A., and Van der Poel, E. (2009). A
BDI agent architecture for a POMDP planner. In
Lakemeyer, G., Morgenstern, L., and Williams, M.-
A., editors, Proceedings of the 9th Intl. Symposium
on Logical Formalizations of Commonsense Reason-
ing (Commonsense 2009), pages 109–114, University
of Technology, Sydney. UTSe Press.
Ross, S., Pineau, J., Paquet, S., and Chaib-draa, B. (2008).
Online planning algorithms for POMDPs. Journal of
Artificial Intelligence Research (JAIR), 32:663–704.
Roy, N., Gordon, G., and Thrun, S. (2005). Finding
approximate POMDP solutions through belief com-
pressions. Journal of Artificial Intelligence Research
(JAIR), 23:1–40.
Schut, M. and Wooldridge, M. (2000). Intention reconsider-
ation in complex environments. In Proceedings of the
the 4th Intl. Conf. on Autonomous Agents (AGENTS-
00), pages 209–216, New York, NY, USA. ACM.
Schut, M. and Wooldridge, M. (2001a). The control of rea-
soning in resource-bounded agents. The Knowledge
Engineering Review, 16(3):215–240.
Schut, M. and Wooldridge, M. (2001b). Principles of inten-
tion reconsideration. In Agents 2001: Proceedings of
the 5th Intl. Conf. on Autonomous Agents, pages 340–
347, New York, NY. ACM Press.
Schut, M., Wooldridge, M., and Parsons, S. (2004).
The theory and practice of intention reconsideration.
Experimental and Theoretical Artificial Intelligence,
16(4):261–293.
Shani, G., Brafman, R., and Shimony, S. (2007). Forward
search value iteration for POMDPs. In de Mantaras,
R. L., editor, Proceedings of the 20th Intl. Joint Conf.
on Artif. Intell. (IJCAI-07), pages 2619–2624, Menlo
Park, CA. AAAI Press.
Shani, G., Pineau, J., and Kaplow, R. (2013). A survey of
point-based pomdp solvers. Autonomous Agents and
Multi-Agent Systems, 27(1):1–51.
Simari, G. and Parsons, S. (2006). On the relationship
between mdps and the bdi architecture. In Proceed-
ings of the fifth international joint conference on Au-
tonomous agents and multiagent systems, AAMAS
’06, pages 1041–1048, New York, NY, USA. ACM.
Simari, G. and Parsons, S. (2011). Markov Deci-
sion Processes and the Belief-Desire-Intention Model.
Springer Briefs in Computer Science. Springer, New
York, Dordrecht, Heidelberg, London.
Walczak, A., Braubach, L., Pokahr, A., and Lamersdorf,
W. (2007). Augmenting BDI agents with delibera-
tive planning techniques. In Bordini, R., Dastani,
M., Dix, J., and Seghrouchni, A., editors, Proceed-
ings of the 4th Intl. Workshop of Programming Multi-
Agent Systems (ProMAS-06), pages 113–127, Heidel-
berg/Berlin. Springer Verlag.
Wooldridge, M. (1999). Intelligent agents. In Weiss, G., ed-
itor, Multiagent Systems: A Modern Approach to Dis-
tributed Artificial Intelligence, chapter 1. MIT Press,
Massachusetts/England.
Wooldridge, M. (2000). Reasoning about Rational Agents.
MIT Press, Massachusetts/England.
Wooldridge, M. (2002). An introduction to multiagent sys-
tems. John Wiley & Sons, Chichester, England.
ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence
14