Modelling Evolving Voting Behaviour on Internet Platforms

Stochastic Modelling Approaches for Dynamic Voting Systems

Shikhar Raje

, Navjyoti Singh

and Shobhit Mohan

Center for Exact Humanities, International Institute of Information Technology, Hyderabad, Hyderabad, India

Department of Economics, Hyderabad Central University, Hyderabad, India

Keywords:

Preference Aggregation, Stochastic Modelling, Dynamic Voting, Markov Decision Processes, Computational

Complexity, Algorithm Analysis.

Abstract:

Markov Decision Processes (MDPs) and their variants are standard models in various domains of Artiﬁcial

Intelligence. However, each model captures a different aspect of real-world phenomena and results in different

kinds of computational complexity. Also, MDPs are recently ﬁnding use in the scenarios involving aggregation

of preferences (such as recommendation systems, e-commerce platforms, etc.). In this paper, we extend one

such MDP variant to explore the effect of including observations made by stochastic agents, on the complexity

of computing optimal outcomes for voting results. The resulting model captures phenomena of a greater

complexity than current models, while being closer to a real world setting. The utility of the theoretical model

is discussed by application to the real world setting of crowdsourcing. We address a key question in the

crowdsourcing domain, namely, the Exploration Vs. Exploitation problem, and demonstrate the ﬂexibility of

adaptation of MDP-based models in Dynamic Voting scenarios.

1 INTRODUCTION

The Internet exhibits a variety of voting and prefer-

ence aggregation schemes. This is immediately ev-

ident from the wide use of such schemes in rank-

ing product features, ranking of songs and artists, etc

(Altman and Tennenholtz, 2005). In many of these

settings the aggregated ranking is dynamic, i.e. the

system announces the ranking at each particular point,

and may revise it when new agents arrive to the sys-

tem and announce their votes, or when existing agents

change their votes. Therefore, the Internet calls for

additional study of voting and preference aggregation

schemes, that goes beyond the classical models. We

refer to this setting as Dynamic Voting (Tennenholtz,

2004).

A recent treatment of Dynamic Voting is due to

(Parkes and Procaccia, 2013). The authors modelled

preference aggregation as a discrete-space, discrete-

time evolutionary model, which was then modelled as

a stochastic process. In order to construct a tractable

algorithm for their model, a symmetry-based contrac-

tion was used to reduce the state space of the prob-

lem. The results generated from their approach dif-

fered from traditional results in Social Choice since

an axiomatic approach was traded for specifying the

behaviour of the system formally. The study focused

on the dynamic behaviour of the system as opposed

to axiomatic insights.

In this paper,we extend stochastic modelling tech-

niques to scenarios in Dynamic Voting. The rest of

the paper is organized as follow. We begin Section 2

by surveying the approach taken in (Parkes and Pro-

caccia, 2013). Section 3 extends the groundwork to

newer stochastic models to deal with more complex

settings. Speciﬁcally, we introduce a model that al-

lows us to account for observations made by voters

about their surroundings (including the votes of fel-

low voters) and interpret the ﬁndings from using the

model in that setting. In Section 4, a high-level ap-

plication of this model to existing scenarios presents

potential beneﬁts, thereby justifying the use of such a

novel approach. We conclude in Section 5.

2 DYNAMIC VOTING AS AN MDP

In a voting or preference aggregation scenario, the un-

certainty stems from the possible action that an agent

may take from a set of actions, given a particular set

of inputs. Current methods in Dynamic Voting focus

on relating these actions to the future evolution of the

Raje, S., Singh, N. and Mohan, S.

Modelling Evolving Voting Behaviour on Internet Platforms - Stochastic Modelling Approaches for Dynamic Voting Systems.

DOI: 10.5220/0006073502390244

In Proceedings of the 8th International Joint Conference on Computational Intelligence (IJCCI 2016) - Volume 1: ECTA, pages 239-244

ISBN: 978-989-758-201-1

239

system. This relation is achieved by deﬁning the pos-

sible actions that a voter can take as a random vari-

able for constructing a stochastic process. Random

variables from multiple voters can then be aggregated

into a single random variable that represents the ag-

gregate uncertainty of the entire system of voters.

In (Parkes and Procaccia, 2013), the class of

stochastic processes chosen to model Dynamic Vot-

ing are Markov Decision Processes (Puterman, 2014),

(Howard, 1960). A Markov Decision Process (MDP)

is a tuple M =< S, A, R,T >. S represents the ﬁnite

state space of the process. A is the ﬁnite set of ac-

tions that can be taken to change the state of the sys-

tem from one state to another. R : S × A → R is a

reward function that where, for s ∈ S and a ∈ A,

R(s, a) is a reward obtained by taking action a in state

s. T : S×(S×A) → R is the transition function, where

T(s

′

|s, a) (or T(s, a, s

′

)) is the probability that the sys-

tem will move to state s

‘

when action a is taken in

state s. Notice that the rewards for the voter are de-

ﬁned only in terms of the action taken in the current

state, which is known as the Markov property. A

deterministic optimal policy maps set of preferences

from multiple voters to a single aggregated preference

that satisﬁes some optimality criteria.

2.1 Formal Model of Dynamic Voting as

an MDP

Formally, the approach taken by the authors is to

model individual voters as MDPs, then combine the

individual models into an aggregate model. This

model is then optimized for computation of the op-

timal policy. The voter MDPs are deﬁned as follows:

• The state space of the voter is the set of preference

orderings that the voter can vote for. For m alterna-

tives, S is a set of size m!.

• The actions that the voter can take is to change

his current ordering over the various preferences.

Therefore, A is also a set of size m!, since the agent

can change to any of the other orderings.

• T and R are provided as input. In a real-world

setting, these functions could come from machine

learning techniques over past data for the voter.

In the aggregate setting, S becomes a state space of

size (m!)

, since every one of the n voters can be in

one of the individual m! states. A, while still remain-

ing a set of size m! now represents the system declar-

ing a particular preference ordering as the winner of

the lot. Since each voter evolves independently, the

aggregate T for a state s and s

′

is the product of the

probabilities of each voter to move from s to s

′

. The

rewards are simply the sum total of rewards for each

voter in the state change.

Finally, the authors also note that the system can

be designed to comply with certain desirable be-

haviour. Brieﬂy, they propose modifying the aggre-

gate reward function to heavily penalize certain ac-

tions in certain states.

3 PARTIALLY OBSERVABLE

MDPs IN DYNAMIC VOTING

A Partially Observable MDP (POMDP) differs from

basic MDPs in that, the state of the evolving agent

is not completely visible to the agent. The agent, in-

stead of being in a particular state, maintains a be-

lief vector or probabilities of being in various states.

The agent then receives certain observations on ev-

ery state change, through which it updates its be-

lief vector on which state it is likely to be in. For-

mally, a Partially Observable Markov Decision Pro-

cess (POMDP) (Kaelbling et al., 1998), (Sondik,

1971) is a tuple < S, A, Ω, T, O, R >. Besides the el-

ements already deﬁned as part of an MDP, Ω is a ﬁ-

nite set of observations that the voter can receive on a

state change, and O(o|s) (or O(s, o)) is a probability

of making a given observation in a particular state.

3.1 Formal Model of Dynamic Voting as

a POMDP

Formally, the voter POMDPs are deﬁned the same

way as they are for MDPs, with the only difference

coming in the voter state space. In our model, we

propose the voter state space as an (m!)

set, contain-

ing all the possible preference orderings of the other

agents, as well. Therefore, if the voter does not have

perfect information about the preferences reported by

all the other voters in the setting, then he is unsure

about his present state. The voter therefore maintains

his current state as a probability distribution or belief

vector b over the state space, where b(s) is the agent’s

belief that it is in state s. This lack of perfect infor-

mation is a reasonable assumption in most real-world

scenarios involving Dynamic Voting.

Additionally, we deﬁne the observation set as a

subset of the preference orderings reported by the

other voters. That is, we say that voter i observes

the votes of some subset L

= {1, · ·· , l} of the other

agents. Therefore, the observation space becomes a

set of size (m!)

for each agent. We observethat, deﬁ-

nitions of the observation set and observation function

need not be limited to observations of the behaviour

ECTA 2016 - 8th International Conference on Evolutionary Computation Theory and Applications

240

of other voters. It may not even be a single observable

phenomena. Multiple classes of observations can be

combined into a uniﬁed observation function and ob-

servation set through Bayesian operations. As we dis-

cuss later, this allows for greater adaptability in using

this model in real-world settings. Also, as for transi-

tion and reward functions, observation functions are

also entered as input.

The aggregate POMDP can be obtained from the

individual voter POMDPs following the process de-

scribed in section 2.2. We observe that, in deﬁning the

observation set for the aggregate model, since each

agent makes an independent observation, the size of

the observation set becomes |Ω| =

∏

i ∈N

(m!)

∑

i∈N

. Finally, at the combination stage, a bi-

nary constraint function C(s, a) is also added, which

is used to enforce desirable behaviour on the system.

C(s, a) returns 1 if an action a is not allowed in a par-

ticular state, and 0 otherwise.

3.1.1 Computation of the Optimal Policy

There exists a large body of literature on the chal-

lenges and methods for computation of optimal poli-

cies for POMDPs. Readers are directed to (Kael-

bling et al., 1998), (Sondik, 1971), (Kaelbling et al.,

1996) and (Amato et al., 2014) for further reading.

For the purpose of our study, however, we choose

the algorithm presented in (Undurti and How, 2010).

We do this since the algorithm deals with constrained

POMDPs, is reasonably simple in complexity and

the authors emphasise the tractability of the algo-

rithm through ofﬂine, pre-computation methods. An

adapted version of the algorithm is presented in the

appendix. The algorithm computes the optimal pol-

icy by computing future belief states (encapsulated in

the τ function, which comes from (Kaelbling et al.,

1998)), while using a discount factor γ for deciding

the impact of future rewards on current optimal ac-

tions. We observe that, in the computation process for

the optimal policy, the system anticipates each of the

possible observations in the aggregate observationset,

and calculates an expected reward in the eventof mak-

ing that observation, making the computational com-

plexity of the algorithm polynomial in |Ω|. However,

as deﬁned earlier, this observationset grows exponen-

tially in the number of voters that can be observed by

each voter (assuming the number of alternatives re-

mains constant).

3.1.2 Model Optimization

An assumption that we can make in the design of

our model, which would improve tractability, is that

each individual voter, instead of observing the pref-

erences of a ﬁxed subset L

of all the other voters,

simply observes |L

| votes. That is, the reported pref-

erences are disassociated from the people reporting

the preference. Intuitively, one way to interpret this

simpliﬁcation is the difference between aggregation

behaviour on social networks (where we can observe

which of our connections “liked” or “followed” an

alternative) versus aggregation behaviour on e-retail

websites (where some people rated a product 5 stars,

some others rated it 4 stars, etc.). This clearly re-

duces the size of the voter observation sets, and con-

sequently, the aggregate observation set.

The exact size of the observation set in this new

setting can be calculated as follows. If we assume an

“alphabet” of size m! (where each “letter” is a prefer-

ence ordering), then we wish to know how many sets

of length |L

| can be formed from this alphabet. An

established result in combinatorics allows us to com-

pute this result as

∏

i∈N

m!+|L

|−1

. Again, assuming

the number of alternatives to be ﬁxed (for example,

3), then the expression evaluates to a polynomial of

order 5, or O(|L

) complexity.

4 CROWDSOURCING AS

DYNAMIC VOTING

In this section, we apply the model to a real-world set-

ting and discuss the advantages. We apply this model

to crowdsourcing (Slivkins and Vaughan, 2014). We

begin by deﬁning the problem in crowdsourcing plat-

forms, brieﬂy analysing existing methodologies, and

comparing our model to these methodologies.

We identify three interacting components in a

crowdsourcing platform, namely, the workers, the re-

questers and the platform matching these two compo-

nents to one another. Two observations about crowd-

sourcing encourage an approach from a Dynamic Vot-

ing perspective. Firstly, with each iteration, workers

and requesters can form an insight into how the other

group is making decisions. The platform can gener-

ate proﬁles to predict the kinds of tasks that workers

might choose, or their performance in the completion

of tasks, or it might match requesters to a set of work-

ers which give the requesters the most optimal output.

Secondly, we observe the presence of two differ-

ent strategic elements in the system. In the ﬁrst lo-

cal element, members of individual components are

seeking to maximize their payoff with strategic in-

teractions. These interactions can be between work-

ers and requesters, such as when workers are debat-

ing how much effort to put in for the amount of-

Modelling Evolving Voting Behaviour on Internet Platforms - Stochastic Modelling Approaches for Dynamic Voting Systems

241

fered by the requester (Mason and Watts, 2010), or

by requesters who want to incentivize good behaviour

through mechanisms such as reputation systems (Del-

larocas, 2005). In the second global element, we ob-

serve that, for long-term sustainability of a crowd-

sourcing platform, it should have certain desirable

characteristics that are guaranteed in it’s performance

over a horizon of repeated decisions.

These observations make crowdsourcing a good ﬁt

for the constrained POMDP Dynamic Voting model

presented in the earlier sections.

4.1 Current Perspectives

A general model for the study of crowdsourcing is

due to Cesa-Bianchi et al., known in literature as

the multi-armed bandit model (Auer et al., 1995),

(Slivkins and Vaughan, 2014). The problem state-

ment for the model is stated as follows. In a simple

scenario, an agent plays against a Vegas-style slot ma-

chine (colloquially referred to as a ‘one-armed ban-

dit’). In such a setting, the agent does not know the

probabilities of payoff with repeated play. The multi-

armed scenarios extends this setting to k arms or ma-

chines, with each machine having a unique probabil-

ity distribution for the payoffs assigned.

This model addresses a key challenge in exist-

ing crowdsourcing literature known as the Explo-

ration Vs. Exploitation problem. This is a challenge

wherein, given a set of repeated plays against multi-

armed bandits, an agent has to decide whether a par-

ticular play should be allotted to the task of explo-

ration or exploitation. An in-depth presentation of the

problem is due to Slivkins and Vaughan. (Slivkins

and Vaughan, 2014). Emphasis on the Exploration

Vs. Exploitation problem has the effect of localizing

the study of crowdsourcing to a per-agent basis. That

is, “efﬁciency” and “optimal behaviour” of such mod-

els are deﬁned in terms of maximizing payoff for only

the task requesters and workers of the system, and not

in terms of desiderata which exist beyond those dis-

tributions.

While a few studies have been done which factor

in desiderata from literature on Social Choice theory

(Lee et al., 2014), (Mao et al., 2013), our model dif-

fers from these in two ways. Firstly, we focus on inte-

grating Dynamic Voting models into existing crowd-

sourcing models. This separates our work from the

work done by Lee et al. (Lee et al., 2014), where the

end product is a new model based on an alternative

deﬁnition of “exploration vs. exploitation”. Secondly,

our model aims to be closer to real-life situations by

factoring in the aspect of repeated decision-making,

which is not covered by the work of Mao et al. (Mao

et al., 2013).

4.2 An Initial Model

While the action and state spaces have a direct deﬁni-

tion as per section 3, the deﬁnition of the observation

space provides some more insight. Deﬁning obser-

vations and relating them to the progression of the

model is a challenge, since the original multi-armed

bandit model proposed updations of the static and dy-

namic elements of the MDP with each observation.

Essentially, this resulted in models of the multi-armed

bandit problem being constructed as one-state MDPs,

with every iteration changing the probabilities of out-

comes from playing. The POMDP version ﬁxes this

issue by adding all the information of the system that

can be known as a static set, and the effects of an

observation made at one iteration are manifested in

the value function of the next iteration, not in the ele-

ments of the POMDP itself.

4.3 Adding Exploration Vs.

Exploitation

Adding elements to model the question of Exploration

Vs. Exploitation can now be achieved by modify-

ing the action space and reward function of the agent

POMDP.

1. We redeﬁne the action space by adding a redun-

dancy. Let one action space A

exploration

be an

m! sized set of the preference orderings when

the agent is taking an action from an explo-

ration perspective. Similarly, another action space

exploitation

be the m! sized set of all the prefer-

ence orderings when the agent is exploiting the

current system with his current knowledge. The

transition function for an agent is also redeﬁned

for this new action space. For example, a more

exploration-prone agent would have greater transi-

tion probabilities for an action in A

exploration

than

for the same action in A

exploitation

2. To reﬂect the trade-offbetween the two actions, we

redeﬁne the reward function. In a trivial case, if,

for an action a, the reward function in all states is

R(a

exploration

, s) = R(a

exploitation

, s), then the agent

is playing a purely random strategy, where the in-

formation from the observations are not playing

a role in deciding the next action that the agent

should take. However, without loss of generality,

we can deﬁne the relationships between the reward

functions in the two cases as R(a

exploration

, s) =

R(a

exploitation

, s)+∆

a,s

, where the ∆

a,s

term deﬁnes

ECTA 2016 - 8th International Conference on Evolutionary Computation Theory and Applications

242

the extra value that the agent gains in terms of in-

formation about the system, in addition to the re-

ward from taking the action in that particular state.

3. Now, the Exploration Vs. Exploitation problem

can be expressed by modifying the Value Func-

tion and Optimal Policy. We deﬁne this problem in

terms of the difference between V

ideal

(s), the ideal

or maximum value that an agent can gain from the

current state, if he were playing with full observa-

tions and complete knowledge of the system and

actual

(s). We now assert that the Exploration Vs.

Exploitation problem is to minimize the distance

between this value, and the value that the agent

gains in it’s own iterations i.e.

(1)

actual

(s) = min

a ∈ A

ideal

(s) − R(s, a)

− γ

∑

‘

∈S

T(s

′

|s, ca).V

actual

′

)]

Essentially, this optimality criteria asserts that the

best possible outcome occurs when the agent’s be-

haviour converges towards that shown by an iden-

tical agent operating under ideal conditions. The

value of V

ideal

(s), and the method for solving of

the recurrencerelation is explained under the chap-

ter on Discounted MDPs (chapter 6) of (Puterman,

2014).

4.4 Insights

While the usage of a POMDP presents a steep cost in

terms of tractability over the single-state, multi-armed

bandit model used in literature, we note the bene-

ﬁts that this model presents. Firstly, we observe that

the deﬁnition of general observation sets and func-

tions provides a distinct modelling advantage for real-

world settings over even the most widely-used gener-

alization of the multi-armed bandit model.

The second advantage is that the process followed

here to adapt the model can be generalized and used

to extend Dynamic Voting POMDPs in ways similar

to the various extensions of the multi-armed bandit

model. For example, to study regret minimization in

a system and reward based reputation systems (Auer

et al., 1995).

In conclusion, we propose that models for crowd-

sourcing based on dynamic voting approaches can

supersede current multi-armed bandit based models.

The core phenomena being modelled by the latter can

be incorporated into the former, while the former pro-

vides a more generalized and easily extensible frame-

work for the crowdsourcing scenario.

5 RESULTS

In this paper, we began with the problem of modelling

the evolving of preferences of agents on internet plat-

forms. Current approaches focus on evolving pref-

erences over a horizon of repeated decision-making.

This was a novel approach, since it studied aggre-

gate decision-making by constructing it as an evolv-

ing system, rather than using the standard logic-based

approaches that were the mainstay of the domain so

far. These approaches focused on a single instance of

decision aggregation, and questions asked were along

the lines of how best to map the individual inputs to

the output, to maximize overall social utility (Moulin

et al., 2016).

Therefore, with this work, this paper builds a case

for further investigation into the use of MDP vari-

ants and stochastic modelling techniquesin real world

Dynamic Voting scenarios. We show that, stochastic

modelling yields greater insight into the relation be-

tween factoring a real-world aspect (such as the ob-

servation a voter makes about the votes of others), and

the resulting change in computational complexity.

For example, Semi Markov Decision Processes, a

class of MDPs where different state changes are com-

pleted in different times, presents a unique challenge

at the aggregation step. A workaround from (Gosavi,

2014) reduces SMDPs into regular MDPs for this ag-

gregation step. Therefore, while SMDP based mod-

els would be as tractable as the basic MDP model

(and much more tractable than the POMDP model),

the process for conversion changes the optimal pol-

icy for the system. Additionally, we have the novel

insight that, factoring in voter behaviour related to

speed of changing one’s own decisions has no effect

on computational complexity, while making observa-

tions about other voters results in more complex com-

putation.

Similarly, (Gmytrasiewicz and Doshi, 2004)

and (Wiering et al., 2007) introduce interactive

POMDPs and Multi-objective MDPs. IPOMDPs ex-

tend POMDPs by allowing for interactions between

agents, and MOMDPs use multiple optimality crite-

ria, instead of the straightforward criteria used here,

which maximizes only the payoff of the entire sys-

tem.

Finally, another key line of research would use

these theoretical models on actual platforms, with real

data. Of particular interest would be a study on how

different methods for approximating policies would

translate in a Dynamic Voting scenario, in terms of

the tradeoff between optimal rewards and computa-

tional tractability.

Modelling Evolving Voting Behaviour on Internet Platforms - Stochastic Modelling Approaches for Dynamic Voting Systems

243

REFERENCES

Altman, A. and Tennenholtz, M. (2005). Ranking systems:

the pagerank axioms. In Proceedings of the 6th ACM

conference on Electronic commerce, pages 1–8. ACM.

Amato, C., Konidaris, G. D., and Kaelbling, L. P. (2014).

Planning with macro-actions in decentralized pomdps.

In Proceedings of the 2014 international confer-

ence on Autonomous agents and multi-agent systems,

pages 1273–1280. International Foundation for Au-

tonomous Agents and Multiagent Systems.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E.

(1995). Gambling in a rigged casino: The adversarial

multi-armed bandit problem. In Foundations of Com-

puter Science, 1995. Proceedings., 36th Annual Sym-

posium on, pages 322–331. IEEE.

Dellarocas, C. (2005). Reputation mechanism design in on-

line trading environments with pure moral hazard. In-

formation Systems Research, 16(2):209–230.

Gmytrasiewicz, P. J. and Doshi, P. (2004). Interactive

pomdps: Properties and preliminary results. In Pro-

ceedings of the Third International Joint Confer-

ence on Autonomous Agents and Multiagent Systems-

Volume 3, pages 1374–1375. IEEE Computer Society.

Gosavi, A. (2014). Simulation-based optimization: para-

metric optimization techniques and reinforcement

learning, volume 55. Springer.

Howard, R. (1960). Dynamic programming and Markov

processes. MIT Press.

Kaelbling, L. P., Littman, M. L., and Cassandra, A. R.

(1998). Planning and acting in partially observable

stochastic domains. Artiﬁcial intelligence, 101(1):99–

134.

Kaelbling, L. P., Littman, M. L., and Moore, A. W. (1996).

Reinforcement learning: A survey. Journal of artiﬁ-

cial intelligence research, pages 237–285.

Lee, D. T., Goel, A., Aitamurto, T., and Landemore, H.

(2014). Crowdsourcing for participatory democracies:

Efﬁcient elicitation of social choice functions. In Sec-

ond AAAI Conference on Human Computation and

Crowdsourcing.

Mao, A., Procaccia, A. D., and Chen, Y. (2013). Better hu-

man computation through principled voting. In AAAI.

Citeseer.

Mason, W. and Watts, D. J. (2010). Financial incentives

and the performance of crowds. ACM SigKDD Explo-

rations Newsletter, 11(2):100–108.

Moulin, H., Brandt, F., Conitzer, V., Endriss, U., Lang, J.,

and Procaccia, A. D. (2016). Handbook of Computa-

tional Social Choice. Cambridge University Press.

Parkes, D. C. and Procaccia, A. D. (2013). Dynamic social

choice with evolving preferences. In AAAI.

Puterman, M. L. (2014). Markov decision processes: dis-

crete stochastic dynamic programming. John Wiley &

Sons.

Slivkins, A. and Vaughan, J. W. (2014). Online decision

making in crowdsourcing markets: Theoretical chal-

lenges. ACM SIGecom Exchanges, 12(2):4–23.

Sondik, E. J. (1971). The optimal control of partially ob-

servable markov processes. Technical report, Ph.D

Thesis, Stanford University.

Tennenholtz, M. (2004). Dynamic voting. In EC’04: Pro-

ceedings of the 5th ACM Conference on Electronic

Commerce, New York, New York, USA, May 17-20,

2004, page 230. Association for Computing Machin-

ery.

Undurti, A. and How, J. P. (2010). An online algorithm

for constrained pomdps. In Robotics and Automa-

tion (ICRA), 2010 IEEE International Conference on,

pages 3966–3973. IEEE.

Wiering, M., De Jong, E. D., et al. (2007). Computing op-

timal stationary policies for multi-objective markov

decision processes. In Approximate Dynamic Pro-

gramming and Reinforcement Learning, 2007. AD-

PRL 2007. IEEE International Symposium on, pages

158–165. IEEE.

APPENDIX

Procedure OnlinePOMDPsolver

b :

current belief state of the system

T :

A tree -like data structure to

contain the current state of the

system and possible future

transitions

D :

Expansion depth for lookahead

b ← b

T ← b

WHILE ExecutionTerminationCondition

Expand(

b, D

)

Execute action

∗

returned by

Expand

Receive observation

Update

to reflect new belief

state of the system

Update tree

END WHILE

Procedure Expand(b, d)

IF {

d = 0

}

V(s) ← 0

ELSE

V(s) ← −∞

END IF

FOR

a ∈ |A|

AND

C(s, a) = 0

P(o|b, a) ←

∑

′

∈S

O(o|s

′

)

∑

s∈S

T(s

′

|s, a)b(s)

V(s, a) ←

∑

n∈N

[R(s, a) + γ

∑

o∈Ω

P(o|b, a)

Expand(

τ(b

, a, o), d − 1

)]

IF {

V(s,a) > V(s)

AND

C(s

′

, a) = 0

}

V(s) = V(s, a)

∗

← a

END IF

END FOR

RETURN

∗

V(s)

ECTA 2016 - 8th International Conference on Evolutionary Computation Theory and Applications

244