SELF-ORGANIZING SYNCHRONICITY AND DESYNCHRONICITY

USING REINFORCEMENT LEARNING

Mihail Mihaylov, Yann-A

el Le Borgne, Ann Now

Vrije Universiteit Brussel, Brussels, Belgium

Karl Tuyls

Maastricht University, Maastricht, The Netherlands

Keywords:

Reinforcement learning, Synchronicity and Desynchronicity, Wireless sensor networks, Wake-up scheduling.

Abstract:

We present a self-organizing reinforcement learning (RL) approach for coordinating the wake-up cycles of

nodes in a wireless sensor network in a decentralized manner. To the best of our knowledge we are the ﬁrst

to demonstrate how global synchronicity and desynchronicity can emerge through local interactions alone

without the need of central mediator or any form of explicit coordination. We apply this RL approach to

wireless sensor nodes arranged in different topologies and study how agents, starting with a random policy,

are able to self-adapt their behavior based only on their interaction with neighboring nodes. Each agent

independently learns to which nodes it should synchronize to improve message throughput and at the same

with whom to desynchronize in order to reduce communication interference. The obtained results show how

simple and computationally bounded sensor nodes are able to coordinate their wake-up cycles in a distributed

way in order to improve the global system performance through (de)synchronicity.

1 INTRODUCTION

Synchronicity, which can be deﬁned as the ability

of a system to organize simultaneous collective ac-

tions (Werner-Allen et al., 2005), has long attracted

the attention of biologists, mathematicians, and com-

puter scientists. Seminal work on pulse-coupled os-

cillators of Mirollo and Strogatz (Mirollo and Stro-

gatz, 1990) provided a theoretical framework for

modeling the synchronization in biological systems

such as ﬁreﬂies ﬂashings, heart pacemaker cells, or

cricket chirps. The framework has recently been

applied to systems of digital agents such as in the

emerging ﬁeld of wireless sensor networks (Knoester

and McKinley, 2009), and to its counterpart, desyn-

chronicity, in (Werner-Allen et al., 2005). In such

multi-agent systems (MASs), synchronicity can be

desirable to allow all agents to synchronously com-

municate, and desynchronicity to ensure that the

communication channel is evenly shared among the

agents.

Little attention has however been given to MASs

where either global synchronicity or global desyn-

chronicity of the system alone is impractical and/or

undesirable. Desynchronicity refers to the state where

the periodical activity of agents is shifted in time

relative to one another, as opposed to synchronic-

ity, where agents’ activities coincide in time. In

most MASs, an optimal solution is intuitively found

where sets of agents synchronize with one another,

but desynchronize with others. Nodes communicat-

ing in a wireless sensor network (WSN) are an ex-

ample, as routing requires the synchronization of sets

of nodes, while channel contention requires sets of

nodes to desynchronize in order to avoid packet col-

lisions. Other examples include trafﬁc lights guiding

vehicles through crossings in trafﬁc control problems

and jobs that have to be processed by the same ma-

chines at different times in job-scheduling problems.

In such cases applying synchronization or desynchro-

nization alone can be detrimental to the system (e.g.

all trafﬁc lights showing green, or complementary

jobs processed at different time).

In these systems, agents should be logically orga-

nized in groups (or coalitions), such that the actions

of agents within the group need to be synchronized,

while at the same time being desynchronized with the

actions of agents in other groups. We refer to this

Mihaylov M., Le Borgne Y., Nowé A. and Tuyls K..

SELF-ORGANIZING SYNCHRONICITY AND DESYNCHRONICITY USING REINFORCEMENT LEARNING.

DOI: 10.5220/0003162600940103

In Proceedings of the 3rd International Conference on Agents and Artiﬁcial Intelligence (ICAART-2011), pages 94-103

ISBN: 978-989-8425-41-6

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

concept for short as (de)synchronicity (for the system

state), and (de)synchronization (for the process occur-

ring at that state). An important characteristic of these

systems is also that agents need to coordinate their ac-

tions without the need of centralized control.

In this paper, we show that coordinating the ac-

tions of agents (i.e., sensor nodes) can successfully

be done using the reinforcement learning framework

by rewarding successful interactions (e.g., transmis-

sion of a message in a sensor network) and penalizing

the ones with a negative outcome (e.g., overhearing

or packet collisions). This behavior drives nodes to

repeat actions that result in positive feedback more

often and to decrease the probability of unsuccessful

interactions. Agents that tend to select the same suc-

cessful action form a coalition.

A key feature of our approach is that no explicit

notion of coalition is necessary. Rather, these coali-

tions emerge from the global objective of the system,

and agents learn by themselves with whom they have

to (de)synchronize (e.g. to maximize throughput in

a routing problem). Here desynchronization refers to

the situation where one agent’s actions (e.g. waking

up the radio transmitter of a wireless node) are shifted

in time, relative to another, such that the (same) ac-

tions of both agents do not happen at the same time.

For example, two trafﬁc lights at a crossing are desyn-

chronized when the light on one street shows green,

while the other one displays red. In addition, the pat-

terns of these trafﬁc lights need to be synchronized

with those on the nearby crossings to maximize traf-

ﬁc throughput. Thus, globally, one group of lights

at consecutive crossings should be synchronized with

each other and at the same time desynchronized with

others to reach the global objective of high throughput

taking into account the trafﬁc ﬂow.

We illustrate the beneﬁts of combining syn-

chronicity and desynchronicity in a scenario involv-

ing coordination in wireless sensor networks (WSNs).

WSNs form a class of distributed and decentralized

multi-agent systems, whose agents have very limited

computational, communication and energy resources.

A key challenge in these networks lie in the design

of energy efﬁcient coordination mechanisms, in or-

der to maximize the lifetime of the applications. We

apply our self-adapting RL approach in three wire-

less sensor networks of different topologies, namely

line, mesh and grid. We show that nodes form coali-

tions which allow to reduce packet collisions and end-

to-end latency, and thus to increase the network life-

time. This (de)synchronicity is achieved in a decen-

tralized manner, without any explicit communication,

and without any prior knowledge of the environment.

To the best of our knowledge, we are the ﬁrst to

present a decentralized approach to coordination in

WSNs without any communication overhead in the

form of additional data required by the learning algo-

rithm.

The paper is organized as follows. Section 2

guides the reader through some related work. It

presents the background of our study and outlines the

communication and routing protocols. The idea be-

hind our approach is exposed in Section 3, together

with its application in WSNs of three different topolo-

gies. The results of this work are analyzed in Sec-

tion 4 shortly before we conclude in Section 5.

2 RELATED WORK

We proceed with the literature review on two

fronts. We ﬁrst present the challenging problem of

(de)synchronization (also known as wake-up schedul-

ing) in sensor networks, and then provide an overview

of the synchronization (and desynchronization) mech-

anisms which have been proposed in the literature to

solve the problem.

2.1 Wireless Sensor Networks:

Coordination Challenges

A Wireless Sensor Network is a collection of densely

deployed autonomous devices, called sensor nodes,

which gather data with the help of sensors (Ilyas and

Mahgoub, 2005). The untethered nodes use radio

communication to transmit sensor measurements to a

terminal node, called the sink. The sink is the access

point of the observer, who is able to process the dis-

tributed measurements and obtain useful information

about the monitored environment. Sensor nodes com-

municate over a wireless medium, by using a multi-

hop communication protocol that allows data packets

to be forwarded by neighboring nodes to the sink.

When the WSN is deployed, the routing protocol

requires that nodes determine their hop distance to the

sink, i.e. the minimum number of nodes that will

have to forward their packets. This is achieved by

nodes broadcasting calibration packets only once im-

mediately after deployment. Communication is done

via a standard DATA/ACK protocol and messages are

routed according to a shortest path algorithm with re-

spect to the hop distance (Ilyas and Mahgoub, 2005).

Nodes will start sending a packet at a random slot

within the awake period to increase the chances of

successful transmission for nodes that wake up at the

same time.

Since communication is the most energy expen-

sive action (Langendoen, 2008), it is clear that in or-

SELF-ORGANIZING SYNCHRONICITY AND DESYNCHRONICITY USING REINFORCEMENT LEARNING

der to save more energy, a node should turn off its

antenna (or go to sleep). However, when sleeping,

the node is not able to send or receive any messages,

therefore it increases the latency of the network, i.e.,

the time it takes for messages to reach the sink. High

latency is undesirable in any real-time applications.

On the other hand, a node does not need to listen to

the channel when no messages are being sent, since it

loses energy in vain. As a result, nodes should deter-

mine on their own when they should be awake within

a frame. This behavior is called wake-up scheduling.

Once a node wakes up, it remains active for a prede-

ﬁned amount of time, called duty cycle.

Wake-up scheduling in wireless sensor networks

is an active research domain (Liang et al., 2007;

Paruchuri et al., 2004; Liu and Elhanany, 2006; Co-

hen and Kapchits, 2009). A good survey on wake-up

strategies in WSNs is presented in (Schurgers, 2007).

In (Paruchuri et al., 2004) a randomized algorithm for

asynchronous wake-up scheduling is presented that

relies on densely deployed sensor nodes with means

of localization. It also requires additional data to be

piggybacked to messages in order to allow for mak-

ing local decisions, based on other nodes. This book-

keeping of neighbors’ schedules, however, introduces

larger memory requirements and imposes signiﬁcant

communication overhead. A different asynchronous

protocol for generating wake-up schedules (Zheng

et al., 2003) is formulated as a block design prob-

lem with derived theoretical bounds. The authors

derive theoretical bounds under different communi-

cation models and propose a neighbor discovery and

schedule bookkeeping protocol operating on the opti-

mal wake-up schedule derived. However, both proto-

cols rely on localization means and incur communica-

tion overhead by embedding algorithm-speciﬁc data

into packets. Adding such data to small packets will

decrease both the throughput and the lifetime of the

network.

A related approach that applies reinforcement

learning in WSNs is presented in (Liu and Elhanany,

2006). As in the former two protocols, this ap-

proach requires nodes to include additional data in the

packet header in order to measure the incoming trafﬁc

load. Moreover, the learning algorithm requires frame

boundaries to be aligned in addition to slot bound-

aries.

2.2 Coordination and Cooperative

Behavior

Coordination and cooperative behavior has recently

been studied for digital organisms in (Knoester and

McKinley, 2009), where it is demonstrated how pop-

ulations of such organisms are able to evolve algo-

rithms for synchronization based on biologically in-

spired models for synchronicity while using minimal

information about their environment. Synchroniza-

tion in WSNs, based on the Reachback Fireﬂy Algo-

rithm for synchronicity, is more speciﬁcally applied

to WSNs in (Werner-Allen et al., 2005). The purpose

of the study is to investigate the realistic radio effects

of synchronicity in WSNs. Two complementary pub-

lications to the aforementioned work present the con-

cept of desynchronicity in wireless sensor networks as

the logical opposite of synchronicity (Degesys et al.,

2007; Patel et al., 2007), where nodes perform their

periodic tasks as far away as possible from all other

nodes. Agents achieve that in a decentralized way

by observing the ﬁring messages of their neighbors

and adjusting their phase accordingly, so that all ﬁr-

ing messages are uniformly distributed in time.

The latter three works are based on the ﬁreﬂy-

inspired mathematical model of pulse-coupled oscil-

lators, introduced by Mirollo and Strogatz (Mirollo

and Strogatz, 1990). In this seminal paper the authors

proved that, using a simple oscillator adjustment func-

tion, any number of pulse-coupled oscillators would

always converge to produce global synchronicity irre-

spective of the initial state. More recently, Lucarelli

and Wang (Lucarelli and Wang, 2004) applied this

concept in the ﬁeld of WSNs by demonstrating that

it also holds for multi-hop topologies.

The underlying assumption in all of the above

work on coordination is that agents can observe each

other’s actions (or frequencies) and thus adapt their

own policy (or phase), such that the system is driven

to either full synchronicity or full desynchronicity, re-

spectively. However, there are cases where agents are

not able to observe the actions of others, thus render-

ing the above approaches inapplicable for such do-

mains. For example sensor nodes could be in sleep

mode while their neighbors wake up, thus they are

unable to detect this event and adjust their wake-

up schedule accordingly. Moreover, achieving either

global synchronicity or global desynchronicity alone

in most WSNs can be impractical or even detrimental

to the system (see Section 1). In Section 3 we will

present how we tackle these challenges using our de-

centralized reinforcement learning approach.

A related methodology is the collective intelli-

gence framework (Wolpert and Tumer, 2008). It stud-

ies how to design large multi-agent systems, where

selﬁsh agents learn to optimize a private utility func-

tion, so that the performance of a global utility is in-

creased. This framework, however, requires agents

to store and propagate additional information, such

as neighborhood’s efﬁciency, in order to compute the

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

world utility, to which they compare their own perfor-

mance. The approach therefore causes a communi-

cation overhead, which is detrimental to the network

lifetime.

3 (DE)SYNCHRONICITY WITH

REINFORCEMENT LEARNING

This section presents our decentralized approach to

(de)synchronicity using the reinforcement learning

framework. The proposed approach requires very few

assumptions on the underlying networking protocols,

which we discuss in Section 3.1. The subsequent sec-

tions detail the different components of the reinforce-

ment learning mechanism.

3.1 Motivations and Network Model

All the previous approaches discussed in Section 2 re-

quire the agents to exchange information to achieve

coordination. An important feature of our approach

is that coordination is not explicitly agreed upon, but

rather emerges as a result of the implicit communica-

tion generated by the data collection process.

Communication in WSNs is achieved by means

of networking protocols, and in particular by means

of the Medium Access Control (MAC) and the rout-

ing protocols (Ilyas and Mahgoub, 2005). The

MAC protocol is the data communication proto-

col concerned with sharing the wireless transmission

medium among the network nodes. The routing pro-

tocol allows to determine where sensor nodes have to

transmit their data so that they eventually reach the

sink. A vast amount of literature exists on these two

topics (Ilyas and Mahgoub, 2005), and we sketch in

the following the key requirements for the MAC and

routing protocols so that our reinforcement learning

mechanism presented in Section 3.2 can be imple-

mented. We emphasize that these requirements are

very loose.

We use a simple asynchronous MAC protocol

that divides the time into small discrete units, called

frames, where each frame is further divided into time

slots. The frame and slot duration are application de-

pendent and in our case they are ﬁxed by the user prior

to network deployment. The sensor nodes then rely

on a standard duty cycle mechanism, in which the

node is awake for a predetermined number of slots

during each period. The awake period is ﬁxed by the

user, while the wake-up slot is initialized randomly

for each node. These slots will be shifted as a result

of the learning, which will coordinate nodes’ wake-

up schedules in order to ensure high data throughput

and longer battery life. Each node will learn to be in

active mode when its parents and children are awake,

so that it forwards messages faster (synchronization),

and stay asleep when neighboring nodes on the same

hop are communicating, so that it avoids collisions

and overhearing (desynchronization).

The routing protocol is not explicitly part of the

learning algorithm and therefore any multi-hop rout-

ing scheme can be applied without losing the prop-

erties of our approach. In the experimental results,

presented in Section 4, the routing is achieved using a

standard shortest path multi-hop routing mechanism

(cf. Section 2.1). The forwarding nodes need not

be explicitly known, as long as they ensure that their

distance to the sink is lower than the sender. Com-

munication is done using a Carrier Sense Multiple

Access (CSMA) protocol. Successful data reception

is acknowledged with an ACK packet.We would like

to note that the acknowledgment packet is necessary

for the proper and reliable forwarding of messages.

Our algorithm does use this packet to indicate a “cor-

rect reception” in order to formulate one of its reward

signals (see Subsection 4.1). However, this signal is

not crucial for the RL algorithm and thus the latter

can easily function without acknowledgment packets.

Subsection 3.3 will further elaborate on the use of re-

ward signals.

It is noteworthy that the communication partners

of a node (and thus the formation of coalitions) are

inﬂuenced by the communication and routing proto-

cols that are in use and not by our algorithm itself.

These protocols only implicitly determine the direc-

tion of the message ﬂow and not who will forward

those messages, since nodes should ﬁnd out the latter

by themselves.

Figure 1: Examples of routing and coalition formation.

Depending the routing protocol, coalitions (e.g.,

synchronized groups of nodes) logically emerge

across the different hops, such that there is, if possi-

ble, only one agent from a certain hop within a coali-

tion. Figure 1 illustrates this concept in three differ-

ent topologies. It shows as an example how coalitions

form as a result of the routing protocol. Intuitively,

nodes from one coalition need to synchronize their

wake-up schedules. As deﬁned by the routing pro-

SELF-ORGANIZING SYNCHRONICITY AND DESYNCHRONICITY USING REINFORCEMENT LEARNING

tocol, messages are not sent between nodes from the

same hop, hence these nodes should desynchronize

(or belong to separate coalitions) to avoid communi-

cation interference. The emergence of coalitions will

be experimentally illustrated for different topologies

in Section 4.

3.2 Reinforcement Learning Approach:

Methodology

Each agent in the WSN uses a reinforcement learn-

ing (RL) (Sutton and Barto, 1998) algorithm to learn

an efﬁcient wake-up schedule (i.e. when to remain

active within the frame) that will improve through-

put and lifetime in a distributed manner. It is clear

that learning in multi-agent systems of this type re-

quires careful exploration in order to make the action-

values of agents converge. We use a value iteration

approach similar to single-state Q-learning (Watkins,

1989) with an implicit exploration strategy, as subsec-

tion 3.5 will further elaborate on. However, our up-

date scheme differs from that of traditional Q-learning

(cf. subsection 3.4). The battery power required to

run the algorithm is marginal to the communication

costs and thus it is neglected. The main challenge

in such a decentralized approach is to deﬁne a suit-

able reward function for the individual agents that will

lead to an effective emergent behavior as a group. To

tackle this challenge, we proceed with the deﬁnition

of the basic components of the reinforcement learning

algorithm as used by our methodology in the context

of our application.

3.3 Actions and Rewards

The actions of each agent are restricted to selecting

a time window (or a wake period) within a frame for

staying awake. Since the size of these frames remains

unchanged and they constantly repeat throughout the

network lifetime, our agents use no notion of states,

i.e. we say that our learning system is stateless. The

duration of this wake period is deﬁned by the duty cy-

cle, ﬁxed by the user of the system. In other words,

each node selects a slot within the frame when its ra-

dio will be switched on for the duration of the duty

cycle. Thus, the size of the action space of each

agent is determined by the number of slots within a

frame. In general, the more actions agents have, the

slower the reinforcement learning algorithm will con-

verge (Leng, 2008). On the other hand, a small action

space might lead to suboptimal solutions and will im-

pose an energy burden on the system. Setting the right

amount of time slots within a frame requires a study

on itself, that we shall not undertake in this paper due

to space restrictions (see subsection 4.1 for exact val-

ues).

Every node stores a “quality value” (or Q-value)

for each slot within its frame. This value for each

slot indicates how beneﬁcial it is for the node to stay

awake during these slots for every frame, i.e. what is

an efﬁcient wake-up pattern, given its duty cycle and

considering its communication history. When a com-

munication event occurs at a node (overheard, sent

or received a packet) or if no event occurred during

the wake period (idle listening), that node updates the

quality-value of the slot(s) when this event happened.

The motivation behind this scheme is presented in

subsection 3.5.

3.4 Updates and Action Selection

The slots of agents are initiated with Q-values drawn

from a uniform random distribution between 0 and 1.

Whenever events occur during node’s active period,

that node updates the quality values of the slots, at

which the corresponding events occurred, using the

following update rule:

← (1 −α) ·

+ α · r

s,e

where Q

∈ [0, 1] is the quality of slot s within the

frame of agent i. Intuitively, a high Q

value indicates

that it is beneﬁcial for agent i to stay awake during slot

s. This quality value is updated using the previous Q-

value (

) for that slot, the learning rate α ∈ [0, 1], and

the newly obtained reward r

s,e

∈ [0, 1] for the event e

that (just) occurred in slot s. Thus, nodes will update

as many Q-values as there are events during its active

period. In other words, agent i will update the value

for each slot s where an event e occurred. The

latter update scheme differs from that of traditional Q-

learning (Watkins, 1989), where only the Q-value of

the selected action is updated. The motivation behind

this update scheme is presented in subsection 3.5. In

addition, we set here the future discount parameter γ

to 0, since our agents are stateless (or single-state).

Nodes will stay awake for those consecutive time

slots that have the highest sum of Q-values. Put dif-

ferently, each agent selects the action a

(i.e., wake

up at slot s

) that maximizes the sum of the Q-values

for the D consecutive time slots, where D is the duty

cycle, ﬁxed by the user. Formally, agent i will wake

up at slot s

, where

= argmax

s∈S

∑

j=0

s+ j

For example, if the required duty cycle of the nodes

is set to 10% (D = 10 for a frame of S = 100 slots),

each node will stay active for those 10 consecutive

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

slots within its frame that have the highest sum of Q-

values. Conversely, for all other slots the agent will

remain asleep, since its Q-values indicate that it is less

beneﬁcial to stay active during that time. Nodes will

update the Q-value of each slot for which an event

occurrs within its duty cycle. Thus, when forwarding

messages to the sink, over time, nodes acquire sufﬁ-

cient information on “slot quality” to determine the

best period within the frame to stay awake. This be-

haviour makes neighbouring nodes (de)synchronize

their actions, resulting in faster message delivery and

thus lower end-to-end latency.

3.5 Exploration

As explained in the above two subsections, active

time slots are updated individually, regardless of

when the node wakes up. The reason for this choice

is threefold. Firstly, this allows each slot to be ex-

plored and updated more frequently. For example,

slot s will be updated when the node wakes up any-

where between slots s − 1 and s − D + 1, i.e. in D

out of S possible actions. Secondly, updating individ-

ual Q-values makes it possible to alter the duty cycle

of nodes at run time (as suggest some preliminary re-

sults, not displayed in this paper) without invalidating

the Q-values of slots. In contrast, if a Q-value was

computed for each start slot s, i.e. the reward was ac-

cumulated over the wake duration and stored at slot s

only, changing the duty cycle at run-time will render

the computed Q-values useless, since the reward was

accumulated over a different duration. In addition,

slot s will be updated only when the agent wakes up

at that slot. A separate exploration strategy is there-

fore required to ensure that this action is explored suf-

ﬁciently. Thirdly, our exploration scheme will contin-

uously explore and update not only the wake-up slot,

but all slots within the awake period. Treating slots

individually results in an implicit exploration scheme

that requires no additional tuning.

Even though agents employ a greedy policy (se-

lecting the action that gives the highest sum of Q-

values), this “smooth” exploration strategy ensures

that all slots are explored and updated regularly at the

start of the application (since values are initiated ran-

domly), until the sum of Q-values of one group of

slots becomes strictly larger than the rest. In that case

we say that the policy has converged and thus explo-

ration has stopped.

The speed of convergence is inﬂuenced by the

duty cycle, ﬁxed by the user, and the learning rate,

which we empirically chose to be 0.1. A constant

learning rate is in fact desirable in a non-stationary

environment to ensure that policies will change with

respect to the most recently received rewards (Sutton

and Barto, 1998).

4 RESULTS

We proceed with the experimental comparison be-

tween our (de)synchronization approach and a fully

synchronized one that is typically used in practice,

such as S-MAC (Ye et al., 2004; Liu and Elhanany,

2006).

4.1 Experimental Setup

We applied our approach on three networks of dif-

ferent size and topology. In particular, we investi-

gate two extreme cases where nodes are arranged in

a 5-hop line (Figure 2(a)) and a 6-node single-hop

mesh topology (Figure 3(a)). The former one requires

nodes to synchronize in order to successfully forward

messages to the sink. Intuitively, if any one node is

awake while the others are asleep, that node would not

be able to forwarded its messages to the sink. Con-

versely, in the mesh topology it is most beneﬁcial for

nodes to fully desynchronize to avoid communication

interference with neighboring nodes. Moreover, the

sink is able to communicate with only one node at a

time. The third topology is a 4 by 4 grid (Figure 4(a))

where sensing agents need to both synchronize with

some nodes and at the same time desynchronize with

others to maximize throughput and network lifetime.

The latter topology clearly illustrates the importance

of combining synchronicity and desynchronicity, as

neither one of the two behaviors alone achieves the

global system objectives. Subsection 4.2 will con-

ﬁrm these claims and will elaborate on the obtained

results.

Table 1: Parameters for experimental topologies.

topology num. num. avg. num. data rate

type nodes hops neighbors (msg/sec)

line 5 5 1.8 0.10

mesh 6 1 3.0 0.33

grid 16 4 3.25 0.05

Each network was ran for 3600 seconds in the

OMNeT++ simulator (http://www.omnetpp.org/ – a

C++ simulation library and framework, ). Results

were averaged over 50 runs with the same topologies,

but with different data sampling times. Table 1 sum-

marizes the differences between the experimental pa-

rameters of the three topologies.

SELF-ORGANIZING SYNCHRONICITY AND DESYNCHRONICITY USING REINFORCEMENT LEARNING

Frames were divided in S = 100 slots of 10 mil-

liseconds each, and we modeled ﬁve different events,

namely overhearing (r = 0), idle listening (r = 0 for

each idle slot), successful transmission (r = 1 if ACK

received), unsuccessful transmission (r = 0 if no ACK

received) and successful reception (r = 1). Maxi-

mizing the throughput requires both proper transmis-

sion as well as proper reception. Therefore, we treat

the two corresponding rewards equally. Furthermore,

most radio chips require nearly the same energy for

sending, receiving (or overhearing) and (idle) listen-

ing (Langendoen, 2008), making the three rewards

equal. We consider these ﬁve events to be the most

energy expensive or latency crucial in wireless com-

munication. Additional events were also modeled,

but they were either statistically insigniﬁcant (such as

busy channel) or already covered (such as unsuccess-

ful transmissions instead of collisions).

Due to the exponential smoothing nature of the

reward update function (cf. subsection 3.4) the Q-

values of slots will be shifted towards the latest reward

they receive. We would expect that the “goodness” of

slots will decrease for negative events, and will in-

crease for successful communication. Therefore, the

feedback agents receive is binary, i.e. r

s,e

∈ {0, 1},

since it carries the necessary information. Other re-

ward signals were also evaluated, resulting in similar

performance.

To better illustrate the importance and effect of

(de)synchronicity in these topologies, we compare

our approach to networks where all nodes wake up at

the same time. All other components of the compared

networks, such as the routing and CSMA communi-

cation protocols, remain the same. In other words, we

compare our RL technique to networks with no coor-

dination mechanism, but which employ some means

of time synchronization, the small overhead of which

will be neglected for the sake of a clearer exposition.

This synchronized approach ensures high network

throughput and is already used by a number of state-

of-the-art MAC protocols, such as RL-MAC (Liu and

Elhanany, 2006) and S-MAC (Ye et al., 2004). How-

ever, as we will demonstrate in subsection 4.2, syn-

chronization alone will in fact decrease system per-

formance.

4.2 Evaluation

Figure 2(b) displays the resulting schedule of the line

topology (Figure 2(a)) after the action of each agent

converges. The results indicate that nodes have suc-

cessfully learned to stay awake at the same time in

order for messages to be properly forwarded to the

sink. In other words, we observe that all nodes belong

to the same coalition, as suggested in Figure 1. If any

one node in the line topology had remained active dur-

ing the sleep period of others, its messages, together

with those of its higher hop neighbors would not have

been delivered to the sink. Even though neighboring

nodes are awake at the same time (or have synchro-

nized), one can see that schedules are slightly shifted

in time. The reason for this desynchronicity is to re-

duce overhearing of messages from lower hop nodes

and to increase throughput – a behavior that nodes

have learned by themselves. The size of this time

shift depends on the time at which nodes send data

within their awake period. As mentioned in subsec-

tion 2.1, messages are sent at a random slot within a

frame. Therefore, the difference between the wake-up

times is small enough to increase the chance of suc-

cessful transmissions and large enough to ensure fast

throughput, compensating for propagation delays.

In the line topology this time shift is, however,

marginal to the active period and therefore the per-

formance of the learning nodes is comparable to that

of the fully synchronized network. This effect can

be observed in Figure 2(c), which displays the end-

to-end latency of the learning and the synchronized

(a) Line topology

(b) Resulting wake-up schedule after convergence for duty

cycle of 10%

Figure 2: Experimental results for the line topology.

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

100

nodes respectively. We can conclude from the graph

that in a line topology, where nodes have no neighbors

on the same hop, the improvements of our learning

algorithm are marginal to that of a synchronized net-

work. As mentioned above, the reason for this com-

parable performance lies in the fact that a successful

message forwarding in the line topology requires syn-

chronicity. Nevertheless, our agents are able to inde-

pendently achieve this behavior without any commu-

nication overhead.

(a) Mesh topology

(b) Resulting wake-up schedule after convergence for

duty cycle of 10%

Figure 3: Experimental results for the mesh topology.

In contrast to the previous topology, our second

set of experiments investigate the performance of the

network where all nodes lie on the same hop from

the sink. This setup presents agents with the opposite

challenge, namely to ﬁnd an active period where no

other node is awake. The latter behavior will elim-

inate communication interference with neighboring

nodes and will ensure proper reception of messages

at the sink. Figure 3(b) displays the wake-up sched-

ule of the learning nodes for a duty cycle of 10% af-

ter the actions of agents converge. One can observe

that the state of desynchronicity has been success-

fully achieved where each node is active at a different

time within a frame. Put differently, each node has

chosen a different wake-up slot and therefore belongs

to different coalition. The beneﬁt of this desynchro-

nized pattern is clearly evident in Figure 3(c) where

we compare it to the latency of the synchronized sys-

tem. For duty cycles lower than 25% the reduction in

latency is signifﬁcant, as compared to a synchronized

network of the same topology.

Lastly, we investigate a “mix” of the above two

topologies, namely a grid shown in Figure 4(a).

Nodes here need to synchronize with those that lie

on the same branch of the routing tree to ensure high

throughput, while at the same time desynchronize

with neighboring branches to avoid communication

interference. The resulting schedule of the learning

nodes after convergence is displayed in Figure 4(b).

As expected, the four columns of nodes belong to

four different coalitions, where nodes in one coalition

are synchronized with each other (being active nearly

at the same time) and desynchronized with the other

coalitions (sleeping while others are active). This is

the state of (de)synchronicity. Nodes in one coali-

tion exhibit comparable behavior to those in a line

topology, i.e. they have synchronized with each other,

while still slightly shifted in time. At the same time

nodes on one hop have learned to desynchronize their

active times similar to the mesh topology.

The result of applying our learning approach in a

grid topology for various duty cycles can be observed

in Figure 4(c). It displays the average end-to-end la-

tency of the network when using synchronicity and

(de)synchronicity respectively. For duty cycles lower

than 20% the improvements of our learning nodes

over a synchronized network are outstanding. Due

to the high data rate, the synchronized nodes are inca-

pable of delivering all packets within the short awake

period, resulting in increased latency and high packet

loss. The high amount of dropped packets at low duty

cycles is simply intolerable for real-time applications.

This reduced performance at low duty cycles is due

to the large number of collisions and re-transmissions

necessary when all nodes wake up at the same time.

The learning approach on the other hand drives nodes

to coordinate their wake-up cycles and shift them in

time, such that nodes at neighboring coalitions desyn-

chronize their awake periods. In doing so, nodes ef-

fectively avoid collisions and overhearing, leading to

lower end-to-end latency and longer network lifetime.

When nodes coordinate their actions, they effec-

tively reduce communication interference with neigh-

SELF-ORGANIZING SYNCHRONICITY AND DESYNCHRONICITY USING REINFORCEMENT LEARNING

101

(a) Grid topology

(b) Resulting wake-up schedule after convergence for

duty cycle of 10%

Figure 4: Experimental results for the grid topology.

boring nodes. This behavior results in lower amount

of overheard packets, less collisions and therefore

fewer retries to forward a message, as compared to

the fully synchronized network.

Finally, we would like to mention the convergence

rate of the learning agents. The implicit exploration

scheme, described in subsection 3.5 makes nodes se-

lect different actions in the beginning of the simula-

tion in order to determine their quality. As time pro-

gresses, the Q-values of slots are updated sufﬁciently

enough to make the policy of the agents converge. We

measured that after 500 iterations on average the ac-

tions of agents do not change and thus the state of

(de)synchronicity has been reached. In other words,

after 500 seconds each node ﬁnds the wake-up sched-

ule that improves message throughput and minimizes

communication interference. This duration is sufﬁ-

ciently small compared to the lifetime of the system

for a static WSN, which is in the order of several days

up to a couple of years depending on the duty cycle

and the hardware characteristics (Ilyas and Mahgoub,

2005).

4.3 Discussion

The main feature of the proposed reinforcement

learning approach is that nodes learn to cooperate

by simply observing their own reward signals while

forwarding messages. As a result, it drives agents

to coordinate their wake-up cycles without any

communication overhead. This behavior ensures that

neighboring nodes avoid communication interference

without exchanging additional data. The synchro-

nized approach, on the other hand, lets all nodes be

awake at the same time. Thus, for low duty cycles, the

risks of collisions and therefore re-transmissions are

increased. This was conﬁrmed by our experimental

results, where we observed that (de)synchronicity

can increase the system performance in mesh and

grid topologies for low duty cycles as compared to a

standard, fully synchronized approach.

When duty cycles of (synchronized) agents be-

come larger, nodes have less chance of collisions and

hence re-transmissions, leading to decreased latency

and no packet loss. As mentioned in subsection 2.1,

packets are sent at random slots within the active

period. Thus, the negative effect of being awake

at the same time becomes less pronounced as the

duty cycle increases. Similarly, as the number of

messages per active period decreases, the learning

agents receive less reward signals, leading to slower

convergence and poorer adaptive behavior. In such

cases it is less beneﬁcial to apply our RL approach,

as it performs similar to the fully synchronized

one. Nevertheless, synchronized nodes still overhear

packets of neighbors, resulting in higher battery

consumption as compared to nodes that use our

learning algorithm.

Different axes may ﬁnally be considered

to improve the efﬁciency of the proposed

(de)synchronization policy. First, we only con-

sidered ﬁxed duty-cycles, whereas the trafﬁc in WSN

is often unequal. In particular, nodes closer to the

sink often undergo higher trafﬁc loads due to data

forwarding. A possible approach would be to adopt

a cross-layering strategy where information from

the routing layer, such as the node depth or the

expected trafﬁc load, would be used to adapt the duty

cycle. Second, the RL approach is not well suited for

irregular data trafﬁc, as such type of trafﬁc is likely

to impair the convergence rate of the RL procedure.

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

102

A related issue concerns the clock drifts, which may

also cause the learning procedure to fail to converge

to a stable scheduling. We plan to address these

issues in our future work.

5 CONCLUSIONS

In this paper we presented a decentralized rein-

forcement learning (RL) approach for self-organizing

wake-up scheduling in wireless sensor networks

(WSNs). Our approach drives nodes to coordinate

their wake-up cycles based only on local interactions.

In doing so, agents independently learn both to syn-

chronize their active periods with some nodes, so that

message throughput is improved, and at the same time

to desynchronize with others in order to reduce com-

munication interference. We refer to this concept

as (de)synchronicity. We investigated three different

topologies and showed that agents are able to inde-

pendently adapt their duty cycles to the routing tree of

the network. For high data rates this adaptive behav-

ior improves both the throughput and lifetime of the

system, as compared to a fully synchronized approach

where all nodes wake up at the same time. We demon-

strated how initially randomized wake-up schedules

successfully converge to the state of (de)synchronicity

without any form of explicit coordination. As a result,

our approach makes it possible that agent coordina-

tion emerges rather than is agreed upon.

ACKNOWLEDGEMENTS

The authors of this paper would like to thank the

anonymous reviewers for their useful comments and

valuable suggestions. This research is funded by the

agency for Innovation by Science and Technology

(IWT), project IWT60837; and by the Research Foun-

dation - Flanders (FWO), project G.0219.09N.

REFERENCES

Cohen, R. and Kapchits, B. (2009). An optimal wake-

up scheduling algorithm for minimizing energy con-

sumption while limiting maximum delay in a mesh

sensor network. IEEE/ACM Trans. Netw., 17(2):570–

581.

Degesys, J., Rose, I., Patel, A., and Nagpal, R. (2007).

Desync: self-organizing desynchronization and tdma

on wireless sensor networks. In Proceedings of the 6th

IPSN, pages 11–20, New York, NY, USA. ACM.

http://www.omnetpp.org/ – a C++ simulation library and

framework.

Ilyas, M. and Mahgoub, I. (2005). Handbook of sensor net-

works: compact wireless and wired sensing systems.

CRC.

Knoester, D. B. and McKinley, P. K. (2009). Evolving vir-

tual ﬁreﬂies. In Proceedings of the 10th ECAL, Bu-

dapest, Hungary.

Langendoen, K. (2008). Medium access control in wireless

sensor networks. Medium access control in wireless

networks, 2:535–560.

Leng, J. (2008). Reinforcement learning and convergence

analysis with applications to agent-based systems.

PhD thesis, University of South Australia.

Liang, S., Tang, Y., and Zhu, Q. (2007). Passive wake-up

scheme for wireless sensor networks. In Proceedings

of the 2nd ICICIC, page 507, Washington, DC, USA.

IEEE Computer Society.

Liu, Z. and Elhanany, I. (2006). Rl-mac: a reinforcement

learning based mac protocol for wireless sensor net-

works. Int. J. Sen. Netw., 1(3/4):117–124.

Lucarelli, D. and Wang, I.-J. (2004). Decentralized syn-

chronization protocols with nearest neighbor commu-

nication. In Proceedings of the 2nd SenSys, pages 62–

68, New York, NY, USA. ACM.

Mirollo, R. E. and Strogatz, S. H. (1990). Synchronization

of pulse-coupled biological oscillators. SIAM J. Appl.

Math., 50(6):1645–1662.

Paruchuri, V., Basavaraju, S., Durresi, A., Kannan, R., and

Iyengar, S. S. (2004). Random asynchronous wakeup

protocol for sensor networks. In Proceedings of the

1st BROADNETS, pages 710–717, Washington, DC,

USA. IEEE Computer Society.

Patel, A., Degesys, J., and Nagpal, R. (2007). Desynchro-

nization: The theory of self-organizing algorithms for

round-robin scheduling. In Proceedings of the 1st

SASO, pages 87–96, Washington, DC, USA. IEEE

Computer Society.

Schurgers, C. (2007). Wireless Sensor Networks and Appli-

cations, chapter Wakeup Strategies in Wireless Sensor

Networks, page 26. Springer.

Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learn-

ing: An Introduction. MIT Press.

Watkins, C. (1989). Learning from delayed rewards. PhD

thesis, University of Cambridge, England.

Werner-Allen, G., Tewari, G., Patel, A., Welsh, M., and

Nagpal, R. (2005). Fireﬂy-inspired sensor network

synchronicity with realistic radio effects. In Proceed-

ings of the 3rd SenSys, pages 142–153, New York, NY,

USA. ACM.

Wolpert, D. H. and Tumer, K. (2008). An introduction

to collective intelligence. Technical Report NASA-

ARC-IC-99-63, NASA Ames Research Center.

Ye, W., Heidemann, J., and Estrin, D. (2004). Medium

access control with coordinated adaptive sleeping for

wireless sensor networks. IEEE/ACM Trans. Netw.,

12(3):493–506.

Zheng, R., Hou, J. C., and Sha, L. (2003). Asynchronous

wakeup for ad hoc networks. In Proceedings of the 4th

MobiHoc, pages 35–45, New York, NY, USA. ACM.

SELF-ORGANIZING SYNCHRONICITY AND DESYNCHRONICITY USING REINFORCEMENT LEARNING

103