Andrea Bonarini, Alessandro Lazaric and Marcello Restelli
Department of Electronics and Information, Politecnico di Milano
piazza Leonardo Da Vinci, 32
20133, Milan, Italy
Robot Learning, Reinforcement Learning.
Writing good behaviors for mobile robots is a hard task that requires a lot of hand tuning and often fails to
consider all the possible configurations that a robot may face. By using reinforcement learning techniques a
robot can improve its performance through a direct interaction with the surrounding environment and adapt its
behavior in response to some non-stationary events, thus achieving a higher degree of autonomy with respect
to pre-programmed robots. In this paper, we propose a novel reinforcement learning approach that addresses
the main issues of learning in real-world robotic applications: experience is expensive, explorative actions
are risky, control policy must be robust, state space is continuous. Preliminary results performed on a real
robot suggest that on-line reinforcement learning, matching some specific solutions, can be effective also in
real-world physical environments.
A lot of research efforts have been spent in robotics
to identify control architectures with the aim of mak-
ing the writing of control programs easier. Although
many advances have been made, it is often difficult
for a programmer to specify how to achieve the de-
sired solution. Furthermore, it is hard to take into
consideration all the possible configurations the robot
may face or the changes that may occur in the envi-
Reinforcement Learning (RL) (Sutton and Barto,
1998) is a well-studied set of techniques that allow
an agent to achieve, by trial-and-error, optimal poli-
cies (i.e., policies that maximize the expected sum of
observed rewards) without any a priori information
about the problem to be solved. In the RL paradigm,
the programmer, instead of programming how the
robot should behave, has just to specify a reward func-
tion that models how good is an action when taken in
a given state. This level of abstraction allows to write
specifications for the robot behavior in a short time
and to obtain better and more robust policies with re-
spect to hand-written control code.
Despite the huge research efforts in the RL field,
the application of RL algorithms to real-world robotic
problems is quite limited. The difficulty to gather ex-
perience, the necessity to avoid dangerous configura-
tions, the presence of continuous state variables are
some of the features that make the application of RL
techniques to robotic tasks complex.
In this paper, we analyze the main issues that must
be faced by learning robots and propose a set of tech-
niques aimed at making the RL approach more ef-
fective in real robotic tasks. In particular, our main
contributions are the introduction of the lower bound
update strategy, which allows to learn robust poli-
cies without the need of a complete exploration of the
whole state-action space, and the use of piecewise-
constant policies with reward accumulation, which al-
lows to efficiently learn even in presence of coarse
discretizations of the state space. Experimental re-
sults carried out with a real robot show that the pro-
posed learning techniques are effective in making the
learning process more stable than traditional RL algo-
In the next section, we will briefly review the
main approaches proposed in literature to overcome
the problems described above. In Section 3, we intro-
duce the RL framework and present the details of our
Bonarini A., Lazaric A. and Restelli M. (2007).
In Proceedings of the Four th International Conference on Informatics in Control, Automation and Robotics, pages 214-221
DOI: 10.5220/0001649102140221
algorithm. The results of the experimental validation,
carried out on a real robot with a soccer task, are re-
ported in section 4. We draw conclusions and propose
future research directions in section 5.
Given the complexity of the development of robotic
applications, the possibility to exploit learning tech-
niques is really appealing. In particular, the reinforce-
ment learning research field provided a number of al-
gorithms that allow an agent to learn to behave opti-
mally by direct interaction with the environment with-
out any a priori information.
Up to now, excluding a few notable exceptions,
the application of the RL approach has been success-
ful only in small gridworlds and simple simulated
control problems. The success obtained in more com-
plex domains (Tesauro, 1995; Sutton, 1996) is mainly
due to ad-hoc solutions and the exploitation of domain
dependent information.
Several difficulties prevent the use of pure RL
methods in real world robotic applications. Since di-
rect experience is the main source of information for
an RL algorithm, the learning agent needs to repeat-
edly interact with the world executing each available
action in every state. While in software domains it
is possible to perform a large number of trials and
to place the agent in arbitrary states, in real robotic
domains there are several factors (limited battery ca-
pacity, blocking states, mechanical or electrical faults,
etc.) that make learning from scratch not feasible.
Several works in the robotic area have studied dif-
ferent solutions to reduce the amount of direct expe-
rience required by RL algorithms. A typical solution
consists of performing extensive training sessions us-
ing a physical simulator (Morimoto and Doya, 2000).
When the simulated robot achieves a good perfor-
mance the learned policy is applied on the physical
robot and the learning process goes on with the aim
of adjusting it to the real conditions. Although this
approach can be really effective, for many robotic do-
mains it is too hard to realize good enough simulation
environments. Furthermore, it may happen that the
approximation introduced in the simulation is such
that the knowledge gathered in the simulated learning
phase is almost useless for the real robot. Another ap-
proach that was originally proposed for robotic tasks,
but that has found common application in other do-
mains, is experience replay (Lin, 1992). The idea
is that the robot stores data about states, actions,
and rewards experienced, and fictitiously repeats the
same moves thus performing more updates that speed
up the propagation of the rewards and the conver-
gence to the optimal value function. Although this
approach succeeds in speeding up the learning pro-
cess, it still requires an expensive exploration phase
to gather enough information. To overcome this prob-
lem, Lin adopts a human teacher to show the robot
several instances of reactive sequences that achieve
the task in order to bias the exploration to promising
regions of the action space. The same goal is pursued
in (Mill
an, 1996), but instead of a teacher it requires
a set of pre-programmed behaviors to focus the ex-
ploration on promising parts of the action space when
the robot faces new situations. In this paper, we fol-
low the approach proposed by (Smart and Kaelbling,
2002), which effectively provides prior knowledge by
splitting the learning process in two phases. In the
first phase, example trajectories are supplied to the
robot (by automatic control or by human guidance)
through a control policy and the RL system passively
watches the experienced states, actions and rewards
with the aim of bootstrapping information into the
value-function. Once enough data has been collected,
the second phase starts and the robot is completely
controlled by the RL algorithm. In problems with
sparse reward functions, without any hint, the robot
would take a huge number of steps before collect-
ing some significant reward, thus making the learning
process prohibitive. At the opposite, the “supervised”
phase is an initialization of the learning process so
that it can initially avoid a fully random exploration of
the environment. Furthermore, differently from imi-
tation learning methods, this approach allows to sup-
ply prior knowledge without knowing anything about
inverse kinematics.
Learning in real-world environments requires to
deal with dangerous actions that may harm the robot
or humans, and with stalling states, i.e., configura-
tions that prevent the robot from autonomously go-
ing on with the learning process. Again, example tra-
jectories can be effective to provide safe policies that
avoid harmful situations. Another way to reduce the
risk of performing dangerous actions is to use mini-
max learning (like the
Q-learning algorithm (Heger,
1994)), where the robot, instead of maximizing the
expected sum of discounted rewards, tries to maxi-
mize the value of the worst case. This kind of pes-
simistic learning has lead to good results in stochas-
tic (Heger, 1994), partially observable (Buffet and
Aberdeen, 2006), and multi-agent (Littman, 1994)
problems, showing also to be robust with respect to
changes in the problem parametrization, thus allow-
ing the reuse of the learned policy in different oper-
ating conditions. Unfortunately, algorithms like
learning need to perform an exhaustive search through
the action space, but this is not feasible in a real-world
robotic context. In the following sections, we propose
a variant to the
Q-learning that is able to find safe poli-
cies without requiring the complete exploration of the
action space.
Another relevant issue in real-world robotic appli-
cations is that both the state and action spaces are
continuous. Usually, this problem is faced by us-
ing function approximators such as state aggregation,
CMAC (Sutton, 1996) or neural networks, in order to
approximate the value function over the state space.
Although these techniques obtained relevant results
in supervised learning, they require long hand tun-
ing and may result in highly unstable learning pro-
cesses and even divergence. Although state aggrega-
tion is one of the most stable function approximator,
its performance of state aggregation is strictly related
to the width of the aggregated states and algorithms
like Q-learning may have very poor performance even
in simple continuous problems, unless a very fine dis-
cretization of the state space is used. In this paper, we
propose a learning technique that allows to achieve
good policies even in presence of large state aggrega-
tions, thus exploiting their generalization properties
to reduce the learning times.
As discussed in previous sections, learning in noisy
continuous state spaces is a difficult task for many
different reasons. In this section, after the introduc-
tion of the formal description of the RL framework,
we detail a novel algorithm based on the idea of the
computation of a lower bound for a piecewise con-
stant policy.
3.1 The Reinforcement Learning
RL algorithms deal with the problem of learning how
to behave in order to maximize a reinforcement signal
by a direct interaction with a stochastic environment.
Usually, the environment is formalized as a finite state
discrete Markov Decision Process (MDP):
1. A set of states
S = {s
2. A set of actions
A = {a
3. A transition model
P (s, a, s
) that gives the proba-
bility to get to state s
from state s by taking action
4. A reward function
R (s,a) that gives the value of
taking action a in state s
Furthermore, each MDP satisfies the Markov
P (s
) = P (s
,··· ,s
that is, the probability of getting in state s at time t +1
depends only on the state and action at the previous
time step and not on the history of the system.
A deterministic policy π :
S A is a function that
maps each state in the environment to the action to
be executed by the agent. The action value function
(s) measures the utility of taking action a in state s
and following a policy π thereafter:
(s,a) = R(s,a) + E
, (2)
where γ [0,1) is a discount factor that weights recent
rewards more than those in the future.
The goal of the agent is to learn the optimal policy
that maximizes the expected discounted reward in
each state. The action value function corresponding
to the optimal policy can be computed by solving the
following Bellman equation (Bellman, 1957):
(s,a) = R(s,a) + γ
P (s, a, s
Thus, the optimal policy can be defined as the greedy
action in each state:
(s) = argmax
(s,a). (4)
Q-learning (Watkins and Dayan, 1992) is a model-
free algorithm that incrementally approximates the
solution through a direct interaction with the environ-
ment. At each time step, the action value function
Q(s,a) is updated according to the reward received by
the agent and to the estimation of the future expected
(s,a) = (1 α)Q
(s,a) +α
r+ γmax
where r = R(s, a) and α is the learning rate. In the
following, we will refer to the term in square brackets
as the target value:
) = R(s,a) + γmax
) (5)
When α decreases to 0 according to the Robbins-
Monro (Sutton and Barto, 1998) conditions and each
state-action pair is visited infinitely often, the algo-
rithm is proved to converge to the optimal action value
function. Usually, at each time step the action to be
executed is chosen according to an explorative pol-
icy that balances a wide exploration of the environ-
ment and the exploitation of the learned policy. One
of the most used exploration policies is ε-greedy (Sut-
ton, 1996) that chooses the greedy action with proba-
bility 1ε and a random action with probability ε.
ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics
3.2 Local Exploration Strategy
In order to avoid an exhaustive exploration of the ac-
tion space, a more sophisticated exploration policy
can be adopted. As suggested in (Smart and Kael-
bling, 2002), a “supervised” phase performed using a
sub-optimal controller is an effective way to initial-
ize the action value function. This way, the learning
process is bootstrapped by a hand-coded controller
whose policy is optimized when the control of the
robot is passed to the learning algorithm. Even if this
technique is effectiveto avoid an initial random explo-
ration, the ε-greedy exploration policy used thereafter
does not guarantee that many useless, and potentially
dangerous, actions are explored. To reduce this prob-
lem, we propose to adopt a local ε-greedy exploration
policy. Since the learning process is initialized using
a sub-optimal controller, it is preferable to perform an
exploration in the range of the greedy action, instead
of completely random actions. Therefore, with prob-
ability ε a locally explorative action is drawn from a
uniform probability distribution over a δ-interval of
the greedy action a
+ δ) (6)
Even if the local ε-greedy exploration policy is not
guaranteed to avoid dangerous explorative actions,
in many robotic applications it is likely to be safer
and to converge to the optimal policy in less learn-
ing episodes than usual ε-greedy policy. In fact, it is
based on the assumption (often verified in robotic ap-
plications) that the optimal policy can be obtained by
small changes to the sub-optimal controller.
3.3 Q
As showed in many works (e.g., (Gaskett, 2003;
Morimoto and Doya, 2001)) traditional Q-learning
is often ill-suited for robotic applications character-
ized by noisy continuous environments with several
uncertain parameters in which non-stationary tran-
sitions may occur because of external unpredictable
factors. Many techniques (Gaskett, 2003; Morimoto
and Doya, 2001; Heger, 1994) improve the stability
and the robustness of the learning process, on the ba-
sis of the concept of maximization of performance in
the worst case, i.e., the min-max principle. This prin-
ciple deals with the problems introduced by highly
uncertain and stochastic environments. In fact, the
controller obtained at the end of the learning process
is optimized for the worst condition and not for the
In case of problems with multiple actuators, the explo-
rative action is obtained by the composition of explorative
actions for each actuator.
average situation, such as in Q-learning. Although ef-
fective in principle, this approach cannot always be
applied in real world applications. The Robust Rein-
forcement Learning (RRL) paradigm (Morimoto and
Doya, 2001) relies on an estimation of the dynam-
ics and of the noise of the environment in order to
compute the min-max solution of the value function.
Unfortunately, the model of the environment is not
always available and the estimation of its dynamics
often requires many learning episodes. A model-
free solution, the
Q-learning algorithm, proposed in
(Heger, 1994) can learn a robust controller through a
direct interaction with the environment. The update
formula for the action value function is:
(s,a) = min
(s,a),U(s, a, s
, (7)
whereU(s, a, s
) is the target value. If the action value
function is initialized to the highest possible value
(i.e., Q(s, a) =
), this algorithm is proved to con-
verge to the min-max value function and policy, that is
the policy that receives the highest expected reward in
the worst case. Although this algorithm is guaranteed
to find a robust controller, it can be applied only to
simulated environments, since the optimistic initial-
ization of the action value function makes the agent
to explore randomly all the available actions until at
least the best action converged to the min-max value
function. Thus, it is not suitable for robotic applica-
tions where long exploration is too expensive.
Another drawback of
Q-learning is that it finds an
optimal policy for the worst case even if caused by
non-stationary transitions. In fact, in real-world appli-
cations very negative conditions may occur during the
learning process because of very limited and uncon-
trolled situations possibly caused by non-stationarity
in the environment and, with
Q-learning, these condi-
tions are immediately stored in the action value func-
tion and cannot be removed anymore.
In order to keep the robustness of a min-max con-
troller, to reduce the exploration and to avoid effects
of non-stationarity as much as possible, we propose
-learning, a novel algorithm for the computation
of a lower bound for the action value function. Instead
of a minimization between the current estimation and
the target value (Eq. 5), we adopt the following update
(s,a) =
U(s,a, s
) if U(s,a,s
) < Q
(1 α)Q
(s,a) +αU(s,a, s
) otherwise
As it can be noticed, when the worst case is visited
the action value estimation is set to the target value as
Q-learning. On the other hand, if the target received
by the agent is greater than the current estimation, the
usual Q-learning update rule is used. As a result, the
action value function may not take into consideration
the worst case ever visited in the learning process,
when this is the result of rare events not following the
real dynamics of the system (e.g., collisions against
moving obstacles). As learning progresses the learn-
ing rate α decreases (according to Robbins-Monro
conditions) thus granting the convergence of the Q
learning algorithm since it becomes more and more
similar to
. As it can be noticed, this al-
gorithm does not require any particular initialization
of the action value function as in
Q-learning and this
can reduce the exploration needed to learn a nearly
optimal solution. Furthermore, this algorithm is ef-
fective in case of continuous state spaces in which the
transitions between states may be affected also by the
policy the robot is performing (Moore and Atkeson,
1995). In this situation, the worst case depends on the
policy and not only on the dynamics of the environ-
ment, thus it is necessary to evaluate the action value
function according to the current policy and not with
respect to the worst possible case. Therefore, while
Q-learning would converge to the worst case indepen-
dently from the policy, Q
-learning learns the action
value function for the worst case of the current policy.
3.4 PWC-Q-learning
Although the previous algorithm is effective in noisy
or non-stationary environments, it may experience
bad results (see Section 4) when applied to problems
with continuous state spaces. Usually, when applying
RL algorithms, a continuous state space is discretized
into intervals that are considered as aggregated states.
Unfortunately, if a coarse discretization is adopted,
the environment looses the Markov property and the
learning process is likely to fail. In fact, a very fine
resolution on the state space is required to make algo-
rithms such as Q-learning stable and effective. Since
the number of states has a strong impact on the speed
of the learning process, this is often a problem of its
application to robotic problems.
The reason for Q-learning to fail in learning on
coarsely discretized states is strictly related to its
learning process, that continuously performs updates
within the same state. With Q-learning, when the
robot takes the greedy action a in state s, receives a
reward r and remains in the very same state, the ac-
tion value function is updated as:
(s,a) = (1α)Q
(s,a) + α
r+ γQ
Let us notice that when α = 0 Q
-learning is the same
as the
Thus, the value of Q(s, a) is updated using its own es-
timation. If the state is sufficiently large, the agent
is likely to remain in the same state for many steps
and the Q-value tends to converge to its limit
until either the state is left or another action is cho-
sen. As a result, the policy continuously changes and
the learning process may experience instability. This
phenomenon is much more relevant when a min-max-
based update rule is adopted; in that case, the conver-
gence to the limit is even faster.
In order to avoid the negative effect of the self-
update rule, we introduce a novel learning algo-
rithm: the Piecewise Constant Q-learning (PWC-Q-
learning). The main difference with respect to the tra-
ditional Q-learning is about the way the action value
function is updated. When the robot enters a state s
and selects an action a, PWC-Q-learning makes the
robot repeatedly execute the same action a and it ac-
cumulates the reward until a state transition occurs.
Only at that time, the update is performed according
to the SMDP Q-learning rule (Sutton et al., 1999):
(s,a) = (1 α)Q
(s,a) +
+ γ
where N is the number of times in which the agent has
performed action selection in the state s. As it can be
noticed, in this way s
is always different from s and
no self-update is performed. As a result, the PWC-Q-
learning is more stable and guarantees a more reliable
learning process than the original Q-learning.
Furthermore, the PWC-Q-learning algorithm
matches also the limitations caused by the discretiza-
tion on the resolution of the controller that can be ac-
tually learned. In fact, when the learning is over, the
learned policy π maps each state into one single action
that must be kept constant until a different state is per-
ceived. Therefore, in this case a learning algorithm as
PWC-Q-learning that evaluates the real utility of an
action throughout a state is more suitable, and does
not allow any change in the action as in Q-learning.
While Q-learning needs a very fine discretization
to reduce the instability caused by the loss of the
Markov property, PWC-Q-learning (as shown in the
experimental section) proved to be more stable even
in coarsely discretized continuous state spaces. By
using coarse discretizations it is also possible to re-
duce the duration of the learning process.
Finally, PWC-Q-learning can be merged with the
computation of the lower bound action value function
introduced in Section 3.3 in order to obtain a learn-
ing algorithm that is robust in highly stochastic and
noisy environments and that, at the same time, can be
successfully applied to continuous robotic problems.
ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics
Figure 1: The RoboCup robot used for the experiments.
To verify the effectiveness of the proposed approach
we made real robotic experiments with the aim of
measuring speed and stability of the learning pro-
cess, and optimality and robustness of the learned pol-
icy. The experimental activity has been performed on
a real robot belonging to the Milan RoboCup Team
(MRT) (Bonarini et al., 2006), a team of soccer robots
that participates to the Middle Size League of the
RoboCup competition (Kitano et al., 1997). The robot
used for the experiments is a holonomic robot with
three omnidirectional wheels that can reach the max-
imum speed of 1.8m/s (see Figure 1). The robot is
equipped with a omnidirectional catadioptric vision
sensor able to detect objects (e.g., ball, robots) up to
a distance of about 6m all around the robot.
Preliminary experiments have been carried out on
the task “go to ball”, in which the robot must learn
how to reach the ball as fast as possible. Although we
have chosen a quite simple task, the large discretiza-
tion adopted, the non-stationarity of the environment
(due to battery consumption during the learning pro-
cess), the noise affecting robot’s sensors and actua-
tors, and the limited amount of experience make the
results obtained significant to evaluate the benefits of
the PWC-Q
-learning algorithm. The state space of
this task is characterized by two continuous state vari-
ables: the distance and the angle at which the robot
perceives the ball. The ball distance has been dis-
cretized into five intervals: [0 : 50),[50 : 100), [100 :
200),[200 : 350), [350 : 600], while the angle has been
evenly split into 24 sectors 15
wide, thus obtaining
a state space with 120 states. As far as the action
space is concerned, thanks to its three omnidirectional
wheels, the robot can move on the plane with three
degrees of freedom mapped to three action variables:
module of the tangential velocity, associated to
the speed at which the robot translates. Its value
is expressed in percentage of the maximum tan-
gential velocity and it is discretized into 6 values
0,20,40, 60, 80, 100;
direction of the tangential velocity, the direction
along with the robot translates. Its value is ex-
pressed in degrees and it is discretized into 24
evenly spaced values: 0
rotational velocity, associated to the speed at
which the robot changes its heading. Its value
is expressed in percentage of the maximum rota-
tional velocity and it is discretized into 9 values:
20,15, 10, 5, 0, 5, 10, 15, 20.
The total number of available actions is 1, 296.
The reward function is such that the robot receives
1 as reward at each step except when its distance
from the ball is below 50cm and the angle falls in the
range [15
: 15
], in which case the reward is +10
and the trial ends. At the beginning of each learn-
ing trial, the robot starts form the center kickoff po-
sition and performs learning steps until it succeeds in
reaching the ball which is positioned at 350cm. Once
a trial is finished, the learning process is suspended,
and the robot autonomously performs a resetting pro-
cedure moving towards the starting position.
The sparsity of this reward function, although
making easier its definition and preventing the in-
troduction of biases, requires a long exploration pe-
riod before catching some positive rewards and prop-
agate the associated information to the rest of the state
space. In this task, the robot would clueless wan-
der around the field with little hope of success and
high risk of bumping against objects around the field.
For this reason, as mentioned in Section 2, we split
the learning process into two phases. The first phase
consists of “supervised” trials, i.e., trials in which the
robot is controlled by hand-written behaviors and the
RL algorithm only observes and records the actions
taken in the visited states and the associated rewards.
On the basis of the observed data, the RL algorithm
builds a first approximation of the value function that
will be exploited in the second phase. The acquired
policy allows to make a safe exploration of the en-
vironment, thus considerably speeding up the learn-
ing process towards the optimal policy. In the second
phase, the hand-written controller is bypassed by the
RL system which chooses which action must be exe-
cuted by the robot in each state, and, on the basis of
the collected reward and the reached state, performs
on-line updates of its knowledge.
In the following, we present comparative exper-
iments among Q-learning, Q
-learning (see Sec-
tion 3.3), and PWC-Q
-learning (see Section 3.4).
The “supervised” phase has been performed only
once and then the collected data have been reused
in all the experiments. It consists of 60 trials with
the ball placed in different positions around the robot
within 400cm. Since the actions produced by the
0 20 40 60 80 100 120 140 160 180 200
Number of steps per trial
Number of learning trials
0 20 40 60 80 100 120 140 160 180 200
Number of steps per trial
Number of learning trials
Q-Learning with lower bounds
0 20 40 60 80 100 120 140 160 180 200
Number of steps per trial
Number of learning trials
Piecewise constant Q-Learning with lower bounds
Figure 2: Performance of Q-learning, Q
-learning and PWC-Q
-learning on the “go to ball” task.
hand-written controller (based on fuzzy rules) are
continuous, and given that the RL algorithms work
with discrete actions, we replace each action pro-
duced by the controller with the one with the clos-
est value among those in the discrete set of the RL
system. Given the low complexity of the task, the
hand-written behavior is, on purpose, highly subopti-
mal (with only low speed commands) in order to bet-
ter highlight the improvements obtained by the learn-
ing processes in such a noisy environment.
In Figure 2 are reported the learning plots of the
three algorithms during the second learning phase.
The performance of the algorithms is measured by
the number of steps (each step lasts 70ms) required
to reach the ball. Every five trials the learning process
is suspended for one trial; in this trial the robot exe-
cutes the policy learned so far. Only the exploitation
trials are shown in the graphs. When the robot is not
able to reach the ball within 200 steps, the trial ends.
For each graph are also reported the average perfor-
mance of the suboptimal hand-coded policy followed
by the controller ( 152 steps) and the average per-
formance of our best hand-coded policy ( 48 steps).
It is worth noting that, due to noise in the sensing and
actuating systems and to the low accuracy of the reset-
ting procedure, also the performance of fixed policies
is affected by a high variance (standard deviation of
about ±10 steps). The parametrization is the same
for each learning algorithm: the learning rate is 0.5
and decreases quite quickly, the ε-greedy exploration
starts at 0.5 and decreases slowly, the discount factor
is 0.99. The action values stored in the Q-table are
initialized with a low value (100), so that the values
of the actions performed during the supervised trials
become larger, thus biasing the exploration to regions
near to the example trajectories. Given the stochastic-
ity (due to sensor noise) and the non-stationarity (due
to battery discharging), each experiment has been re-
peated three times for each algorithm, for a total of
1,350 trials and more than 130,000 steps.
In the first two graphs we have reported typical
runs of Q-learning and Q
-learning. Both of them
show a quick learning in the first trials, but then they
alternate good trials to bad trials without appearing to
converge to a stable solution. Trials that reach 200
steps typically mean that the robot, in some regions,
has learned a policy that stops the movement of the
robot. As explained above, this irrational behavior is
due to self-updates (see Equation 8). Unsurprisingly,
this problem is more frequent in the Q
algorithm. The third plot displays the performance
of PWC-Q
-learning algorithm averaged over three
different runs. Here, the number of steps needed to
reach the ball decreases quite slowly, but, unlike the
other algorithms, the policy improves more and more
reaching performance close to the best hand-coded
policy, without any trial reaching the 200 steps limit.
It is worth briefly describing the policy learned by
the robot with the PWC-Q
-learning algorithm: in
the first trials the robot gradually learns to increase
its speed in different situations until it learns to reach
the ball at the maximum speed. Although this pol-
icy allows to complete some trials in very few steps
(even less than 40 steps), it is likely to fail, since a
coarse discretization gives control problems at high
speed, especially in regions where a good accuracy is
required. The problem is that, when the robot trav-
els fast, it may not be able to go straight to the ball,
and in general it hits the ball, thus needing to run af-
ter it for many steps. Given the risk aversion typical
of minimax approaches, the PWC-Q
-learning algo-
rithm quickly learns to give up with fast movements
when the robot is near to the ball. The final result
is that the robot starts at high speed and slows down
when the ball gets closer, thus managing to reach the
termination condition without touching the ball.
During one of the experiments another interest-
ing characteristic of the PWC-Q
-learning algorithm
emerged. After about 100 trials, one of the wheels be-
gan to partially lose its grip on the engine axis, thus
causing the robot to turn slightly right instead of mov-
ing straightforward. Our learning algorithm managed
ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics
to face this non-stationarity by reducing the tangen-
tial speed in all the states, turning slightly right, and
then it started to gradually raise the tangential speed.
This behavior has put in evidence how this learning
approach potentially can increase the robustness to-
wards mechanical faults, thus increasing the auton-
omy of the robot. Further studies will be focused on
a detailed analysis of this important aspect.
The application of RL techniques to robotic applica-
tions could lead to the autonomous learning of opti-
mal solutions for many different tasks. Unfortunately,
most of the traditional RL algorithms fail when ap-
plied to real-world problems because of the time re-
quired to find the optimal solution, of the noise that
affects both sensors and actuators, and of the diffi-
culty to manage continuous state spaces.
In this paper, we have described and experimen-
tally tested a novel algorithm (PWC-Q
designed to overcome the main issues that arise when
learning is applied to real-world robotic tasks. PWC-
-learning computes the lower bound for the ac-
tion value function while following a piecewise con-
stant policy. Unlike other min-max-based algorithms,
-learning does not require the model of the
dynamics of the environment and avoids long and
blind exploration phases. Furthermore, it does not
learn the optimal policy for the theoretically worst
case, but it estimates the lower bound on the condi-
tions actually experienced by the robot according to
its current policy and to the current dynamics of the
environment. Finally, the piecewise constant action
selection and update guarantee a stable learning pro-
cess in continuous state spaces, even when the dis-
cretization is such that the Markov property is lost.
Although preliminary, the experiments showed
that PWC-Q
-learning succeeds in learning a nearly-
optimal policy by optimizing the behavior of a sub-
optimal controller in noisy continuous environments.
Furthermore, it proved to be more stable with re-
spect to Q-learning even when a coarse discretiza-
tion of the state space is used. At the moment, we
are currently investigating the theoretical properties
of the proposed algorithm and we are testing its per-
formance on more complex robotic tasks, such as the
“align to goal” and the “kick” tasks.
Bellman, R. (1957). Dynamic Programming. Princeton
University Press, Princeton.
Bonarini, A., Matteucci, M., Restelli, M., and Sorrenti,
D. G. (2006). Milan robocup team 2006. In RoboCup-
2006: Robot Soccer World Cup X.
Buffet, O. and Aberdeen, D. (2006). Policy-gradient for
robust planning. In Proceedings of the Workshop on
Planning, Learning and Monitoring with Uncertainty
and Dynamic Worlds (ECAI 2006).
Gaskett, C. (2003). Reinforcement learning under circum-
stances beyond its control. In Proceedings of Interna-
tional Conference on Computational Intelligence for
Modelling Control and Automation.
Heger, M. (1994). Consideration of risk in reinforcement
learning. In Proceedings of the 11th ICML, pages
Kitano, H., Asada, M., Osawa, E., Noda, I., Kuniyoshi, Y.,
and Matsubara, H. (1997). Robocup: The robot world
cup initiative. In Proceedings of the First Interna-
tional Conference on Autonomous Agent (Agent-97).
Lin, L.-J. (1992). Self-improving reactive agents based on
reinforcement learning, planning and teaching. Ma-
chine Learning, 8(3-4):293–321.
Littman, M. L. (1994). Markov games as a framework for
multi-agent reinforcement learning. In Proceedings of
the 11th ICML, pages 157–163.
an, J. D. R. (1996). Rapid, safe, and incremental learn-
ing of navigation strategies. IEEE Transactions on
Systems, Man, and Cybernetics (B), 26(3):408–420.
Moore, A. and Atkeson, C. (1995). The parti-game algo-
rithm for variable resolution reinforcement learning
in multidimensional state-spaces. Machine Learning,
Morimoto, J. and Doya, K. (2000). Acquisition of stand-up
behavior by a real robot using hierarchical reinforce-
ment learning. In Proceedings of the 17th ICML.
Morimoto, J. and Doya, K. (2001). Robust reinforcement
learning. In Advances in Neural Information Process-
ing Systems 13, pages 1061–1067.
Smart, W. D. and Kaelbling, L. P. (2002). Effective rein-
forcement learning for mobile robots. In Proceedings
of ICRA, pages 3404–3410.
Sutton, R. S. (1996). Generalization in reinfrocement learn-
ing: Successful examples using sparse coarse coding.
In Advances in Neural Information Processing Sys-
tems 8, pages 1038–1044.
Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learn-
ing: An Introduction. MIT Press, Cambridge, MA.
Sutton, R. S., Precup, D., and Singh, S. (1999). Between
mdps and semi-mdps: A framework for temporal ab-
straction in reinforcement learning. Artificial intelli-
gence, 112(1-2):181–211.
Tesauro, G. (1995). Temporal difference learning and td-
gammon. Communications of the ACM, 38.
Watkins, C. and Dayan, P. (1992). Q-learning. Machine
Learning, 8:279–292.