FOOTSTEP PLANNING FOR BIPED ROBOT BASED ON FUZZY

Q-LEARNING APPROACH

Christophe Sabourin, Kurosh Madani

Laboratoire Images, Signaux, et Syst`emes Intelligents (LISSI EA / 3956), Universit´e Paris-XII

IUT de S´enart, Avenue Pierre Point, 77127 Lieusaint, France

Weiwei Yu, Jie Yan

Flight Control and Simulation Institute, Northwestern Polytechnical University, Xi’an 710072, China

Keywords:

Biped robots, Footstep planning, Fuzzy Q-Learning.

Abstract:

The biped robots have more ﬂexible mechanical systems and they can move in more complex environment

than wheeled robots. Their abilities to step over both static and dynamic obstacles allow to the biped robots to

cross an uneven terrain where ordinary wheeled robots can fail. In this paper we present a footstep planning

for biped robots allowing them to step over dynamic obstacles. Our footstep planning strategy is based on a

fuzzy Q-learning concept. In comparison with other previous works, one of the most appealing interest of our

approach is its good robustness because the proposed footstep planning is operational for both constant and

random velocity of the obstacle.

1 INTRODUCTION

In contrast with the wheeled robots, the biped robots

have more ﬂexible mechanical system and thus they

can move in more complex environment. Actually,

their abilities to step over both static and dynamic

obstacles allow to the biped robots to cross an un-

even terrain where regular wheeled robots can fail.

Although there are a large number of papers dealing

with the ﬁeld of biped and humanoid robots (see for

examples (Hackel, 2007) and (Carlos, 2007)), only a

few of publication researches concern the path plan-

ning for biped robots (Ayza, 2007), (Chestnutt, 2004),

(Sabe, 2004). In fact, the design of a path planning for

biped robots into indoor and outdoor environment is

more difﬁcult than for wheeled robots because it must

take into account their abilities to step over obstacles.

Consequently, path planning with obstacle avoidance

strategy like the wheeled robots is not sufﬁcient.

Generally, the previous proposed approaches in

the ﬁeld of path planning for biped robots arebased on

a tree search algorithm. In (Kuffner, 2001) , Kuffner

et al. have proposed a footstep planning approach us-

ing a search tree from a discrete set of feasible foot-

step locations. This approach has been validated on

the robot H6 (Kuffner, 2001) and H7 (Kuffner, 2003).

Later, this strategy has been extended for the robot

Honda ASIMO (Chestnutt, 2005). Although the foot-

step planning proposed by Kuffner seems an interest-

ing way to solve the problem of the path planning

for biped robots, the main drawbacks are on the one

hand the limitation at 15 foot placements (Kuffner,

2001) in order to limit the computational time, and

on the other hand, this approach is operational only

in the case of the predictable dynamic environments

(Chestnutt, 2005). In this paper, we present a new

concept of a footstep planning for biped robots in

dynamic environments. Our approach is based on a

Fuzzy Q-learning (FQL) algorithm. The FQL, pro-

posed by Glorennec et al. (Glorennec, 1997) (Jouffe,

1998), is an extension of the traditional Q-learning

concept (Watkins, 1992) (Sutton, 1998) (Glorennec,

2000) allowing to handle the continuous nature of the

state-action. In this case, both actions and Q-function

may be represented by Takagi-Sugeno Fuzzy Infer-

ence System (TS-FIS). After a training phase, our

footstep planning strategy is able to adapt the step

length of the biped robot only using a Fuzzy Inference

System. However, our study is limited to the sagittal

plane and does not take into account the feasibility of

the joint trajectories of the leg. In fact, the footstep

planning gives only the position of the landing point.

But the ﬁrst investigations show a real interest of this

approach because:

183

Sabourin C., Madani K., Yu W. and Yan J. (2008).

FOOTSTEP PLANNING FOR BIPED ROBOT BASED ON FUZZY Q-LEARNING APPROACH.

In Proceedings of the Fifth International Conference on Informatics in Control, Automation and Robotics - RA, pages 183-188

DOI: 10.5220/0001486701830188

Copyright

c

SciTePress

• The computing time is very short. After the learn-

ing phase, the footstep planning is based only on

a FIS,

• The footstep planning is operational for both pre-

dictable and unpredictable dynamic environment

allowing to increase the robustness.

This paper is organized as follows. In Section 2,

the Fuzzy Q-learning concept is presented. Section

3 describes the footstep planning based on the Fuzzy

Q-learning. In section 4, the main results, obtained

from simulations, are given. Conclusions and further

developments are ﬁnally set out in section 5.

2 FUZZY Q-LEARNING

CONCEPT

Reinforcement learning (Sutton, 1998) (Glorennec,

2000) involves problems in which an agent interacts

with its environment and estimates consequences of

its actions on the base of a scalar signal in terms of

reward or punishment. The goal of the reinforcement

learning algorithm is to ﬁnd the action which maxi-

mize a reinforcement signal. The reinforcement sig-

nal provides an indication of the interest of last chosen

actions. Q-Learning, proposed by Watkins (Watkins,

1992), is a very interesting way to use reinforcement

learning strategy. However, the Q-Learning algorithm

developed by Watkins deals with discrete cases and

assumes that the whole state space can be enumerated

and stored in a memory. Because the Q-matrix values

are stored in a look-up table, the use of this method

becomes impossible when the state-action spaces are

continuous. For a continuous state space, Glorenec

et al. (Glorennec, 1997) (Jouffe, 1998) proposed to

use fuzzy logic where both actions and Q-function

may be represented by Takagi-Sugeno Fuzzy Infer-

ence System (TS-FIS). Unlike the TS-FIS in which

there is only one conclusion for each rule, the Fuzzy

Q-Learning (FQL) approach admits several actions

per rule. Therefore, the learning agent has to ﬁnd the

best issue for each rule.

The FQL algorithm uses a set of N

K

fuzzy rules

such as:

IF x

1

is M

1

1

AND x

i

is M

j

i

THEN

y

k

= a

1

k

with q = q

1

k

or y

k

= a

l

k

with q = q

l

k

or y

k

= a

N

l

k

with q = q

N

l

k

(1)

x

i

(i = 1..N

i

) are the inputs of the FIS which represent

the state space, N

i

is the size of the input space. Each

fuzzy set j for the input i is modeled by a membership

function M

j

i

and its membership value µ

j

i

. a

l

k

and q

l

k

are respectively the l

th

possible action for the rule k

and its corresponding Q-value (k = 1..N

k

;l = 1..N

l

).

At each step time t, the agent observes the present

state X(t). For each rule k, the learning system has

to choose one action among the total N

l

actions us-

ing an Exploration/Exploitation Policy (EEP). In our

approach, ε-greedy algorithm is used to select the lo-

cal action for each activated rule. The action with the

best evaluation value (max(q

l

k

),l = 1..N

l

) has a prob-

ability P

ε

to be chosen, otherwise, an action is chosen

randomly among all possible actions. After, the ex-

ecution of the next computed action, the agent may

update the Q-value using of a reinforcement signal.

The algorithm of the FQL may be decomposed into

four stages:

• After the fuzziﬁcation of the perceived state X(t),

the rule values α

k

(t) are computing using equa-

tion (2):

α

k

(t) = µ

j

1

µ

j

2

................µ

j

Ni

(2)

• The ﬁnal actionY(t) is computed through two lev-

els of computation: in the ﬁrst level, local action l

in each activated rule is determined by using EEP,

and in the second level global action is calculated

as a combination of all local actions. Equations

(3) and (4) give respectively the computation of

the global action Y(t) and the corresponding Q(t)

value according to the truth value α

k

(t):

Y(t) =

N

k

∑

k=1

α

k

(t)a

l

k

(t) (3)

Q(t) =

N

k

∑

k=1

α

k

(t)q

l

k

(t) (4)

• Matching up the new action, given by Y(t)

and taking into account the environment’s reply,

Q(t)may be updated using equation (5):

∆Q(t) = β[r + γV

max

(t + 1) − Q(t)] (5)

Where V

max

(t + 1) is the maximum Q-value for

the activated rule at the next step time t + 1:

V

max

(t + 1) =

N

k

∑

k=1

α

k

(t + 1)max(Q

l

k

(t + 1)) (6)

γ is a discount factor which can be chosen from 0

to 1. If it is close to 0, the reinforcement informa-

tion tends to consider only the immediate reward,

while if it is closer to 1, it considers the future

ICINCO 2008 - International Conference on Informatics in Control, Automation and Robotics

184

Figure 1: Footstep planning strategy.

rewards with greater weight. β is a learning rate

parameter allowing to weight the part of the old

and new rewards in a reinforcement signal r.

• Finally, for each activated rules, the correspond-

ing elementary quality ∆q

l

k

of the Q-matrix is up-

dated as:

∆q

l

k

= Q(t)α

k

(t) (7)

3 FOOTSTEP PLANNING

The proposed footstep planning is based on a FQL ap-

proach. Our aim is to design a control strategy allow-

ing to adjust automatically the step length of a biped

in order that the robot avoids dynamic obstacles by

using step over strategy. As ﬁgure 1 shows it, our

footstep planning may be divided into four parts:

• The ﬁrst part involves a fuzziﬁcation of inputs of

the state X(t),

• The second concerns the FQL algorithm allowing

to compute the length of the step,

• The third part allows simulating dynamic environ-

ment into which the robot moves,

• And the fourth part gives the reinforcement signal.

3.1 Virtual Dynamic Environment

The both robot and obstacle move in sagittal plane but

in opposite directions. We consider that the walking

of the biped robot may include as well strings of sin-

gle support phases (only one leg is in contact with the

ground) as instantaneous double support phases (the

two legs are in contact with the ground). The biped

robot may adjust the length of its step but we con-

sider that the duration of each step is always equal

to 1s. The size and velocity of the obstacle are in-

cluded into [0,0.4m] and [0,0.4m/s] ranges respec-

tively. Although the robot has the ability to adjust

its step length, there are two possibilities in which the

robot may crash with the obstacle. First one occurs

when the length of the step is not correctly adapted

according to the position of the dynamic obstacle. In

this case, the swing leg touches directly the obstacle

during a double support phase. The other case cor-

responds to the situation where the obstacle collides

with the stance leg during the single or double sup-

port phase.

3.2 Fuzziﬁcation

The design of our footstep planning is based on both

Takagi-Sugeno FIS and Q-learning strategies. Conse-

quently, it is necessary to use a fuzziﬁcation for each

input. In the proposed approach, we use two inputs

in order to perform a correct footstep planning. These

inputs are the distance between the robot and the ob-

stacle d

obs

and the velocity of the obstacle v

obs

. d

obs

and v

obs

are updated at each double support phase.

d

obs

correspondsto the distance between the front foot

and the ﬁrst side of the obstacle. v

obs

is computed

from the distance covered during 1s. The fuzziﬁcation

of v

obs

and d

obs

is carried out by using respectively 6

and 11 triangular membership functions. Figure 2(a)

and 2(b) gives the membership functions of the obsta-

cle velocity and distance respectively.

3.3 FQL-based Step Length

The FQL algorithm uses a set of fuzzy rules such as

equation (1). For the proposed problem, the number

of the rules is 66 (6 and 11 membership functions

for velocity and distance of the obstacle respectively).

For each rules, we deﬁne 5 possible outputs which

are [0.1,0.2,0.3, 0.4,0.5]m. In fact, these outputs cor-

FOOTSTEP PLANNING FOR BIPED ROBOT BASED ON FUZZY Q-LEARNING APPROACH

185

(a) Obstacle velocity v

obs

(in m/s).

(b) Obstacle distance d

obs

(in m).

Figure 2: Membership functions used for the input space.

respond to the length of the step. Consequently, at

each step time, the Fuzzy Q-Learning algorithm needs

to choose one output among ﬁve possible outputs for

each activated rules. It must be pointed out that the

chosen output is included into a discrete set, but the

real output Y(t) is a real number dues a fuzziﬁcation.

During the simulation, the size of the obstacle is con-

stant but the velocity of the obstacle may be modi-

ﬁed. At each episode, initialization of some param-

eters are necessary. The initial distance between the

biped robot and the obstacle is always equal to 2.5m.

The velocity of the obstacle is chosen randomly into

the interval [0,0.4]m/s. During one episode, the step

length of the robot is computed using the FQL algo-

rithm described in section 2. Consequently, the biped

robot moves step by step towards the obstacle dur-

ing the episode. The episode is ﬁnished whether the

robot steps over the obstacle (success) or if the robot

crashes into obstacle (failure). The discount factor

γ and the learning rate parameter β are equal to 0.8

and 0.1 respectively. This parameters have been cho-

sen empirically after several trials in order to assure a

good convergence of FQL algorithm. The probability

P

ε

is equal to 0.1 and means that the random explo-

ration is privileged during the learning phase.

3.4 Reinforcement Signal

The reinforcement signal provides an information in

terms of reward or punishment. Consequently, the re-

inforcement signal informs the learning agent about

the quality of the chosen action. In our case, the learn-

ing agent must ﬁnd a succession of action allowing

to the biped robot to step over an obstacle. But here

the obstacle is a dynamic object which moves towards

the biped robot. Consequently, the reinforcement in-

formation have to take into account of the velocity

of the moving obstacle. In addition, the position of

the foot just before the stepping-over is very impor-

tant as well. On the base of these considerations, we

designed reinforcement signal in two parts.

Firstly, if x

rob

< x

obs

where x

rob

and x

obs

give the

positions of the robot and of the obstacle respectively:

• r = 0, if the robot is still far from obstacle,

• r = 1, if the position of the robot is appropriate to

cross the obstacle at next step,

• r = −1, if the robot is too close to the obstacle.

In this ﬁrst case, r is computed with the following

equation:

r =

0 if ( x

rob

≤ (x

obs

− 1.2v

obs

∆t) )

1 if (x

rob

> (x

obs

− 1.2v

obs

∆t))

AND (x

rob

≤ (x

obs

− 1.1v

obs

∆t)).

-1 if (x

rob

> (x

obs

− 1.1v

obs

∆t))

(8)

x

rob

and x

obs

are updated after each action. v

obs

∆t rep-

resents the distance covering by obstacle during the

time ∆t. As the duration of the step is always equal

1s, ∆t is always equal to 1s.

Secondly, if (x

rob

≥ x

obs

):

• r = −2, if the robot crashes into the obstacle at the

next step,

• r = 2, if the robot crosses the obstacle at the next

step.

In this last case, r is given by equation (9).

r =

-2 if ( x

rob

≤ (x

obs

+ L

obs

) )

2 if ( x

rob

> (x

obs

+ L

obs

) )

(9)

Where L

obs

is the size of the obstacle.

4 SIMULATION RESULTS

In this section, we present the main results related the

footstep planning based on FQL approach by using

MATLAB software. It must be noticed that our goal

is to design a control strategy allowing to give a path

planning into a dynamic environment for biped robot

but we do not take into account the dynamic of the

biped robot. We consider only discrete information

allowing to compute the landing position of the foot.

In addition, we consider only ﬂat obstacles in the fol-

lowing simulations.

ICINCO 2008 - International Conference on Informatics in Control, Automation and Robotics

186

4.1 Training Phase

During the training phase, the goal of the learning

agent is to ﬁnd the best rules in order that the biped

robot crosses the obstacle. On the base of the previous

description, we trained the Q-matrix during 10000

episodes. After a full training, we test the footstep

planning approach with 1000 velocity samples cover-

ing uniformly the input range [0,0.4]m/s.

Table 1 gives results about successes rate for four

sizes of the obstacle. The rate success corresponds

to the ratio between the number of successes and the

totality of trials (1000). And the ﬁgure 3 shows an

example of the repartition between the successes and

the failures over an input range v

obs

and when L

obs

is equal to 0.2m. When the robot can step over the

obstacle successfully, the results is 1 otherwise it is 0.

Table 1: Rate success according to obstacle size.

Size (m) 0.1 0.2 0.3 0.4

Successes rate (%) 65.6 31.3 21.7 4.8

Figure 3: Successes rate when the size of the obstacle is

equal to 0.2m.

It must be pointed out that more the size is large,

more the successes rate is weak. And like ﬁgure 3

shows it, there is a threshold (0.12m/s approxima-

tively when L

obs

= 0.2m) where our footstep planning

never ﬁnds a solution. Consequently, the velocity of

the obstacle must be limited if we want the biped

crosses the obstacle successfully.

4.2 Footstep Planning Examples

Figure 4 shows a footstep sequence when the robot

crosses an obstacle. The size of the obstacle is equal

to 0.2m and its velocity is constant during all the sim-

ulation. Rectangles indicate the obstacle and the spots

indicate the two positions of the feet (left and right)

for each step. Table 2 gives the step length for all

the steps. It must be pointed out that when the biped

robot is close to the obstacle, then the length of the

step decreases in order to prepare the stepping over.

Finally, the last step allows to avoid obstacle without

collision.

Table 2: Length of the step L

step

when v

obs

= 0.1m/s and

L

obs

= 0.2m.

Step 1 2 3 4 5 6

L

step

0.50 0.22 0.50 0.45 0.13 0.50

Figure 4: Successful footstep planning when v

obs

= 0.1m/s

and L

obs

= 0.2m.

It is pertinent to note that one of the most inter-

esting point in our approach is its abilities to operate

when the velocity of the obstacle is not constant. Fig-

ure 5 shows the footstep sequence when the obstacle

moves with a random velocity. The velocity of the

obstacle is carried out by the sum of a constant value

which is equal to 0.1m/s and a random value included

into [−0.1..0.1]m/s. Table 3 gives V

obs

and L

step

for

each step. The size of the obstacle is equal to 0.1m.

It must be pointed out that the control strategy allows

to adapt automatically the length of the step accord-

ing to the obstacle velocity thanks to FQL algorithm.

For 1000 trials realized in the same conditions, the

successes rate is equal to 85% approximatively. This

is very interesting because our strategy allows to in-

crease the robustness of the footstep planning.

5 CONCLUSIONS

In this paper we have presented a footstep planning

strategy for biped robots allowing them to step over

FOOTSTEP PLANNING FOR BIPED ROBOT BASED ON FUZZY Q-LEARNING APPROACH

187

Table 3: Length of the step when v

obs

is random and L

obs

=

0.1m.

Step 1 2 3 4 5 6

v

obs

0.14 0.05 0.16 0.10 0.16 0.04

L

step

0.50 0.23 0.37 0.44 0.10 0.50

Figure 5: successful footstep planning when v

obs

is random,

L

obs

= 0.1m.

dynamic obstacles. Our footstep planning tactic is

based on a fuzzy Q-learning concept. The most ap-

pealing interest of our approach is its outstanding ro-

bustness related to the fact that the proposed footstep

planning is operational for both constant and variable

velocity of the obstacle.

Futures works will be focus on the improvement

of our footstep planning strategy:

• First, our actual control strategy does not take into

account the duration of the step. However, this

parameter is very important with dynamic obsta-

cles. Therefore, our goal is to enhance the pro-

posal footstep planning in order to take care about

both the length and the duration of the step,

• Second, in some cases, biped robot can not step

over obstacle: for example when the size of the

obstacle is too large. Consequently, the footstep

planning must be able to propose a path planning

in order to make the robot avoid obstacle.

• Third, in long-term, our goal is to design more

general footstep planning based on both local

footstep planning and global path planning,

• Finally, experimental validation may be consider

on real humanoid robot. But in this case, it is nec-

essary to design the joint trajectories based on the

position of feet.

REFERENCES

M. Hackel. Humanoid Robots: Human-like Machines. I-

Tech Education and Publishing, Vienna, Austria, June

2007 .

A. Carlos, P. Filho. Humanoid Robots: New Developments.

I-Tech Education and Publishing, Vienna, Austria,

June 2007.

Y. Ayza, K. Munawar, M. B. Malik, A. Konno and

M. Uchiyama. A Human-Like Approach to Footstep

Planning. Humanoid Robots, I-Tech Education and

Publishing, Vienna, Austria, June 2007, pp.296–314

J. Chestnutt, J. J. Kuffner. A Tiered Planning Strategy for

Biped Navigation. Int. Conf. on Humanoid Robots

(Humanoids’04), Santa Monica, California, 2004.

K. Sabe, M. Fukuchi, J.Gutmann, T. Ohashi, K. Kawamoto,

and T. Yoshigahara. Obstacle Avoidance and Path

Planning for Humanoid Robots using Stereo Vision.

Int. Conf. on Robotics Automation (ICRA). 2004, 592–

597.

J.J. Kuffner, K. Nishiwaki, S. Kagami, M. Inaba, H. In-

oue. Footstep Planning Among Obstacles for Biped

Robots. Proceedings of IEEE/RSJ Int. Conf. on Intel-

ligent Robots and Systems (IROS), 2001, 500–505.

J.J. Kuffner, K. Nishiwaki, S. Kagami, M. Inaba, H. Inoue.

Online Footstep Planning for Humanoid Robots. Pro-

ceedings of IEEE/RSJ Int. Conf. on Robotics and Au-

tomation (ICRA), 2003, 932–937

J. Chestnutt, M. Lau, G. Cheung, J.J. Kuffner, J. Hodgins, T.

Kanade. Footstep Planning for the Honda Asimo Hu-

manoid. Proceedings of IEEE Int. Conf. on Robotics

Automation (ICRA), 2005, pp. 629-634

C. Watkins, P. Dayan. Q-learning. Machine Learning, 8,

1992, 279–292.

R.S. Sutton, A.G. Barto. Reinforcement Learning: An In-

troduction. MIT Press, Cambridge, MA, 1998.

P. Y. Glorennec. Reinforcement Learning: an Overview. Eu-

ropean Symposium on Intelligent Techniques (ESIT),

2000,17–35.

P.Y. Glorennec, L. Jouffe. Fuzzy Q-Learning Proc. of

FUZZ-IEEE’97, Barcelona, 1997.

L. Jouffe. Fuzzy inference system learning by reinforce-

ment methods. IEEE Trans. on SMC, Part C, August

1998, Vol. 28 (3).

ICINCO 2008 - International Conference on Informatics in Control, Automation and Robotics

188