GENETIC REINFORCEMENT LEARNING OF FUZZY

INFERENCE SYSTEM APPLICATION TO MOBILE ROBOTIC

Abdelkrim Nemra, Hacene Rezine and Abdelkrim Souici

Unit of Control, Robotic and Productic Laboratory

Polytechnical Military School

Keywords: Reinforcement learning, Fuzzy controllers’ Genetic algorithm, mobile robotic.

Abstract: An efficient genetic reinforcement learning algorithm for designing Fuzzy Inference System (FIS) with out

any priory knowledge is proposed in this paper. Reinforcement learning using Fuzzy Q-Learning (FQL) is

applied to select the consequent action values of a fuzzy inference system, in this method, the consequent

value is selected from a predefined value set which is kept unchanged during learning and if the optimal

solution is not present in the randomly generated set, then the performance may be poor. Also genetic

algorithms (Genetic Algorithm) are performed to on line search for better consequent and premises

parameters based on the learned Q-values as adaptation function. In Fuzzy-Q-Learning Genetic Algorithm

(FQLGA), memberships (premises) parameters are distributed equidistant and the consequent parts of fuzzy

rules are randomly generated. The algorithm is validated in simulation and experimentation on mobile robot

reactive navigation behaviors.

1 INTRODUCTION

In the last decade, fuzzy logic has supplanted

conventional technologies in some scientific

applications and engineering systems especially in

control systems, particularly the control of the

mobile robots evolving (moving) in completely

unknown environments. Fuzzy logic has the ability

to express the ambiguity of human thinking and

translate expert knowledge into computable

numerical data. Also, for real-time applications, its

relatively low computational complexity makes it a

good candidate. A fuzzy system consists of a set of

fuzzy if-then rules. Conventionally, the selection of

fuzzy if-then rules often relies on a substantial

amount of heuristic observation to express the

knowledge of proper strategies

Recently, many authors proved that it is possible

to reproduce the operation of any standard

continuous controller using fuzzy controller (Jouffe,

1996), (Watkins, 1992), (Glorennec, 1997),

(Dongbing, 2003). However it is difficult for human

experts to examine complex systems, then it isn't

easy to design an optimized fuzzy controller.

Generally the performances of a system of fuzzy

inference (SIF) depend on the formulation of the

rules, but also the numerical specification of all the

linguistic terms used and an important number of

choices is given a priori also it is not always easy or

possible to extract these data using human expert.

These choices are carried with empirical methods,

and then the design of the FIS can prove to be long

and delicate vis-à-vis the important number of

parameters to determine, and can lead then to a

solution with poor performance. To cope with this

difficulty, many researchers have been working to

find learning algorithms for fuzzy system design.

These automatic methods enable to extract

information when the experts’ priori knowledge is

not available.

The most popular approach to design FLC may

be a kind of supervised learning where the training

data is available. However in real applications

extraction of training data is not always easy and

become impossible when the cost to obtain training

data is expensive. For these problems, reinforcement

learning is more suitable than supervised learning. In

reinforcement learning, an agent receives from its

environment a critic, called reinforcement, which

can be thought of as a reward or a punishment. The

objective then is to generate a policy maximizing on

average the sum of the rewards in the course of time,

starting from experiments (state, action, reward).

206

Nemra A., Rezine H. and Souici A. (2007).

GENETIC REINFORCEMENT LEARNING OF FUZZY INFERENCE SYSTEM APPLICATION TO MOBILE ROBOTIC.

In Proceedings of the Fourth International Conference on Informatics in Control, Automation and Robotics, pages 206-213

DOI: 10.5220/0001645902060213

Copyright

c

SciTePress

This paradigm corresponds to one of the

fundamental objectives of the mobile robotics which

constitutes a privileged applicability of

reinforcement learning. The paradigm suggested is

to regard behaviour as a sensor-effectors

correspondence function. The objective is to favour

the robots autonomy using learning algorithms.

In this article, we used the algorithm of

reinforcement learning, Fuzzy Q-Learning (FQL)

(Jouffe, 1996), (Souici, 2005) which allows the

adaptation of apprentices of the type SIF (continuous

states and actions), fuzzy Q-learning is applied to

select the consequent action values of a fuzzy

inference system. For these methods, the consequent

value is selected from a predefined value set which

is kept unchanged during learning, and if an

improper value set is assigned, and then the

algorithm may fail. Also, the approach suggested

called Fuzzy-Q-Learning Genetic Algorithm

(FQLGA), is a hybrid method of Reinforcement

Genetic combining FQL and genetic algorithms for

on line optimization of the parametric characteristics

of a SIF. In FQLGA we will tune free parameters

(precondition and consequent part) by genetic

algorithms (GAs) which is able to explore the space

of solutions effectively.

This paper is organized as follows. In Section 2,

overviews of Reinforcement learning,

implementation and the limits of the Fuzzy-Q-

Learning algorithm is described. The

implementation and the limits of the Fuzzy-Q-

Learning algorithm are introduced in Section 3.

Section 4 describes the combination of

Reinforcement Learning (RL) and genetic algorithm

(GA) and the architecture of the proposed algorithm

called Fuzzy-Q-Learning Genetic Algorithm

(FQLGA). This new algorithm is applied in the

section 5 for the on line learning of two elementary

behaviors of mobile robot reactive navigation, “Go

to Goal” and “Obstacles Avoidance”. Finally,

conclusions and prospects are drawn in Section 6.

2 REINFORCEMENT LEARNING

As previously mentioned, there are two ways to

learn either you are told what to do in different

situations or you get credit or blame for doing good

respectively bad things. The former is called

supervised learning and the latter is called learning

with a critic, of which reinforcement learning (RL)

is the most prominent representative. The basic idea

of RL is that agents learn behaviour through trial-

and-error, and receive rewards for behaving in such

a way that a goal is fulfilled.

Reinforcement signal, measures the utility of the

exits suggested relative with the task to be achieved,

the received reinforcement is the sanction (positive,

negative or neutral) of behaviour: this signal states

that it should be done without saying how to do it.

The goal of reinforcement learning is to find the

behavior most effective, i.e. to know, in each

possible situation, which action is achieved to

maximize the cumulated future

rewards.Unfortunately the sum of rewards could be

infinite for any policy. To solve this problem a

discount factor is introduced.

k

k

k0

rR

γ

∞

≠

=

∑

(1)

Where 0 ≤ γ ≤ 1 is the discount factor.

The idea of RL can be generalized into a model,

in which there are two components: an agent that

makes decisions and an environment in which the

agent acts. For every time step, the agent perceives

information from the environment about the current

state, s. The information perceived could be, for

example, the positions of a physical agent, to

simplify say the x and y coordinates. In every state,

the agent takes an action u

t

, which transits the agent

to a new state. As mentioned before, when taking

that action the agent receives a reward.

Formally the model can be written as follows;

for every time step t the agent is in a state st

∈

S

where S is the set of all possible states, and in that

state the agent can take an action at

∈ (At), where

(At) is the set of all possible actions in the state st.

As the agent transits to a new state st+1 at time t + 1

it receives a numerical reward rt+1. It up to date

then its estimate of the function of evaluation of the

action using the immediate reinforcement, rt

+1

, and

the estimated value of the following state, Vt (St

+1

),

which is defined by:

(

)()

1

11

max ,

tt

tt t t

uU

Vs Qs u

+

++

∈

= (2)

The Q-value of each state/action pair is updated by:

(

)

(

)()()

{

}

111

,, ,

tt t t t t t t t t

Qs u Qsu r V s Qsu

βγ

+++

=++−

(3)

Where

(

)

()

,

11

γ

+−

++

rVs Qsu

ttt

tt

the TD error

and β is the learning rate.

GENETIC REINFORCEMENT LEARNING OF FUZZY INFERENCE SYSTEM APPLICATION TO MOBILE

ROBOTIC

207

This algorithm is called Q-Learning. It shows

several interesting characteristics. The estimates of

the function Q, also called the Q-values, are

independent of the policy pursued by the agent. To

calculate the function of evaluation of a state, it is

not necessary to test all the possible actions in this

state but only to take the maximum Q-value in the

new state (eq.4). However, the too fast choice of the

action having the greatest Q-value:

()

1

1

arg max ,

tt

ttt

uU

uQsu

+

+

∈

= (4)

can lead to local minima. To obtain a useful

estimate of Q, it is necessary to sweep and evaluate

the whole of the possible actions for all the states: it

is what one calls the phase of exploration (Jouffe,

1996), (Souici, 2005). In the preceding algorithm,

called TD (0), we use only the state which follows

the robot evolution, Moreover only the running state

is concerned. Sutton (Souici, 2005) extended the

evaluation in all the states, according to their

eligibility traces that memorise the previously

visited state action pairs in our case. Eligibility

traces can be defined in several ways (Jouffe, 1996),

(Watkins, 1992), (Souici, 2005). Accumulating

eligibility is defined by:

()

(

)

()

1

1

sinon

1

es siss

t

t

es

t

es

t

γλ

γλ

+=

−

=

−

⎧

⎪

⎨

⎪

⎩

(5)

The algorithm Q (λ) is a generalization of Q-

Learning which uses the eligibilities truces (Souici,

2005): Q-Learning is then a particular case of Q (λ)

when λ=0.

(

)

(

)

(

)

(

)

{

}

()

,, ,.

111

βγ

=++ −

+++

Qs u Qsu r V s Qsu e s

tt tt t ttt

tt

(6)

3 FUZZY Q-LEARNING

ALGORITHM

In mobile robotics, input and output variables given

by the sensors and effectors are seldom discrete, or,

if they are, the number of state is very large.

However reinforcement learning such as we

described it uses a discrete space of states and

actions which must be have reasonable size to

enable the algorithms to converge in an acceptable

time in practice.

The principle of the Fuzzy Q-Learning

algorithm consists in choosing for each rule

R

i

a

conclusion among a whole of actions available for

this rule. Then it implements a process of

competition between actions. With this intention, the

actions corresponding to each rule

R

i

have a quality

i

q

which then determines the probability to choose

the action. The output action (discrete or continues)

is then the result of the inference between various

actions locally elected. The matrix q enables to

implement not only the local policies the rules, but

also to represent the function of evaluation of the

overall t-optimal policy.

For every time step t the agent is in a state st

∈

S

where S is the set of all possible states, and in that

state the agent can take an action at

∈(At), where

(At) is the set of all possible actions in the state st.

As the agent receives a numerical reward rt+1

∈

R at

time t + 1, and it transits to a new state st+1. It then

perceives this state by the means of its activated

degree of its rules. The algorithm FQL uses a Policy

of Exploration/Exploitation (PEE) (Jouffe, 1996),

(Souici, 2005), combining a random exploration part

()

i

U

ρ

and determinist guided part ()

i

U

η

.

The steps are summarized as follows:

1. Calculate of the evaluation function:

,)())((()(

11

∑

∈

+

∈

+

∗

=

ti

i

AR

tR

i

t

UU

tt

SUqMaxSQ

α

2. Calculate the TD error

() (,)

11 1

rQSQSU

tttt

tt t

εγ

∗

=+ −

++ +

3. Update the matrix of

Q

values

i

i

tt

i

t

i

t

Reqq

T

∀+=

++

,

~

.

11

εβ

4. Election of the action to be applied

() ()(), ,

11 1 1

1

i

US Electionq S UU

UR

tt t t

RA

i

t

i

α

=∀∈

∑

++ + +

∈

+

Where Election is defined by:

11

() (() () ())

iiii

Ut UUt

E

lection q ArgMax q U U U

ηρ

+∈+

=++

5. Update of the eligibility traces

() ,( )

11

()

1

(), sinon

ii i i i

eU U U

ii

t

tt

eU

t

ii

eU

t

γλ φ

γλ

+=

++

=

+

⎧

⎪

⎨

⎪

⎩

And new calculation of the evaluation function.

(, ) ( )(),

11 1 1 1 1

1

ii

QSU qU S

R

tt t t t t

RA

i

t

i

α

=

∑

+

++ ++ +

∈

+

This value will be used to calculate the TD error in

the next step time. However the performances of the

controller are closely dependent on the correct

choice of the discrete actions set, witch is

determined using a priori knowledge about system,

for complex systems like robots, priori knowledge

are not available, then it becomes difficult to

determine a set of correct actions in which figure the

optimal action for each fuzzy rule. To solve this

problem and to improve the performances of the

ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics

208

reinforcement learning, the genetic algorithms will

explore the broadest space of solutions to find the

solution optimal (Dongbing, 2003), (Chia-Feng,

2005), (Min-Soeng, 2000), (Chia-Feng, 2003)

, and

that without any priori knowledge.

4 GENETIC REINFORCEMENT

ALGORITHM

Genetic Algorithms (GA) are stochastic optimization

algorithms, founded on species evolution

mechanisms (Goldberg, 1994). In GA, a candidate

solution for a specific problem is called an

individual or a chromosome and consists of a linear

list of genes. Each individual represents a point in

the search space and, hence, a possible solution to

the problem. A population consists of a finite

number of individuals. Each individual is decided by

an evaluating mechanism to obtain its fitness value.

Based on this fitness value and undergoing genetic

operators, a new population is generated iteratively,

with each successive population referred to as a

generation.

Genetic Reinforcement Learning enable to

determine the best set of parameters of (antecedents

/ consequences) starting from a random initialization

of these parameters and in each iteration, only one

action is applied on the system to the difference of a

traditional genetic algorithm GA use three basic

operators (the selection, crossover, and mutation) to

manipulate the genetic composition of a population:

• Reproduction: Individuals are copied according to

their fitness values. The individuals with higher

fitness values have more offspring than those with

lower fitness values.

• Crossover: The crossover will happen for two

parents that have high fitness values with the

crossover probability p

c. One point crossover is used

to exchange the genes.

• Mutation: The real value mutation is done by

adding a certain amount of noise (Gaussian

in this

paper) to new individuals to produce the offspring

with the mutation probability pm. For the ith variable

in jth individual, it can be expressed as:

(). (0, )

1

aaiN

t

t

β

σ

=+

+

(7)

Where N denote a Gaussian noise, and

() exp( )ii

β

=− for the

th

i generation.

4.1 FQLGA Algorithm

Because of its simplicity, a Takagi-Sugeno Fuzzy

inference system is considered, with triangular

membership function. The structure (partition of its

input space, the determination of the number of IF-

THEN rules) of the SIF is predetermined.

i

j

a is a vector representing the discrete set of K

conclusions generated randomly for the rule

i

R

with which is associated a vector

i

j

q

representing the quality of each action (i= 1 ~N and

j= 1 ~ K).

The principle of the approach is to use a

population of K (SIF) and to treat the output of each

one of them as a possible action to apply on the

system. FQL algorithm exploits local quality

function q witch is associated with each action of a

fuzzy rule (équat.6) whereas FQLGA algorithm uses

as function fitness the sum of local qualities q given

by:

()fInd

j

=

(, )

11

1

N

i

QS SIF q

t

t

j

i

=

∑

++

=

(8)

To reduce the training time, the quality matrix Q

is not initialized after each iteration but undergoes

the same genetic operations as those applied to the

set of the individuals (selection, crossing).

4.2 Optimization of the Consequent

Part of a FIS

A population of K individuals of a predetermined

structure is adopted. The size of an individual is

equal to number N of the FIS’s rules. The

architecture of FQLGA algorithm proposed for the

optimization of the conclusions is represented on the

figure (1).

Figure 1:Representation of the individuals and qualities of

the actions in FQLGA algorithm.

GENETIC REINFORCEMENT LEARNING OF FUZZY INFERENCE SYSTEM APPLICATION TO MOBILE

ROBOTIC

209

4.3 Optimization of the Antecedent

Part of a SIF

To find the best set of premises generated by GA, a

population made up of M SIF is created. Each

individual (FIS) of the population encode the

parameters of the antecedents i.e. the modal points

of the FIS and his performance is evaluated by the

fitness function of Q (global quality).

The conclusion part of each individual SIF

remains fixed and corresponds to the values

determined previously. The coding of the

membership functions of the antecedent part of a

FIS (individual) is done according to the figure (2).

To keep the legibility of the SIF, we impose

constraints during the evolution of the FIS to ensure

the interpretability of the FIS.

Figure 2: Coding of the parameters of the antecedent part.

Input N :

1

1Nn

m

N

m

−+

< …. <

1

-2

Nm

m

<

1

Nm

m

The fitness function used in by genetic

algorithm for the optimization of the antecedent part

is the global quality of the FIS which uses the degree

of activation of the fuzzy rules; this fitness function

is given by the following equation:

(()). ()

() ((), )

(())

iA

i

iA

i

Ri î

RR

ii

R

RR

St q t

find QSt SIF

St

α

α

∈

∈

==

∑

∑

(9)

Figure 3: General architecture of FQLGA algorithm.

4.4 Optimization of the Consequent

and Antecedent Part of FIS

A SIF consists of a set of fuzzy rules each one of it

is made of an antecedent and a consequent part.

Optimize the antecedent and the consequent part of

the fuzzy controller at the same time is a complex

problem which can admit several solutions which

are not acceptable in reality, the great number of

possible solutions makes the algorithm very heavy

and can lead to risks of instability.

FQLGA algorithm proposed in (fig.3) for the

optimization of the premises and the conclusions

allows the total parametric optimization of the FIS in

three stages represented in the flow chart (fig.4).

• At the beginning of the learning process, the

quality matrix is initialized at zero, and then

traditional algorithm FQL evaluates each action

using an exploration policy. This step finishes

when a number of negative reinforcements is

received.

• After the evaluation of the individuals, the genetic

algorithm for the optimization of the consequent

part of the fuzzy rules creates a new better adapted

generation. This stage is repeated until obtaining

convergence of the conclusions or after having

reached a certain number of generations. The

algorithm passes then to the third stage:

• Once the conclusions of the FIS are optimized, the

second genetic algorithm for the optimization of

the antecedent part is carried out to adjust the

positions of the input membership functions of the

controller which are initially equidistant on their

universe of discourse.

Figure 4: Flow chart of FQLGA algorithm.

ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics

210

5 EXPERIMENTAL RESULTS

To verify the performance of FQLGA two

elementary behaviors of reactive navigation of a

mobile robot: "Go to Goal" and "Obstacles

Avoidance" are presented in this section; the

algorithm was

adopted in the experiments for both

simulations (

Saphira simulator) and real robots

(Pioneer II).

5.1 «Go to Goal» Behaviour

The two input variables are: the angle"

Rb

θ

"

between the robot velocity vector and the robot-goal

vector, and the distance robot-goal "

b

ρ

".They are

respectively defined by three (Negative, Zeros,

Positive) and two (Near, Loin) fuzzy subsets (fig.5).

The two output variables are the rotation speed,

Vrot_CB and the translation speed Vtran_CB each

output is represented by nine actions initialised

randomly.

Figure 5: Membershipfunctions of the input space.

The reinforcement functions adopted for the two

outputs are respectively given by:

_

1 ( 0 -1 1 )

( ) 0 ( 0)

-1

Vrot CB

Si ou

tSi

Else

r

θθ θ

θθ

⋅

<°<<+°

=⋅=

⎧

⎪

⎨

⎪

⎩

(10)

_

10.15 et 800

00

( ) 1 220 et 800 et ( ) 5

0

1

Vtran CB

Si Vtrans

rtSiVtrans abs

Else

ρρ

ρθ

≤≥

=≥≥<

−

⎧

⎪

⎨

⎪

⎩

The parameters of FQL algorithm and the

genetic algorithms are as follows:

LP and Lc respectively indicate the sizes of the

chromosomes for the antecedents and the

conclusions part, Np, Nc respectively represent the

size of the population of the parameters of

antecedent and the conclusions and Pm the

probability of mutation.

The simulation results are given of the first

behaviour "Go to Goal" are presented in the figure

(6), 28 generations were sufficient to find the good

actions.

Figure 6: Go to goal: Training/Validation.

Figure (7) shows the convergence of the fitness

values of the genetic algorithms for the two output

variables Vrot_CB and Vtran_CB obtained during

experimental test.

Figure 7: Fitness functions for the two output variables.

5.2 «Obstacles Avoidance» Behaviour

The Inputs of the FIS are the distances provided by

the ultrasounds sensors to in the three directions

(Frontal, Left and Right) and defined by the three

fuzzy subsets: near (N), means (M), and far (F)

(fig.8).

Figure 8:Memberships Functions of the inputs variables.

The conclusions (rotation speeds) are initialized

randomly. The translation speed of Vtran_EO is

given analytically; it is linearly proportional to the

frontal distance:

max

_

.( _ )

max

V

Vtran EO Dis F Ds

D

=− (11)

maxV is the maximum speed of the robot equal to

350mm/s.

0 20 0 400 600 800 1000 1200 1400 1600 1800 2000

0

0.2

0.4

0.6

0.8

1

Dist ance Frontal e

Degree of mem bership

PM L

0 20 0 400 600 800 1000 1200 1400 1600 1800 2000

0

0.2

0.4

0.6

0.8

1

Distanc e Droite

Degree of mem bership

PM L

0 20 0 400 600 800 1000 1200 1400 1600 1800 2000

0

0.2

0.4

0.6

0.8

1

Distanc e Gauche

Degree of mem bership

PM L

GENETIC REINFORCEMENT LEARNING OF FUZZY INFERENCE SYSTEM APPLICATION TO MOBILE

ROBOTIC

211

0 0. 5 1 1.5 2 2.5 3

x 10

4

-0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Tem ps

Qual it é du SIF

FQL --- FQLGA

FQL

FQLGA

Optimum Local

maxD is the maximum value measured by the

frontal sensors and estimated to 2000mm and Ds is

a safe distance witch is fixed at 250mm.

The function of reinforcement is defined as

follows (Souici, 2005):

_

1 ( _ ) _ _

_ _ min 800

(12)

()

-1 ( _ ) _ _

__ min800

0

×<

<<

=

×<

<<

Vrot CB

Signe Vrot EO if Dis R Dis L

or Dis F Dis L and D

t

Signe Vrot EO and Dis L Dis R

or Dis F Dis Rand D

elsewhere

r

⎧

⎪

⎪

⎪

⎪

⎨

⎪

⎪

⎪

⎪

⎩

with Dmin=min (Dis_L, Dis_F, Dis_R)

The parameters of FQL algorithm and the

genetic algorithms are identical to the preceding

behaviour except for the sizes of the chromosomes

Lp=12 and Lc=27.

Figure (9) represents the trajectories of the robot

during the learning phase with FQL algorithm and a

random initialization of the consequents part for the

27 fuzzy rules. Several situations of failures are

observed, this example show the limits of traditional

FQL algorithm when priory knowledge about the

system are not available.

Figure 9: Trajectories of the robot obtained by FQL

algorithm using a random initialization of parameters.

The trajectories of figure (10) show the

effectiveness of the association of the reinforcement

learning FQL and the genetic algorithm as stochastic

tool for exploration. FQLGA Algorithm enables to

find an optimal FIS for the desired behaviour

(obstacles avoidance). The duration of learning

depend to the genetic algorithm parameters and the

obstruction density of the environment. We observe

that after each generation the quality of the FIS (sum

of local qualities) increases, which give more chance

to the best individuals to be in the next generations.

Figure 10: Learning/Validation Trajectories of the robot

with FQLGA algorithm for various environments.

Figure (11) shows the performances of FQLGA

algorithm compared to the FQL algorithm which can

be blocked in a local minimum when the optimal

solution is not present in the randomly generated set

of actions. On the other hand FQLGA algorithm

converges towards the optimal solution

independently of the initialized values.

Figure 11: Evolution of the quality of the Fuzzy controller

with FQL and FQLGA algorithms.

5.3 Experimental Results with the Real

Robot Pioneer II

Figure (12) represents the results of the on line

learning of the robot Pioneer II for the behaviour

"Go to goal". During the learning phase, the robot

does not follow a rectilinear trajectory (represented

in green) between the starting point and the goal

point because several actions are tested

(exploration). Finally the algorithm could find the

good actions, and the robot converges towards the

goal marked in red colour, the necessary time to find

these good actions is estimated at 2mn. Each

generation is generated after having noted twenty

(20) failures. The learning process requires

respectively 32 and 38 generations for GA to

determine rotation and translation speed.

ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics

212

Figure 12: On line learning of the real robot Pioneer II,

"Go to goal" behaviour.

Figure (13) represents the results of the on line

learning of the "Obstacles Avoidance" robot

behaviour. For reasons of safety, we consider that

the minimal distances detected by the frontal sonars

and of left/right are respectively 408 mm and of 345

mm a lower value than these distances is considered

then as a failure and involves a retreat (represented

in green) of the mobile robot. A generation is

created after 50 failures. The genetic algorithms

require 14 generations to optimize the conclusions

and 24 generations to optimize the parameters of the

antecedents. The duration of on line learning is

estimated at 20 min, this time is acceptable vis-à-vis

the heaviness of the traditional genetic algorithms.

Figure 13: On line learning of the real robot Pioneer II,

behaviour "Obstacle Avoidance».

Figure (14) represents the evolution of the fitness

function obtained during this experimentation.

Figure 14: Fitness function evolution, "Obstacles

avoidance" robot behaviour.

6 CONCLUSION

The combination of the reinforcement Q-Learning

algorithm and genetics Algorithms give a new type

of hybrid algorithms (FQLGA) which is more

powerful than traditional learning algorithms.

FQLGA proved its effectiveness when no priori

knowledge about system is available. Indeed,

starting from a random initialization of the

conclusions values and equidistant distribution for

the membership functions for antecedent part the

genetic algorithm enables to find the best individual

for the task indicated using only the received

reinforcement signal. The controller optimized by

FQLGA algorithm was validated on a real robot and

satisfactory results were obtained. The next stage of

this work is the on line optimization of the structure

of the Fuzzy controller.

REFERENCES

L. Jouffe, “Actor-Critic Learning Based on Fuzzy

Inference System”, Proc of the IEEE International

Conference on Systems, Man and Cybernetics.

Beijing, China, pp.339-344, 1996.

C.Watkins, P. Dayan: “Q-Learning Machine Learning”,

pp.279-292, 1992.

P. Y. Glorennec and L. Jouffe, “Fuzzy Q-learning,” Proc.

Of IEEE Int. Con On Fuzzy Systems, pp. 659-662,

1997

Dongbing Gu, Huosheng Hu, Libor Spacek “ Learning

Fuzzy Logic Controller for Reactive Robot

Behaviours” Proc. of IEEE/ASME International

Conference on Advanced Japan, pp.20-24 July, 2003.

A. Souici

: “Apprentissage par Renforcement des

Systèmes d’inférence Floue par des Méthodes par

renforcement application à la navigation réactive d’un

robot mobile

”, Thèse de Magistère, Ecole Militaire

Polytechnique, Janvier 2005.

Chia-Feng Juang “ Combination of online Clustering and

Q-Value Based GA for Reinforcement Fuzzy System

Design” IEEE Transactions On Systems, Man, And

Cybernetics Vol. 13, N°. 3, pp.289-302, June 2005

Min-Soeng Kim and Kim and Ju-Jang Lee. “Constructing

a Fuzzy Logic ControllerUsing Evolutionary Q-

Learning” IEEE. pp.1785-1790, 2000

Chia-Feng Juang and Chun-Feng Lu. Combination of On-

line Clustering and Q-value Based Genetic

Reinforcement Learning For Fuzzy Network Design 0-

7803-7898-9/03/$17.00 0 2003 IEEE.

David Goldberg “Algorithme génétique Exploration,

optimisation et apprentissage automatique”» Edition

Addison-Wesley, France 1994.

GENETIC REINFORCEMENT LEARNING OF FUZZY INFERENCE SYSTEM APPLICATION TO MOBILE

ROBOTIC

213