STOCHASTIC CONTROL STRATEGIES AND ADAPTIVE CRITIC

METHODS

Randa Herzallah

Faculty of Engineering Technology, Al-Balqa’ Applied University, Jordan

David Lowe

NCRG, Aston University, U.K.

Keywords:

Adaptive critic methods, functional uncertainty, stochastic control.

Abstract:

Adaptive critic methods have common roots as generalizations of dynamic programming for neural reinforce-

ment learning approaches. Since they approximate the dynamic programming solutions, they are potentially

suitable for learning in noisy, nonlinear and nonstationary environments. In this study, a novel probabilistic

dual heuristic programming (DHP) based adaptive critic controller is proposed. Distinct to current approaches,

the proposed probabilistic (DHP) adaptive critic method takes uncertainties of forward model and inverse con-

troller into consideration. Therefore, it is suitable for deterministic and stochastic control problems character-

ized by functional uncertainty. Theoretical development of the proposed method is validated by analytically

evaluating the correct value of the cost function which satisﬁes the Bellman equation in a linear quadratic con-

trol problem. The target value of the critic network is then calculated and shown to be equal to the analytically

derived correct value.

1 INTRODUCTION

In recent research of stochastic control systems, much

attention has been paid to the problem of charac-

terizing and incorporating functional uncertainty of

dynamical control systems. This is because there

is an increasing demand for high reliability of com-

plex control systems which are accompanied by high

level of inherent uncertainty in modeling and esti-

mation and are characterized by intrinsic nonlinear

dynamics involving unknown functionals and latent

processes. Several methods have been developed,

and examples include feedback linearization tech-

niques (Botto et al., 2000; Hovakimyan et al., 2001),

backstepping techniques (Sastry and Isidori, 1989;

Zhang et al., 2000; Lewis et al., 2000), neural net-

work based methods (Wang and Huang, 2005; Ge and

Wang, 2004; Ge et al., 2001; Murray-Smith and Sbar-

baro, 2002; Fabri and Kadirkamanathan, 1998), sto-

chastic adaptive control methods (Karny, 1996; Wang

and Zhang, 2001; Wang, 2002; Herzallah and Lowe,

2007; Herzallah and Lowe, ), and adaptive critic

based methods (Herzallah, 2007).

In the feedback linearization, backstepping and

neural network based methods, only parameters or

forward model uncertainty have been considered.

The inverse controller has been assumed to be

deterministic or dependent on the forward model.

Stochastic adaptive control methods on the other

hand have considered modeling the distribution

of the inverse controller. However uncertainty in

the stochastic adaptive control methods proposed

in (Karny, 1996; Wang and Zhang, 2001; Wang,

2002), has been treated as a nuisance or perturbation

therefore; did not affect the derivation of the optimal

control law. In other words, uncertainty has been

assumed to be input–independent and consequently

they did not contribute to the derivation of the optimal

control law. The stochastic adaptive control methods

developed in (Herzallah and Lowe, 2007; Herzallah

and Lowe, ) on the other hand have considered

input–dependent uncertainty and the methods are

proven to signiﬁcantly improve the performance of

the controller.

Selected adaptive Critic (AC) methods, known

as action–independent adaptive critic methods, have

been shown to implement useful approximations

of Dynamic Programming, a method for designing

281

Herzallah R. and Lowe D. (2008).

STOCHASTIC CONTROL STRATEGIES AND ADAPTIVE CRITIC METHODS.

In Proceedings of the Fifth International Conference on Informatics in Control, Automation and Robotics - ICSO, pages 281-288

DOI: 10.5220/0001481902810288

 SciTePress

optimal control policies in the context of nonlinear

plants (Werbos, 1992). However in their conventional

form, the action–independent adaptive critic methods

do not take into consideration model uncertainty. In

most recent development to these methods, a novel

dual heuristic programming (DHP) adaptive–critic–

based cautious controller is proposed (Herzallah,

2007). The proposed controller avoids the prei-

dentiﬁcation training phase of the forward model

and inverse controller by taking into consideration

model uncertainty when calculating the control law.

Only forward model uncertainty has been consid-

ered in (Herzallah, 2007). The inverse controller

is assumed to be accurate and no knowledge of

uncertainty needed to be characterized. However,

similar to the forward model, the parameters of the

inverse controller of the nonlinear dynamical systems

are usually optimized using nonlinear optimization

methods. This inevitably leads to uncertain model

of the inverse controller. Consequently, uncertainty

of the inverse controller should be estimated and

considered in the derivation of the optimal control

law.

As a result, the dual heuristic programming

(DHP) adaptive–critic–based cautious control

method (Herzallah, 2007) is still in need of further

development. This forms the main purpose of this pa-

per, where functional uncertainty of both the forward

model and the inverse controller is characterized

and used in deriving the optimal control law. Hence

the novelty of this work stems from considering

functional uncertainty in the inverse controller as

well as the forward model. Furthermore, a new

method for estimating functional uncertainty of the

models will be introduced in this work. In contrast

to the method proposed in (Herzallah, 2007) this

method allows for considering multiplicative noise on

both the state and the control law. Also it guarantees

the positivity of the covariance matrix of the errors.

This well lead to a novel theoretical development for

the stochastic adaptive control. Moreover, the Riccati

solution for a quadratic linear inﬁnite horizon control

problem will also be derived and compared to the so-

lution of the developed probabilistic (DHP) adaptive

critic method. The method developed in this paper,

enhances the performance of the system by utilizing

more fully the probabilistic information provided

by the forward model and the inverse controller.

No pre–identiﬁcation will be needed for neither the

forward model, the critic or the inverse controller.

All networks in the new developed framework will

be adapted at each instant of time.

2 PRELIMINARIES

This preparatory section recalls basic elements of

modeling conditional distributions of system outputs

and inverse controller and the aim of fully probabilis-

tic control.

2.1 Basic Elements

The behavior of a stochastic general class discrete

time system with input u

(k) and measurable state

vector x(k) is described by a stochastic model of the

following form

x(k+ ) = g[x(k), u

(k)] +

η(k+ ) (1)

where

η(k + ) is random independent noise which

has zero mean and covariance

This can generally be expressed as:

x(k+ ) = f[x(k), u

(k),

η(k+ )]. (2)

The randomized controller to be designed is de-

scribed by the following stochastic model

(k) = c[x(k)] +

e(k) (3)

where

e(k) represents random independent noise of

zero mean and

Q covariance matrix. Notice that only

state dependent controllers are considered. However,

assuming state dependent controller can be shown to

represent no real restrictions (Mine and Osaki, 1970)

provided that the state can be measured. The stochas-

tic model of the controller can be reexpressed in the

following general form:

(k) = h[x(k),

e(k)]. (4)

All probability density functions in this paper are

assumed to be unknown and need to be estimated.

The estimation method of these probability density

functions will be discussed in Section 2.3, but ﬁrst we

introduce the aim of designing a probabilistic control.

2.2 Problem Formulation

In dynamic programming, the randomized controller

of the above stochastic control problem is obtained by

minimizing the expected value of the Bellman equa-

tion

J[(x(k)] =



U(x(k), u

(k))+γJ[x(k+)]



(5)

where < . > is the expected value, J[x(k)] is the cost

to go from time k to the ﬁnal time, U(x(k), u

(k))

is the utility which is the cost from going from time

k to time k + , and J[x(k+ )] is assumed to be the

minimum cost from going from time k+  to the ﬁnal

ICINCO 2008 - International Conference on Informatics in Control, Automation and Robotics

282

time. The term γ is a discount factor ( ≤ γ ≤ ) which

allows the designer to weight the relative importance

of present versus future utilities. The objective is then

to choose the control sequence u(k), k = , , . . . , so

that the function J in (5) is minimized.

The critic network in the DHP scheme, estimates

a variable called λ[x(k)] as the derivatives of J(x(k))

with respect to the vector x(k).

λ[x(k)] =

∂U[x(k), u

(k)]

∂x(k)

∂U[x(k), u

(k)]

∂u

(k)

∂u

(k)

∂x(k)

+ < λ[x(k+ )]

∂x(k+ )

∂x(k)

+ < λ[x(k+ )]

∂x(k+ )

∂u

(k)

∂u

(k)

∂x(k)

> (6)

where γ has been given the value of . Since hλ[x(k+

)]i,U[x(k), u

(k)] and the system model derivatives

are known, then λ[x(k)] can be calculated. The opti-

mality equation is deﬁned as

∂J[x(k)]

∂u

(k)

∂U[x(k), u

(k)]

∂u

(k)

+ < λ[x(k + )]

∂x(k+ )

∂u

(k)

=  (7)

The above two equations are usually used in dynamic

programmingto solve an inﬁnite or ﬁnite horizoncon-

trol policy.

If the nonlinear function f[x(k), u

(k),

η(k + )]

was known or the system was noiseless, and given a

deterministic function for the inverse controller, the

optimal control law which achieves the above ob-

jective is shown to be derived using techniques of

dynamic programming or DHP adaptive critic meth-

ods as an approximation methods to dynamic pro-

gramming (Herzallah, 2007). Even if the function

f[x(k), u

(k),

η(k+ )] was unknown, researchers in

the model based adaptive critic ﬁeld would simply

adapt a forecasting network which predicts the con-

ditional mean of the state vector. This means that

only deterministic models were considered in the con-

ventional theory of the adaptive critic methods. Re-

cently (Herzallah, 2007), it has been proved that the

control law of the DHP adaptive critic methods which

is derived based on the assumption of determinis-

tic forward model is suboptimal. It has been shown

in (Herzallah, 2007) that if the function of the con-

trolled system was unknown then the problem should

be formulated in an adaptive control scheme which

is known to have functional uncertainty. Therefore,

forward model uncertainty was quantiﬁed and used

in their developed control algorithm. However, only

forward model uncertainty was considered in (Herza-

llah, 2007). This forward model uncertainty was as-

sumed to follow Gaussian distribution. The inverse

controller on the other hand was assumed to be deter-

ministic function.

In the current paper, the forward model and the

inverse controller are described by probability den-

sity functions as shown in Equations (2) and (4).

These probability density functions are not limited to

Gaussian density, they can be of any shape. As men-

tioned in Section 2.1, the probability density func-

tions of the forward model and the inverse controller

are assumed to be unknown and need to be estimated

in this paper. The objectiveof the current paper is then

to develop an appropriate method for estimating the

non Gaussian distributions of both the forward model

and the inverse controller and then use these proba-

bilistic information in the derivation of the optimal

control law. This yields a novel DHP adaptive critic

control algorithm which we refer to as probabilistic

DHP adaptive critic method. The developed theory

will be illustrated on linear inﬁnite horizon quadratic

control problem. The Riccati solution for this linear

problem will also be derived.

2.3 Stochastic Model Estimation

In the neurocontrol ﬁeld researchers usually adapt

forecasting networks to predict the conditional mean

of the system output or state vector, ^x(k+ ). In most

control applications this is probably enough. How-

ever; with the growing complexity of control systems

and because of the inherent uncertainty in modeling

and estimation, researchers recently considered mod-

eling the conditional distribution of the stochastic sys-

tems rather than relying on the single estimate of the

neural networks.

To estimate the conditional distribution of the sys-

tem output, a neural network model is optimized such

that its output approximates the conditional expecta-

tion of the system output. Once the output of the

neural network model has been optimized the stochas-

tic model of the system is simply shown to be given

by (Herzallah and Lowe, 2007),

x(k+ ) = ^x(k+ ) + η(k+ ), (8)

where ^x(k + ) = ^g[x(k), u

(k)], and η(k + ) repre-

sents an input dependent random noise. The stochas-

tic model in Equation (8) can in turn be reexpressed

in the following general form:

x(k+ ) =

f[x(k), u

(k), η(k + )]. (9)

STOCHASTIC CONTROL STRATEGIES AND ADAPTIVE CRITIC METHODS

283

Usually the noise η(k + ) is assumed to follow

Gaussian distribution of zero mean and covariance

matrix P. In this work the assumption of Gaussian

distribution is relaxed. In other words η(k + ) is

an input dependent random noise which could follow

any non-Gaussian distribution of zero mean. This is

a more realistic assumption, since a nonlinear map-

ping of random variable is non-Gaussian. This non-

Gaussian distribution will be identiﬁed by evaluating

the expectation and moments of the distribution. For

example the second moment of the distribution is rep-

resented by its covariance matrix P. This covariance

matrix represents the covariance of the error in pre-

dicting x(k + ).

The method proposed in (Herzallah and Lowe,

2007) estimates the conditional distribution of the

system output by using another neural network model

to provide a prediction for the input dependent covari-

ance matrix P =< η(k + )η

(k + ) >. In the cur-

rent paper we propose a different method for estimat-

ing the conditional distribution of the system output

which could be non-Gaussian as well. This novel pro-

posed method is based on estimating the distribution

of the input dependent error η(k+) and not the input

dependent covariance matrix P. Since the covariance

matrix P can be evaluated the distribution of the in-

put dependent error is estimated by using a Gaussian

Radial Basis Function neural network which has the

important property of linear transformation.

η(x(k), u

(k)) = w φ(x(k), u

(k)). (10)

where w

is a random vector which has zero mean

and a covariance matrix Σ

=< φ

†

>, and

i is the output index. Here the RBFNN is taken to

be a probabilistic rather than deterministic model.

To adapt this probabilistic neural network model the

following conditions are assumed to hold for the

neural network:

Assumption 1. The state and control is always

conﬁned within the network approximation region

deﬁned by subset Z whose boundaries are known.

This approximation region is a design parameter and

could be made arbitrarily large.

Assumption 2. The basis function centers and

width parameters ensuring that condition  is satisﬁed

are known a priori.

The second assumption is justiﬁed by the universal

approximation property of neural networks with well

known developed methods of choosing appropriate

basis function centers and width parameters a pri-

ori (Sanner and Slotine, 1992).

Using the neural network as a probabilistic model for

the input dependent error allows us to consider multi-

plicative noise on both the state and control. Besides,

it ensures the positivity of the error covariance ma-

trix P. Following the same procedure of the forward

model, the stochastic model of the inverse controller

is given by

(k) = u(k) + e(k). (11)

The distribution of the error in predicting the con-

trol law is also estimated using the same method of

predicting the distribution of the error of the forward

model.

To reemphasize, the method proposed in this sec-

tion for estimating the conditional distributions of the

models: ensures the positivity of the covariance ma-

trix of the errors, it uses the neural network as a proba-

bilistic models, and allows considering multiplicative

noise on both the state and the control. However, the

method proposed in (Herzallah and Lowe, 2007) does

not guarantee the positivity of the covariance matrix,

and it uses the neural network a deterministic model.

The theory developed in this section will be used

in the next section for developing the theory behind

the probabilistic DHP adaptive critic method pro-

posed in this paper.

3 PROPOSED PROBABILISTIC

ADAPTIVE CRITIC METHOD

In this section we propose a probabilistic type DHP

adaptive critic controller which takes uncertainty of

the forward model and the inverse controller into con-

sideration when calculating the control law. The pro-

posed controller can be obtained directly by optimally

solving the adaptive critic problem which consid-

ers stochastic models rather than deterministic mod-

els. In the proposed probabilistic DHP adaptive critic

method the control law is derived such as to minimize

the expected value of the cost–to–go J[x(k)] given

in (5) using γ = , but with the uncertainty of the mod-

els’ estimates being taken into consideration. This is

accomplished by treating the forward model and the

inverse controller as random variables.

Following the procedure presented in Section 2.3

the conditional distributions of the forward model

and the inverse controller are estimated. Using this

in equation (5), Bellman’s equation could be reex-

pressed as:

J[x(k)] =< U(x(k), u

(k)) > + < J[x(k + )] >

=< U(x(k), u

(k)) >

+ < J[

f(x(k), u

(k), η(k + ))] > (12)

ICINCO 2008 - International Conference on Informatics in Control, Automation and Robotics

284

Since the errors η(k + ) and e(k) of the forward

model and the inverse controller respectively are state

dependent, the variable λ[x(k)] is shown to be given

by the following theorem.

Theorem 1. The variable λ[x(k)] of the cost function

of equation (12) subject to the stochastic models of

equations (9) and (11), is given by

λ[x(k)] =<

∂U[x(k), u

(k)]

∂x(k)

∂U[x(k), u

(k)]

∂u

(k)

∂u

(k)

∂u(k)

∂x(k)

∂U[x(k), u

(k)]

∂u

(k)

∂u

(k)

∂e(k)

∂x(k)

+ < λ[x(k + )]

∂x(k + )

∂x(k)

+ λ[x(k+ )]

∂x(k + )

∂u

(k)

∂u

(k)

∂u(k)

∂x(k)

+ λ[x(k+ )]

∂x(k + )

∂u

(k)

∂u

(k)

∂e(k)

∂x(k)

+ λ[x(k+ )]

∂x(k + )

∂η(k + )

∂x(k)

+ λ[x(k+ )]

∂x(k + )

∂η(k + )

∂u

(k)

∂u

(k)

∂u(k)

∂x(k)

+ λ[x(k+ )]

∂x(k + )

∂η(k + )

∂u

(k)

∂u

(k)

∂e(k)

∂x(k)

> (13)

Proof. To prove the above theorem we simply derive

the cost function of equation (12) with respect to the

state x(k) at time k.

The error in predicting the state vector η(k + )

is dependent on the control signal as well, so the

optimality equation can be seen to be given by the

following theorem.

Theorem 2. The optimality equation of the cost func-

tion of equation (12) subject to the stochastic models

of equations (9) and (11), is given by

∂J[x(k)]

∂u(k)

∂U[x(k), u

(k)]

∂u

(k)

∂u

(k)

∂u(k)

+ λ[x(k+ )]

∂x(k+ )

∂u

(k)

∂u

(k)

∂u(k)

+ λ[x(k+ )]

∂x(k+ )

∂η(k+ )

∂u

(k)

∂u

(k)

∂u(k)

>= 

(14)

Proof. To prove the above theorem we simply derive

the cost function of equation (12) with respect to the

optimal control u(k) at time k.

The training process for the probabilistic type

DHP adaptive critic proposed in this section is ex-

actly the same as that for the conventional DHP adap-

tive critic. It consists of training the action network

which outputs the optimal control policy u[x(k)] and

the critic network which approximates the derivative

of the cost function λ[x(k)]. As a ﬁrst step both net-

works’ parameters are initially randomized. Next,

the difference between the target value of the critic,

∗

[x(k)] calculated from Equation (13) and the critic

network output λ[x(k)] is used to correct the critic net-

work until it converges. The output from the con-

verged critic is used in (14) solving for the target

(k) which is then used to correct the action net-

work. These two steps continue until a predetermined

level of convergence is reached.

Because the proposed probabilistic DHP adaptive

critic method takes model uncertainty into consider-

ation, it is recommended to be implemented on–line.

The forward model of the plant to be controlled the

controller and the critic networks can all be adapted

on–line.

4 LINEAR QUADRATIC MODEL

Stochastic linear quadratic models is one of the most

widely used models in modern control engineering

and ﬁnance. To understand and prove the valid-

ity of the proposed probabilistic DHP adaptive critic

methods, the theory developed in the previous sec-

tion is applied here to inﬁnite horizon linear quadratic

control problem. Before we evaluate the proposed

methods themselves, the correct values of various

functions in this problem will be calculated, so that

we have something to check the proposed method

against. Besides evaluating the correct values of var-

ious functions, we also derive the Riccati solution of

this nonstandard stochastic control problem.

4.1 Dynamic Programming Solution for

the Linear Quadratic Model

Suppose that the vector of observable, x(k) is the

same as the state vector of the plant. Since we con-

sider an inﬁnite horizon problem, the objective is to

minimize a measure of utility, U(k), summed from

the present time to the inﬁnite future, which is deﬁned

by:

U(k) = x

Ox+ u

. (15)

Suppose that the plant is described by the following

stochastic model:

x(k+ ) = Sx(k) + Ru

(k) + η(k+ ), (16)

where the error of the prediction, η(k+) is estimated

as described in Section 2.3. This error is shown in

Section 2.3 to be control signal and state dependent.

STOCHASTIC CONTROL STRATEGIES AND ADAPTIVE CRITIC METHODS

285

Since it should have the same structure and same in-

puts as the forward model of the plant, it is taken in

this linear quadratic problem to be linear with two in-

puts, the state vector and the control signal:

η(k+ ) = Dx(k) + Eu

(k). (17)

where D and E are matrices of random numbers

which contain the parameters of the error model. Sup-

pose that the action network is described by the fol-

lowing stochastic model:

(k) = u(k) + e(k), (18)

where

u(k) = Ax(k) (19)

and where A is the matrix of the controller parame-

ters, and e(k) is the error in predicting the optimal

control estimated as discussed in Section 2.3 and as-

sumed to have the following form

e(k) = Qx(k) (20)

where Q is a matrix of random numbers that describes

the mapping from the state space to the error in pre-

dicting the optimal control law. Using the control ex-

pression of Equation (19) in Equation (18) and substi-

tuting back in Equation (16) yields:

x(k+ ) =

Sx(k) + Re(k) + η(k+ ), (21)

where

S = S+ RA. (22)

Similarly the expression of the error in predicting the

state vector as deﬁned in Equation (17) can be rewrit-

ten in the following form

η(k+ ) =

Dx(k), (23)

where we have used Equations (18), (19), and (20)

and where

D = D+ EA+ EQ. (24)

As a preliminary step to calculating the correct value

of the cost function of Bellman’s equation let us de-

ﬁne M as the matrix that solves the following equa-

tion:

M = O+ A

GA+ < Q

GQ > +

+ < Q

MRQ > + <

D > . (25)

Following all the above deﬁnitions, the true value of

the cost function J satisfying the Bellman equation is

given in the following theorem:

Theorem 3. The true value of the cost function J,

satisfying the Bellman equation (with γ = ) subject to

the system of equation (16) and uncertainty models of

the forward model and the inverse controller deﬁned

in Equations (17) and (20) respectively and all other

deﬁnitions previously mentioned is given by:

J(x) = x

Mx. (26)

Proof. To prove the above theorem we simply substi-

tute into Bellman’s equation (12) and verify that it is

satisﬁed. For the left hand side of the equation, we

get:

J[x(k)] = x

(k)Mx(k). (27)

For the right hand side, we get:

< U(x(k), u

(k)) + J[x(k+ )] >=

< x

Ox+ (Ax+ e)

G(Ax+ e) >

+ < (

Sx+ Re+ η)

Sx+ Re+ η) >

= x

Ox+x

GAx+x

< Q

GQ > x+x

+ x

< Q

MRQ > x+ x

D > x, (28)

where we used Equations (20) and (23) and where we

made use of the fact that η, and e are uncorrelated

random variables of zero mean. Making use of Equa-

tion (25) in Equation (28), yields

J(x) = x

Mx. (29)

Comparing Equations (26) and (29) we can see

that Bellman’s equation is satisﬁed. Howards has

proven (Howard, 1960) that the optimal control law,

based on the policy iteration method, can be derived

by alternately calculating the cost function J for the

current control law, modify the control law so as to

minimize the cost function J, recalculate J for the new

control law, and so on.

4.2 Proposed Probabilistic DHP

Adaptive Critic in the Linear

Quadratic Model

The objective of this section is to calculate the tar-

gets for the output of the critic network λ

∗

[x(k)] as

they would be generated by the proposed probabilis-

tic DHP adaptive critic, and then check them against

the correct values. In other words we need to check

that if the critic was initially correct it will stay correct

after one step of adaptation.

From (29), the correct value of J(x) is x

Mx, and

consequently the correct value of λ(x) is simply the

gradient of J(x), i.e λ(x) = Mx(k). Hence λ(k + )

is given by,

λ(k+ ) = Mx(k + ).

Next we carry out the calculations implied by equa-

tion (13) but with the expectation of the derivatives

being evaluated at the end.

ICINCO 2008 - International Conference on Informatics in Control, Automation and Robotics

286

To calculate the ﬁrst term on the right hand

side of (13), we simply calculate the gradient of

U(x(k), u

(k)) with respect to x(k):

∂U[x(k), u

(k)]

∂x(k)

= Ox(k).

For the second term the value of the partial deriva-

tives of u(k), u

(k) andU(x(k), u

(k)) with respect

to x(k), u(k)and u

(k) respectively need to be eval-

uated:

∂U[x(k), u

(k)]

∂u

(k)

∂u

(k)

∂u(k)

∂x(k)

= G(u+ e)A.

The third term can be evaluated by calculating the

partial derivatives of e(k), u

(k) andU(x(k), u

(k))

with respect to x(k), e(k)and u

(k) respectively:

∂U[x(k), u

(k)]

∂u

(k)

∂u

(k)

∂e(k)

∂x(k)

= G(u+ e)Q.

The fourth term requires propagating λ(k+) through

the model of equation (16) back to x(k), which yields

λ[x(k+ )]

∂x(k+ )

∂x(k)

= Mx(k + )S.

The ﬁfth term requires propagating λ(k + ) through

the model of the plant, x(k + ), back to u

(k) and

then through the action network, which yields

λ[x(k+ )]

∂x(k+ )

∂u

(k)

∂u

(k)

∂u(k)

∂x(k)

= Mx(k + )RA.

The sixth term can be calculated by propagatingλ(k+

) through the model of the plant, x(k + ), back to

(k) and then through the error network of the con-

troller, which yields

λ[x(k+ )]

∂x(k+ )

∂u

(k)

∂u

(k)

∂e(k)

∂x(k)

= Mx(k + )RQ.

The seventh term can also be calculated by propagat-

ing λ(k + ) through the model of the plant, x(k+ ),

back to the error network of the forward model, which

yields

λ[x(k+ )]

∂x(k+ )

∂η(k+ )

∂x(k)

= Mx(k + )D.

The eighth term is calculated by propagating λ(k+ )

through the model of the plant, x(k + ), back to the

error network of the forward model and then u

(k)

and then through the action network. This yields

λ[x(k+ )]

∂x(k+ )

∂η(k+ )

∂u

(k)

∂u

(k)

∂u(k)

∂x(k)

= Mx(k + )EA.

Finally the last term is calculated as follows

λ[x(k+ )]

∂x(k+ )

∂η(k+ )

∂u

(k)

∂u

(k)

∂e(k)

∂x(k)

= Mx(k + )EQ.

Adding all terms together and taking the expectation,

yields

∗

(k) =< Ox(k) + A

G{Ax(k) + Qx(k)}

+ Q

G{Ax(k) + Qx(k)} + S

Sx(k) + RQx(k)

Dx(k)} + A

Sx(k) + RQx(k) +

Dx(k)}

+Q

Sx(k)+RQx(k)+

Dx(k)}+D

Sx(k)

+ RQx(k) +

Dx(k)} + A

Sx(k) + RQx(k)

Dx(k)}+Q

Sx(k)+RQx(k)+

Dx(k)} >,

(30)

where we used equations (21), (19), (20) and (23).

Evaluating the expectation of Equation (30) yields,

∗

(k) = Ox(k) + A

GAx(k)

+  < Q

GQ > x(k) + 

Sx(k)

+  < Q

MRQ > x(k) +  <

D > x(k),

(31)

where we made use of the fact that the expected value

of the random variables Q, E, and D is zero, that η

and e are uncorrelated and ﬁnally that Q and E are

uncorrelated random variables. Using equation (25)

in (31) yields,

∗

(k) = Mx(k). (32)

From (32) it can be clearly seen that the target vector

of the proposed probabilistic critic network is equal

to the correct value. This validates the theoretical

development of the probabilistic DHP adaptive critic

method proposed in this paper.

5 CONCLUSIONS

The nonstandard formulation of the stochastic control

design presented in this paper leads to a differentform

of optimal controller that depends on the solution of

stochastic functional equations. It provides the com-

plete solution for designing a stochastic controller for

complex control systems accompanied by high levels

of inherent uncertainty in modeling and estimation.

All probability density functions needed in the pro-

posed methods are assumed to be unknown. To esti-

mate these probability density functions we propose

using probabilistic neural network models to estimate

errors in predicting conditional expectations of the

STOCHASTIC CONTROL STRATEGIES AND ADAPTIVE CRITIC METHODS

287

functions. This proposed method always guarantees

the positivity of the covariance of the errors and al-

lows for considering multiplicative noise on both the

state and control of the system.

The proposed probabilistic DHP critic method is

suitable for deterministic and stochastic control prob-

lems characterized by functional uncertainty. Unlike

current established control methods, it takes uncer-

tainty of the forward model and inverse controller into

consideration when deriving the optimal control law.

Theoretical development in this paper is demon-

strated through linear quadratic control problem.

There, the correct value of the cost function which

satisﬁes the Bellman equation is evaluated and shown

to be equal to its corresponding value produced by the

proposed probabilistic critic network.

REFERENCES

Botto, M. A., Wams, B., van den Boom, and da Costa,

J. M. G. S. (2000). Robust stability of feedback lin-

earised systems modelled with neural networks: Deal-

ing with uncertainty. Engineering Applications of Ar-

tiﬁcial Intelligence, 13(6):659–670.

Fabri, S. and Kadirkamanathan, V. (1998). Dual adaptive

control of nonlinear stochastic systems using neural

networks. Automatica, 34(2):245–253.

Ge, S. S., Hang, C. C., Lee, T. H., and Zhang, T. (2001). Sta-

ble Adaptive Neural Network Control. Kluwer, Nor-

well, MA.

Ge, S. S. and Wang, C. (2004). Adaptive neural control

of uncertain mimo nonlinear systems. IEEE Transac-

tions on Neural Networks, 15(3):674–692.

Herzallah, R. (2007). Adaptive critic methods for stochas-

tic systems with input-dependent noise. Automatica.

Accepted to appear.

Herzallah, R. and Lowe, D. A Bayesian perspective on sto-

chastic neuro control. IEEE Transactions on Neural

Networks. re-submited 2006.

Herzallah, R. and Lowe, D. (2007). Distribution model-

ing of nonlinear inverse controllers under a Bayesian

framework. IEEE Transactions on Neural Networks,

18:107–114.

Hovakimyan, N., Nardi, F., and Calise, A. J. (2001). A

novel observer based adaptive output feedback ap-

proach for control of uncertain systems. In Proceed-

ings of the American Control Conference, volume 3,

pages 2444–2449, Arlington, VA, USA.

Howard, R. A. (1960). Dynamic Programming and Markov

Processes. The Massachusetts Institute of Technology

and John Wiley and Sons, Inc., New York. London.

Karny, M. (1996). Towards fully probabilistic control de-

sign. Automatica, 32(12):1719–1722.

Lewis, F. L., Yesildirek, A., and Liu, K. (2000). Robust

backstepping control of induction motors using neural

netwoks. IEEE Transactions on Neural Networks,

11:1178–1187.

Mine, H. and Osaki, S., editors (1970). Markovian Decision

Processes. Elsevier, New York, N.Y.

Murray-Smith, R. and Sbarbaro, D. (2002). Nonlinear adap-

tive control using non-parametric gaussian process

prior models. In 15th IFAC Triennial World Congress,

Barcelona.

Sanner, R. M. and Slotine, J. J. E. (1992). Gaussian net-

works for direct adaptive control. IEEE Transactions

on Neural Networks, 3(6).

Sastry, S. S. and Isidori, A. (1989). Adaptive control of

linearizable systems. IEEE Transactions on Automatic

Control, 34(11):1123–1131.

Wang, D. and Huang, J. (2005). Neural network-based

adaptive dynamic surface control for a class of uncer-

tain nonlinear systems in strict-feedback form. IEEE

Transactions on Neural Networks, 16(1):195–202.

Wang, H. (2002). Minimum entropy control of non-

gaussian dynamic stochastic systems. IEEE Transac-

tions on Automatic Control, 47(2):398–403.

Wang, H. and Zhang, J. (2001). Bounded stochastic dis-

tribution control for pseudo armax stochastic systems.

IEEE Transactions on Automatic Control, 46(3):486–

490.

Werbos, P. J. (1992). Approximate dynamic programming

for real-time control and neural modeling. In White,

D. A. and Sofge, D. A., editors, Handbook of In-

tillegent Control, chapter 13, pages 493–526. Multi-

science Press, Inc, New York, N.Y.

Zhang, Y., Peng, P. Y., and Jiang, Z. P. (2000). Stable neural

controller design for unknown nonlinear systems us-

ing backstepping. IEEE Transactions on Neural Net-

works, 11:1347–1359.

ICINCO 2008 - International Conference on Informatics in Control, Automation and Robotics

288