STOCHASTIC CONTROL STRATEGIES AND ADAPTIVE CRITIC
METHODS
Randa Herzallah
Faculty of Engineering Technology, Al-Balqa’ Applied University, Jordan
David Lowe
NCRG, Aston University, U.K.
Keywords:
Adaptive critic methods, functional uncertainty, stochastic control.
Abstract:
Adaptive critic methods have common roots as generalizations of dynamic programming for neural reinforce-
ment learning approaches. Since they approximate the dynamic programming solutions, they are potentially
suitable for learning in noisy, nonlinear and nonstationary environments. In this study, a novel probabilistic
dual heuristic programming (DHP) based adaptive critic controller is proposed. Distinct to current approaches,
the proposed probabilistic (DHP) adaptive critic method takes uncertainties of forward model and inverse con-
troller into consideration. Therefore, it is suitable for deterministic and stochastic control problems character-
ized by functional uncertainty. Theoretical development of the proposed method is validated by analytically
evaluating the correct value of the cost function which satisfies the Bellman equation in a linear quadratic con-
trol problem. The target value of the critic network is then calculated and shown to be equal to the analytically
derived correct value.
1 INTRODUCTION
In recent research of stochastic control systems, much
attention has been paid to the problem of charac-
terizing and incorporating functional uncertainty of
dynamical control systems. This is because there
is an increasing demand for high reliability of com-
plex control systems which are accompanied by high
level of inherent uncertainty in modeling and esti-
mation and are characterized by intrinsic nonlinear
dynamics involving unknown functionals and latent
processes. Several methods have been developed,
and examples include feedback linearization tech-
niques (Botto et al., 2000; Hovakimyan et al., 2001),
backstepping techniques (Sastry and Isidori, 1989;
Zhang et al., 2000; Lewis et al., 2000), neural net-
work based methods (Wang and Huang, 2005; Ge and
Wang, 2004; Ge et al., 2001; Murray-Smith and Sbar-
baro, 2002; Fabri and Kadirkamanathan, 1998), sto-
chastic adaptive control methods (Karny, 1996; Wang
and Zhang, 2001; Wang, 2002; Herzallah and Lowe,
2007; Herzallah and Lowe, ), and adaptive critic
based methods (Herzallah, 2007).
In the feedback linearization, backstepping and
neural network based methods, only parameters or
forward model uncertainty have been considered.
The inverse controller has been assumed to be
deterministic or dependent on the forward model.
Stochastic adaptive control methods on the other
hand have considered modeling the distribution
of the inverse controller. However uncertainty in
the stochastic adaptive control methods proposed
in (Karny, 1996; Wang and Zhang, 2001; Wang,
2002), has been treated as a nuisance or perturbation
therefore; did not affect the derivation of the optimal
control law. In other words, uncertainty has been
assumed to be input–independent and consequently
they did not contribute to the derivation of the optimal
control law. The stochastic adaptive control methods
developed in (Herzallah and Lowe, 2007; Herzallah
and Lowe, ) on the other hand have considered
input–dependent uncertainty and the methods are
proven to significantly improve the performance of
the controller.
Selected adaptive Critic (AC) methods, known
as action–independent adaptive critic methods, have
been shown to implement useful approximations
of Dynamic Programming, a method for designing
281
Herzallah R. and Lowe D. (2008).
STOCHASTIC CONTROL STRATEGIES AND ADAPTIVE CRITIC METHODS.
In Proceedings of the Fifth International Conference on Informatics in Control, Automation and Robotics - ICSO, pages 281-288
DOI: 10.5220/0001481902810288
Copyright
c
SciTePress
optimal control policies in the context of nonlinear
plants (Werbos, 1992). However in their conventional
form, the action–independent adaptive critic methods
do not take into consideration model uncertainty. In
most recent development to these methods, a novel
dual heuristic programming (DHP) adaptive–critic–
based cautious controller is proposed (Herzallah,
2007). The proposed controller avoids the prei-
dentification training phase of the forward model
and inverse controller by taking into consideration
model uncertainty when calculating the control law.
Only forward model uncertainty has been consid-
ered in (Herzallah, 2007). The inverse controller
is assumed to be accurate and no knowledge of
uncertainty needed to be characterized. However,
similar to the forward model, the parameters of the
inverse controller of the nonlinear dynamical systems
are usually optimized using nonlinear optimization
methods. This inevitably leads to uncertain model
of the inverse controller. Consequently, uncertainty
of the inverse controller should be estimated and
considered in the derivation of the optimal control
law.
As a result, the dual heuristic programming
(DHP) adaptive–critic–based cautious control
method (Herzallah, 2007) is still in need of further
development. This forms the main purpose of this pa-
per, where functional uncertainty of both the forward
model and the inverse controller is characterized
and used in deriving the optimal control law. Hence
the novelty of this work stems from considering
functional uncertainty in the inverse controller as
well as the forward model. Furthermore, a new
method for estimating functional uncertainty of the
models will be introduced in this work. In contrast
to the method proposed in (Herzallah, 2007) this
method allows for considering multiplicative noise on
both the state and the control law. Also it guarantees
the positivity of the covariance matrix of the errors.
This well lead to a novel theoretical development for
the stochastic adaptive control. Moreover, the Riccati
solution for a quadratic linear infinite horizon control
problem will also be derived and compared to the so-
lution of the developed probabilistic (DHP) adaptive
critic method. The method developed in this paper,
enhances the performance of the system by utilizing
more fully the probabilistic information provided
by the forward model and the inverse controller.
No pre–identification will be needed for neither the
forward model, the critic or the inverse controller.
All networks in the new developed framework will
be adapted at each instant of time.
2 PRELIMINARIES
This preparatory section recalls basic elements of
modeling conditional distributions of system outputs
and inverse controller and the aim of fully probabilis-
tic control.
2.1 Basic Elements
The behavior of a stochastic general class discrete
time system with input u
op
(k) and measurable state
vector x(k) is described by a stochastic model of the
following form
x(k+ ) = g[x(k), u
op
(k)] +
˜
η(k+ ) (1)
where
˜
η(k + ) is random independent noise which
has zero mean and covariance
˜
P.
This can generally be expressed as:
x(k+ ) = f[x(k), u
op
(k),
˜
η(k+ )]. (2)
The randomized controller to be designed is de-
scribed by the following stochastic model
u
op
(k) = c[x(k)] +
˜
e(k) (3)
where
˜
e(k) represents random independent noise of
zero mean and
˜
Q covariance matrix. Notice that only
state dependent controllers are considered. However,
assuming state dependent controller can be shown to
represent no real restrictions (Mine and Osaki, 1970)
provided that the state can be measured. The stochas-
tic model of the controller can be reexpressed in the
following general form:
u
op
(k) = h[x(k),
˜
e(k)]. (4)
All probability density functions in this paper are
assumed to be unknown and need to be estimated.
The estimation method of these probability density
functions will be discussed in Section 2.3, but first we
introduce the aim of designing a probabilistic control.
2.2 Problem Formulation
In dynamic programming, the randomized controller
of the above stochastic control problem is obtained by
minimizing the expected value of the Bellman equa-
tion
J[(x(k)] =
U(x(k), u
op
(k))+γJ[x(k+)]
(5)
where < . > is the expected value, J[x(k)] is the cost
to go from time k to the final time, U(x(k), u
op
(k))
is the utility which is the cost from going from time
k to time k + , and J[x(k+ )] is assumed to be the
minimum cost from going from time k+ to the final
ICINCO 2008 - International Conference on Informatics in Control, Automation and Robotics
282
time. The term γ is a discount factor ( γ ) which
allows the designer to weight the relative importance
of present versus future utilities. The objective is then
to choose the control sequence u(k), k = , , . . . , so
that the function J in (5) is minimized.
The critic network in the DHP scheme, estimates
a variable called λ[x(k)] as the derivatives of J(x(k))
with respect to the vector x(k).
λ[x(k)] =
U[x(k), u
op
(k)]
x(k)
+
U[x(k), u
op
(k)]
u
op
(k)
u
op
(k)
x(k)
+ < λ[x(k+ )]
x(k+ )
x(k)
>
+ < λ[x(k+ )]
x(k+ )
u
op
(k)
u
op
(k)
x(k)
> (6)
where γ has been given the value of . Since hλ[x(k+
)]i,U[x(k), u
op
(k)] and the system model derivatives
are known, then λ[x(k)] can be calculated. The opti-
mality equation is defined as
J[x(k)]
u
op
(k)
=
U[x(k), u
op
(k)]
u
op
(k)
+ < λ[x(k + )]
x(k+ )
u
op
(k)
>
= (7)
The above two equations are usually used in dynamic
programmingto solve an infinite or finite horizoncon-
trol policy.
If the nonlinear function f[x(k), u
op
(k),
˜
η(k + )]
was known or the system was noiseless, and given a
deterministic function for the inverse controller, the
optimal control law which achieves the above ob-
jective is shown to be derived using techniques of
dynamic programming or DHP adaptive critic meth-
ods as an approximation methods to dynamic pro-
gramming (Herzallah, 2007). Even if the function
f[x(k), u
op
(k),
˜
η(k+ )] was unknown, researchers in
the model based adaptive critic field would simply
adapt a forecasting network which predicts the con-
ditional mean of the state vector. This means that
only deterministic models were considered in the con-
ventional theory of the adaptive critic methods. Re-
cently (Herzallah, 2007), it has been proved that the
control law of the DHP adaptive critic methods which
is derived based on the assumption of determinis-
tic forward model is suboptimal. It has been shown
in (Herzallah, 2007) that if the function of the con-
trolled system was unknown then the problem should
be formulated in an adaptive control scheme which
is known to have functional uncertainty. Therefore,
forward model uncertainty was quantified and used
in their developed control algorithm. However, only
forward model uncertainty was considered in (Herza-
llah, 2007). This forward model uncertainty was as-
sumed to follow Gaussian distribution. The inverse
controller on the other hand was assumed to be deter-
ministic function.
In the current paper, the forward model and the
inverse controller are described by probability den-
sity functions as shown in Equations (2) and (4).
These probability density functions are not limited to
Gaussian density, they can be of any shape. As men-
tioned in Section 2.1, the probability density func-
tions of the forward model and the inverse controller
are assumed to be unknown and need to be estimated
in this paper. The objectiveof the current paper is then
to develop an appropriate method for estimating the
non Gaussian distributions of both the forward model
and the inverse controller and then use these proba-
bilistic information in the derivation of the optimal
control law. This yields a novel DHP adaptive critic
control algorithm which we refer to as probabilistic
DHP adaptive critic method. The developed theory
will be illustrated on linear infinite horizon quadratic
control problem. The Riccati solution for this linear
problem will also be derived.
2.3 Stochastic Model Estimation
In the neurocontrol field researchers usually adapt
forecasting networks to predict the conditional mean
of the system output or state vector, ^x(k+ ). In most
control applications this is probably enough. How-
ever; with the growing complexity of control systems
and because of the inherent uncertainty in modeling
and estimation, researchers recently considered mod-
eling the conditional distribution of the stochastic sys-
tems rather than relying on the single estimate of the
neural networks.
To estimate the conditional distribution of the sys-
tem output, a neural network model is optimized such
that its output approximates the conditional expecta-
tion of the system output. Once the output of the
neural network model has been optimized the stochas-
tic model of the system is simply shown to be given
by (Herzallah and Lowe, 2007),
x(k+ ) = ^x(k+ ) + η(k+ ), (8)
where ^x(k + ) = ^g[x(k), u
op
(k)], and η(k + ) repre-
sents an input dependent random noise. The stochas-
tic model in Equation (8) can in turn be reexpressed
in the following general form:
x(k+ ) =
^
f[x(k), u
op
(k), η(k + )]. (9)
STOCHASTIC CONTROL STRATEGIES AND ADAPTIVE CRITIC METHODS
283
Usually the noise η(k + ) is assumed to follow
Gaussian distribution of zero mean and covariance
matrix P. In this work the assumption of Gaussian
distribution is relaxed. In other words η(k + ) is
an input dependent random noise which could follow
any non-Gaussian distribution of zero mean. This is
a more realistic assumption, since a nonlinear map-
ping of random variable is non-Gaussian. This non-
Gaussian distribution will be identified by evaluating
the expectation and moments of the distribution. For
example the second moment of the distribution is rep-
resented by its covariance matrix P. This covariance
matrix represents the covariance of the error in pre-
dicting x(k + ).
The method proposed in (Herzallah and Lowe,
2007) estimates the conditional distribution of the
system output by using another neural network model
to provide a prediction for the input dependent covari-
ance matrix P =< η(k + )η
T
(k + ) >. In the cur-
rent paper we propose a different method for estimat-
ing the conditional distribution of the system output
which could be non-Gaussian as well. This novel pro-
posed method is based on estimating the distribution
of the input dependent error η(k+) and not the input
dependent covariance matrix P. Since the covariance
matrix P can be evaluated the distribution of the in-
put dependent error is estimated by using a Gaussian
Radial Basis Function neural network which has the
important property of linear transformation.
η(x(k), u
op
(k)) = w φ(x(k), u
op
(k)). (10)
where w
i
is a random vector which has zero mean
and a covariance matrix Σ
i
=< φ
T
η
T
i
η
i
φ
>, and
i is the output index. Here the RBFNN is taken to
be a probabilistic rather than deterministic model.
To adapt this probabilistic neural network model the
following conditions are assumed to hold for the
neural network:
Assumption 1. The state and control is always
confined within the network approximation region
defined by subset Z whose boundaries are known.
This approximation region is a design parameter and
could be made arbitrarily large.
Assumption 2. The basis function centers and
width parameters ensuring that condition is satisfied
are known a priori.
The second assumption is justified by the universal
approximation property of neural networks with well
known developed methods of choosing appropriate
basis function centers and width parameters a pri-
ori (Sanner and Slotine, 1992).
Using the neural network as a probabilistic model for
the input dependent error allows us to consider multi-
plicative noise on both the state and control. Besides,
it ensures the positivity of the error covariance ma-
trix P. Following the same procedure of the forward
model, the stochastic model of the inverse controller
is given by
u
op
(k) = u(k) + e(k). (11)
The distribution of the error in predicting the con-
trol law is also estimated using the same method of
predicting the distribution of the error of the forward
model.
To reemphasize, the method proposed in this sec-
tion for estimating the conditional distributions of the
models: ensures the positivity of the covariance ma-
trix of the errors, it uses the neural network as a proba-
bilistic models, and allows considering multiplicative
noise on both the state and the control. However, the
method proposed in (Herzallah and Lowe, 2007) does
not guarantee the positivity of the covariance matrix,
and it uses the neural network a deterministic model.
The theory developed in this section will be used
in the next section for developing the theory behind
the probabilistic DHP adaptive critic method pro-
posed in this paper.
3 PROPOSED PROBABILISTIC
ADAPTIVE CRITIC METHOD
In this section we propose a probabilistic type DHP
adaptive critic controller which takes uncertainty of
the forward model and the inverse controller into con-
sideration when calculating the control law. The pro-
posed controller can be obtained directly by optimally
solving the adaptive critic problem which consid-
ers stochastic models rather than deterministic mod-
els. In the proposed probabilistic DHP adaptive critic
method the control law is derived such as to minimize
the expected value of the cost–to–go J[x(k)] given
in (5) using γ = , but with the uncertainty of the mod-
els’ estimates being taken into consideration. This is
accomplished by treating the forward model and the
inverse controller as random variables.
Following the procedure presented in Section 2.3
the conditional distributions of the forward model
and the inverse controller are estimated. Using this
in equation (5), Bellman’s equation could be reex-
pressed as:
J[x(k)] =< U(x(k), u
op
(k)) > + < J[x(k + )] >
=< U(x(k), u
op
(k)) >
+ < J[
^
f(x(k), u
op
(k), η(k + ))] > (12)
ICINCO 2008 - International Conference on Informatics in Control, Automation and Robotics
284
Since the errors η(k + ) and e(k) of the forward
model and the inverse controller respectively are state
dependent, the variable λ[x(k)] is shown to be given
by the following theorem.
Theorem 1. The variable λ[x(k)] of the cost function
of equation (12) subject to the stochastic models of
equations (9) and (11), is given by
λ[x(k)] =<
U[x(k), u
op
(k)]
x(k)
+
U[x(k), u
op
(k)]
u
op
(k)
u
op
(k)
u(k)
u(k)
x(k)
+
U[x(k), u
op
(k)]
u
op
(k)
u
op
(k)
e(k)
e(k)
x(k)
>
+ < λ[x(k + )]
x(k + )
x(k)
+ λ[x(k+ )]
x(k + )
u
op
(k)
u
op
(k)
u(k)
u(k)
x(k)
+ λ[x(k+ )]
x(k + )
u
op
(k)
u
op
(k)
e(k)
e(k)
x(k)
+ λ[x(k+ )]
x(k + )
∂η(k + )
∂η(k + )
x(k)
+ λ[x(k+ )]
x(k + )
∂η(k + )
∂η(k + )
u
op
(k)
u
op
(k)
u(k)
u(k)
x(k)
+ λ[x(k+ )]
x(k + )
∂η(k + )
∂η(k + )
u
op
(k)
u
op
(k)
e(k)
e(k)
x(k)
> (13)
Proof. To prove the above theorem we simply derive
the cost function of equation (12) with respect to the
state x(k) at time k.
The error in predicting the state vector η(k + )
is dependent on the control signal as well, so the
optimality equation can be seen to be given by the
following theorem.
Theorem 2. The optimality equation of the cost func-
tion of equation (12) subject to the stochastic models
of equations (9) and (11), is given by
J[x(k)]
u(k)
=<
U[x(k), u
op
(k)]
u
op
(k)
u
op
(k)
u(k)
+ λ[x(k+ )]
x(k+ )
u
op
(k)
u
op
(k)
u(k)
+ λ[x(k+ )]
x(k+ )
∂η(k+ )
∂η(k+ )
u
op
(k)
u
op
(k)
u(k)
>=
(14)
Proof. To prove the above theorem we simply derive
the cost function of equation (12) with respect to the
optimal control u(k) at time k.
The training process for the probabilistic type
DHP adaptive critic proposed in this section is ex-
actly the same as that for the conventional DHP adap-
tive critic. It consists of training the action network
which outputs the optimal control policy u[x(k)] and
the critic network which approximates the derivative
of the cost function λ[x(k)]. As a first step both net-
works’ parameters are initially randomized. Next,
the difference between the target value of the critic,
λ
[x(k)] calculated from Equation (13) and the critic
network output λ[x(k)] is used to correct the critic net-
work until it converges. The output from the con-
verged critic is used in (14) solving for the target
u
op
(k) which is then used to correct the action net-
work. These two steps continue until a predetermined
level of convergence is reached.
Because the proposed probabilistic DHP adaptive
critic method takes model uncertainty into consider-
ation, it is recommended to be implemented on–line.
The forward model of the plant to be controlled the
controller and the critic networks can all be adapted
on–line.
4 LINEAR QUADRATIC MODEL
Stochastic linear quadratic models is one of the most
widely used models in modern control engineering
and finance. To understand and prove the valid-
ity of the proposed probabilistic DHP adaptive critic
methods, the theory developed in the previous sec-
tion is applied here to infinite horizon linear quadratic
control problem. Before we evaluate the proposed
methods themselves, the correct values of various
functions in this problem will be calculated, so that
we have something to check the proposed method
against. Besides evaluating the correct values of var-
ious functions, we also derive the Riccati solution of
this nonstandard stochastic control problem.
4.1 Dynamic Programming Solution for
the Linear Quadratic Model
Suppose that the vector of observable, x(k) is the
same as the state vector of the plant. Since we con-
sider an infinite horizon problem, the objective is to
minimize a measure of utility, U(k), summed from
the present time to the infinite future, which is defined
by:
U(k) = x
T
Ox+ u
op
T
Gu
op
. (15)
Suppose that the plant is described by the following
stochastic model:
x(k+ ) = Sx(k) + Ru
op
(k) + η(k+ ), (16)
where the error of the prediction, η(k+) is estimated
as described in Section 2.3. This error is shown in
Section 2.3 to be control signal and state dependent.
STOCHASTIC CONTROL STRATEGIES AND ADAPTIVE CRITIC METHODS
285
Since it should have the same structure and same in-
puts as the forward model of the plant, it is taken in
this linear quadratic problem to be linear with two in-
puts, the state vector and the control signal:
η(k+ ) = Dx(k) + Eu
op
(k). (17)
where D and E are matrices of random numbers
which contain the parameters of the error model. Sup-
pose that the action network is described by the fol-
lowing stochastic model:
u
op
(k) = u(k) + e(k), (18)
where
u(k) = Ax(k) (19)
and where A is the matrix of the controller parame-
ters, and e(k) is the error in predicting the optimal
control estimated as discussed in Section 2.3 and as-
sumed to have the following form
e(k) = Qx(k) (20)
where Q is a matrix of random numbers that describes
the mapping from the state space to the error in pre-
dicting the optimal control law. Using the control ex-
pression of Equation (19) in Equation (18) and substi-
tuting back in Equation (16) yields:
x(k+ ) =
˜
Sx(k) + Re(k) + η(k+ ), (21)
where
˜
S = S+ RA. (22)
Similarly the expression of the error in predicting the
state vector as defined in Equation (17) can be rewrit-
ten in the following form
η(k+ ) =
˜
Dx(k), (23)
where we have used Equations (18), (19), and (20)
and where
˜
D = D+ EA+ EQ. (24)
As a preliminary step to calculating the correct value
of the cost function of Bellman’s equation let us de-
fine M as the matrix that solves the following equa-
tion:
M = O+ A
T
GA+ < Q
T
GQ > +
˜
S
T
M
˜
S
+ < Q
T
R
T
MRQ > + <
˜
D
T
M
˜
D > . (25)
Following all the above definitions, the true value of
the cost function J satisfying the Bellman equation is
given in the following theorem:
Theorem 3. The true value of the cost function J,
satisfying the Bellman equation (with γ = ) subject to
the system of equation (16) and uncertainty models of
the forward model and the inverse controller defined
in Equations (17) and (20) respectively and all other
definitions previously mentioned is given by:
J(x) = x
T
Mx. (26)
Proof. To prove the above theorem we simply substi-
tute into Bellmans equation (12) and verify that it is
satisfied. For the left hand side of the equation, we
get:
J[x(k)] = x
T
(k)Mx(k). (27)
For the right hand side, we get:
< U(x(k), u
op
(k)) + J[x(k+ )] >=
< x
T
Ox+ (Ax+ e)
T
G(Ax+ e) >
+ < (
˜
Sx+ Re+ η)
T
M(
˜
Sx+ Re+ η) >
= x
T
Ox+x
T
A
T
GAx+x
T
< Q
T
GQ > x+x
T
˜
S
T
M
˜
Sx
+ x
T
< Q
T
R
T
MRQ > x+ x
T
<
˜
D
T
M
˜
D > x, (28)
where we used Equations (20) and (23) and where we
made use of the fact that η, and e are uncorrelated
random variables of zero mean. Making use of Equa-
tion (25) in Equation (28), yields
J(x) = x
T
Mx. (29)
Comparing Equations (26) and (29) we can see
that Bellman’s equation is satisfied. Howards has
proven (Howard, 1960) that the optimal control law,
based on the policy iteration method, can be derived
by alternately calculating the cost function J for the
current control law, modify the control law so as to
minimize the cost function J, recalculate J for the new
control law, and so on.
4.2 Proposed Probabilistic DHP
Adaptive Critic in the Linear
Quadratic Model
The objective of this section is to calculate the tar-
gets for the output of the critic network λ
[x(k)] as
they would be generated by the proposed probabilis-
tic DHP adaptive critic, and then check them against
the correct values. In other words we need to check
that if the critic was initially correct it will stay correct
after one step of adaptation.
From (29), the correct value of J(x) is x
T
Mx, and
consequently the correct value of λ(x) is simply the
gradient of J(x), i.e λ(x) = Mx(k). Hence λ(k + )
is given by,
λ(k+ ) = Mx(k + ).
Next we carry out the calculations implied by equa-
tion (13) but with the expectation of the derivatives
being evaluated at the end.
ICINCO 2008 - International Conference on Informatics in Control, Automation and Robotics
286
To calculate the first term on the right hand
side of (13), we simply calculate the gradient of
U(x(k), u
op
(k)) with respect to x(k):
U[x(k), u
op
(k)]
x(k)
= Ox(k).
For the second term the value of the partial deriva-
tives of u(k), u
op
(k) andU(x(k), u
op
(k)) with respect
to x(k), u(k)and u
op
(k) respectively need to be eval-
uated:
U[x(k), u
op
(k)]
u
op
(k)
u
op
(k)
u(k)
u(k)
x(k)
= G(u+ e)A.
The third term can be evaluated by calculating the
partial derivatives of e(k), u
op
(k) andU(x(k), u
op
(k))
with respect to x(k), e(k)and u
op
(k) respectively:
U[x(k), u
op
(k)]
u
op
(k)
u
op
(k)
e(k)
e(k)
x(k)
= G(u+ e)Q.
The fourth term requires propagating λ(k+) through
the model of equation (16) back to x(k), which yields
λ[x(k+ )]
x(k+ )
x(k)
= Mx(k + )S.
The fifth term requires propagating λ(k + ) through
the model of the plant, x(k + ), back to u
op
(k) and
then through the action network, which yields
λ[x(k+ )]
x(k+ )
u
op
(k)
u
op
(k)
u(k)
u(k)
x(k)
= Mx(k + )RA.
The sixth term can be calculated by propagatingλ(k+
) through the model of the plant, x(k + ), back to
u
op
(k) and then through the error network of the con-
troller, which yields
λ[x(k+ )]
x(k+ )
u
op
(k)
u
op
(k)
e(k)
e(k)
x(k)
= Mx(k + )RQ.
The seventh term can also be calculated by propagat-
ing λ(k + ) through the model of the plant, x(k+ ),
back to the error network of the forward model, which
yields
λ[x(k+ )]
x(k+ )
∂η(k+ )
∂η(k+ )
x(k)
= Mx(k + )D.
The eighth term is calculated by propagating λ(k+ )
through the model of the plant, x(k + ), back to the
error network of the forward model and then u
op
(k)
and then through the action network. This yields
λ[x(k+ )]
x(k+ )
∂η(k+ )
∂η(k+ )
u
op
(k)
u
op
(k)
u(k)
u(k)
x(k)
= Mx(k + )EA.
Finally the last term is calculated as follows
λ[x(k+ )]
x(k+ )
∂η(k+ )
∂η(k+ )
u
op
(k)
u
op
(k)
e(k)
e(k)
x(k)
= Mx(k + )EQ.
Adding all terms together and taking the expectation,
yields
λ
(k) =< Ox(k) + A
T
G{Ax(k) + Qx(k)}
+ Q
T
G{Ax(k) + Qx(k)} + S
T
M{
˜
Sx(k) + RQx(k)
+
˜
Dx(k)} + A
T
R
T
M{
˜
Sx(k) + RQx(k) +
˜
Dx(k)}
+Q
T
R
T
M{
˜
Sx(k)+RQx(k)+
˜
Dx(k)}+D
T
M{
˜
Sx(k)
+ RQx(k) +
˜
Dx(k)} + A
T
E
T
M{
˜
Sx(k) + RQx(k)
+
˜
Dx(k)}+Q
T
E
T
M{
˜
Sx(k)+RQx(k)+
˜
Dx(k)} >,
(30)
where we used equations (21), (19), (20) and (23).
Evaluating the expectation of Equation (30) yields,
λ
(k) = Ox(k) + A
T
GAx(k)
+ < Q
T
GQ > x(k) +
˜
S
T
M
˜
Sx(k)
+ < Q
T
R
T
MRQ > x(k) + <
˜
D
T
M
˜
D > x(k),
(31)
where we made use of the fact that the expected value
of the random variables Q, E, and D is zero, that η
and e are uncorrelated and finally that Q and E are
uncorrelated random variables. Using equation (25)
in (31) yields,
λ
(k) = Mx(k). (32)
From (32) it can be clearly seen that the target vector
of the proposed probabilistic critic network is equal
to the correct value. This validates the theoretical
development of the probabilistic DHP adaptive critic
method proposed in this paper.
5 CONCLUSIONS
The nonstandard formulation of the stochastic control
design presented in this paper leads to a differentform
of optimal controller that depends on the solution of
stochastic functional equations. It provides the com-
plete solution for designing a stochastic controller for
complex control systems accompanied by high levels
of inherent uncertainty in modeling and estimation.
All probability density functions needed in the pro-
posed methods are assumed to be unknown. To esti-
mate these probability density functions we propose
using probabilistic neural network models to estimate
errors in predicting conditional expectations of the
STOCHASTIC CONTROL STRATEGIES AND ADAPTIVE CRITIC METHODS
287
functions. This proposed method always guarantees
the positivity of the covariance of the errors and al-
lows for considering multiplicative noise on both the
state and control of the system.
The proposed probabilistic DHP critic method is
suitable for deterministic and stochastic control prob-
lems characterized by functional uncertainty. Unlike
current established control methods, it takes uncer-
tainty of the forward model and inverse controller into
consideration when deriving the optimal control law.
Theoretical development in this paper is demon-
strated through linear quadratic control problem.
There, the correct value of the cost function which
satisfies the Bellman equation is evaluated and shown
to be equal to its corresponding value produced by the
proposed probabilistic critic network.
REFERENCES
Botto, M. A., Wams, B., van den Boom, and da Costa,
J. M. G. S. (2000). Robust stability of feedback lin-
earised systems modelled with neural networks: Deal-
ing with uncertainty. Engineering Applications of Ar-
tificial Intelligence, 13(6):659–670.
Fabri, S. and Kadirkamanathan, V. (1998). Dual adaptive
control of nonlinear stochastic systems using neural
networks. Automatica, 34(2):245–253.
Ge, S. S., Hang, C. C., Lee, T. H., and Zhang, T. (2001). Sta-
ble Adaptive Neural Network Control. Kluwer, Nor-
well, MA.
Ge, S. S. and Wang, C. (2004). Adaptive neural control
of uncertain mimo nonlinear systems. IEEE Transac-
tions on Neural Networks, 15(3):674–692.
Herzallah, R. (2007). Adaptive critic methods for stochas-
tic systems with input-dependent noise. Automatica.
Accepted to appear.
Herzallah, R. and Lowe, D. A Bayesian perspective on sto-
chastic neuro control. IEEE Transactions on Neural
Networks. re-submited 2006.
Herzallah, R. and Lowe, D. (2007). Distribution model-
ing of nonlinear inverse controllers under a Bayesian
framework. IEEE Transactions on Neural Networks,
18:107–114.
Hovakimyan, N., Nardi, F., and Calise, A. J. (2001). A
novel observer based adaptive output feedback ap-
proach for control of uncertain systems. In Proceed-
ings of the American Control Conference, volume 3,
pages 2444–2449, Arlington, VA, USA.
Howard, R. A. (1960). Dynamic Programming and Markov
Processes. The Massachusetts Institute of Technology
and John Wiley and Sons, Inc., New York. London.
Karny, M. (1996). Towards fully probabilistic control de-
sign. Automatica, 32(12):1719–1722.
Lewis, F. L., Yesildirek, A., and Liu, K. (2000). Robust
backstepping control of induction motors using neural
netwoks. IEEE Transactions on Neural Networks,
11:1178–1187.
Mine, H. and Osaki, S., editors (1970). Markovian Decision
Processes. Elsevier, New York, N.Y.
Murray-Smith, R. and Sbarbaro, D. (2002). Nonlinear adap-
tive control using non-parametric gaussian process
prior models. In 15th IFAC Triennial World Congress,
Barcelona.
Sanner, R. M. and Slotine, J. J. E. (1992). Gaussian net-
works for direct adaptive control. IEEE Transactions
on Neural Networks, 3(6).
Sastry, S. S. and Isidori, A. (1989). Adaptive control of
linearizable systems. IEEE Transactions on Automatic
Control, 34(11):1123–1131.
Wang, D. and Huang, J. (2005). Neural network-based
adaptive dynamic surface control for a class of uncer-
tain nonlinear systems in strict-feedback form. IEEE
Transactions on Neural Networks, 16(1):195–202.
Wang, H. (2002). Minimum entropy control of non-
gaussian dynamic stochastic systems. IEEE Transac-
tions on Automatic Control, 47(2):398–403.
Wang, H. and Zhang, J. (2001). Bounded stochastic dis-
tribution control for pseudo armax stochastic systems.
IEEE Transactions on Automatic Control, 46(3):486
490.
Werbos, P. J. (1992). Approximate dynamic programming
for real-time control and neural modeling. In White,
D. A. and Sofge, D. A., editors, Handbook of In-
tillegent Control, chapter 13, pages 493–526. Multi-
science Press, Inc, New York, N.Y.
Zhang, Y., Peng, P. Y., and Jiang, Z. P. (2000). Stable neural
controller design for unknown nonlinear systems us-
ing backstepping. IEEE Transactions on Neural Net-
works, 11:1347–1359.
ICINCO 2008 - International Conference on Informatics in Control, Automation and Robotics
288