BATCH REINFORCEMENT LEARNING

An Application to a Controllable Semi-active Suspension System

Simone Tognetti, Marcello Restelli, Sergio M. Savaresi

Dipartimento di elettronica e informazione, Politecnico di Milano, via Ponzio 34/5, 20133 Milano, Italy

Cristiano Spelta

Dipartimento di Ingegneria dellInformazione e Metodi Matematici, Universit degli Studi di Bergamo

viale Marconi 5, 24044 Dalmine (BG), Italy

Keywords:

Batch-reinforcement learning, Control theory, Non linear optimal control, Semi-active suspension.

Abstract:

The design problem of optimal comfort-oriented semi-active suspension has been addressed with different

standard techniques which failed to come out with an optimal strategy because the system is hard non-linear

and the solution is too complex to be found analytically. In this work, we aimed at solving such complex

problem by applying Batch Reinforcement Learning (BRL), that is an artiﬁcial intelligence technique that

approximates the solution of optimal control problems without knowing the system dynamics. Recently,

a quasi optimal strategy for semi-active suspension has been designed and proposed: the Mixed SH-ADD

algorithm, which the strategy designed in this paper is compared to. We show that an accurately tuned BRL

provides a policy able to guarantee the overall best performance.

1 INTRODUCTION

Among the many different types of controlled suspen-

sion systems (see e.g., (Sammier et al., 2003; Savaresi

et al., 2005; Silani et al., 2002)), semi-active sus-

pensions have received a lot of attention since they

provide the best compromise between cost (energy-

consumption and actuators/sensors hardware) and

performance. The research activity on controllable

suspensions develops along two mainstreams: the de-

velopment of reliable, high-performance, and cost-

effective semi-active controllable shock-absorbers

(Electro-Hydraulic or Magneto-Rheological see e.g.,

(Ahmadian et al., 2001; Guardabassi and Savaresi,

2001; Valasek et al., 1998; Williams, 1997)), and

the development of control strategies and algorithms

which can fully exploit the potential advantages of

controllable shock-absorbers. This work focuses on

the control-design issue for road vehicles.

The design problem of optimal comfort oriented

semi-active suspension has been addressed with dif-

ferent standard techniques which failed to came out

with an optimal strategy because the system is hard

non-linear and the solution is too complex to be found

analytically. The literature offers many contributions

that provide approximate solutions to the non-linear

problem, or alternatively, the non-linearity is par-

tially removed to exploit linear techniques (see e.g.,

(Karnopp and Crosby, 1974; Sammier et al., 2003)-

(Savaresi and Spelta, 2008; Valasek et al., 1998)).

In this work, we aimed at solving the optimal con-

trol problem of comfort-oriented semi-active suspen-

sion by using Batch Reinforcement Learning (BRL).

Developed in the artiﬁcial intelligent research ﬁeld,

BRL provides numerical algorithms able to approxi-

mate the solution of an optimal-control problem with-

out knowing the system dynamics (see (Kaelbling

et al., 1996) and (Sutton and Barto, 1998)). The al-

gorithm is independent from the model complexity

and can be trained on the real system without knowing

its dynamics. We compared the strategy obtained by

BRL with the ones given by the state-of-the-art semi-

active control algorithms. We showed that an accu-

rately tuned BRL provides a policy able to guarantee

the overall best performance.

The outline of the paper is as follows. In Sec-

tion 2 the control problem is stated. Section 3 re-

calls the BRL technique. Section 4 sums up the de-

sign of BRL-based control rule. Section 5 motivates

the choice of algorithm parameters, section 6 presents

228

Tognetti S., Restelli M., M. Savaresi S. and Spelta C. (2009).

BATCH REINFORCEMENT LEARNING - An Application to a Controllable Semi-active Suspension System.

In Proceedings of the 6th International Conference on Informatics in Control, Automation and Robotics - Intelligent Control Systems and Optimization,

pages 228-233

DOI: 10.5220/0002210302280233

 SciTePress

experimental results, and ﬁnally, section 7 ends the

paper with some concluding remarks.

2 PROBLEM STATEMENT AND

PREVIOUS WORK

The dynamic model of a quarter-car system equipped

with semi-active suspensions can be described by

the following set of differential equations (Williams,

1997):











M ¨z(t) = −c(t)(˙z(t) − ˙z

(t)) − k(z(t) − z

(t) − ∆

) − Mg

m¨z

(t) = c(t)(˙z(t) − ˙z

(t)) + k(z(t) − z

(t) − ∆

−k

(t) − z

(t) − ∆

) − mg [z

(t) − z

(t) < ∆

]

˙c(t) = −βc(t) + βc

(t) [ ˆc

min

≤ c

(t) ≤ ˆc

max

]

(1)

where the symbols in (1) are as follows (see also

Figure 1): z(t), z

(t),z

(t) are the vertical positions of

the body, the unsprung mass, and the road proﬁle, re-

spectively. M is the quarter-car body mass; m is the

Figure 1: Quarter-car diagram.

unsprung mass (tire, wheel, brake caliper, suspension

links, etc.); k and k

are the stiffnesses of the suspen-

sion spring and of the tire, respectively; ∆

and ∆

are

the lengths of the unloaded suspension spring and tire,

respectively; c(t) and c

(t) are the actual and the re-

quested damping coefﬁcients of the shock-absorber,

respectively.

The damping-coefﬁcient variation is ruled by a

1st-order dynamic model, where β is the bandwidth;

consequently the actual damping coefﬁcient remains

in the interval c

min

≤ c

(t) ≤ c

max

, where c

min

and

max

are design parameters of the semi-active shock-

absorber. This limitation is the so-called “passivity-

constraint” of a semi-active suspension.

For the above quarter-car model, the following set

of parameters are used (unless otherwise stated): M =

400 kg, m = 50 kg, k = 20 KN/m, k

= 250 KN/m,

min

= 300 Ns/m, c

max

= 3000 Ns/m and β = 100π.

Notice that (1) is non-linear since the damping co-

efﬁcient c(t) is a state variable; in the case of a passive

suspension with a constant damping coefﬁcient c, (1)

is reduced to a 4th-order linear system by simply set-

ting β → ∞ and c

(t) = c.

The general high-level structure of a comfort-

oriented control architecture for a semi-active suspen-

sion device is the following. The control variable is

the requested damping coefﬁcient c

(t). The mea-

sured output signals are two: the vertical acceleration

¨z(t) and the suspension displacement z(t) − z

(t). The

disturbance is the road proﬁle z

(t) (non-measurable

and unpredictable signal).

The goal of a comfort-oriented semi-active con-

trol system is to manage the damping of the shock-

absorber to ﬁlter the road disturbance towards the

body dynamics. Thus the following cost function is

introduced: J =

(¨z(t))

dt. It has been shown in

(Savaresi et al., 2005) that the optimal control strat-

egy is necessarily a rationale that switches from the

minimum to the maximum damping of the shock ab-

sorber (two-state algorithms).

In the literature there exist many control strategies

with a ﬂavor of optimality: Skyhook (SH) (Karnopp

and Crosby, 1974) and Acceleration Driven Damping

(ADD) (Savaresi et al., 2005). Recently an almost

optimal control strategy has been developed: the so-

called Mixed SH-ADD (Savaresi and Spelta, 2007).

Similarly to SH, also this strategy requires a two-state

damper:











(t) = c

max



¨z

− α

˙z

6 0 ∧ ˙z(˙z − ˙z

) > 0



∨



¨z

− α

˙z

> 0 ∧ ¨z(˙z − ˙z

) > 0



(t) = c

max



¨z

− α

˙z

6 0 ∧ ˙z(˙z − ˙z

) 6 0



∨



¨z

− α

˙z

> 0 ∧ ¨z(˙z − ˙z

) 6 0



(2)

Notice that accordingly to the sign of ¨z

− α

˙z

an appropriate sub-strategy is selected. This quan-

tity is a frequency selector and α represents the de-

sired cross-over frequency between two suboptimal

strategies, namely SH and ADD (see. (Savaresi and

Spelta, 2007)). A single sensor implementation of this

strategy has been recently developed (the so-called 1-

Sensor-Mix, (Savaresi and Spelta, 2008)). SH, ADD

and Mixed SH-ADD have been already compared in

the time and frequency domains (Savaresi and Spelta,

2007).

3 REINFORCEMENT LEARNING

Research in Reinforcement Learning (RL) aims at

designing algorithms by which autonomous agents

BATCH REINFORCEMENT LEARNING - An Application to a Controllable Semi-active Suspension System

229

(controller) can learn to behave (estimation of control

policy) in some appropriate fashion in some environ-

ment (controlled system), from their interaction (con-

trol variable) with this environment or from observa-

tions gathered from the environment (see e.g. (Sutton

and Barto, 1998) for a broad overview).

The interaction between the agent and the environ-

ment is modeled as a discrete-time Markov Decision

Process (MDP). An MDP is a tuple < S , A,P , R ,γ >,

where S is the state space, A is the action space,

P : S ×A → Π(S) is the transition model that assigns

to each state-action pair a probability distribution over

S, R : S × A → Π(R) is the reward function, or cost

function, that assigns to each state-action pair a prob-

ability distribution over R, γ ∈ [0,1) is the discount

factor. At each time step, the agent chooses an action

according to its current policy π : S → Π(A), which

maps each state to a probability distribution over ac-

tions. The goal of an RL agent is to maximize the

expected sum of discounted rewards, that is to learn

an optimal policy π

∗

that leads to the maximization

of the action-value function, or cost-to-go from each

state.

The optimal action-value function Q

∗

(s(t),a(t))

s(t) ∈ S, a(t) ∈ A is deﬁned by the Bellman equation:

∗

(s(t),a(t)) =

∑

s(t+∆T )∈S

P (s(t + ∆T )|s(t), a(t))

R(s(t),a(t)) + γ max

a(t+∆T )∈A

∗

(s(t + ∆T ),a(t + ∆T ))

(3)

where R(s,a) = E[R (s,a)] is the expected reward.

From the Control Theory perspective this equation

represents the optimal cost-to-go, indeed it represents

the discrete-time version of the Hamilton-Jacobi-

Bellman equation.

In order to manage the huge amount of samples

needed to solve real-world tasks, batch approaches

have been proposed (Riedmiller, 2005; Antos et al.,

2008; Ernst et al., 2005). The main idea is to distin-

guish between the exploration strategy that collects

samples (sampling phase), and the off-line learning

algorithm that, on the basis of the samples, computes

the approximation of the action-value function (learn-

ing phase) that is the solution of the control problem.

3.1 Batch Reinforcement Learning

Let us consider a system having a discrete-time dy-

namics. If the transition model or the reward function

are unknown, we cannot use dynamic programming

to solve the control problem. However, we suppose to

perform a sampling phase by which a set o samples

F = {< s(t)

,a(t)

,s(t + 1)

,r(t)

>,i = 1..K} (4)

is obtained from one or more system trajectories gen-

erated starting from an initial state, following a given

policy.

In the learning phase we used Fitted Q-iteration

(FQI, see (Ernst et al., 2005)) that reformulates value

function estimation as a sequence of regression prob-

lems by iteratively extending the optimization horizon

− f unction). First Q

(s,a) is set to 0 then the al-

gorithm iterates over the full sample set F . Given

the i-th sample < s(t)

,a(t)

,s(t + 1)

,r(t)

> and the

approximation of Q-function at time N (Q

), the es-

timation of Q

N+1

is performed by using Q-learning

update rule (Watkins, 1989):

N+1

(s(t)

,a(t)

) = (1 − α)Q

(s(t)

,a(t)

α(r(t)

+ γ max

∈A

(s(t + 1)

)). (5)

For each sample a new one is generated replac-

ing single step rewards with estimated Q-values.

This deﬁnes a regression problem from s(t)

,a(t)

to Q

(s(t)

,a(t)

) that enables the estimation of Q

Thereafter, at each iteration N, a new estimation is

performed exploiting the approximation at the previ-

ous iteration.

3.2 Q-function Approximation

Tree-based regression methods produce one or more

trees (ensemble) that are composed by a set of de-

cision nodes used to partition the input space. The

tree determines a constant prediction in each re-

gion of the partition by averaging the output val-

ues of the elements of the training set T S =

{(i

),...,(i

#T S

)} which belong to this re-

gion. Q-function are approximated by considering

=< s(t)

,a(t)

> while the output o

is the Q-value

of i

We used extremely-randomized tree ensem-

ble (Geurts et al., 2006), that is a regressor com-

posed by a forest of M trees each constructed by ran-

domly choosing K cut-points i

, representing the j-th

component of the action-state space, and the corre-

spondingly binary split [i

< t], representing the cut-

direction. The construction proceeds by choosing a

set of tests that maximizes a given score. The algo-

rithm stops splitting a node when the number of ele-

ments in the node is lower than a parameter n

min

4 PROBLEM DEFINITION

The state space of a dynamical system is deﬁned by

the set of state variables that compound the ODE’s

system. System 1 in its canonical form (Savaresi

ICINCO 2009 - 6th International Conference on Informatics in Control, Automation and Robotics

230

1k 2k 5k 10k 15k 20k 25k 30k 40k 50k

0.6

0.7

0.8

0.9

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

Fitted Q−Iteration comparison: N°samples vs NMin

Normalized performance index value

100

200

250

500

1000

N°Samples (k=1000)

NMin:

Vertical Body Velocity

(a)

1k 2k 5k 10k 15k 20k 25k 30k 40k 50k

0.8

0.9

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

Fitted Q−Iteration comparison: N°samples vs NMin

Normalized performance index value

100

200

250

500

1000

Vertical Body Acceleration

N°Samples (k=1000)

NMin:

(b)

Figure 2: Comparison of vertical body velocity (J

) and acceleration (J

) indexes w.r.t number of samples and NMin. J

and

are normalized by Mixed SH-ADD policy (J

= J

= 1).

et al., 2005) has 5 continuous state variables: S ≡<

˙z(t), ˙z

(t), ˙c(t),z(t) − ¯z,z

(t) − ¯z

>. The presence of

non-measurable road disturbances makes both the

transition model and the cost function stochastic.

Since not all the variables are measurable, BRL

policy is built exploiting the same sensor measures

as those used in the Mixed SH-ADD one (Savaresi

and Spelta, 2007): < ¨z(t), ˙z(t), ˙z(t) − ˙z

(t) >. The ac-

tion space contains only two values: A ≡< c

(t) >

(t) ∈ {c

min

max

The minimization of the squared vertical acceler-

ations can be deﬁned as the maximization of the fol-

lowing reward function:

(s(t),a(t)) = −¨z

(t + ∆T ). (6)

The ideal goal of a semi-active suspension system

is to negate the body vertical movements around its

steady state conditions, with respect to any road dis-

turbance. Thus, we considered also the following re-

ward:

(s(t),a(t)) = −˙z

(t + ∆T ) (7)

that aims to minimize the squared variation of the

body vertical velocity.

5 EXPERIMENTAL RESULTS:

ALGORITHM PARAMETERS

We performed a set of experiments in order to com-

pare performance with different parameterizations.

Samples are generated by controlling System (1) with

a random policy and by feeding it with a road dis-

turbance z

(t) designed as an integrated band-limited

white noise. This signal is a realistic approximation

of a road proﬁle and excites all the system dynam-

ics (Hrovat, 1997). The number of samples ranges

from 1000 (10 seconds system simulation at 100Hz)

to 50000 (500 seconds simulation).

The optimization horizon has been ﬁx to 10 since

it does not play a central role in this control prob-

lem. The number K of regressor’s cut points is set

to the dimension of the input space (in this case

K = |S| + |A| = 4, see (Geurts et al., 2006)) . The

number of trees depends mainly on the problem com-

plexity and ranges from 1 to 100 in our experiments.

Finally, the number of samples into a leaf, that affects

the regressor’s generalization ability, ranges from 2 to

1000. For each parameterization two cost functions

have been evaluated: J

and J

, which are obtained

by learning the control policy with R

and R

respec-

tively.

Figure 2 shows a comparison of policy obtained

by varying both the number of samples and the num-

ber of samples in a leaf (NMin). Cost function val-

ues are normalized by Mixed SH-ADD policy. Re-

sults showed that as the number of samples increases,

the cost decreases. Conversely, NMin and cost have

a quadratic relationship. Lower values of NMin lead

to over-ﬁtting, larger values lead to a poor policy ap-

proximation, while intermediate values (NMin = 50)

obtained the best performance.

In Figure 3, a comparison of cost functions ob-

tained by varying both the number of samples and the

number of trees is presented. Again, index values are

normalized by the Mixed policy. The performance

improves by increasing both the number of samples

and the number of trees. Nonetheless, after a cer-

tain value (NTree >= 50) no signiﬁcant cost reduc-

tion is observed, while the computational cost grows

linearly.

Figures 2 and 3 point out that the overall BRL-

policy behavior is better than Mixed SH-ADD one

on both indexes: J

and J

. The more samples we

have, the more accurate the learned policy will be.

Few samples can be used, but choosing NMin or trees

number can be critical.

BATCH REINFORCEMENT LEARNING - An Application to a Controllable Semi-active Suspension System

231

1k 2k 5k 10k 15k 20k 25k 30k 40k 50k

0.6

0.7

0.8

0.9

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

Fitted Q−Iteration comparison: N°samples vs N°Trees

Normalized performance index value

100

200

N°Samples (k=1000)

N°Trees:

Vertical Body Velocity

(a)

1k 2k 5k 10k 15k 20k 25k 30k 40k 50k

0.8

0.85

0.9

0.95

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.4

1.45

1.5

Fitted Q−Iteration comparison: N°samples vs N°Trees

Normalized performance index value

100

200

N°Samples (k=1000)

N°Trees:

Vertical Body Acceleration

(b)

Figure 3: Comparison of vertical body velocity (J

) and acceleration (J

) indexes w.r.t number of samples and number of

trees. J

and J

are normalized by Mixed SH-ADD policy (J

= J

= 1).

(a) (b)

Figure 4: Graphical representation of Mixed SH-ADD and BRL policies projected on ¨z(t), ˙z(t) (a) and on ˙z(t), ˙z(t) − ˙z

(t)

(b).

6 EXPERIMENTAL RESULTS:

POLICY COMPARISON

The experiments of Section 5 identiﬁed good param-

eters value for the estimation of an optimal policy us-

ing the BRL technique: state space S

, action space

A, reward function R

(s(t),a(t)) (Equation 7), 50K

samples (500 seconds systems simulation), 10 ﬁtted

horizon, 50 tress, 5 random splits and Nmin = 50.

BRL-policy is a multi-dimensional control

map that associates to every measurable state

< ¨z(t), ˙z(t), ˙z(t) − ˙z

(t) > a control action c

(t). A

graphical representation of this map is depicted in

Figure 4, where it is compared to the one obtained

by controlling the semi-active system with the Mixed

SH-ADD rule (quasi-optimal algorithm).

Figure 4 shows that BRL policy is very similar to

the one associated to the Mixed SH-ADD algorithm.

The BRL policy tends to prefer a high-damped sus-

pension. The main differences between BRL map

and Mixed SH-ADD can be highlighted around the

origin of the axis. However, notice that, in such a sit-

uation, any selected damping has small inﬂuences on

the body dynamics.

The performances of the semi-active suspension

system fed with a random signal z

(t) and ruled by

the BRL-policy has been evaluated in time and fre-

quency domain. The frequency domain analysis in

reported in Figure 5, which depicts the approximate

(Hz)

(dB)

Frequency response from road profile to body acceleration

α selector

ADD

BRL

LOW FREQUENCIES

HIGH FREQUENCIES

Mixed 2−sensors

Open Loop

min

)

Open Loop

max

)

Figure 5: Frequency response from road proﬁle to

vertical acceleration of different policies: c

max

, c

min

ADD,SH,Mixed SH-ADD and BRL-policy.

frequency response obtained as the ratio between the

power spectrum of the output ¨z(t) and the input z

(t).

ICINCO 2009 - 6th International Conference on Informatics in Control, Automation and Robotics

232

c_max c_min SH ADD Mixed SH−ADD RL

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Comparison of performance indexes obtained by different policies

Normalized performance index value

=sum(ddz

)

=sum(dz

)

Figure 6: Comparison of different policies by using both

cost function J1 and J

The time domain results are condensed in Figure 6

where the cost functions J

and J

are reported for dif-

ferent control strategies and compared to the extreme

passive conﬁgurations.

Figure 5 shows that BRL policy outperforms the

Mixed SH-ADD at low frequency. This is paid in

terms of ﬁltering at high frequencies where Mixed

SH-ADD shows a better behavior. Figure 5 points

out that BRL policy provides the overall best perfor-

mance in terms of minimization of integral of squared

vertical body accelerations.

7 CONCLUSIONS

In this work we applied Batch Reinforcement Learn-

ing (BRL) to the design problem of optimal comfort-

oriented semi-active suspension which has not been

solved with standard techniques due to its complex-

ity. Results showed that BRL policy provides the best

results in terms of road disturbance ﬁltering. However

the achieved performances are not far from the ones

obtained by the Mixed SH-ADD. Thus, comparing

the numerical approximation given by BRL, against

the analytical approximation given by the Mixed ap-

proach, we showed that they result in a similar strat-

egy. This is an important ﬁnding which shows how

numerical-based model-free algorithms can be used

to solve complex control problems. Since BRL tech-

niques can be applied to systems with unknown dy-

namics and are robust to noisy sensors, we expect to

obtain even larger improvements on real motorbikes,

as shown by preliminary experiments.

REFERENCES

Ahmadian, M., Reichert, B. A., and Song, X. (2001).

System non-linearities induced by skyhook dampers.

Shock and Vibration, 8(2):95–104.

Antos, A., Munos, R., and Szepesvari, C. (2008). Fitted

q-iteration in continuous action-space mdps. In Platt,

J., Koller, D., Singer, Y., and Roweis, S., editors, Ad-

vances in Neural Information Processing Systems 20,

pages 9–16. MIT Press, Cambridge, MA.

Ernst, D., Geurts, P., Wehenkel, L., and Littman, L. (2005).

Tree-based batch mode reinforcement learning. Jour-

nal of Machine Learning Research, 6:503–556.

Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely

randomized trees. Machine Learning, 63(1):3–42.

Guardabassi, G. and Savaresi, S. (2001). Approximate lin-

earization via feedback - an overview. Survey paper

on Automatica, 27:1–15.

Hrovat, D. (1997). Survey of advanced suspension devel-

opments and related optimal control applications. Au-

tomatica(Oxford), 33(10):1781–1817.

Kaelbling, L. P., Littman, M. L., and Moore, A. W. (1996).

Reinforcement learning: a survey. Journal of Artiﬁcial

Intelligence Research, 4:237–285.

Karnopp, D. and Crosby, M. (1974). System for Controlling

the Transmission of Energy Between Spaced Mem-

bers. US Patent 3,807,678.

Riedmiller, M. (2005). Neural ﬁtted q iteration - ﬁrst experi-

ences with a data efﬁcient neural reinforcement learn-

ing method. In ECML, pages 317–328.

Sammier, D., Sename, O., and Dugard, L. (2003). Sky-

hook and H8 Control of Semi-active Suspensions:

Some Practical Aspects. Vehicle System Dynamics,

39(4):279–308.

Savaresi, S., Silani, E., and Bittanti, S. (2005).

Acceleration-Driven-Damper (ADD): An Optimal

Control Algorithm For Comfort-Oriented Semiactive

Suspensions. Journal of Dynamic Systems, Measure-

ment, and Control, 127:218.

Savaresi, S. and Spelta, C. (2007). Mixed Sky-Hook and

ADD: Approaching the Filtering Limits of a Semi-

Active Suspension. Journal of Dynamic Systems,

Measurement, and Control, 129:382.

Savaresi, S. and Spelta, C. (2008). A single-sensor control

strategy for semi-active suspensions. To Appear, -:–.

Silani, E., Savaresi, S., Bittanti, S., Visconti, A., and

Farachi, F. (2002). The Concept of Performance-

Oriented Yaw-Control Systems: Vehicle Model and

Analysis. SAE Transactions, Journal of Passenger

Cars - Mechanical Systems, 111(6):1808–1818. ISBN

No.0-7680-1290-2,.

Sutton, R. and Barto, A. (1998). Reinforcement Learning:

An Introduction. MIT Press.

Valasek, M., Kortum, W., Sika, Z., Magdolen, L., and

Vaculin, O. (1998). Development of semi-active

road-friendly truck suspensions. Control Engineering

Practice, 6:735–744.

Watkins, C. (1989). Learining from Delayed Rewards. PhD

thesis, Cambridge University, Cambridge,England.

Williams, R. (1997). Automotive active suspensions Part

1: basic principles. Proceedings of the Institution of

Mechanical Engineers, Part D: Journal of Automobile

Engineering, 211(6):415–426.

BATCH REINFORCEMENT LEARNING - An Application to a Controllable Semi-active Suspension System

233