Welcome to the Jungle:

A Conceptual Comparison of Reinforcement Learning Algorithms

Kenneth Schr

oder, Alexander Kastius and Rainer Schlosser

Hasso Plattner Institute, University of Potsdam, Potsdam, Germany

Keywords:

Reinforcement Learning, Markov Decision Problem, Conceptual Comparison, Recommendations.

Abstract:

Reinforcement Learning (RL) has continuously risen in popularity in recent years. Consequently, multiple

RL algorithms and extensions have been developed for various use cases. This makes RL applicable to a

wide range of problems today. When searching for suitable RL algorithms to speciﬁc problems, the options

are overwhelming. Identifying the advantages and disadvantages of methods is difﬁcult, as sources use

conﬂicting terminology, imply improvements to alternative algorithms without mathematical or empirical

proof, or provide incomplete information. As a result, there is the chance for engineers and researchers to

miss alternatives or perfect-ﬁt algorithms for their speciﬁc problems. In this paper, we identify and explain

essential RL properties. Our discussion of different RL concepts allows to select, optimize, and compare RL

algorithms and their extensions, as well as reason about their performance.

1 INTRODUCTION

Many recent RL algorithms to solve Markov deci-

sion problems (MDP) include at least one unique

feature designed to improve performance under spe-

ciﬁc challenges. These challenges usually originate

from different properties of the decision process or

the statistical drawbacks of related optimization al-

gorithms. The corresponding publications often in-

clude performance comparisons to the related algo-

rithms but sometimes lack clear distinctions of the

advantages and disadvantages of solutions within this

ﬁeld of research. In (Schulman et al., 2017) the au-

thors, for example, motivate in their introduction that

robustness of RL algorithms with respect to hyperpa-

rameters is desirable. The evaluation section of their

publication does not include any comparisons to other

algorithms in this aspect. In addition, research articles

often fail to include categorizations of their solutions

concerning other RL properties and capabilities.

Introductory material often only categorizes algo-

rithms sparsely or within the main RL families in-

stead of consistently outlining and explaining the dif-

ferences for various properties. In this paper, we dis-

cuss properties of modern RL approaches and outline

respective implications for capabilities of RL algo-

rithms, their variants, extensions, and families.

2 PROPERTIES DICTATED BY

THE PROCESS

The properties discussed in this section can be mostly

seen as algorithm properties which cannot be cho-

sen freely, but are predeﬁned by the Markov process

which has to be solved. Further, value estimation and

policy optimization methods are discussed.

2.1 Model-Free & Model-Based RL

A popular distinction in RL is between model-free

and model-based algorithms. In contrast to standard

machine learning terminology, the term ”model” in

model-free or model-based does not refer to the train-

able algorithm but to a representation of the environ-

ment with knowledge beyond the observations. A

model like this can include state transition probabil-

ities, reward functions, or even optimal expected fu-

ture return values for the environment at training time.

Notably, this information does not necessarily have

to be available at evaluation time for model-based

RL algorithms. In self-play for example, the envi-

ronment model during training is deﬁned by a clone

of the best version of the algorithm. At evaluation

time, the algorithm is expected to have learned a pol-

icy that generalizes from its training experience to be

able to handle other opponents. Model-based algo-

Schröder, K., Kastius, A. and Schlosser, R.

Welcome to the Jungle: A Conceptual Comparison of Reinforcement Learning Algor ithms.

DOI: 10.5220/0011626700003396

In Proceedings of the 12th International Conference on Operations Research and Enterprise Systems (ICORES 2023), pages 143-150

ISBN: 978-989-758-627-9; ISSN: 2184-4372

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

143

rithms use such information to simulate the environ-

ment without executing actual steps or taking actions.

Some model-based algorithms do not use an envi-

ronment at all. Dynamic Programming, for example,

uses its knowledge about state transition probabilities

and rewards to build a table of exact expected future

rewards instead of stepping through an environment

and making observations. Environment models can

be learned or known by an RL algorithm. Another

popular model-based algorithm with a given model is

AlphaZero by DeepMind (Silver et al., 2017). Ac-

cording to (Achiam, 2018) model-based algorithms

can improve sample efﬁciency but tend to be harder

to implement and tune.

2.2 Policy Optimization & Value

Learning

The following subsections discuss several algorithms,

which consist of two categories, policy learning and

value learning methods. The later mentioned algo-

rithms DDPG, SAC, and REINFORCE are policy

learning methods. In those, a parametric represen-

tation of the mapping from state to action is available

and those parameters are adjusted to maximize the ex-

pected discounted reward of the policy (Sutton and

Barto, 2018). Many of those methods also incorpo-

rate value learning, which can also be used on its on

to derive policies. For value learning, the goal is to

develop an estimation of the expected discounted re-

ward given a certain action is performed in a certain

state. This value is called the Q-value, which leads to

Q-learning as large group of algorithms.

2.3 Finite & Inﬁnite Horizon Problems

Whether a problem has a ﬁnite or an inﬁnite horizon

has many implications on which RL algorithms are

applicable, on the objective functions of the usable

RL algorithms, and on which learning strategies can

be applied. The decisive factor behind all those impli-

cations is whether an algorithm follows a Monte Carlo

(MC) or Temporal Difference (TD) approach (Sutton

and Barto, 2018). This section introduces MC and TD

methods. Additionally, the core ideas behind TD(n),

TD(λ), and eligibility traces are examined.

The main features of MC methods are that they

are on-policy by design and only work on ﬁnite hori-

zon problems. The current policy plays the environ-

ment until a terminating state is reached, after which

the realized discounted returns for the episode are

calculated for each visited state. The realized dis-

counted returns for a state can vary signiﬁcantly in-

between such trajectories, even under the same pol-

icy, because of stochastic environments or stochastic

policies. Such MC returns are used by some Policy

Gradient variants, like REINFORCE.

TD approaches do not need complete trajectories

or terminating states to be applied. Therefore, they

can be used on inﬁnite horizon problems as well. The

core idea of TD methods is to deﬁne target values for

a value estimator

using some section of a trajec-

tory between the states s

and s

t+n

by summing the

realized rewards r (discounted by factor γ) and adding

an approximation of the expected future return of the

last state s

t+n

using the current version of the value-

learner. The target value is calculated by G

t:t+n

∑

t+n−1

−t

r(s

) + γ

t+n

) and the tem-

poral difference, also called approximation error is

given by G

t:t+n

−

) and is used to update

Using the current version of a value-estimator for an

update of the same estimator is called bootstrapping

in RL. Algorithms that use the temporal difference

over n steps like this are called n-step TD methods

or TD(n) algorithms. TD(n) for all n > 1 is automat-

ically on-policy learning, as the sequence of rewards

used for the network updates depends on the policy.

Therefore, the values learned by the value-learner re-

ﬂect the values of states under the current policy (see

Section 3.1). TD(1) algorithms can be trained in an

off-policy or on-policy fashion, as single actions and

rewards are policy-independent. The off-policy up-

dates would be unbiased towards the current policy.

DQN is an example of an off-policy TD(1) algorithm,

while SARSA is an on-policy TD(1) method. Some

RL libraries like Tianshou (Weng et al., 2021) include

off-policy DQN algorithms with an n-step parameter.

RAINBOW uses similar updates with different tricks,

e.g. omitting off-policy exploration (see Section 3.2).

While MC methods have high variance and no bias

in their network updates, TD approaches have a low

variance. They only consider a few realized rewards

but add a signiﬁcant bias towards their initial estima-

tions by the bootstrapping term. Increasing the num-

ber of steps in TD(n) makes the updates more similar

to MC updates, and bias decreases while variance in-

creases.

Because of the bias-variance tradeoff and differ-

ences in episode lengths, for example, the best n in

TD(n) is highly problem-speciﬁc. An approach called

TD(λ) tries to combine the advantages of all TD(n)

versions by creating a target value G

λ,φ

that is combi-

nation of all possible TD(n) targets. TD(λ) can be ex-

plained and calculated from a forward or a backward

view, but both variants are mathematically equivalent

(Sutton and Barto, 2018).

The forward view combines all TD(n) targets

ICORES 2023 - 12th International Conference on Operations Research and Enterprise Systems

144

t:t+n

using an exponentially weighted moving av-

erage over all possible TD(n)’s and is deﬁned by:

λ,φ

= (1 − λ) ·

∑

T −t−1

n=1

n−1

t:t+n

+ λ

T −t−1

t:T

for

the ﬁnite horizon case. For inﬁnite horizon problems,

this deﬁnition includes an inﬁnite sum, which is ap-

proximated in practice using the truncated λ-return

(Sutton and Barto, 2018), which truncates the sum af-

ter many steps. Notably, for λ = 1 this formula re-

duces to the MC update G

t:T

= G

1,φ

and for λ = 0 it

reduces to the TD(1) update G

t:t+1

= G

0,φ

for 0

= 1.

This forward version of TD(λ) can be described as an

MC method because it needs to wait until the end of

an episode before starting to compute the updates.

The backward view was introduced to overcome

this limitation. Instead of looking forward on the tra-

jectory to compute the network updates, it calculates

errors locally using TD(1). It passes this information

back to the previously visited states of the trajectory.

This means that if a high error is discovered at step t

of a trajectory, the state-values

) of the previous

states s

with t

< t are adjusted in the same direc-

tion as the local error. Not all of the states visited

previously are similarly responsible for the change

in the value of state s

. The backward version of

TD(λ) uses eligibility traces to quantify the eligibil-

ity of past states for discovered state-value errors. In

practice, such an eligibility trace is a vector e contain-

ing decaying factors for each previously visited state

of the trajectory. Different decay strategies are possi-

ble, in which eligibility values of revisited states are

increased differently.

Similar to TD(n) with n > 1, TD(λ) is also an

on-policy approach and is theoretically incompatible

with off-policy techniques like replay buffers.

2.4 Countable & Uncountable State Sets

Uncountable state sets are found in many real-world

applications, i.e., at least one state parameter lives in

a continuous space. All of the table-based predeces-

sors of RL, like Dynamic Programming or tabular Q-

Learning, fail under uncountable state sets, as an inﬁ-

nite amount of memory and time would be necessary

to compute all state-values (Sutton and Barto, 2018).

Those algorithms require a discretization of the state

space, which can limit the solution quality. RL algo-

rithms using neural networks as regression algorithms

can take continuous values of states as inputs and have

the ability to generalize over uncountable state sets af-

ter learning from a ﬁnite number of training samples.

2.5 Countable & Uncountable

Action Sets

Like uncountable state sets, uncountable action sets

are standard in continuous control tasks, where ac-

tions can be chosen from a continuous interval. Table-

based algorithms like Q-Learning that calculate state-

action values cannot be applied to problems with un-

countable action sets, as an inﬁnite amount of mem-

ory and time would be required for calculations across

the whole table (Sutton and Barto, 2018).

On-policy (see Section 3.1) RL algorithms can

be trained under uncountable action sets by learning

the parameters to a parameterized distribution over a

range of actions. Training off-policy RL algorithms

under uncountable action sets incorporates new chal-

lenges. In regular off-policy RL, the policy network

is trained to reﬂect a discrete probability distribution

over the countable Q-values of each state. In the case

of uncountable action sets, the Q-value distribution

over the uncountable action set is neither fully ac-

cessible nor differentiable if the learner has trained to

output single values to state-action input pairs. Gen-

erally, constructing the value-learners under uncount-

able action sets to output parameters to parameter-

ized distributions is impossible, as the action values

follow an unknown, possibly non-differentiable dis-

tribution. Training with arbitrary distributions would

lead to bad state-value estimates

. Because of these

limitations, deterministic policies must be trained in

off-policy RL under uncountable action sets. DDPG

for example, takes advantage of the differentiability

of continuous actions output by deterministic poli-

cies (Lillicrap et al., 2015). The objective function

of DDPG is deﬁned as: J

(θ) = E

s∼D



(µ

(s)|s)



The policy parameters θ are trained to choose the Q-

value-maximizing actions for the state s ∼ D of the

state distribution D of the off-policy data. Gradients

of the policy network parameters are backpropagated

through the Q-learner

, through the continuous ac-

tion policy output µ

(s) and into the policy network.

Gradient calculations like this would not be possible

for discrete actions.

Some algorithms take one further step further and

make it possible to learn stochastic policies under un-

countable action sets. One example of those algo-

rithms is SAC (Haarnoja et al., 2018b), which intro-

duces some beneﬁcial additions which improve learn-

ing performance when compared to DDPG.

In policy optimization for uncountable action sets, the

learned distribution does not have to accurately reﬂect the

distribution of the true Q-values, as it is not used to estimate

state-values.

Welcome to the Jungle: A Conceptual Comparison of Reinforcement Learning Algorithms

145

The objective of the SAC policy is given by:

(θ) = E

s∼D





(·|s)



(·|s)

(s)



It contains a KL-divergence term, initially proposed

by (Haarnoja et al., 2018b), that aims at minimizing

the difference between the distribution of the policy

in a state s and the distribution implied by the Q-

value estimations

(·|s). Z

(s) is a normalization

term that does not contribute to the gradients and can

be ignored. Negating the remaining terms results in

the equivalent maximization objective:

(θ) = E

s∼D,a∼π

(a|s) − log π

(a|s)

(1)

This equation demonstrates, that SAC’s policy

is trained to maximize both the expected Q-

values, as well as its entropy H (π

(·|s)) =

−E

a∼π

[logπ

(a|s)]. Although the Q-values in SAC

already contain an entropy coefﬁcient, optimizing the

policy to maximize the Q-values would not guaran-

tee that the policy itself has high entropy. The pol-

icy might choose the Q-maximizing actions with 99%

probability density, i. e., a very low entropy, which is

why this entropy term is also necessary for the policy

objective.

A problem with the formulation in (1) is the ex-

pectation over the actions E

s∼D,a∼π

, which depends

on the policy parameters θ (Achiam, 2018). No-

tably, with countable action sets, this is no prob-

lem. The expected value of the Q-values in a state

s can be calculated precisely using the ﬁnite sum

∑

a∈A

π(a|s)

(a|s). For uncountable action sets

there is usually no way to calculate the expectation

precisely. Instead, it is approximated using samples.

This leads to a high variance in the resulting gradi-

ents, which can be avoided by a reformulation of the

problem using the reparametrization trick, in which

the source of randomness is externalized. Using the

reparametrization trick, the objective can be further

rewritten, which exchanges the expectation over the

actions with an expectation over a random sample

from the standard normal distribution ε ∼ N:

(θ) = E

s∼D,ε∼N

( ˜a|s) − log π

( ˜a|s)

where ˜a = f (ε,π

(·|s)). The reparametrization func-

tion f uses the policy network outputs to transform

this random sample from a standard normal distribu-

tion to the distribution deﬁned by the policy network.

This is done by adding the mean and multiplying with

the variance produced by the policy network. This

separates sampling from the policy distribution and

reduces the variance of the computed gradients.

3 ALGORITHM PROPERTIES

Next, we categorize different aspects of RL algo-

rithms and discuss their learning dynamics.

3.1 On-Policy & Off-Policy RL

This subsection explores the differences between on-

policy and off-policy for value-learning RL algo-

rithms like DQN. Afterward, importance sampling

and its applications for off-policy learning in policy

optimization methods are discussed.

In value-learning, the difference between on-

policy and off-policy is best explained by the type of

state-action values learned. There are two main op-

tions for learning state-action values.

One possibility is learning the expected state-

action values of the current policy. For this case, in

each learning step, the Q-values reﬂect expected re-

turns under the current policy. This kind of policy

is optimized by repeatedly adjusting the parameters

slightly in a direction that minimizes the difference to

a target value and reevaluating the expected returns.

This is the main idea behind the Q-Learning vari-

ant SARSA. Algorithms like SARSA are called on-

policy, as the current policy needs to be used to collect

new experiences, and no outdated or unrelated data

can be utilized. This dependency on policy-related

actions is also reﬂected in the Q-value update func-

tion which uses the realized action in the Bellman ex-

pectation equation instead of an aggregation over all

the possible actions. Notably, not all algorithms that

learn values under the current policy are on-policy

methods. The critics in SAC learn Q-values of the

current policy in an off-policy way by inputting the

succeeding states (s

t+1

) of the off-policy experience

into the current policy network. The result is a prob-

ability map over the actions that can be taken in s

t+1

The value of state s

t+1

under the current policy can be

estimated as the sum of the Q-values in s

t+1

weighted

by the calculated distribution.

The second possibility to approach value-learning

is to learn the optimal state-action values achievable

in the environment and independent of the current pol-

icy. This is done by Deep Q-Learning (Mnih et al.,

2013). A deterministic policy is implicitly deﬁned

by maximizing actions, and stochastic policies can be

formulated by applying temperature-regulated soft-

max over the state-action values. This is called off-

policy learning, as state-action values like this can

be learned using policy-unrelated experience from the

environment. While in some cases, the data is gen-

erated using a greedy version of the current policy,

completely policy-unrelated experience can be used,

ICORES 2023 - 12th International Conference on Operations Research and Enterprise Systems

146

for example, from a human player. All data points

can even be trained multiple times, as they are never

outdated w. r. t. the state-action-value deﬁnition.

The optimization objectives of pure policy opti-

mization methods are usually deﬁned as maximizing

the expected future return when following the cur-

rent policy. Because of this, most pure policy opti-

mization algorithms are naturally on-policy methods.

During training, states and actions are sampled from

the current policy, and probabilities or actions are ad-

justed according to their realized or expected future

return. Without adjustments, using state samples from

the stationary state distribution of the current policy

is crucial. While it is technically possible to collect

policy-independent experience from the environment

and use the policy log probabilities of the chosen ac-

tions for policy updates, this optimization objective

would maximize the expected value under the wrong

state distribution. With this setting, the policy might

visit different states when performing independently

and achieve far-from-optimal expected future returns

under its state distribution.

One way to overcome this limitation and train a

policy optimization algorithm using data that was col-

lected from a different state distribution is importance

sampling (Sutton and Barto, 2018). The importance

sampling theorem states that the expected value of

any deterministic function f when drawing from a

distribution π

is equal to the expected value of f

multiplied by the ratio of the two probabilities when

drawing from a distribution D.

In the RL setting, the expectations of the pol-

icy optimization objectives can not be calculated pre-

cisely and are approximated using a batch of sam-

ples. By applying the importance sampling equa-

tion, the expectation can be approximated with sam-

ples from an arbitrary, known state distribution. As

per the central limit theorem, the approximations with

both methods follow a normal distribution around the

actual expected value. Still, they can have signiﬁ-

cant differences in variance if the two distributions

are very different (Sutton and Barto, 2018). Because

of this unreliability, to our knowledge, there are no

pure off-policy optimization algorithms Instead, some

Monte Carlo RL variants use importance sampling to

reduce simulation and calculation overhead for en-

vironments with long trajectories. For those cases,

simulating the realized rewards for just a single net-

work update is inefﬁcient. Using importance sam-

pling, the latest trajectories can be reused for multiple

network updates if the policy does not change drasti-

cally within a few updates.

Actor-critic algorithms can be designed to oper-

ate on-policy or off-policy. The on-policy actor-critic

variants use a policy optimization objective similar

to pure policy optimization algorithms (Sutton and

Barto, 2018). In this case, the policy is improved

by altering the log probabilities of actions according

to their learned Q-values or advantage-values. Off-

policy actor-critic variants are possible by training the

policy to reﬂect the action distributions implicitly de-

ﬁned by the learned Q-values. This is a valid objec-

tive, as the policy is trained to choose the Q-value-

maximizing action in every state.

The choice between on-policy and off-policy algo-

rithms primarily affects its sample efﬁciency and the

bias-variance tradeoff of the network updates (Fakoor

et al., 2020). On-policy algorithms need to consis-

tently generate new data using their current policy,

making them sample inefﬁcient. Accordingly, they

are less suited for problems with high time complexity

environments. Off-policy algorithms can operate on

problems with arbitrarily generated experience, even

without direct access to the environment.

Off-policy algorithms tend to be harder to tune

than on-policy alternatives because of the signiﬁcant

bias from old data and value-learner initializations

(Fujimoto et al., 2018).

3.2 Stochastic & Deterministic Policies

RL algorithms are designed to learn stochastic or de-

terministic policies. While most stochastic methods

can be evaluated deterministically by selecting the ac-

tions with the highest probabilities, there is no sophis-

ticated way of converting deterministic solutions to

stochastic variants other than applying ε-greedy ac-

tions or temperature-controlled softmax.

It could be argued, that deterministically choosing

the optimal action could outperform stochastic poli-

cies, as they tend to diverge from the optimal pol-

icy. In practice this does not hold, e.g., SAC out-

performs DDPG in its peak performance (Haarnoja

et al., 2018a). Also, there are problems with imper-

fect information that can only be reliably solved by

stochastic policies, as demonstrated by (Silver, 2015).

Stochastic policies have advantages over their de-

terministic alternatives in multi-player games, as op-

ponents can quickly adapt to deterministic playstyles.

A simple example of such a game is rock-paper-

scissors (Silver, 2015).

3.3 Exploration-Exploitation Tradeoff

The exploration-exploitation tradeoff is a dilemma

that not only occurs in RL but also in many decisions

in real life. When selecting a restaurant for dinner,

one can revisit a favorite restaurant or try a new one.

Welcome to the Jungle: A Conceptual Comparison of Reinforcement Learning Algorithms

147

By choosing the favorite restaurant, there is a high

likelihood of achieving the known satisfaction, but

without exploring other options, one will never know

whether there are even better alternatives. By only

exploring new restaurants, the pleasure will generally

be lower in expectation. The same concept applies to

RL as well. The agent needs to balance exploration

(for ﬁnding better solutions) with exploitation (to di-

rect exploration and to obtain optimal solutions).

Many different exploration-exploitation strategies

exist, including ε-greedy exploration, upper conﬁ-

dence bounds exploration, Boltzmann exploration,

maximum entropy exploration, and noise-based ex-

ploration (Weng, 2020).

Upper conﬁdence bounds exploration introduces

a notion of conﬁdence to Q-value estimations. Ac-

tions are chosen based on the sum of each Q-value

and an individual uncertainty value. The uncertainty

value is inversely proportional to the number of times

an action was taken. The sum of a Q-value and its

uncertainty value represents a conﬁdence bound of

that Q-value, i. e., Q-values of actions that have rarely

been trained have high uncertainty and could be much

larger in reality. Upper conﬁdence bounds encour-

age exploration of rarely visited actions while con-

sidering their current Q-value estimation. In maxi-

mum entropy exploration, the agents’ objective func-

tions are extended by an entropy term that penalizes

the certainty of the learned policy. Algorithms like

SAC use this type of exploration. Some settings are

especially challenging for exploration to ﬁnd better

solutions consistently. For example, very sparse or

deceptive rewards can be problematic, which is called

the hard exploration problem. (Weng, 2020) and (Mc-

Farlane, 2018) provide further information on explo-

ration challenges and possible solutions.

3.4 Hyperparameter & Robustness

In RL, hyperparameter sensitivity characterizes how

much an algorithm’s performance depends on care-

fully tuned hyperparameters and how much the hy-

perparameters need to be adjusted between different

problems. In other words, it describes the size of the

hyperparameter space that generally produces good

results. Many publications of new RL algorithms

mention hyperparameter sensitivity and claim or im-

ply improvements compared to previous work; e.g.,

(Haarnoja et al., 2018b; ?) note that their algorithm

SAC is less sensitive than DDPG. (Schulman et al.,

2017) also suggest in their publication of PPO that it

is less sensitive than other algorithms. These papers

fail to include concrete sensitivity analysis to support

their claims.

Other researchers have published work on hyper-

parameter tuning and sensitivity comparisons on spe-

ciﬁc RL tasks (Henderson et al., 2018; ?). Most em-

phasize the high sensitivity of all compared RL al-

gorithms and the lack of generally well-performing

conﬁgurations. These articles suggest a need for less

sensitive algorithms to be developed and for more re-

search on best-performing hyperparameters in differ-

ent settings.

Robustness is used as an antonym to hyperpa-

rameter sensitivity in some articles (Schulman et al.,

2017) to describe how successful an algorithm is

on various problems without hyperparameter tuning.

Most of the literature uses robustness as a measure of

how well a learned algorithm can handle differences

between its training and test environment (?). The

second deﬁnition of robustness is speciﬁcally impor-

tant for real-world applications that incorporate a shift

between training and testing environments.

Robustness of algorithms in RL can generally be

achieved in multiple ways. One option is to design

a distribution of environments and optimize the av-

erage performance of an agent on multiple environ-

ment samples from this distribution. A second op-

tion is to create an adversarial setup, where an ad-

versary is trained to adjust the environment such that

the agent’s performance drops. Hence, the agent con-

stantly trains on different environments. Such setups

are promising but not easy to implement. (Eysenbach

and Levine, 2021) show that maximum entropy RL al-

gorithms like SAC are robust to environment changes,

as the respective agents learn to recover from distur-

bances introduced by ”surprising” actions.

3.5 Learning Stability

The learning stability of an algorithm characterizes its

tendency to forget intermediate best-performing poli-

cies throughout training. This is sometimes referred

to as catastrophic forgetting in the literature (G

eron,

2019), although catastrophic forgetting is also used in

the context of sequentially training a model on differ-

ent tasks (Kirkpatrick et al., 2017).

Learning instability in policy optimization algo-

rithms is mainly caused by noise in the gradient

estimations, producing destructive network updates.

Such noise is usually created by a high variance in the

gradient estimates. Therefore, learning stability can

be increased by choosing RL algorithms with low-

variance gradient estimators. Another option to pre-

vent destructive updates to the policy during training

is to limit the change of the policy in-between up-

dates, as done by PPO. A less sophisticated alterna-

tive with similar effects is to clip the gradient norms

ICORES 2023 - 12th International Conference on Operations Research and Enterprise Systems

148

before using them in network updates. This is called

gradient norm clipping. (Nikishin et al., 2018) pro-

pose to transfer stochastic weight averaging (SWA) to

the RL setting to increase learning stability. SWA has

improved generalization in supervised and unsuper-

vised learning and is based on averaging the weights

of the models collected during training.

4 SUMMARY &

RECOMMENDATIONS

This section summarizes the key takeaways for each

RL property discussed in the Section 2 and 3.

(i) Model-Free & Model-Based RL. The catego-

rization of model-free and model-based RL algo-

rithms indicates whether an algorithm learns or is

given additional knowledge of the environment be-

yond the observations. This allows simulations of the

environment and wider updates and exploration com-

pared to following one trajectory at a time. It can be

especially powerful if acting in the real environment is

expensive or an environment model is available any-

way. Model-free algorithms are more popular, and

model-based implementations are not supported by as

many frameworks because of the problem-speciﬁcs.

(ii) Policy Optimization & Value Learning Meth-

ods. For many problems, there is only one alterna-

tive. For all others, it depends on the complexity of a

problem’s value function, if it is easier to learn a value

function or directly search for the optimal policy.

(iii) Finite & Inﬁnite Horizon Problems. Whether

a problem has a ﬁnite or inﬁnite horizon has implica-

tions on which options for value estimations are ap-

plicable. MC methods only work with ﬁnite horizons,

as whole trajectories are necessary. MC targets for

value estimators have no bias but high variance, espe-

cially with extended episodes. TD methods are appli-

cable for ﬁnite & inﬁnite horizons and allow tuning

the problem-speciﬁc tradeoff between bias and vari-

ance. TD(λ) combines all TD(n) updates and ideally

their problem-speciﬁc beneﬁts and is technically an

on-policy method, as it includes MC estimates. Trun-

cating the inﬁnite sum is possible for applying TD(λ)

to inﬁnite horizon problems. Forward and backward

views can be used to calculate TD(λ) estimates. Usu-

ally, the backward view is applied with eligibility

traces, as it allows online updates of the estimator.

(iv) Countable & Uncountable State Sets. Tradi-

tional table-based methods can only operate on count-

able state sets. Neural networks can generalize to

continuous inputs with limited memory, making them

suitable for problems with uncountable state sets.

(v) Countable & Uncountable Action Sets. Table-

based algorithms are usually inapplicable for un-

countable action sets, but RL with neural networks

can be used. Most on-policy RL algorithms can han-

dle both countable and uncountable action sets. Off-

policy learning of uncountable action sets includes

additional challenges because the distribution type of

the action values of a state is unknown. TD methods’

state values cannot be calculated easily, and adding

another expected value over these distributions is re-

quired in the objectives. Using action samples and

the reparameterization trick, these expectations can be

approximated and gradients can be backpropagated

through the continuous actions into the policy net-

work. Learning stochastic policies under uncountable

action sets is also possible, as done by SAC.

(vi) On-Policy & Off-Policy RL. On-policy algo-

rithms need to be trained with experience of the lat-

est version of the policy. Off-policy alternatives can

use and re-use any experience acquired in an envi-

ronment. A signiﬁcant advantage of off-policy algo-

rithms is sample efﬁciency, but they tend to be harder

to tune because of the bias-variance implications be-

tween the respective learning strategies. Learning val-

ues of states or state-action pairs under the current

policy is usually done on-policy, in which case, all

MC, TD(n), and TD(λ) methods can be used. SARSA

is an example that uses TD(1) calculations. Only the

TD(1) targets can be used if values under the current

policy are learned off-policy. The bootstrapped value

of the succeeding state can then be estimated by the

current policy’s probability distribution of that state

and the respective Q-values, as done by SAC. Actual

values within the environment are learned off-policy,

and the maximizing action is chosen for updates. One

example of such an algorithm is Deep Q-Learning.

Most policy optimization objectives are deﬁned

over the stationary state distribution of the current

policy and therefore require on-policy training. Im-

portance sampling still allows for off-policy gradi-

ent calculations in policy optimization but can sig-

niﬁcantly increase gradient approximation variance.

Consequently, it is only used in special cases, for ex-

ample, to execute a few consecutive network updates

with just one batch of data. PER is an extension that

can only be applied to off-policy algorithms.

(vii) Stochastic & Deterministic Policies. Some

problems with imperfect information can not be

solved by deterministic policies, in which case

stochastic alternatives are more powerful. Determin-

istic policies or deterministic evaluations of stochas-

Welcome to the Jungle: A Conceptual Comparison of Reinforcement Learning Algorithms

149

tic policies tend to have higher total expected returns

as only the best known actions are chosen. Stochas-

tic policies have beneﬁts in exploration and in multi-

player games, where unpredictability is beneﬁcial.

(viii) Exploration-Exploitation Tradeoff. The

exploration-exploitation trade-off is a natural

dilemma when not knowing whether an optimum is

reached in a decision process. Exploration is nec-

essary to discover better solutions, but exploitation

is needed for directed learning instead of random

wandering. Many different exploration strategies

exist, such as ε-greedy, stochastic exploration, max-

imum entropy exploration, upper conﬁdence bounds

exploration, etc. Some problems are especially

challenging for exploration, for example, occurs in

environments with very sparse rewards.

(ix) Hyperparameter Sensitivity & Robustness.

The sensitivity of algorithms indicates how narrow

the usable ranges of hyperparameters are and how

well they can be trained on multiple problems with

the same conﬁguration. Sensitivity is often mentioned

in the literature but rarely analyzed in detail. It is

highly problem-speciﬁc, which makes studying sen-

sitivity on multiple problems valuable. Algorithms

like PPO can make algorithms less sensitive by limit-

ing the incentives of drastic adjustments to the policy.

The ability of an algorithm to cope with changes be-

tween training and test environment is referred to as

robustness. It can be improved by training in multiple

environments, creating adversarial setups, or applying

maximum entropy RL.

(x) Learning Stability. The learning stability of an

algorithm is an indicator for its tendency to forget in-

termediate best-performing policies throughout train-

ing. Low variance gradient estimators help improve

algorithms’ stability, and methods like PPO’s objec-

tive function clipping can help prevent destructive up-

dates. Additional options include gradient norm clip-

ping and stochastic weight averaging.

5 CONCLUSION

To solve real-world problems with incomplete infor-

mation, RL is a promising approach as it only requires

a suitable reward function and no optimal data. Over

the training process, the model incrementally builds

better solutions by itself. Downsides of RL are the

opaque selection of algorithms and their extensions

as well as the more complicated tuning compared to

supervised learning. In this context, we sorted out

different RL algorithms’ properties and inferred help-

ful guidelines to decide under which circumstances to

apply which algorithm.

REFERENCES

Achiam, J. (2018). Spinning up in deep reinforcement

learning. URL: https://spinningup.openai.com/.

Eysenbach, B. and Levine, S. (2021). Maximum entropy

rl (provably) solves some robust rl problems. arXiv

preprint arXiv:2103.06257.

Fakoor, R., Chaudhari, P., and Smola, A. J. (2020). P3o:

Policy-on policy-off policy optimization. In Un-

certainty in Artiﬁcial Intelligence, pages 1017–1027.

PMLR.

Fujimoto, S. et al. (2018). Addressing function approxi-

mation error in actor-critic methods. In ICML, pages

1587–1596. PMLR.

eron, A. (2019). Hands-on Machine Learning with Scikit-

Learn, Keras, and TensorFlow: Concepts, Tools, and

Techniques to Build Intelligent Systems. ” O’Reilly

Media, Inc.”.

Haarnoja, T. et al. (2018a). Soft actor-critic algorithms and

applications. arXiv preprint arXiv:1812.05905.

Haarnoja, T. et al. (2018b). Soft actor-critic: Off-policy

maximum entropy deep reinforcement learning with a

stochastic actor. In ICML, pages 1861–1870. PMLR.

Henderson, P. et al. (2018). Deep reinforcement learning

that matters. Proceedings of the AAAI Conference on

Artiﬁcial Intelligence, 32(1).

Kirkpatrick, J. et al. (2017). Overcoming catastrophic for-

getting in neural networks. Proceedings of the na-

tional academy of sciences, 114(13):3521–3526.

Lillicrap, T. P. et al. (2015). Continuous control

with deep reinforcement learning. arXiv preprint

arXiv:1509.02971.

McFarlane, R. (2018). A survey of exploration strategies in

reinforcement learning. McGill University.

Mnih, V. et al. (2013). Playing atari with deep reinforce-

ment learning. arXiv preprint arXiv:1312.5602.

Nikishin, E. et al. (2018). Improving stability in deep re-

inforcement learning with weight averaging. In Un-

certainty in artiﬁcial intelligence workshop on uncer-

tainty in Deep learning.

Schulman, J. et al. (2017). Proximal policy optimization

algorithms. arXiv preprint arXiv:1707.06347.

Silver, D. (2015). Lectures on reinforcement learning.

URL: https://www.davidsilver.uk/teaching/.

Silver, D. et al. (2017). Mastering chess and shogi by self-

play with a general reinforcement learning algorithm.

arXiv preprint arXiv:1712.01815.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learn-

ing: An Introduction. MIT press.

Weng, J. et al. (2021). Tianshou: A highly modularized

deep reinforcement learning library. arXiv preprint

arXiv:2107.14171.

Weng, L. (2020). Exploration strategies in deep reinforce-

ment learning. URL: https://lilianweng.github.io/.

ICORES 2023 - 12th International Conference on Operations Research and Enterprise Systems

150