for example, from a human player. All data points
can even be trained multiple times, as they are never
outdated w. r. t. the state-action-value definition.
The optimization objectives of pure policy opti-
mization methods are usually defined as maximizing
the expected future return when following the cur-
rent policy. Because of this, most pure policy opti-
mization algorithms are naturally on-policy methods.
During training, states and actions are sampled from
the current policy, and probabilities or actions are ad-
justed according to their realized or expected future
return. Without adjustments, using state samples from
the stationary state distribution of the current policy
is crucial. While it is technically possible to collect
policy-independent experience from the environment
and use the policy log probabilities of the chosen ac-
tions for policy updates, this optimization objective
would maximize the expected value under the wrong
state distribution. With this setting, the policy might
visit different states when performing independently
and achieve far-from-optimal expected future returns
under its state distribution.
One way to overcome this limitation and train a
policy optimization algorithm using data that was col-
lected from a different state distribution is importance
sampling (Sutton and Barto, 2018). The importance
sampling theorem states that the expected value of
any deterministic function f when drawing from a
distribution π
θ
is equal to the expected value of f
multiplied by the ratio of the two probabilities when
drawing from a distribution D.
In the RL setting, the expectations of the pol-
icy optimization objectives can not be calculated pre-
cisely and are approximated using a batch of sam-
ples. By applying the importance sampling equa-
tion, the expectation can be approximated with sam-
ples from an arbitrary, known state distribution. As
per the central limit theorem, the approximations with
both methods follow a normal distribution around the
actual expected value. Still, they can have signifi-
cant differences in variance if the two distributions
are very different (Sutton and Barto, 2018). Because
of this unreliability, to our knowledge, there are no
pure off-policy optimization algorithms Instead, some
Monte Carlo RL variants use importance sampling to
reduce simulation and calculation overhead for en-
vironments with long trajectories. For those cases,
simulating the realized rewards for just a single net-
work update is inefficient. Using importance sam-
pling, the latest trajectories can be reused for multiple
network updates if the policy does not change drasti-
cally within a few updates.
Actor-critic algorithms can be designed to oper-
ate on-policy or off-policy. The on-policy actor-critic
variants use a policy optimization objective similar
to pure policy optimization algorithms (Sutton and
Barto, 2018). In this case, the policy is improved
by altering the log probabilities of actions according
to their learned Q-values or advantage-values. Off-
policy actor-critic variants are possible by training the
policy to reflect the action distributions implicitly de-
fined by the learned Q-values. This is a valid objec-
tive, as the policy is trained to choose the Q-value-
maximizing action in every state.
The choice between on-policy and off-policy algo-
rithms primarily affects its sample efficiency and the
bias-variance tradeoff of the network updates (Fakoor
et al., 2020). On-policy algorithms need to consis-
tently generate new data using their current policy,
making them sample inefficient. Accordingly, they
are less suited for problems with high time complexity
environments. Off-policy algorithms can operate on
problems with arbitrarily generated experience, even
without direct access to the environment.
Off-policy algorithms tend to be harder to tune
than on-policy alternatives because of the significant
bias from old data and value-learner initializations
(Fujimoto et al., 2018).
3.2 Stochastic & Deterministic Policies
RL algorithms are designed to learn stochastic or de-
terministic policies. While most stochastic methods
can be evaluated deterministically by selecting the ac-
tions with the highest probabilities, there is no sophis-
ticated way of converting deterministic solutions to
stochastic variants other than applying ε-greedy ac-
tions or temperature-controlled softmax.
It could be argued, that deterministically choosing
the optimal action could outperform stochastic poli-
cies, as they tend to diverge from the optimal pol-
icy. In practice this does not hold, e.g., SAC out-
performs DDPG in its peak performance (Haarnoja
et al., 2018a). Also, there are problems with imper-
fect information that can only be reliably solved by
stochastic policies, as demonstrated by (Silver, 2015).
Stochastic policies have advantages over their de-
terministic alternatives in multi-player games, as op-
ponents can quickly adapt to deterministic playstyles.
A simple example of such a game is rock-paper-
scissors (Silver, 2015).
3.3 Exploration-Exploitation Tradeoff
The exploration-exploitation tradeoff is a dilemma
that not only occurs in RL but also in many decisions
in real life. When selecting a restaurant for dinner,
one can revisit a favorite restaurant or try a new one.
Welcome to the Jungle: A Conceptual Comparison of Reinforcement Learning Algorithms
147