exploration and exploitation required for successful
control (Surriani et al., 2021).
Therefore, this paper aims to analyze and compare
the performance of QARSA, Q-learning, and SARSA
on the cart-pole problem. This comparison will
provide valuable insights into the relative strengths
and limitations of each algorithm. Additionally, this
paper analyzes the learning curves, final outcomes,
hyperparameter sensitivity, and the impact of various
function approximation techniques. Through this
comprehensive analysis, the strengths and
weaknesses of the new QARSA algorithm are
highlighted, contributing valuable insights into its
practical applicability and scalability in
reinforcement learning for control tasks.
2 BACKGROUND WORK
Two of the most fundamental model-free methods in
RL are Q-Learning and SARSA. Both fall under the
category of Temporal-Difference (TD) learning, and
update a value function based on the agent's
experience without requiring a model of the
environment.
Sutton & Barto, (2018) highlighted that SARSA's
conservative update mechanism can yield superior
performance in environments where cautious
exploration strategies are beneficial. Double Q-
learning (Van Hasselt et al., 2016) and Weighted Q-
learning (Cini et al., 2020) address the overestimation
bias inherent in traditional Q-learning, leading to
enhanced stability and improved outcomes,
particularly in environments characterized by high
complexity and variability in action-value functions.
Several comparative analyses have investigated how
RL algorithm performance can be influenced by
exploration methods (e.g., ε-greedy, softmax), reward
shaping, and discretization strategies such as those by
Tokic (2010). These elements are particularly critical
in continuous-state environments like the cart-pole
task, which require discretization for tabular RL
methods.
Nagendra et al. (2017) and Zhong (2024)
conducted comparisons of various RL algorithms,
emphasizing sample efficiency and stability.
However, a thorough comparative analysis explicitly
evaluating Q-learning and SARSA based on
cumulative rewards per episode in this environment
remains under-explored. Mothanna & Hewahi (2022)
demonstrated the applicability of Q-learning and
SARSA in solving the cart-pole problem, proposing
future work to extend the analysis to additional RL
algorithms.
Despite these developments, limited direct
comparative work specifically addresses the learning
efficiency, stability, and ultimate performance of Q-
learning and SARSA in the cart-pole environment.
Most existing comparative studies have either
examined more complex tasks or employed different
evaluation metrics. Those works that focus on the
cart-pole problem utilize the OpenAI Gym simulation
framework and include metrics such as average
reward, stability, sample efficiency, and overall
effectiveness (Brockman et al., 2016).
In (Hazza et al.,2025) the performances of three
reinforcement learning algorithms were collectively
compared under a sensitivity analysis in which all
hyperparameter values were systematically varied to
observe their impact on the learning process. While
Q-learning exhibited marginally higher average and
cumulative rewards, the differences among
Q-learning, SARSA, and Double Q-learning were not
substantial across the tested range, indicating that no
single method decisively outperforms the others.
Q-Learning and SARSA represent two
fundamentally different approaches to value-based
learning (Kommey et al., 2024) . Q-Learning is an
off-policy algorithm that learns the value of the
optimal policy independently of the agent's actions.
While effective in many deterministic environments,
its reliance on the maximum Q-value in the update
step can lead to overestimation bias and instability in
noisy or stochastic environments. In contrast, SARSA
is an on-policy method that learns the value of the
policy the agent is actually following. It tends to be
more stable and risk-averse, especially in
unpredictable environments, but often converges
more slowly and may be overly conservative (Wang
et al., 2013).
With this in mind, this work proposes a novel
hybrid approach, designed to merge the strengths of
on-policy and off-policy reinforcement learning
methods, and tested by simulation on the cart-pole
problem, as detailed in the rest of the paper.
2.1 Q-Learning Algorithm
Q-learning is one of the most widely used RL
algorithms, renowned for its simplicity and
effectiveness, introduced by Watkins in 1989
(Watkins & Dayan, 1992). It is an off-policy
algorithm that estimates the optimal action-value
function Q (s, a) by learning from the maximum
future reward, regardless of the agent's current policy.
Its update rule is defined by Equation (1):
Q (s, a) ← Q (s, a) + α [r + γ max_a′ Q (s′, a′) −
Q (s,a)] (1)