the exploration probability ε gradually decreases as
the training progresses. After performing the action,
calculate the reward based on the state of the bird,
update the Q-values using the Q-learning formula,
and adjust the network weights through
backpropagation.
2.2.3 T-Rex Runner Game
Environment Setup: Preprocess the game screen
image, such as converting RGB to grayscale,
removing irrelevant objects, erosion, and dilation
operations, etc., and stack multiple frames of
historical images as input. Q-learning
Implementation: Initialize the weights of the policy
network to random values and set relevant parameters.
At the same time, set the target network and initialize
it the same as the policy network. In the training loop,
select actions according to the ϵ - greedy strategy.
After performing the action, store the transitions in
the experience replay pool. Sample from the replay
pool, use the target network to calculate the expected
return to optimize the weights of the policy network,
and update the target network when reaching a certain
number of steps. For example, in Yue Zheng's paper
(Zheng, 2019). The original image undergoes a series
of preprocessing operations, including RGB to
grayscale conversion, removal of irrelevant objects,
erosion and dilation operations to reduce noise, and
finally is resized to 84×84. To capture the movement
of the dinosaur, the last four frames of historical
images undergo the same preprocessing and are
stacked into a data point as the input. Overall, in
these three games, Q-learning continuously optimizes
the decision-making strategy of the agent by relying
on the reward signal and the learning of the state-
action value. The integration of deep learning
techniques, especially DQN, provides strong support
for handling complex game inputs and decision
spaces.
2.3 Problems and Optimization
Strategies
2.3.1 Experience Replay
Experience replay is a technique used in
reinforcement learning to improve training efficiency
and stability. In traditional Q-learning, experiences
of consecutive frames are often highly correlated,
which can hinder the training process and lead to
inefficient training. The purpose of experience replay
is to solve this problem by decorrelating these
experiences to improve the training effect.
Specifically, in experience replay, the experience (s,
a, r, s') is stored in the replay memory at each frame.
The replay memory has a certain size and saves some
recent experiences. It updates continuously like a
queue to ensure that the experiences in the memory
are related to the recent actions and the corresponding
Q functions. When it is necessary to update the Deep
Q Network (DQN), instead of using the current
consecutive experiences, a batch of experiences is
uniformly sampled from the replay memory. The
advantage of this is that the sampled experiences are
no longer highly correlated, making the training more
stable and efficient. Through experience replay, the
model can better learn from diverse historical
experiences, avoid being misled by recent
consecutive correlated experiences, and help
converge to a better policy faster.
For example, in the paper by Kvein Chen (Chen,
2015), during the training of Flappy Bird, he found
that the traditional Q-learning method had a strong
correlation, resulting in biased training results.
Therefore, he used the experience replay method in
the training process and updated the DQN, eventually
obtaining weakly correlated results.
2.3.2 Target Network
The target network (Target Network) is a mechanism
introduced in reinforcement learning, especially
when using algorithms such as Q-learning, to increase
training stability. During the training process of Q-
learning, a neural network (such as the Deep Q
Network DQN) is usually used to approximate the Q
function. When updating the Q value, it is necessary
to calculate the target value yᵢ = Eₛ~ᵉ [r + γ maxₐ′Q
(s′, a′; θᵢ₋₁) | s, a]which involves the evaluation of
all possible actions (a') under the current state (s') by
the Q function and selects the maximum value.
However, if the same network is directly used to
calculate both the current Q value and the target value
simultaneously, it may lead to training instability
because the parameters of the network are constantly
updated and the target value will also fluctuate,
making training difficult to converge. The target
network ({Q}(s, a)) is introduced to tackle with the
problem. The parameters of the target network may
be different from the Q network although they have
the structure. During the training process, the
parameters of the target network are updated to
synchronize with the parameters of the current DQN
only after the DQN is updated a certain number of
times (such as C times). The parameter update
approach of the target network is as follows: During
the training process, when the DQN is updated every