An ML Agent using the Policy Gradient Method to win a SoccerTwos

Game

Victor Ulisses Pugliese

a

Federal University of S

˜

ao Paulo, Avenida Cesare Mansueto Giulio Lattes, 1201, S

˜

ao Jos

´

e dos Campos, Brazil

Keywords:

Reinforcement Learning, Proximal Policy Optimization, Curriculum Learning, Video Games.

Abstract:

We conducted an investigative study of Policy Gradient methods using Curriculum Learning applied in Video

Games, as professors at the Federal University of Goi

´

as created a customized SoccerTwos environment to

evaluate the Machine Learning agents of students in a Reinforcement Learning course. We employed the

PPO and SAC as state-of-arts in on-policy and off-policy contexts, respectively. Also, the Curriculum could

improve the performance based on it is easier to teach people in a complex gradual order than randomly. So,

combining them, we propose our agents win more matches than their adversaries. We measured the results by

minimum, maximum, mean rewards, and the mean length per episode in checkpoints. Finally, PPO achieved

the best result with Curriculum Learning, modifying players’ (position and rotation) and ball’s (speed and

position) settings in time intervals. Also, It used fewer training hours than other experiments.

1 INTRODUCTION

Artiﬁcial Intelligence (AI) plays an essential role in

video games to generate responsive, adaptive, or in-

telligent behavior, mainly in non-player characters

(NPCs), similar to human intelligence (Ranjitha et al.,

2020). Thus, it keeps players engaged even when

playing ofﬂine or when no players are available on-

line.

Furthermore, several games provide interesting

and complex problems for Machine Learning (ML)

agents to solve, and gaming environments are secure,

controllable, and offer unlimited valuable data for the

algorithms. These characteristics make video games

a perfect domain for AI research (Shao et al., 2019).

Therefore, the Artiﬁcial Intelligence Center of Ex-

cellence (Centro de Excel

ˆ

encia de Intelig

ˆ

encia Artiﬁ-

cial - CEIA) professors at the Federal University of

Goi

´

as (Universidade Federal de Goias - UFG) did a

customized version of the SoccerTwos game and em-

ployed two ML baseline agents. The baseline agents

were used to evaluate the students’ agents in a Rein-

forcement Learning (RL) course.

The game simulates two soccer teams playing

each other and counts who mark more goals in a spec-

iﬁed time. Our goal was to identify which approach

was the best recommendation to win the matches.

a

https://orcid.org/0000-0001-8033-6679

Thus, we proposed two Policy Gradient meth-

ods, Proximal Policy Optimization (PPO) and Soft

Actor-Critic (SAC), because they are state-of-art in

on-policy and off-policy ways, respectively. We also

employed them with Curriculum Learning (CL). CL

is a provocative learning strategy on how humans and

animals learn better in a complex gradual order than

randomly (Bengio et al., 2009).

To contextualize our work, we surveyed related

works. Then, we performed an evaluation com-

paring the methods with the baseline agents of the

CEIA/UFG. Finally, we present the main ﬁndings and

conclude the paper.

2 BACKGROUND

2.1 Reinforcement Learning

Reinforcement learning (RL) is a subﬁeld of machine

learning (ML) that addresses the problem of the auto-

matic learning of optimal decisions over time. It uses

well-established supervised learning methods, such

as deep neural networks for function approximation,

stochastic gradient descent, and backpropagation, and

applies it differently (Lapan, 2018) because there is

no supervisor, only a reward signal, and feedback is

delayed, not instantaneous. Therefore, an ML agent

628

Pugliese, V.

An ML Agent using the Policy Gradient Method to win a SoccerTwos Game.

DOI: 10.5220/0011108400003179

In Proceedings of the 24th International Conference on Enterprise Information Systems (ICEIS 2022) - Volume 1, pages 628-633

ISBN: 978-989-758-569-2; ISSN: 2184-4992

Copyright

c

2022 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

using these methods faces problems, learning its be-

havior through trial-and-error interactions with a dy-

namic environment (Kaelbling et al., 1996), commu-

nicate them through actions and states. Sutton and

Barto (Sutton and Barto, 2018) model the reinforce-

ment learning cycle as shown in Figure 1.

Figure 1: The agent–environment interaction in reinforce-

ment learning (Sutton and Barto, 2018).

2.2 Policy Gradient Method

Policy Gradient methods are a reinforcement learning

technique that optimizes parameterized policies con-

cerning the expected return (long-term cumulative re-

ward) per descending gradient (Huang et al., 2020).

We selected two methods to work, and they are Proxi-

mal Policy Optimization (PPO) and Soft Actor-Critic

(SAC)

2.2.1 PPO

Proximal Policy Optimization trains stochastic policy

in an on-policy way, which means that it explores by

sampling actions according to the latest version of its

stochastic approach. We can implement this in ei-

ther discrete or continuous action spaces (S

´

aenz Im-

bacu

´

an, 2021) and (Achiam, 2018).

Furthermore, the method utilizes the actor-critic,

which maps an observation to action, while the critic

rewards that. So, It collects a set of trajectories for

each epoch by sampling from the latest version of the

stochastic policy. Then, It computes the rewards-to-

go, and the advantage estimates to update the policy

and ﬁt the value function. The approach is updated

via a stochastic gradient ascent optimizer, while the

value function is via some gradient descent algorithm

(Keras, 2022).

The amount of randomness in selecting actions

depends on the initial conditions and the training pro-

cedure. The policy typically becomes progressively

less random throughout training, as the updated rule

encourages it to explore rewards it has already found

(S

´

aenz Imbacu

´

an, 2021).

2.2.2 SAC

Soft Actor-Critic optimizes stochastic policy in an off-

policy way, forming a bridge between stochastic pol-

icy optimization and DDPG-style approaches. Ini-

tially, It was for environments with continuous action

spaces, but there is already an alternative version for

discrete ones (Achiam, 2018).

The method is based on the maximum entropy

RL framework. Thus, The actor aims to maxi-

mize the expected reward while also maximizing en-

tropy. In other words, It succeeds in the task by act-

ing as randomly as possible. We can connect It to

the exploration-exploitation trade-off: increasing en-

tropy results in more exploration, accelerating learn-

ing later. (Achiam, 2018) and (Haarnoja et al., 2018).

Different from previous deep RL methods based

on this framework formulated as Q-learning meth-

ods. SAC works like TD3, incorporating the clipped

double-Q trick, but due to the inherent stochasticity of

the policy in SAC, it also beneﬁts from something like

target policy smoothing. Therefore, It outperforms

prior on-policy and off-policy methods in a continu-

ous control benchmark (Achiam, 2018) and (Haarnoja

et al., 2018).

2.3 Curriculum Learning

We implement those methods with a training strategy,

such as Curriculum Learning. It is based on how hu-

mans and animals learn better in a complex gradual

order than randomly (Bengio et al., 2009).

An easy way to demonstrate this strategy is to

think about how math students learn arithmetic, al-

gebra, and calculus in the education system. Teachers

usually taught arithmetic before algebra, and algebra

before calculus. The skills and knowledge learned in

previous disciplines provide support for later lessons.

We can also apply this principle in machine learning,

where training the ML agents on the most straightfor-

ward tasks provides scaffolding for future challenging

tasks (Camargo and S

´

aenz, 2021).

3 RELATED WORKS

We searched for the term ’SoccerTwos’ on Google

Scholar and found eight academic papers related to it.

However, only six papers are about ML agents using

Reinforcement Learning.

S

´

aenz wrote a master thesis about the impact of

Curriculum Learning on the training process for an

in- intelligent agent in a video game as the SoccerT-

wos case study, using the SAC and PPO algorithms.

An ML Agent using the Policy Gradient Method to win a SoccerTwos Game

629

To measure the performance, he used the mean cumu-

lative reward. In some cases, this approach shortened

the training process by 40% percent and achieved bet-

ter measures than just algorithms. However, it was

sometimes worse or did not affect other cases. PPO

showed better results than SAC (S

´

aenz Imbacu

´

an,

2021). S

´

aenz and Camargo published a paper in 2021

(Camargo and S

´

aenz, 2021), reporting a part of this

thesis using PPO.

Majumder also realized a signiﬁcant improvement

in training when Curriculum Learning applied along

with a Policy Gradient variant such as PPO. The in-

cremental steps allow the agent to learn quickly in a

new dynamic environment. Therefore, The authors

recommended It in a competitive or collaborative con-

text as SoccerTwos (Majumder, 2021).

Juliani et al. implemented a solution in a ran-

domly generated multiagent using the PPO method

in the’ Soccer Twos’ environment. They trained

the agents in a two-versus-two self-play mode. The

agents learned to reposition themselves defensively or

offensively and work cooperatively to score an oppo-

nent without conceding a goal (Juliani et al., 2018).

Osipov and Petrosian applied a modern multi-

agent reinforcement learning algorithm using the Ten-

sorFlow library, explicitly created for SoccerTwos.

They investigated different modeling tools and did

computational experiments to ﬁnd their best train-

ing hyperparameters. Furthermore, They applied

this with the COMA gradient policy algorithm and

showed Its effectiveness (Osipov and Petrosian, ).

Unfortunately, the authors wrote it in Russian, and we

could not translate it.

Albuainain and Gatzoulis proposed an ML Agent,

using reinforcement learning to adapt to dynamic

physics-based environments in a 2D version of a ve-

hicular football game. Thus, they perform behaviors

such as defending their goal and attacking the ball us-

ing reward functions. They concluded that a reward

function considering different state-space parameters

could produce better-performing agents than those

with less deﬁned reward function and state-space (Al-

buainain and Gatzoulis, 2020).

4 EVALUATION OF THE

METHODS USING ML AGENTS

We implemented our ML agents to play the SoccerT-

wos game customized by CEIA/UFG. The game is

available at this GitHub

4.1 Explaining the Environment

The original SoccerTwos environment contains four

players competing in a two vs. two toy soccer game,

aiming to get the ball into the opponent’s goal while

preventing it from entering its own goal. The players

have the same behavior parameters. The observation

space consists of 336 corresponding to 11 ray-casts

forward distributed over 120 degrees and 3 ray-casts

backward distributed over 90 degrees each 6 possi-

ble object types, along with the object’s distance. The

forward ray-casts contribute 264 state dimensions and

backward 72 state dimensions. The action space con-

sists of 3 discrete branched actions (MultiDiscrete)

corresponding to forward, backward, sideways move-

ment, as well as rotation (27 discrete actions) (Tyagi,

2021).

Figure 2: Observation and action states of SoccerTwos

Game.

The customized game has one time of 2 minutes,

two ML agents that play each other (representing two

teams), and a ball. Each team has two players as left

and right. Both start within a pre-deﬁned position,

close to the ﬁeld’s middle as seen in Figure 3. Its re-

ward function consists of two items (Oliveira, 2021):

• +1 - accumulated time penalty: when a ball en-

ters the opponent’s goal. With each ﬁxed update,

the accrued time penalty is incremented by (1 /

MaxSteps). It reset to 0 at the beginning of an

episode. In this build, MaxSteps is equal to 5000.

• -1: when ball enters team’s goal.

4.2 The ML Agents Available by CEIA

In addition, the professors provided two baseline

agents (CEIA DQN and CEIA PPO) to evaluate and

ICEIS 2022 - 24th International Conference on Enterprise Information Systems

630

Figure 3: SoccerTwos Game by CEIA/UFG.

test the performance of student’s experiments. Both

agents do not use Curriculum Learning.

CEIA DQN is an ML agent that uses the Deep Q-

Network method, which combines the neural network

within a classical reinforcement learning method

called Q-Network using the experience replay tech-

nique. They set optimized hyperparameters like 0.999

as eps decay, 336x512x27 as Q-Network, and 5000 as

max steps.

CEIA PPO is an ML agent that uses the PPO

method with these optimized hyperparameters like

256x256 as hidden layers and 5000 as rollout frag-

ment length in a multiagent setting. The code is avail-

able at this GitHub. The ML agent is available at link.

4.3 Design of Experiments

Using Ray tools (v1.10.0) with Pytorch as a frame-

work, we employed Policy Gradient methods with

Curriculum Learning. Ray aims to provide a sim-

ple universal API for distributed computing, support-

ing multiple libraries to solve problems in machine

learning, such as scalable hyperparameter tuning and

industrial-grade reinforcement learning (Moritz et al.,

2018).

Saenz recommended hyperparameter sets like

[0.00001; 0.001] for learning rate, [128; 512] for

batch size, [32;512] for hidden units and others for

use in PPO and SAC methods (S

´

aenz Imbacu

´

an,

2021). Also, there is an example of Ray example of

Ray - link - applied in SoccerTwos, which uses 0.0003

for learning rate, 0.95 for lambda, 0.99 for gamma,

256 for sgd minibatch size, 4000 for train batch size,

0.2 for clip param, 20 for num sgd iter, two neural

network layers to 512 units for PPO, and others.

We employed the experiments listed below:

• The ﬁrst experiment only employs the policy

method for 24 hours without the opponent’s

movement or Curriculum Learning. The method

uses the recommended hyperparameters.

• The second experiment employs the policy meth-

ods with Curriculum A. It divides the 24 hours of

training into 16 without the opponent’s movement

and 8 of a random opponent. The spontaneous ac-

tivity happens in the middle of 16, making it three

intervals of 8 hours. We also set new hyperparam-

eters values.

• The last experiment employs the policy methods

with Curriculum B. Thus, it sets different levels as

Very Easy, Easy, Medium, and Hard, modifying

players’ (position and rotation) and ball’s (speed

and position) settings.

4.4 Evaluation Measures

To measure the performance of the Policy Gradient

methods employed in this study, we use the metrics:

mean length per episode; mean, maximum, and mini-

mum reward.

• Mean length per episode refers to how many it-

erations the ML agent takes to complete a game

move at a checkpoint.

• Mean is the average of cumulative reward values

by checkpoints.

• Maximum and minimum are the biggest and low-

est reward values by checkpoints.

5 RESULTS WITH THE ML

AGENTS

This section presents the results of the Policy Gradient

methods using Curriculum Learning for the SoccerT-

wos game.

We employed the ’PPO self-play’ and ’PPO

+ Curriculum A’. The ’self-play’ utilizes the

Ray example hyperparameters, while we mod-

iﬁed these settings for ’PPO + Curriculum A’,

removing the train batch size, num sgd iter,

rollout fragment length, no done at end, evalua-

tion interval, and evaluation num episodes, and we

also updated the two neural network layers to 256

units. The Figure 4 shown the results comparing

them performance.

As seen in Figure 4, both experiments learned to

score, and their values are similar. However, if we ob-

serve the details, than the ’PPO + Curriculum A’ (rep-

resented by orange, red, and blue colors) converges

ﬁrst. Also, It ended with a better mean reward than

PPO self-play.

We also evaluate ’PPO + Curriculum A’ versus the

’CEIA PPO’, running the gaming 200 times. Our ex-

An ML Agent using the Policy Gradient Method to win a SoccerTwos Game

631

Figure 4: Results of PPO self-play (pink color) and PPO +

Curriculum A (orange, red and blue colors).

periment had won 125 matches, which means 62.5%,

of victories, as seen in Figure 5.

Figure 5: Using CEIA PPO to evaluate the performance of

PPO + Curriculum A.

We employed the ’PPO + Curriculum B’ experi-

ment, modifying players’ (position and rotation) and

ball’s (speed and position) settings. We compare this

one with ’PPO self-play’, as seen in Figure 6.

As shown in Figure 6, the ’PPO + Curriculum B’

convergence (blue color) is faster than PPO self-play

(orange color). Thus, It achieved more than 1.8 by

the mean reward of an episode in just 250k iterations,

which did not happen with ’self-play’. Furthermore,

the other measures are also better for it.

Figure 6: Comparing the performance of PPO + Curriculum

B versus PPO self-play.

6 MAIN CONCLUDES

This study investigated Policy Gradient methods us-

ing the Curriculum Learning strategy, applied in a

SoccerTwos game customized by CEIA/UFG. We

employ PPO and SAC methods in this environment.

Procedures were measured using minimum, maxi-

mum, average reward, and average episode duration

metrics.

We had to deal with different challenges, such as

the ML agent learning to move towards the ball, kick

towards the opponent’s goal to score a positive re-

ward, defend our goal from the opponent, and others.

Therefore, we recommend that an ML agent learns in

a gradual order.

We obtained the best results in this game using the

’PPO + Curriculum B’, executing its training in just 2

hours. We also found a better recommendation set of

hyperparameters than Ray’s example.

Unfortunately, despite the hyperparameters rec-

ommended by Saenz (S

´

aenz Imbacu

´

an, 2021) for

these methods, we did not achieve convergence for

SAC experiments, as Ray’s API returned an error

message for some parameters like buffer init steps,

init entcoef, save replay buffer, steps per update.

So, we did not show SAC results in this paper.

Next time, we will reproduce the Saenz

(S

´

aenz Imbacu

´

an, 2021) research using our hy-

perparameters recommendations as to future work.

We also want to discover the recommended settings

for SAC with Ray API for this game and continue

evolving the Curriculum B strategy.

ICEIS 2022 - 24th International Conference on Enterprise Information Systems

632

ACKNOWLEDGEMENTS

We would like to thank the CEIA/UFG professors for

providing the game environment and support in the

Reinforcement Learning course.

REFERENCES

Achiam, J. (2018). Openai spinning up. GitHub, GitHub

repository.

Albuainain, A. R. and Gatzoulis, C. (2020). Reinforcement

learning for physics-based competitive games. In

2020 International Conference on Innovation and In-

telligence for Informatics, Computing and Technolo-

gies (3ICT), pages 1–6. IEEE.

Bengio, Y., Louradour, J., Collobert, R., and Weston, J.

(2009). Curriculum learning. In Proceedings of

the 26th annual international conference on machine

learning, pages 41–48.

Camargo, J. E. and S

´

aenz, R. (2021). Evaluating the impact

of curriculum learning on the training process for an

intelligent agent in a video game. Inteligencia Artiﬁ-

cial, 24(68):1–20.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018).

Soft actor-critic: Off-policy maximum entropy deep

reinforcement learning with a stochastic actor. In

International conference on machine learning, pages

1861–1870. PMLR.

Huang, R., Yu, T., Ding, Z., and Zhang, S. (2020). Policy

gradient. In Deep reinforcement learning, pages 161–

212. Springer.

Juliani, A., Berges, V.-P., Teng, E., Cohen, A., Harper, J.,

Elion, C., Goy, C., Gao, Y., Henry, H., Mattar, M.,

et al. (2018). Unity: A general platform for intelligent

agents. arXiv preprint arXiv:1809.02627.

Kaelbling, L. P., Littman, M. L., and Moore, A. W. (1996).

Reinforcement learning: A survey. Journal of artiﬁ-

cial intelligence research, 4:237–285.

Keras, F. (2022). PPO proximal policy optimization.

Lapan, M. (2018). Deep Reinforcement Learning Hands-

On: Apply modern RL methods, with deep Q-

networks, value iteration, policy gradients, TRPO, Al-

phaGo Zero and more. Packt Publishing Ltd.

Majumder, A. (2021). Competitive networks for ai agents.

In Deep Reinforcement Learning in Unity, pages 449–

511. Springer.

Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R.,

Liang, E., Elibol, M., Yang, Z., Paul, W., Jordan, M. I.,

et al. (2018). Ray: A distributed framework for emerg-

ing {AI} applications. In 13th USENIX Symposium on

Operating Systems Design and Implementation (OSDI

18), pages 561–577.

Oliveira, B. (2021). A pre-compiled soccer-twos reinforce-

ment learning environment with multi-agent gym-

compatible wrappers and human-friendly visualizers.

https://github.com/bryanoliveira/soccer-twos-env.

Osipov, A. and Petrosian, O. Application of the contract-

structured gradient group learning algorithm for mod-

eling conﬂict-controlled multi-agent systems.

Ranjitha, M., Nathan, K., and Joseph, L. (2020). Arti-

ﬁcial intelligence algorithms and techniques in the

computation of player-adaptive games. In Journal

of Physics: Conference Series, volume 1427, page

012006. IOP Publishing.

S

´

aenz Imbacu

´

an, R. (2021). Evaluating the impact of cur-

riculum learning on the training process for an intelli-

gent agent in a video game.

Shao, K., Tang, Z., Zhu, Y., Li, N., and Zhao, D. (2019). A

survey of deep reinforcement learning in video games.

arXiv preprint arXiv:1912.10944.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement learn-

ing: An introduction. MIT press.

Tyagi, D. (2021). Reinforcement-learning: Implemen-

tations of deep reinforcement learning algorithms

and benchmarking with pytorch. https://github.com/

deepanshut041/reinforcement-learning.

An ML Agent using the Policy Gradient Method to win a SoccerTwos Game

633