Deep vs. Deep Bayesian: Faster Reinforcement Learning on a

Multi-robot Competitive Experiment

Jingyi Huang

1 a

, Fabio Giardina

2 b

and Andre Rosendo

1 c

School of Information Science and Technology, ShanghaiTech University, China

John A. Paulson School of Engineering and Applied Sciences, Harvard, U.S.A.

Keywords:

Reinforcement Learning, Policy Search, Robotics.

Abstract:

Deep Learning experiments commonly require hundreds of trials to properly train neural networks, often

labeled as Big Data, while Bayesian learning leverages scarce data points to infer next iterations, also known

as Micro Data. Deep Bayesian Learning combines the complexity from multi-layered neural networks to

probabilistic inferences, and it allows a robot to learn good policies within few trials in the real world. In

here we propose, for the ﬁrst time, an application of Deep Bayesian Reinforcement Learning (RL) on a real-

world multi-robot confrontation game, and compare the algorithm with a model-free Deep RL algorithm,

Deep Q-Learning. Our experiments show that DBRL signiﬁcantly outperforms DRL in learning efﬁciency

and scalability. The results of this work point to the advantages of Deep Bayesian approaches in bypassing the

Reality Gap and sim-to-real implementations, as the time taken for real-world learning can quickly outperform

data-intensive Deep alternatives.

1 INTRODUCTION

Deep Q-Learning (DQL) algorithms have been com-

monly used in robotic control and decision making

areas ever since (Mnih et al., 2016) ﬁrst proposed the

Deep Q-Network framework. Because DQL trained

on samples generated in the replay buffer without em-

ulating a transition model, it usually required tremen-

dous trials to learn a speciﬁc task. As a consequence,

most applications were performed in simulated envi-

ronment. (Lillicrap et al., 2016) and (Gu et al., 2016)

applied DQL to the continuous control domain merely

with simulations. (Rusu et al., 2017) learned to ac-

complish a real world robot manipulation task by pre-

senting progressive networks to bridge the sim-to-real

gap.

Compared to the model-free DQL, model-based

RL algorithms are more sample efﬁcient so that they

allow a robot to learn good policies within fewer tri-

als. By learning a probabilistic or Bayesian transition

model, this sample efﬁciency can be further improved

signiﬁcantly (Deisenroth and Rasmussen, 2011; Gal

et al., 2016; Chua, Kurtland and Calandra, Roberto

https://orcid.org/0000-0002-3410-7135

https://orcid.org/0000-0002-2660-5935

https://orcid.org/0000-0003-4062-5390

and McAllister, Rowan and Levine, 2018; Depeweg

et al., 2019). (Gal et al., 2016) proposed a deep

Bayesian model-based RL algorithm, Deep PILCO,

relying on a Bayesian neural network (BNN) transi-

tion model. It advanced the DQL algorithms used in

(Lillicrap et al., 2016) and (Gu et al., 2016) in terms

of number of trials by at least an order of magnitude

on the cart-pole swing benchmark task.

Figure 1: The arena used for experiments. As enemies don’t

change position during iterations, we use two plastic boxes

(in black, at the ﬁgure) to emulate their positioning, forcing

our robots to use LiDAR sensors to localize them.

Utill now, Deep PILCO has been applied on a

number of robotic tasks. (Gamboa Higuera et al.,

2018) improved Deep PILCO by using random num-

Huang, J., Giardina, F. and Rosendo, A.

Deep vs. Deep Bayesian: Faster Reinforcement Learning on a Multi-robot Competitive Experiment.

DOI: 10.5220/0010601905010506

In Proceedings of the 18th International Conference on Informatics in Control, Automation and Robotics (ICINCO 2021), pages 501-506

ISBN: 978-989-758-522-7

501

bers and clips gradients, and applied it for learning

swimming controllers for a simulated 6 legged au-

tonomous underwater vehicle. In (Kahn et al., 2017),

the authors learned the speciﬁc task of a quadrotor

and an RC car navigating an a priori unknown envi-

ronment while avoiding collisions using Deep PILCO

with bootstrap (Efron, 1982). The advantages of Deep

PILCO in learning speed have been proven on simu-

lations and single-robot experiments.

Here we propose, for the ﬁrst time in the real-

world, applications of Deep Learning and Deep

Bayesian Learning on a multi-robot confrontation

game. Our experiments aimed to solve the decision

making problem of robots at the international IEEE

ICRA AI Challenge, a problem which was tackled

with a different approach in our previous work (Zhang

and Rosendo, 2019). We compare a Deep PILCO im-

plementation to a Deep Q Learning algorithm on this

very same experiment. Our results prove that Deep

PILCO signiﬁcantly outperformed DQL in learning

efﬁciency and scalability, and these results are dis-

cussed in Section 4. We conclude pointing to the ad-

vantages of Deep Bayesian Reinforcement Learning

implementations over Deep Reinforcement Learning

when implemented in the real world.

2 PROBLEM DEFINITION

The real-world experiments were run using a robot

manufactured by DJI (Fig. 2). The robot’s different

hardware compositions sensed the environment and

provided sensory data for the experiment. LiDAR and

IMU collected data for the robot to localize both itself

and the enemy robots in the map. The camera helped

detect the armors of the enemy robots, which is essen-

tial to accomplish auto ﬁring. Rasberry Pi and TX2

are two computing units of the robot.

RPLIDAR A3

CAMERA

RPLIDAR S1

NVIDIA TX2

IMU!

Figure 2: Hardware of the adopted robot. The robot is ca-

pable of recognizing the enemy through a combination of a

LiDAR and a camera, both sensors sampled by a TX2 and

a Raspberry Pi.

The multi-robot competitive problem was divided

into several sub-modules as shown in Fig. 3. LiDAR-

based localization module and enemy detection mod-

ule transferred the Cartesian coordinates of the robots

to decision making module as inputs. For decision

making module, we ran RL algorithms to obtain a pol-

icy search strategy. The strategy then generated a goal

position for the robot based on the current circum-

stance and sent it to the path planning module. After-

wards, path planning module planned a feasible path

on the map to let the robot arrive at the goal position.

In this paper, we focus on the implementation details

of the RL methods for the decision making module.

Decision Making Path Planning

Localization

Enemy

Detection

Lidar

Figure 3: Main modules of the multi-robot competitive

problem. We use Deep PILCO and DQL algorithms to train

a policy search strategy for the decision making module.

To fulﬁll the Markovian property requirement of

RL problems, we re-formulated the multi-robot com-

petitive problem to be a Markov Decision Process

(MDP) as follows. The MDP is composed of states,

actions, transitions, rewards and policy, which can be

represented by a tuple < S , A, T, R, π >.

• State: S is the state space which contains all pos-

sible states. Considering the map of the arena

is omniscient and path planning module is inde-

pendent to the RL algorithm, we projected the 3-

dimensional coordinate of the robot (x, y, z) to a

1-dimensional coordinate (p), where p represents

the position on the map. As shown in Fig. 4, the

original map was divided into 30 strategic areas

in advance. The size of each area depended on

the appearing possibility of the robot during the

match. Following this treatment, the state can be

denoted by a tuple (p

, p

, N

, p

) represents

the position of the robot itself. p

represents the

positions of the enemy robots, where n ∈ {1, 2} is

the index of the enemy robots. N

represents the

number of detected enemy robots discovered by

the LiDAR-based enemy detection function.

• Action: A is the action space which consists of

the actions the robot can take. For our problem,

an action (p

) is the next goal position for the

robot.

• Transition: T (s

|s, a) is the transition distribution

ICINCO 2021 - 18th International Conference on Informatics in Control, Automation and Robotics

502

Figure 4: The original arena on the top is divided into 30

strategic areas to discretize the state space, as shown in the

ﬁgure below. In the original map, the red and blue squares

indicate the special regions in the competition, such as start-

ing zones and bonus zones. These marks can be ignored in

our experiments.

over the next state s

, given the robot took the ac-

tion a at the state s.

• Reward: R(s) is the immediate reward function

over state s. For this experiment, the reward was

computed merely based on the number of visible

enemy robots of the state s.

R(s) =



0, s[N

] 6= n (1a)

1, s[N

] = n (1b)

where n is the target number of visible enemy

robots.

• Policy: π(a|s) is a probability distribution of all

actions under the state s. The action to be taken is

given by the policy based on the state.

3 MATERIALS AND METHODS

3.1 Experimental Design

We ran the experiments in two cases. The ﬁrst case

was 1v1 design, which meant there was only one

robot against one enemy robot, while the second case

was 1v2 design, as there were one robot against two

enemy robots.

The enemy robots kept static at one place during

each episode. Therefore, we could just use boxes of

similar sizes with the robot to represent enemy robots,

as shown in Fig. 1.

3.2 Experimental Methods

3.2.1 Deep Q-Learning

Q-Learning (Watkins and Dayan, 1992) algorithms

aim to solve an MDP by learning the Q value func-

tion Q(s, a). Q(s, a) is a state-action value function,

which gives the expected future return starting from a

particular state-action tuple. The basic idea is to esti-

mate the optimal Q value function Q

∗

(s, a) by using

the Bellman equation as an update:

∗

(s, a) = E

[r + γ max

∗

, a

)|s, a]. (2)

DQL is a variant of the Q-Learning algorithm, which

takes a deep neural network as a function approxima-

tor for the Q value function where samples are gen-

erated from the experience replay buffer. Note that

DQL is model-free: it solves the RL task directly

using samples from the emulator, without explicitly

constructing an estimate of the emulator (or transition

model) (Mnih et al., 2016). Instead of updating the

policy once after an episode in the model-based al-

gorithm PILCO (Deisenroth and Rasmussen, 2011),

DQL updates the Q-network with samples from the

replay buffer every step.

We implemented the DQL algorithm using the

Tianshou library (Weng et al., 2020), whose underly-

ing layer calls the pytorch library (Paszke et al., 2019)

for neural network-related computations. As for the

model architecture, the input to the Q-network is a

state vector. The two hidden layers consist of 128

neurons for simulations, 16 neurons for experiments,

activated by ReLU function (Nair and Hinton, 2010).

The output layer is a fully-connected linear layer with

a single action output. The policy during training is ε-

greedy at ε = 0.1. The learning rate is 0.001, and the

discount factor is 0.9. The size of the replay buffer is

20000.

3.2.2 Deep PILCO

Compared to model-free deep RL algorithms, model-

based RL allows higher sample efﬁciency, which can

be further improved with a probabilistic transition

model. Deep PILCO is a prominent example which

utilizes a Bayesian neural network (BNN) (MacKay,

1992) to estimate the transition model (Gal et al.,

2016; Gamboa Higuera et al., 2018).

The algorithm can be summarized as follows: A

policy π’s functional form is chosen from scratch,

with randomly chosen parameters φ. Then, Deep

PILCO executes the current policy on the real agents

from the current state until the time horizon T . The

new observations are recorded and appended to the

Deep vs. Deep Bayesian: Faster Reinforcement Learning on a Multi-robot Competitive Experiment

503

whole dataset, from which a new probabilistic tran-

sition model (or more precisely, the model parame-

ters of BNN) is re-trained. Based on this probabilistic

transition model, Deep PILCO predicts state distribu-

tions from the current initial state distribution p(X

)

to p(X

). In detail, the state input and output uncer-

tainty are encoded by using particle methods. Pro-

vided with the multi-state distribution p(X

, ..., X

the cumulative expected cost J(φ) is computed, with

a user-deﬁned cost function. By minimizing this ob-

jective function (using gradient descent method), a

newly optimized policy π

is obtained. Note that here

we deﬁned the cost function opposite to the reward:

Cost(X ) = 1 − R(X).

We implement Deep PILCO in an episodic way

so that the algorithm updates the policy after every

episode based on the episodic rewards. The episodic

reward is the sum of iteration rewards. Each episode

consists of 10 iterations. During one iteration, the

robot moves from the current position to the goal

position given by the action along the planned path.

The code is a modiﬁed version of an open-source

implementation of Deep PILCO algorithm (Gam-

boa Higuera et al., 2018).

3.2.3 LiDAR-based Enemy Detection Function

We used a 2d obstacle detection algorithm to extract

obstacles information from lidar data. Since we knew

the map, we knew where the walls are. If the center

of a robot was inside a wall, we ﬁltered out this cir-

cle (Zhang and Rosendo, 2019). A screenshot taken

during the algorithm running is presented in Figure 5.

Figure 5: The visualization of the LiDAR-based enemy de-

tection algorithm. The position of the plastic boxes (ene-

mies) are shown in the two green circles. The navigation

stack in ROS depicts the contour of the obstacles in yellow,

and displays the local costmap in blue, red and purple.

4 RESULTS

We ﬁrst compare the episodic rewards of DQL and

Deep PILCO. In order to reveal the learning trend,

we also plot the rolling mean rewards of six neigh-

boring episodes for DQL. For the 1v1 case, both al-

gorithms learned optimal solutions after training. In

Fig. 6(a), we can see that Deep PILCO found the

solution within 11 episodes, much fewer than DQL,

which took around 90 episodes. Furthermore, the re-

sult of Deep PILCO stayed maximal after the optimal

moment, while the result of DQL was more unsteady.

For the 1v2 case, the results of both algorithms

ﬂuctuated more than in the 1v1 case, as shown in Fig.

6(b). While the performance of Deep PILCO kept a

similar number of episodes when changing from the

1v1 case to the 1v2 case, DQL failed to converge to

an optimal solution even after 400 training episodes.

Considering the expensive training cost of real-

world experiments, we stopped the experiment after

400 episodes. To eliminate the impact of the hyper

parameters, we changed the learning rate parameter of

DQL algorithm and reran the experiments, but DQL

was still unable to ﬁnd a stable optimal solution, as

we can see in Fig. 7.

With regards to computation time, fewer train-

ing episodes are not necessarily equivalent to shorter

training time, since each episode costs different clock

time for DQL and Deep PILCO. For both algorithms,

each episode contains 10 iterations, while each itera-

tion costs about 10 seconds to run. With regards to the

computation time, Deep PILCO takes approximately

1 minute per episode, while DQN takes 3 seconds

per episode. To sum up, the training time for Deep

PILCO is 160 seconds per episode, and 103 seconds

per episode for DQL.

Fig. 8 displays the snapshots of our experiments

with Deep PILCO for 1v2 case. The four pictures

show different phases of the found optimal policy. We

can see that our robot started from the initial state,

where it saw none of the enemy robots, and ﬁnally

navigated to an optimal position where it could see

two enemy robots at the same time. In the rest of the

episode, it stayed at the optimal position in order to

achieve a highest episodic reward.

5 DISCUSSION

Experimental results show that Deep Bayesian RL

surpassed Deep RL in both learning efﬁciency and

learning speed. This conﬁrms the ﬁndings of previ-

ous works in (Deisenroth and Rasmussen, 2011; Gal

et al., 2016). Although for each iteration, the calcula-

ICINCO 2021 - 18th International Conference on Informatics in Control, Automation and Robotics

504

(a) 1v1 case

(b) 1v2 case

Figure 6: Learning curves tracking the rewards over episodes. Deep PILCO and Deep Q-Learning were running with learning

rate α = 0.001. (a) Training rewards of DQL and Deep PILCO for 1v1 case. The red curve vanished earlier since Deep

PILCO converged to the optimal reward within fewer training episodes. (b) Training rewards of DQL and Deep PILCO for

1v2 case.

Figure 7: Training results of 3 different learning rate setup for Deep Q-Learning in 1v2 case. Empirically, learning rates with

large values hinder the convergence in DQL experiments. With that in mind we ran more trials with the smallest learning rate

0.001. Nonetheless, all three experiments fail to achieve a reasonably high reward.

tion time requires for DQL is much shorter than that

of Deep PILCO, the learning efﬁciency of the latter

makes up for the cost. The transistors on a chip will

double in each generation of technology as claimed

by Moore’s law. We presume that Deep Bayesian

RL algorithms will learn policies much faster than

Deep RL, with foreseeable more advanced computa-

tion hardware.

The ﬁrst few training episodes of Deep PILCO in

1v2 case achieved higher initial reward than 1v1 case,

as we can see in Figure 6. This is the result of a better

exploration of the initial random rollouts in the 1v2

case. Yet in our experiments, the larger number of

initial random rollouts did not guarantee the higher re-

wards, which veriﬁes the ﬁnding in (Nagabandi et al.,

2018). In this work, they evaluated various design de-

cisions in model-based RL algorithms, including the

number of initial random trajectories. They found that

low-data initialization runs were able to reach a high

ﬁnal performance level as well, due to the reinforce-

ment data aggregation.

6 CONCLUSIONS

We proposed a new application of Deep PILCO on

a real-world multi-robot combat game. We further

compared this Deep Bayesian RL algorithm with

the Deep Learning-based RL algorithm, DQL. Our

results showed that Deep PILCO signiﬁcantly out-

performs Deep Q-Learning in learning speed and

scalability. We conclude that sample-efﬁcient Deep

Bayesian learning algorithms have great prospects on

competitive games where the agent aims to win the

opponents in the real world, as opposed to being lim-

ited to simulated applications.

Deep vs. Deep Bayesian: Faster Reinforcement Learning on a Multi-robot Competitive Experiment

505

Initial state

Figure 8: Snapshots of the real-world experiment for the 1

vs 2 situation. After about 15 training episodes, the robot

found the optimal position to see the two enemies simulta-

neously. During the episode that the reward is the highest,

the robot started from the initial position, and then navi-

gated to the optimal place at the end of the ﬁrst iteration.

The robot stayed at the optimal place during the rest of the

episode to get a maximal reward.

REFERENCES

Chua, Kurtland and Calandra, Roberto and McAllister,

Rowan and Levine, S. (2018). Deep Reinforcement

Learning in a Handful of Trials using Probabilistic

Dynamics Models. Advances in Neural Information

Processing Systems, (NeurIPS):4754—-4765.

Deisenroth, M. P. and Rasmussen, C. E. (2011). PILCO:

A model-based and data-efﬁcient approach to policy

search. Proceedings of the 28th International Confer-

ence on Machine Learning, ICML 2011, pages 465–

472.

Depeweg, S., Hernández-Lobato, J. M., Doshi-Velez, F.,

and Udluft, S. (2019). Learning and policy search

in stochastic dynamical systems with Bayesian neural

networks. 5th International Conference on Learning

Representations, ICLR 2017 - Conference Track Pro-

ceedings, pages 1–14.

Efron, B. (1982). The jackknife, the bootstrap, and other

resampling plans, volume 38. Siam.

Gal, Y., Mcallister, R. T., and Rasmussen, C. E. (2016). Im-

proving PILCO with Bayesian Neural Network Dy-

namics Models. Data-Efﬁcient Machine Learning

Workshop, ICML, pages 1–7.

Gamboa Higuera, J. C., Meger, D., and Dudek, G. (2018).

Synthesizing Neural Network Controllers with Proba-

bilistic Model-Based Reinforcement Learning. IEEE

International Conference on Intelligent Robots and

Systems, pages 2538–2544.

Gu, S., Lillicrap, T., Sutskever, U., and Levine, S. (2016).

Continuous deep q-learning with model-based accel-

eration. 33rd International Conference on Machine

Learning, ICML 2016, 6:4135–4148.

Kahn, G., Villaﬂor, A., Pong, V., Abbeel, P., and

Levine, S. (2017). Uncertainty-aware reinforcement

learning for collision avoidance. arXiv preprint

arXiv:1702.01182.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T.,

Tassa, Y., Silver, D., and Wierstra, D. (2016). Contin-

uous control with deep reinforcement learning. 4th In-

ternational Conference on Learning Representations,

ICLR 2016 - Conference Track Proceedings.

MacKay, D. J. (1992). Bayesian Methods for Adaptive

Models. PhD thesis, California Institute of Technol-

ogy.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,

Antonoglou, I., Wierstra, D., and Riedmiller, M.

(2016). Playing Atari with Deep Reinforcement

Learning.

Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S.

(2018). Neural Network Dynamics for Model-Based

Deep Reinforcement Learning with Model-Free Fine-

Tuning. Proceedings - IEEE International Conference

on Robotics and Automation, pages 7579–7586.

Nair, V. and Hinton, G. E. (2010). Rectiﬁed linear units

improve restricted boltzmann machines. In Icml.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,

Chanan, G., Killeen, T., Lin, Z., Gimelshein, N.,

Antiga, L., et al. (2019). Pytorch: An imperative

style, high-performance deep learning library. arXiv

preprint arXiv:1912.01703.

Rusu, A. A., Vecerik, M., Rothörl, T., Heess, N., Pascanu,

R., and Hadsell, R. (2017). Sim-to-Real Robot Learn-

ing from Pixels with Progressive Nets. 1st Conference

on Robot Learning, CoRL 2017, (CoRL):1–9.

Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine

learning, 8(3-4):279–292.

Weng, J., Zhang, M., Duburcq, A., You, K., Yan, D., Su,

H., and Zhu, J. (2020). Tianshou. https://github.com/

thu-ml/tianshou.

Zhang, Y. and Rosendo, A. (2019). Tactical reward shaping:

Bypassing reinforcement learning with strategy-based

goals. IEEE International Conference on Robotics

and Biomimetics, ROBIO 2019, (December):1418–

1423.

ICINCO 2021 - 18th International Conference on Informatics in Control, Automation and Robotics

506