Figure 8: Evaluation reward as a function of the training time-step on AntBulletEnv-v0, HalfCheetahBulletEnv-v0,
HopperBulletEnv-v0, and Walker2DBulletEnv-v0 environment. The data is collected from mixed-policy training results. The
red curve is with PER attached Q2F-Opt agent. The blue curve is with the Q2F-Opt agent. The green curve is with PER
attached Q2R-Opt agent. The purple curve is with the Q2R-Opt agent. The orange curve is with PER attached QT-Opt agent.
The yellow curve is with vanilla QT-Opt agent. All curves in the plot are trained with the same random seed 254306. The
half-transparent area shows the range between the max and min value at this time-step in three runs. Notice total training time
steps in Ant is 839,680, which is slightly lower than in other environments.
vestigate the conflict between Noisy net and QT-Opt.
Though Noisy net is not stable in the current version
of QT-Opt, there still exists research significance for
this phenomenon theoretically.
REFERENCES
Bellemare, M. G., Dabney, W., and Munos, R. (2017).
A distributional perspective on reinforcement learn-
ing. In International Conference on Machine Learn-
ing, pages 449–458. PMLR.
Berg, M. d., Kreveld, M. v., Overmars, M., and
Schwarzkopf, O. (1997). Computational geometry. In
Computational geometry, pages 1–17. Springer.
Bodnar, C., Li, A., Hausman, K., Pastor, P., and Kalakrish-
nan, M. (2020). Quantile qt-opt for risk-aware vision-
based robotic grasping. In Proceedings of Robotics:
Science and Systems, Corvalis, Oregon, USA.
Coumans, E. and Bai, Y. (2016–2021). Pybullet, a python
module for physics simulation for games, robotics and
machine learning. http://pybullet.org.
Dabney, W., Ostrovski, G., Silver, D., and Munos, R.
(2018a). Implicit quantile networks for distributional
reinforcement learning. In International conference
on machine learning, pages 1096–1105. PMLR.
Dabney, W., Rowland, M., Bellemare, M., and Munos, R.
(2018b). Distributional reinforcement learning with
quantile regression. In Proceedings of the AAAI Con-
ference on Artificial Intelligence, volume 32.
De Boer, P.-T., Kroese, D. P., Mannor, S., and Rubinstein,
R. Y. (2005). A tutorial on the cross-entropy method.
Annals of operations research, 134(1):19–67.
Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband,
I., Graves, A., Mnih, V., Munos, R., Hassabis, D.,
Pietquin, O., et al. (2017). Noisy networks for ex-
ploration. arXiv preprint arXiv:1706.10295.
Fujimoto, S., Hoof, H., and Meger, D. (2018). Address-
ing function approximation error in actor-critic meth-
ods. In International conference on machine learning,
pages 1587–1596. PMLR.
Fujimoto, S., Meger, D., and Precup, D. (2020). An equiv-
alence between loss functions and non-uniform sam-
pling in experience replay. Advances in Neural Infor-
mation Processing Systems, 33.
Hasselt, H. (2010). Double q-learning. Advances in neural
information processing systems, 23.
Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Os-
trovski, G., Dabney, W., Horgan, D., Piot, B., Azar,
M., and Silver, D. (2018). Rainbow: Combining im-
provements in deep reinforcement learning. In Thirty-
second AAAI conference on artificial intelligence.
Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog,
A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M.,
Vanhoucke, V., et al. (2018). Scalable deep reinforce-
ment learning for vision-based robotic manipulation.
In Conference on Robot Learning, pages 651–673.
PMLR.
Koenker, R. and Hallock, K. F. (2001). Quantile regression.
Journal of Economic Perspectives, 15(4):143–156.
Li, Y. (2017). Deep reinforcement learning: An overview.
arXiv preprint arXiv:1701.07274.
Schaul, T., Quan, J., Antonoglou, I., and Silver, D.
(2015). Prioritized experience replay. arXiv preprint
arXiv:1511.05952.
Van Hasselt, H., Guez, A., and Silver, D. (2016). Deep re-
inforcement learning with double q-learning. In Pro-
ceedings of the AAAI conference on artificial intelli-
gence, volume 30.
Vaserstein, L. N. (1969). Markov processes over denumer-
able products of spaces, describing large systems of
automata. Problemy Peredachi Informatsii, 5(3):64–
72.
Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine
learning, 8(3):279–292.
Integration of Efficient Deep Q-Network Techniques Into QT-Opt Reinforcement Learning Structure
599