Keyword(s):Reinforcement Learning, Distributional Reinforcement Learning, Risk, AI Safety, Conditional Value-at-Risk, CVaR, Value Iteration, Q-learning, Deep Learning, Deep Q-learning.

Abstract: Conditional Value-at-Risk (CVaR) is a well-known measure of risk that has been directly equated to robustness, an important component of Artificial Intelligence (AI) safety. In this paper we focus on optimizing CVaR in the context of Reinforcement Learning (RL), as opposed to the usual risk-neutral expectation. As a first original contribution, we improve the CVaR Value Iteration algorithm (Chow et al., 2015) in a way that reduces computational complexity of the original algorithm from polynomial to linear time. Secondly, we propose a sampling version of CVaR Value Iteration we call CVaR Q-learning. We also derive a distributional policy improvement algorithm, and later use it as a heuristic for extracting the optimal policy from the converged CVaR Q-learning algorithm. Finally, to show the scalability of our method, we propose an approximate Q-learning algorithm by reformulating the CVaR Temporal Difference update rule as a loss function which we later use in a deep learning context. All proposed methods are experimentally analyzed, including the Deep CVaR Q-learning agent which learns how to avoid risk from raw pixels.(More)

Conditional Value-at-Risk (CVaR) is a well-known measure of risk that has been directly equated to robustness, an important component of Artificial Intelligence (AI) safety. In this paper we focus on optimizing CVaR in the context of Reinforcement Learning (RL), as opposed to the usual risk-neutral expectation. As a first original contribution, we improve the CVaR Value Iteration algorithm (Chow et al., 2015) in a way that reduces computational complexity of the original algorithm from polynomial to linear time. Secondly, we propose a sampling version of CVaR Value Iteration we call CVaR Q-learning. We also derive a distributional policy improvement algorithm, and later use it as a heuristic for extracting the optimal policy from the converged CVaR Q-learning algorithm. Finally, to show the scalability of our method, we propose an approximate Q-learning algorithm by reformulating the CVaR Temporal Difference update rule as a loss function which we later use in a deep learning context. All proposed methods are experimentally analyzed, including the Deep CVaR Q-learning agent which learns how to avoid risk from raw pixels.

Guests can use SciTePress Digital Library without having a SciTePress account. However, guests have limited access to downloading full text versions of papers and no access to special options.

Guests can use SciTePress Digital Library without having a SciTePress account. However, guests have limited access to downloading full text versions of papers and no access to special options.

Stanko, S. and Macek, K. (2019). Risk-averse Distributional Reinforcement Learning: A CVaR Optimization Approach.In Proceedings of the 11th International Joint Conference on Computational Intelligence - Volume 1: NCTA, (IJCCI 2019) ISBN 978-989-758-384-1, pages 412-423. DOI: 10.5220/0008175604120423

@conference{ncta19, author={Silvestr Stanko. and Karel Macek.}, title={Risk-averse Distributional Reinforcement Learning: A CVaR Optimization Approach}, booktitle={Proceedings of the 11th International Joint Conference on Computational Intelligence - Volume 1: NCTA, (IJCCI 2019)}, year={2019}, pages={412-423}, publisher={SciTePress}, organization={INSTICC}, doi={10.5220/0008175604120423}, isbn={978-989-758-384-1}, }

TY - CONF

JO - Proceedings of the 11th International Joint Conference on Computational Intelligence - Volume 1: NCTA, (IJCCI 2019) TI - Risk-averse Distributional Reinforcement Learning: A CVaR Optimization Approach SN - 978-989-758-384-1 AU - Stanko, S. AU - Macek, K. PY - 2019 SP - 412 EP - 423 DO - 10.5220/0008175604120423