Exploration Versus Exploitation Trade-off in Infinite Horizon Pareto Multi-armed Bandits Algorithms

Madalina Drugan, Bernard Manderick

2015

Abstract

Multi-objective multi-armed bandits (MOMAB) are multi-armed bandits (MAB) extended to reward vectors. We use the Pareto dominance relation to assess the quality of reward vectors, as opposite to scalarization functions. In this paper, we study the exploration vs exploitation trade-off in infinite horizon MOMABs algorithms. Single objective MABs explore the suboptimal arms and exploit a single optimal arm. MOMABs explore the suboptimal arms, but they also need to exploit fairly all optimal arms. We study the exploration vs exploitation trade-off of the Pareto UCB1 algorithm. We extend UCB2 that is another popular infinite horizon MAB algorithm to rewards vectors using the Pareto dominance relation. We analyse the properties of the proposed MOMAB algorithms in terms of upper regret bounds. We experimentally compare the exploration vs exploitation trade-off of the proposed MOMAB algorithms on a bi-objective Bernoulli environment coming from control theory.

References

  1. Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite time analysis of the multiarmed bandit problem. Machine Learning, 47(2/3):235-256.
  2. Drugan, M. and Nowe, A. (2013). Designing multiobjective multi-armed bandits: a study. In Proc of International Joint Conference of Neural Networks (IJCNN).
  3. Lizotte, D., Bowling, M., and Murphy, S. (2010). Efficient reinforcement learning with multiple reward functions for randomized clinical trial analysis. In Proceedings of the Twenty-Seventh International Conference on Machine Learning (ICML).
  4. Maron, O. and Moore, A. (1994). Hoeffding races: Accelerating model selection search for classification and function approximation. In Advances in Neural Information Processing Systems, volume 6, pages 59-66. Morgan Kaufmann.
  5. Vaerenbergh, K. V., Rodriguez, A., Gagliolo, M., Vrancx, P., Nowe, A., Stoev, J., Goossens, S., Pinte, G., and Symens, W. (2012). Improving wet clutch engagement with reinforcement learning. In International Joint Conference on Neural Networks (IJCNN). IEEE.
  6. van Moffaert, K., Drugan, M., and Nowe, A. (2013). Hypervolume-based multi-objective reinforcement learning. In Proc of Evolutionary Multi-objective Optimization (EMO). Springer.
  7. Wang, W. and Sebag, M. (2012). Multi-objective Monte Carlo tree search. In Asian conference on Machine Learning, pages 1-16.
  8. Wiering, M. and de Jong, E. (2007). Computing optimal stationary policies for multi-objective markov decision processes. In Proc of Approximate Dynamic Programming and Reinforcement Learning (ADPRL), pages 158-165. IEEE.
  9. Zitzler, E., Thiele, L., Laumanns, M., Fonseca, C. M., and da Fonseca, V. (2003). Performance assessment of multiobjective optimizers: An analysis and review. IEEE T. on Evol. Comput., 7:117-132.
Download


Paper Citation


in Harvard Style

Drugan M. and Manderick B. (2015). Exploration Versus Exploitation Trade-off in Infinite Horizon Pareto Multi-armed Bandits Algorithms . In Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, ISBN 978-989-758-074-1, pages 66-77. DOI: 10.5220/0005195500660077


in Bibtex Style

@conference{icaart15,
author={Madalina Drugan and Bernard Manderick},
title={Exploration Versus Exploitation Trade-off in Infinite Horizon Pareto Multi-armed Bandits Algorithms},
booktitle={Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,},
year={2015},
pages={66-77},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005195500660077},
isbn={978-989-758-074-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,
TI - Exploration Versus Exploitation Trade-off in Infinite Horizon Pareto Multi-armed Bandits Algorithms
SN - 978-989-758-074-1
AU - Drugan M.
AU - Manderick B.
PY - 2015
SP - 66
EP - 77
DO - 10.5220/0005195500660077