# The Possibilistic Reward Method and a Dynamic Extension for the Multi-armed Bandit Problem: A Numerical Study

### Miguel Martin, Antonio Jiménez-Martín, Alfonso Mateos

#### Abstract

Different allocation strategies can be found in the literature to deal with the multi-armed bandit problem under a frequentist view or from a Bayesian perspective. In this paper, we propose a novel allocation strategy, the possibilistic reward method. First, possibilistic reward distributions are used to model the uncertainty about the arm expected rewards, which are then converted into probability distributions using a pignistic probability transformation. Finally, a simulation experiment is carried out to find out the one with the highest expected reward, which is then pulled. A parametric probability transformation of the proposed is then introduced together with a dynamic optimization, which implies that neither previous knowledge nor a simulation of the arm distributions is required. A numerical study proves that the proposed method outperforms other policies in the literature in five scenarios: a Bernoulli distribution with very low success probabilities, with success probabilities close to 0.5 and with success probabilities close to 0.5 and Gaussian rewards; and truncated in [0,10] Poisson and exponential distributions.

#### References

- Agrawal, R. (1995). Regret bounds and minimax policies under partial monitoring. Advances in Applied Probability, 27(4):1054-1078.
- Audibert, J.-Y. and Bubeck, S. (2010). Sample mean based index policies by o(log n) regret for the multi-armed bandit problem. Journal of Machine Learning Research, 11:2785-2836.
- Audibert, J.-Y., Munos, R., and Szepervári, C. (2009). Exploration-exploitation trade-off using variance estimates in multi-armed bandits. Theoretical Computer Science, 410:1876-1902.
- Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finitetime analysis of the multiarmed bandit problem. Machine Learning, 47:235-256.
- Auer, P. and Ortner, R. (2010). UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Advances in Applied Mathematics, 61:55- 65.
- Baransi, A., Maillard, O., and Mannor, S. (2014). Subsampling for multi-armed bandits. In Proceedings of the European Conference on Machine Learning, page 13.
- Berry, D. A. and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments. Chapman and Hall, London.
- Burnetas, A. N. and Katehakis, M. N. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122 - 142.
- Cappé, O., Garivier, A., Maillard, O., Munos, R., and Stoltz, G. (2013). Kullbackleibler upper confidence bounds for optimal sequential allocation. Annals of Statistics, 41:1516-1541.
- Chapelle, O. and Li, L. (2001). An empirical evaluation of thompson sampling. In Advances in Neural Information Processing Systems, pages 2249-2257.
- Dubois, D., Foulloy, L., Mauris, G., and Prade, H. (2004). Probability-possibility transformations, triangular fuzzy sets, and probabilistic inequalities. Reliable Computing, 10:273-297.
- Dupont, P. (1978). Laplace and the indifference principle in the 'essai philosophique des probabilits.78. Rendiconti del Seminario Matematico Universit e Politecnico di Torino, 36:125-137.
- Garivier, A. and Cappé, O. (2011). The kl-ucb algorithm for bounded stochastic bandits and beyond. Technical report, arXiv preprint arXiv:1102.2490.
- Gittins, J. (1979). Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society, 41:148-177.
- Gittins, J. (1989). Multi-armed Bandit Allocation Indices. Wiley Interscience Series in Systems and Optimization. John Wiley and Sons Inc., New York, USA.
- Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Advances in Applied Mathematics, 58:13-30.
- Holland, J. (1992). Adaptation in Natural and Artificial Systems. MIT Press/Bradford Books, Cambridge, MA, USA.
- Honda, J. and Takemura, A. (2010). An asymptotically optimal bandit algorithm for bounded support models. In Proceedings of the 24th annual Conference on Learning Theory, pages 67-79.
- Kaufmann, E., Cappé, O., and Garivier, A. (2012). On bayesian upper confidence bounds for bandit problems. In International Conference on Artificial Intelligence and Statistics, pages 592-600.
- Lai, T. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4-22.
- Maillard, O., Munos, R., and Stoltz, G. (2011). Finitetime analysis of multi-armed bandits problems with kullback-leibler divergences. In Proceedings of the 24th Annual Conference on Learning Theory, pages 497-514.
- Smets, P. (2000). Data fusion in the transferable belief model. In Proceedings of the Third International Conference on Information Fusion, volume 1, pages 21- 33.
- Sutton, R. S. and Barto, A. G. (1998). Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition.

#### Paper Citation

#### in Harvard Style

Martin M., Jiménez-Martín A. and Mateos A. (2017). **The Possibilistic Reward Method and a Dynamic Extension for the Multi-armed Bandit Problem: A Numerical Study** . In *Proceedings of the 6th International Conference on Operations Research and Enterprise Systems - Volume 1: ICORES,* ISBN 978-989-758-218-9, pages 75-84. DOI: 10.5220/0006186400750084

#### in Bibtex Style

@conference{icores17,

author={Miguel Martin and Antonio Jiménez-Martín and Alfonso Mateos},

title={The Possibilistic Reward Method and a Dynamic Extension for the Multi-armed Bandit Problem: A Numerical Study},

booktitle={Proceedings of the 6th International Conference on Operations Research and Enterprise Systems - Volume 1: ICORES,},

year={2017},

pages={75-84},

publisher={SciTePress},

organization={INSTICC},

doi={10.5220/0006186400750084},

isbn={978-989-758-218-9},

}

#### in EndNote Style

TY - CONF

JO - Proceedings of the 6th International Conference on Operations Research and Enterprise Systems - Volume 1: ICORES,

TI - The Possibilistic Reward Method and a Dynamic Extension for the Multi-armed Bandit Problem: A Numerical Study

SN - 978-989-758-218-9

AU - Martin M.

AU - Jiménez-Martín A.

AU - Mateos A.

PY - 2017

SP - 75

EP - 84

DO - 10.5220/0006186400750084