Thompson Sampling in the Adaptive Linear Scalarized Multi Objective Multi Armed Bandit

Saba Yahyaa; Madalina Drugan; Bernard Manderick

doi:10.5220/0005184400550065

Thompson Sampling in the Adaptive Linear Scalarized Multi Objective Multi Armed Bandit

Saba Yahyaa, Madalina Drugan, Bernard Manderick

2015

Abstract

In the stochastic multi-objective multi-armed bandit (MOMAB), arms generate a vector of stochastic normal rewards, one per objective, instead of a single scalar reward. As a result, there is not only one optimal arm, but there is a set of optimal arms (Pareto front) using Pareto dominance relation. The goal of an agent is to find the Pareto front. To find the optimal arms, the agent can use linear scalarization function that transforms a multi-objective problem into a single problem by summing the weighted objectives. Selecting the weights is crucial, since different weights will result in selecting a different optimum arm from the Pareto front. Usually, a predefined weights set is used and this can be computational inefficient when different weights will optimize the same Pareto optimal arm and arms in the Pareto front are not identified. In this paper, we propose a number of techniques that adapt the weights on the fly in order to ameliorate the performance of the scalarized MOMAB. We use genetic and adaptive scalarization functions from multi-objective optimization to generate new weights. We propose to use Thompson sampling policy to select frequently the weights that identify new arms on the Pareto front. We experimentally show that Thompson sampling improves the performance of the genetic and adaptive scalarization functions. All the proposed techniques improves the performance of the standard scalarized MOMAB with a fixed set of weights.

References

Das, I. and Dennis, J. E. (1997). A closer look at drawbacks of minimizing weighted sums of objectives for pareto set generation in multicriteria optimization problems. Structural Optimization, 14(1):63-69.
Drugan, M. (2013). Sets of interacting scalarization functions in local search for multi-objective combinatorial optimization problems. In IEEE Symposium Series on Computational Intelligence (IEEE SSCI).
Drugan, M. and Nowe, A. (2013). Designing multiobjective multi-armed bandits algorithms: A study. In Proceedings of the International Joint Conference on Neural Networks (IJCNN).
Drugan, M. and Thierens, D. (2010). Geometrical recombination operators for real-coded evolutionary mcmcs. Evolutionary Computation, 18(2):157-198.
Eichfelder, G. (2008). Adaptive Scalarization Methods in Multiobjective Optimization. Springer-Verlag Berlin Heidelberg, 1st edition.
I. O. Ryzhov, W. P. and Frazier, P. (2011). The knowledgegradient policy for a general class of online learning problems. Operation Research.
J. Dubois-Lacoste, M. L.-I. and Stutzle, T. (2011). Improving the anytime behavior of two-phase local search. In Annals of Mathematics and Artificial Intelligence.
Powell, W. B. (2007). Approximate Dynamic Programming: Solving the Curses of Dimensionality. John Wiley and Sons, New York, USA, 1st edition.
S. Q. Yahyaa, M. D. and Manderick, B. (2014a). Empirical exploration vs exploitation study in the scalarized multi-objective multi-armed bandit problem. In International Joint Conference on Neural Networks (IJCNN).
S. Q. Yahyaa, M. D. and Manderick, B. (2014b). Knowledge gradient for multi-objective multi-armed bandit algorithms. In International Conference on Agents and Artificial Intelligence (ICAART), France. International Conference on Agents and Artificial Intelligence (ICAART).
S. Q. Yahyaa, M. D. and Manderick, B. (2014c). Multivariate normal distribution based multi-armed bandits pareto algorithm. In the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD).
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. In Biometrika.
Zitzler, E. and et al. (2002). Performance assessment of multiobjective optimizers: An analysis and review. IEEE Transactions on Evolutionary Computation, 7:117-132.

Download

Paper Citation

in Harvard Style

Yahyaa S., Drugan M. and Manderick B. (2015). Thompson Sampling in the Adaptive Linear Scalarized Multi Objective Multi Armed Bandit . In Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, ISBN 978-989-758-074-1, pages 55-65. DOI: 10.5220/0005184400550065

in Bibtex Style

@conference{icaart15,
author={Saba Yahyaa and Madalina Drugan and Bernard Manderick},
title={Thompson Sampling in the Adaptive Linear Scalarized Multi Objective Multi Armed Bandit},
booktitle={Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,},
year={2015},
pages={55-65},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005184400550065},
isbn={978-989-758-074-1},
}

in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,
TI - Thompson Sampling in the Adaptive Linear Scalarized Multi Objective Multi Armed Bandit
SN - 978-989-758-074-1
AU - Yahyaa S.
AU - Drugan M.
AU - Manderick B.
PY - 2015
SP - 55
EP - 65
DO - 10.5220/0005184400550065