Authors:
Francis Maes
;
Louis Wehenkel
and
Damien Ernst
Affiliation:
University of Liège, Belgium
Keyword(s):
Multi-armed bandit problems, Reinforcement learning, Exploration-exploitation dilemma.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Symbolic Systems
;
Uncertainty in AI
Abstract:
We propose a learning approach to pre-compute K-armed bandit playing policies by exploiting prior information
describing the class of problems targeted by the player. Our algorithm first samples a set of K-armed
bandit problems from the given prior, and then chooses in a space of candidate policies one that gives the
best average performances over these problems. The candidate policies use an index for ranking the arms and
pick at each play the arm with the highest index; the index for each arm is computed in the form of a linear
combination of features describing the history of plays (e.g., number of draws, average reward, variance of
rewards and higher order moments), and an estimation of distribution algorithm is used to determine its optimal
parameters in the form of feature weights. We carry out simulations in the case where the prior assumes a
fixed number of Bernoulli arms, a fixed horizon, and uniformly distributed parameters of the Bernoulli arms.
These simulations show that
learned strategies perform very well with respect to several other strategies previously
proposed in the literature (UCB1, UCB2, UCB-V, KL-UCB and en-GREEDY); they also highlight
the robustness of these strategies with respect to wrong prior information.
(More)