Dataset Characteristics and Their Impact on Ofﬂine Policy Learning of

Contextual Multi-Armed Bandits

Piotr Januszewski

, Dominik Grzegorzek

and Paweł Czarnul

Department of Computer Architecture, Gda

nsk University of Technology, Gda

nsk, Poland

Keywords:

Contextual Multi-Armed Bandits, Ofﬂine Policy Learning, Dataset Quality.

Abstract:

The Contextual Multi-Armed Bandits (CMAB) framework is pivotal for learning to make decisions. However,

due to challenges in deploying online algorithms, there is a shift towards ofﬂine policy learning, which relies

on pre-existing datasets. This study examines the relationship between the quality of these datasets and the

performance of ofﬂine policy learning algorithms, speciﬁcally, Neural Greedy and NeuraLCB. Our results

demonstrate that NeuraLCB can learn from various datasets, while Neural Greedy necessitates extensive cov-

erage of the action-space for effective learning. Moreover, the way data is collected signiﬁcantly affects ofﬂine

methods’ efﬁciency. This underscores the critical role of dataset quality in ofﬂine policy learning.

1 INTRODUCTION

The Contextual Multi-Armed Bandits (CMAB)

framework (Lattimore and Szepesv

ari, 2020) enables

a myriad of services such as recommendation sys-

tems (Zhao et al., 2013), personalized healthcare (Ko-

morowski et al., 2018), online advertisement (Li et al.,

2010), and resource allocation (Badanidiyuru et al.,

2013) to actively learn and adapt from interactions

with their respective environments.

However, the challenge of continuously deploy-

ing online learning algorithms due to costs, privacy

concerns, or constraints on exploration has brought

ofﬂine policy learning to the forefront (Levine et al.,

2020). Ofﬂine policy learning, which learns policies

from previously collected datasets without interacting

with the environment, is beneﬁcial in settings where

online interactions are costly, risky, or unethical.

One major determinant of ofﬂine policy learning

performance is the quality and nature of the ofﬂine

dataset (Fujimoto et al., 2019). This component is

often overlooked, with the majority of studies empha-

sizing algorithmic improvements. Hence, there exists

a research gap in fully understanding the impact of

the ofﬂine dataset on the performance of the ofﬂine

policy learning algorithms.

This paper seeks to shed light on the relation-

https://orcid.org/0000-0003-3817-3479

https://orcid.org/0009-0006-0310-7104

https://orcid.org/0000-0002-4918-9196

ship between ofﬂine datasets and the performance of

ofﬂine policy learning algorithms in the contextual

bandits setting. In our empirical investigation, we

gather datasets that meet various conditions. For the

uniform data coverage assumption (Brandfonbrener

et al., 2021), actions need to be chosen uniformly at

random. In contrast, the single-policy concentration

condition (Nguyen-Tang et al., 2022) necessitates the

behavior policy to only cover the target optimal pol-

icy. We also explore datasets that extrapolate between

these two conditions or that satisfy neither, particu-

larly those that predominantly issue suboptimal ac-

tions.

Our work considers two ofﬂine methods, Neural

Greedy, referred to as the Direct Method in (Dudik

et al., 2011), and NeuraLCB (Nguyen-Tang et al.,

2022). We chose NeuraLCB due to its state-of-the-

art performance in ofﬂine policy learning, offering

cutting-edge insights into how different dataset char-

acteristics impact algorithm efﬁciency. On the other

hand, Neural Greedy is known for its simplicity and

ease of implementation, and represents a method that

practitioners are likely to adopt ﬁrst (Dutta et al.,

2019).

We present a comprehensive empirical investiga-

tion that elucidates the relationship between ofﬂine

dataset characteristics and the performance of ofﬂine

policy learning algorithms. This analysis speciﬁ-

cally contrasts two distinct methods: Neural Greedy

and NeuraLCB, providing a nuanced understanding

of how different dataset conditions inﬂuence their ef-

Januszewski, P., Grzegorzek, D. and Czarnul, P.

Dataset Characteristics and Their Impact on Ofﬂine Policy Learning of Contextual Multi-Armed Bandits.

DOI: 10.5220/0012311000003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 2, pages 87-98

ISBN: 978-989-758-680-4; ISSN: 2184-433X

ﬁciency. Our ﬁndings offer practical guidance for

practitioners in preparing ofﬂine datasets and select-

ing appropriate ofﬂine methods, thus contributing to

improved performance in real-world applications.

2 PRELIMINARIES

We consider Contextual Multi-Armed Bandits

(CMAB) framework (Lattimore and Szepesv

ari,

2020). Let the context space X be inﬁnite and

the action space A be ﬁnite with |A | = K < ∞.

Additionally, let a reward vector r ∈ [r

min

, r

max

]

and

a context x ∈ X be drawn from a joint distribution ρ.

Let H represent the history of past interactions up

to round t − 1, where a history h

1:t−1

is a sequence

1:t−1

= {(x

, a

, r

), . . . , (x

t−1

, a

t−1

, r

t−1

)}.

We now deﬁne a policy π : X × H → P(A) that

maps the current context and the history up to that

round to a distribution over actions. At each round

t, a contextual bandit observes a context x

, decides

on an action a

∼ π(x

, h

1:t−1

), and receives an action

reward r

which is the component of the vector r

cor-

responding to the action a

— this is termed a bandit

feedback.

For every policy π, the regret R

after T rounds is

deﬁned as:

∑

t=1



max

a∈A

− r



where x

, r

∼ ρ and a

∼ π(x

, h

1:t−1

) at each round t.

The suboptimality of a policy π is deﬁned as its

average regret after T rounds:

SubOpt(π) =

In the ofﬂine setting, we deﬁne a ﬁnite dataset

consisting of CMAB rounds played with a behavior

policy β:

= {(x

, a

, r

)}

i=1

The goal of an ofﬂine CMAB method is to learn an

optimal policy π

∗

= min

SubOpt(π), that minimizes

the suboptimality, using only the bandit feedback in

the dataset S

3 METHODS

3.1 Behavior Policies

In this section, we brieﬂy describe the four online

learning algorithms used in our research to collect

datasets: Neural Greedy, Bayes by Backprop, Lin-

UCB, and NeuralUCB.

The Neural Greedy method was also used as one

of the baselines in (Blundell et al., 2015). The Neu-

ral Greedy method learns from observations at each

round to estimate the reward vectors r with a reward

function f (x; θ). The reward function f (x; θ) is a neu-

ral network with parameters θ. It is then used in place

of the actual reward to evaluate the policy value and

to maximize it by selecting the highest valued action

in the policy π(x) = argmax

(x;θ), where f

(x;θ)

is the component corresponding to action a of the out-

put from the neural network. This approach requires

an accurate model of rewards, which can be a restric-

tive assumption in some cases. However, when the

model of rewards is accurate, it can provide a reliable

estimate of the policy value.

The Bayes by Backprop is an algorithm proposed

in (Blundell et al., 2015) that regularizes the neural

network model weights by minimizing a compres-

sion cost. It introduces uncertainty in the weights of

the neural network, which can improve generalization

in non-linear regression problems. The uncertainty

is modeled by a variational posterior q

(w) over the

weights w of the network, where θ is the variational

parameters learned by the algorithm. The compres-

sion cost is given by the KL divergence between the

variational posterior and a prior p(w), and is min-

imized by adjusting the variational parameters. In

areas where the network is less conﬁdent in its pre-

dictions (high uncertainty), Bayes by Backprop ex-

plores more, gathering more data to reduce the un-

certainty. In areas where the network is more con-

ﬁdent in its predictions (low uncertainty), Bayes by

Backprop exploits its knowledge, making decisions

that maximize the expected reward. This approach

can help to focus more on approximating the reward

in areas that are important for the policy (high un-

certainty), leading to more accurate estimates of the

policy value and better decision-making overall. This

can be particularly beneﬁcial in addressing the lim-

itations of Neural Greedy, where the reward esti-

mate might focus on approximating the reward in ar-

eas that are irrelevant to the policy (low uncertainty).

The policy used by Bayes by Backprop is given by

π(x) = argmax

(x;w), where f

(x;w) is the compo-

nent corresponding to action a of the output from the

neural network with weights w ∼ q

, sampled from

the variational posterior, and input context x. Effec-

tively, this policy selects actions that maximize the

expected output of the network under the variational

posterior over the weights.

The LinUCB method is described in (Li et al.,

2010). The authors model personalized recommen-

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

dation of news articles as a contextual bandit prob-

lem. The LinUCB algorithm assumes that the ex-

pected payoff of an arm is linear in its d-dimensional

feature vector with some unknown coefﬁcient vector.

The algorithm uses ridge regression to estimate the

coefﬁcients θ and computes a conﬁdence interval for

the expected payoff of each arm. The arm with the

highest upper conﬁdence bound (UCB) is selected in

each round. The policy used by LinUCB is given by

π(x) = argmax

+ α

−1

x, where θ

is the

ridge regression estimate of the coefﬁcients for action

a, A

is the design matrix for action a, and α is a hy-

perparameter controlling the exploration-exploitation

trade-off.

The NeuralUCB method, introduced in (Zhou

et al., 2020), extends the LinUCB approach by using

a neural network f (x; θ) to model the reward func-

tion, allowing it to handle more complex, non-linear

relationships between the context and the expected re-

ward. This is a signiﬁcant departure from the Lin-

UCB method, which assumes a linear relationship.

The policy used by NeuralUCB is given by π(x) =

argmax

(x;θ) + UCB(x, a, θ), where UCB(x, a, θ) is

an exploration bonus described in the paper.

3.2 Ofﬂine Methods

In this section, we brieﬂy describe the two ofﬂine

policy learning algorithms trained on the collected

datasets: Neural Greedy and NeuraLCB.

The ofﬂine variant of the Neural Greedy method

is used in the context of Ofﬂine Contextual Multi-

Armed Bandits (CMAB). Unlike its online counter-

part, the ofﬂine Neural Greedy operates under the

constraint that it cannot interact with the environment

or collect more data. It aims to learn an optimal policy

solely from a ﬁxed ofﬂine dataset of bandit feedback,

denoted as S

. The method uses a neural network to

estimate the reward vectors r from the observations

in the dataset S

. The policy π(x) is then deﬁned as

the action that maximizes the estimated reward, i.e.,

π(x) = argmax

(x;θ), where f

(x;θ) is the compo-

nent corresponding to action a of the output from the

neural network.

The NeuraLCB method is a novel approach to Of-

ﬂine Contextual Multi-Armed Bandits (CMAB) intro-

duced by (Zhou et al., 2020). It uses a neural net-

work to model any bounded reward function with-

out assuming any functional form. The key feature

of NeuraLCB is its pessimistic formulation, which

constructs a lower conﬁdence bound of the reward

functions for decision-making. The lower conﬁdence

bound is computed for each context and action based

on the current network parameter, and the network

is updated by minimizing a regularized squared loss

function using stochastic gradient descent (SGD).

The policy used by NeuraLCB is given by π(x) =

argmax

(x;θ) − LCB(x, a, θ), where LCB(x, a, θ) is

the lower conﬁdence bound for action a at context x

with network parameters θ.

The ofﬂine policy learning faces issues with the

distributional shift problem (Levine et al., 2020), par-

ticularly due to the behavior policy mismatch. This

problem arises when the policy used to collect the

dataset (the behavior policy) is different from the

learned policy by the ofﬂine method, leading to sit-

uations in which the learned policy exploits actions

about which it has little to no data. Neural Greedy

addresses this problem by making assumptions about

the uniform data coverage (Brandfonbrener et al.,

2021; Xie et al., 2023). Speciﬁcally, Neural Greedy

assumes that the behavior policy has already explored

the entire action space to a sufﬁcient degree. How-

ever, this assumption may not always hold in prac-

tice. To address this limitation, NeuraLCB utilizes

a pessimism principle that constructs lower conﬁ-

dence bounds for the reward functions for conserva-

tive decision-making (Rashidinejad et al., 2022). By

doing so, NeuraLCB reduces the requirement for uni-

form data coverage to the single-policy concentration

condition, which ensures that the behavior policy only

needs to cover the target optimal policy, rather than

the entire action space.

In our approach, we adopt the batch mode training

inspired by NeuraLCB. We replay the dataset S

the order it was collected. During each replay of a

round t, the reward model is updated using 100 steps

of Stochastic Gradient Descent (SGD) on a random

batch of size 50, sampled from the subset S

(1:t)

of the

dataset accumulated from round 1 to t.

3.3 Problems

The algorithms are evaluated on real-world problems

obtained from the UCI Machine Learning Reposi-

tory (Markelle et al., ), including Mushroom, Shut-

tle, Adult, and MNIST. These problems are diverse in

terms of size, dominant actions, and stochastic versus

deterministic rewards. Details on each problem can

be found in Table 1.

The Mushroom problem contains equal numbers

of edible and poisonous mushroom examples. Each

mushroom is represented by a set of attributes. The

learner’s task is to choose from two actions: to eat a

given mushroom or not. If the mushroom is edible

and the learner chooses to eat it, they receive a re-

ward of +5. On the other hand, if the mushroom is

poisonous and the learner chooses to eat it, there is

Dataset Characteristics and Their Impact on Ofﬂine Policy Learning of Contextual Multi-Armed Bandits

Table 1: Problems characteristics.

size

context

dimension

number of

actions

dominant action

stochastic

rewards

Adult 45,222 94 14 >80% of data belongs to half of actions no

MNIST 70,000 784 10 balanced no

Mushroom 8,124 22 2 balanced yes

Shuttle 43,500 9 7 ∼80% of data belong to one action no

a probability of 0.5 that they receive a reward of +5

and an equal probability that they receive a penalty

of −35 instead. If the learner chooses not to eat the

mushroom, they always receive a reward of 0.

All the other problems are K-class classiﬁcation

problems:

1. The Adult problem contains personal information

from the US Census Bureau database including

occupations that we take for actions.

2. The MNIST problem contains images of various

handwritten digits from 0 to 9.

3. The Shuttle problem contains data about a space

shuttle ﬂight where the goal is to predict the state

of the radiator subsystem of the shuttle.

We convert these problems into K-armed contex-

tual bandit problems based on the methodology pro-

posed by (Nguyen-Tang et al., 2022). Speciﬁcally, the

learner receives a reward of 1 if it selects the action y,

where y is the correct label and 0 otherwise.

3.4 Datasets

We collect datasets that satisfy the uniform data

coverage assumption, the single-policy concentration

condition, extrapolate between the two, or satisfy nei-

ther condition. To do this we run the following behav-

ior policies to collect the datasets:

1. Uniform and stationary policies that fulﬁll the uni-

form data coverage assumption

• full exploration which issues uniformly random

actions (we later refer to it as uniformly random

data).

2. Non-uniform, but stationary policies that fulﬁll

the single-policy concentration condition

• full exploitation which issues only optimal ac-

tions, meaning minimizing the suboptimality.

3. Non-uniform, but still stationary policies that ex-

trapolate between the two conditions

• ε-greedy which issues uniformly random ac-

tions with probability ε and optimal actions oth-

erwise.

4. Non-uniform and not-stationary policies that

might not satisfy any of the two conditions are de-

scribed in detail in Section 3.1

• Neural Greedy,

• LinUCB,

• NeuralUCB,

• Bayes by Backprop.

It is crucial to note that the last category includes

learning algorithms that: 1) will gradually make less

and less diverse decisions and 2) may or may not con-

verge to the optimal solution. That is why they might

not satisfy any of the aforementioned conditions.

We run each behavior policy 10 times, with dif-

ferent seeds, for 15,000 rounds in each problem and

record observed contexts, issued actions, and received

action rewards which establish our ofﬂine datasets.

4 RESULTS & DISCUSSION

Based on the experimental results, it was observed

that the performance of ofﬂine contextual bandits

algorithms is heavily dependent on the nature of

datasets used for training and the speciﬁc character-

istics of the problem being addressed. In this section,

we explore this observation from different angles.

4.1 Ofﬂine Training on Datasets with

Varying Degrees of Exploration

Our study investigates whether the NeuraLCB and

Neural Greedy algorithms learn better from datasets

collected with the ε-greedy behavior policy, which

mixes the optimal actions with the uniformly random

actions, or the full exploration behavior policy, which

always picks the actions uniformly at random. Ta-

ble 2 shows that the NeuraLCB algorithm performs

better when trained on datasets coming from behav-

ior policies that are more focused on optimal actions,

with one exception. In Figure 5, the NeuraLCB al-

gorithm performed better when trained on uniformly

random data in the Mushroom problem. This excep-

tion might be related to the method’s failure to learn

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

Table 2: Which random behavior policy collects datasets on which the ofﬂine method achieves better performance (AOC)?

Adult MNIST Mushroom Shuttle

NeuraLCB e-greedy e-greedy u. random e-greedy

Neural Greedy u. random u. random u. random u. random

Table 3: Which ofﬂine method achieves better performance (AOC) for each dataset?

Adult MNIST Mushroom Shuttle

e-greedy NeuraLCB NeuraLCB NeuraLCB NeuraLCB

u. random Neural Greedy Neural Greedy Neural Greedy Neural Greedy

greedy NeuraLCB NeuraLCB both cannot learn NeuraLCB

Bayes by Backprop NeuraLCB Neural Greedy NeuraLCB NeuraLCB

LinUCB NeuraLCB Neural Greedy NeuraLCB / both cannot learn NeuraLCB

Neural Greedy NeuraLCB / draw Neural Greedy NeuraLCB NeuraLCB

NeuralUCB NeuraLCB NeuraLCB NeuraLCB NeuraLCB

from the optimal actions in this particular problem, as

described in Section 4.2. Figure 7 reveals that Neu-

raLCB can fail to achieve stable convergence with

the dataset coming from the full exploration behav-

ior policy, which is the case in the Shuttle problem.

Table 2 shows that the Neural Greedy ofﬂine method

performed better when trained on uniformly random

data. This was expected, as Neural Greedy requires

uniform data coverage for efﬁcient learning under dis-

tributional shift (Nguyen-Tang et al., 2022). In Fig-

ure 8, in the Shuttle problem, the dataset coming from

the uniformly random behavior policy is the only one

that results in the Neural Greedy convergence.

4.2 Ofﬂine Training on Datasets with

Only Optimal Actions

We investigate whether the ofﬂine methods can learn

from datasets collected with the full exploitation be-

havior policy, which always picks the optimal actions.

Based on the results obtained, it has been observed

that the Neural Greedy algorithm is unable to learn

from datasets consisting solely of optimal actions. In

contrast, the NeuraLCB algorithm has been found to

be capable of learning on such datasets. Neverthe-

less, its performance still beneﬁts from some level of

exploration as can be seen in Figures 1 and 3. The

Mushroom problem is one exception in which Neu-

raLCB cannot learn from the dataset collected by the

full exploitation behavior policy. This is due to the

fact that this dataset does not include the result of eat-

ing a poisonous mushroom, leading ofﬂine method

to learn a suboptimal policy which thinks that eat-

ing mushrooms has a higher reward than not eating

them, regardless of their type. Figure 5 shows that

even a small level of exploration, such as 10% in the

ε-greedy collected dataset, can mitigate this problem.

These ﬁndings suggest that ofﬂine training of contex-

tual bandits from optimal actions can be challenging,

and at least some level of exploration may be neces-

sary for effective training.

4.3 Ofﬂine Training on Datasets

Collected by Learning Behavior

Policies

In this experiment, we look at the ofﬂine methods’

performance when trained on datasets collected by

behavioral policies that start from the random initial-

ization and are actively learning to issue better ac-

tions.

As Figures 1 and 2 show, in the Adult problem,

there is not much difference in performance across

Bayes by Backprop, LinUCB, Neural Greedy, or

NeuralUCB behavior policies used for datasets col-

lection. Neither performance of the ofﬂine methods

trained on these datasets differ.

In the MNIST problem, Figures 3 and 4 nicely

show the situation where, although behavior policies’

performance signiﬁcantly differs, the NeuraLCB and

Neural Greedy ofﬂine methods achieve similar per-

formance across collected datasets – in most cases

exceeding the performance of the corresponding be-

havior policies. This suggests that the ofﬂine meth-

ods can pick the optimal solution from a variety of

datasets. However, the slower the behavior policies

learn the slower the NeuraLCB algorithm learns on

their collected datasets.

For the Mushroom problem, in Figures 5 and 6, all

the behavior policies reach similar, near-optimal per-

formance, with the LinUCB reaching it much quicker.

The NeuraLCB algorithm quickly converges on the

optimal policy across the datasets besides the one col-

lected by the LinUCB in which it struggles to learn.

This is most probably because the LinUCB behavior

policy quickly starts to issue mostly optimal actions

Dataset Characteristics and Their Impact on Ofﬂine Policy Learning of Contextual Multi-Armed Bandits

which causes difﬁculties described in Section 4.2. For

the Neural Greedy algorithm the situation is similar,

but it does not achieve the optimal performance, it

maintains little bias. It is worth noting that Neural

Greedy can learn the optimal policy from the uni-

formly random data. This suggests that the lack of

ability to directly model the uncertainty about the less

common actions, in this case, the eating of inedi-

ble mushrooms in datasets collected by mostly opti-

mal behavior policies, contributes to the worse per-

formance of Neural Greedy.

Figures 5 and 6 present the results in the Shut-

tle problem. Behavior policies gain similar ﬁnal per-

formance but at different rates except for the Lin-

UCB which quickly converges on the worse solution

than other learning behavior policies. Similarly, as in

the MNIST problem, the NeuraLCB algorithm learns

slower if the behavior policy learns slowly. However,

it reaches the same ﬁnal performance across datasets

except for the dataset collected by the LinUCB from

which NeuraLCB is missing optimal actions in the

dataset. Generally, the faster behavior policies con-

verge on some solution, either optimal (see the Neural

Greedy behavior policy) or suboptimal (see the Lin-

UCB behavior policy) the more problems the Neu-

ral Greedy algorithm has to learn from such collected

datasets. This conﬁrms that the Neural Greedy algo-

rithm requires a substantial degree of exploration in

the behavior policies. This conclusion is further rein-

forced by the observation, that Neural Greedy recov-

ers around the middle of the training on the dataset

collected by the Bayes by Backprop behavior policy,

just when the Bayes by Backprop again started to pick

more explorative actions. This exploration increase

can be observed as the slight Bayes by Backprop per-

formance deterioration.

4.4 Can NeuraLCB Learn Where

Neural Greedy Fails and Vice

Versa?

At the high level, the aim of these experiments was

to investigate if NeuraLCB can learn from datasets

where Neural Greedy fails and vice versa. The results

in Figures 7 and 8 show that, in the Shuttle problem,

NeuraLCB learns well from all datasets, while Neu-

ral Greedy can only learn from the uniformly random

data. Across all problems but the Mushroom, Neu-

raLCB can learn from the datasets consisting solely

of optimal actions, whereas Neural Greedy fails to do

so. Interestingly, the study found that NeuraLCB was

not always better than Neural Greedy, see Table 3.

Particularly, when trained on the uniformly random

data which is the most visible in the Adult problem,

in Figures 1 and 2, and in the MNIST problem, in Fig-

ures 3 and 4. However, we argue that it is unreason-

able to expect the deployed policies to be uniformly

random in their decisions in real-world scenarios.

5 RELATED WORK

The existing literature provides valuable insights into

ofﬂine policy learning, contextual bandits, and behav-

ior policies. However, the speciﬁc exploration of how

different behavior policies inﬂuence the performance

of ofﬂine policy learning algorithms in the contextual

bandits setting appears to be a novel and underex-

plored aspect.

5.1 Ofﬂine Policy Learning

Ofﬂine policy learning has been a subject of inter-

est in various domains. Recent works have proposed

frameworks for ofﬂine reinforcement learning (Fuji-

moto et al., 2019), with advances like conservative

Q-learning (Kumar et al., 2020), general methods for

data reuse (Xiao and Wang, 2021), and methods to

bridge the sim-to-real gap (Rashidinejad et al., 2022).

Uncertainty-aware approaches have also gained trac-

tion (Hu et al., 2023), with methodologies like dif-

fusion models being proposed (Wang et al., 2023).

Moreover, information-theoretic considerations in of-

ﬂine policy learning have been explored in (Chen and

Jiang, 2019). However, these works primarily focus

on general frameworks and methodologies rather than

the nuanced inﬂuence of different behavior policies

on ofﬂine policy learning.

5.2 Contextual Bandits

Contextual bandits have been applied to diverse areas

such as interactive recommendation (Li et al., 2010)

and energy optimization (Vannella et al., 2023). The

domain has seen studies that investigated ofﬂine pol-

icy optimization (Nguyen-Tang et al., 2022) and those

that delve into the relationship of ofﬂine optimiza-

tion with overparametrized models (Brandfonbrener

et al., 2021). Furthermore, Joachims et al. have stud-

ied the use of a counterfactual risk minimization ap-

proach for training deep networks with logged bandit

feedback (Joachims et al., 2018). While these works

consider aspects of ofﬂine datasets, they do not delve

into the speciﬁc impact of various behavior policies

on ofﬂine policy learning performance.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

5.3 Off-Policy Evaluation

There is signiﬁcant research related to off-policy eval-

uation on ofﬂine datasets, touching on aspects like

doubly robust policy evaluation (Dudik et al., 2011),

distributionally robust policy gradients (Yang et al.,

2023), and methodologies like variance-minimizing

augmentation logging (Tucker and Joachims, 2023).

The literature has also included works on neural con-

textual bandits with UCB-based exploration (Zhou

et al., 2020) and PAC-Bayesian approaches (Sakhi

et al., 2022). However, a comprehensive examination

of how different behavior policies directly inﬂuence

the efﬁciency and effectiveness of ofﬂine policy learn-

ing algorithms remains an underexplored area.

6 CONCLUSIONS

Our study demonstrates the critical impact of dataset

characteristics on ofﬂine policy learning of Contex-

tual Multi-Armed Bandits, offering key insights for

their practical application. The Neural Greedy algo-

rithm requires datasets with a substantial degree of ex-

ploration for effective learning. In practical scenarios,

however, it is unreasonable to anticipate the deployed

policies to consistently make highly exploratory deci-

sions e.g. based on uniformly random criteria. We ad-

vise employing the NeuraLCB method, as it can learn

effectively from datasets collected by behavior poli-

cies that leverage problem-speciﬁc knowledge. It has

the better performance, the more optimal the actions

in the dataset. Nonetheless, we show NeuraLCB still

beneﬁts from some exploratory actions. We recom-

mend ensuring that each action gets chosen multiple

times in the datasets.

Future work shall tune the proportion of ex-

ploratory actions to the optimal ones for the best per-

formance. Experiments with learning behavioral poli-

cies in Section 4.3 could also be extended by dropping

early, uninformed decisions and checking how this in-

ﬂuences the ofﬂine methods performance.

Our investigation adds a new dimension to the

body of knowledge concerning ofﬂine policy learn-

ing. While algorithms undoubtedly form the learning

engine, our research underscores the importance of

fuel quality — the ofﬂine dataset — for the journey

toward efﬁcient ofﬂine policy learning and decision-

making. We hope to contribute to the ongoing dia-

logue on improving the implementation of ofﬂine pol-

icy learning in real-world scenarios.

REFERENCES

Badanidiyuru, A., Kleinberg, R., and Slivkins, A. (2013).

Bandits with Knapsacks. IEEE Computer Society.

Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,

D. (2015). Weight Uncertainty in Neural Networks.

Brandfonbrener, D., Whitney, W. F., Ranganath, R., and

Bruna, J. (2021). Ofﬂine Contextual Bandits with

Overparameterized Models.

Chen, J. and Jiang, N. (2019). Information-Theoretic Con-

siderations in Batch Reinforcement Learning. PMLR.

Dudik, M., Langford, J., and Li, L. (2011). Doubly Robust

Policy Evaluation and Learning.

Dutta, P., Cheuk, M. K., Kim, J. S., and Mascaro, M.

(2019). Automl for contextual bandits. CoRR,

abs/1909.03212.

Fujimoto, S., Meger, D., and Precup, D. (2019). Off-Policy

Deep Reinforcement Learning without Exploration.

PMLR.

Hu, B., Xiao, Y., Zhang, S., and Liu, B. (2023). A Data-

Driven Solution for Energy Management Strategy of

Hybrid Electric Vehicles Based on Uncertainty-Aware

Model-Based Ofﬂine Reinforcement Learning.

Joachims, T., Swaminathan, A., and de Rijke, M. (2018).

Deep learning with logged bandit feedback. OpenRe-

view.net.

Komorowski, M., Celi, L. A., Badawi, O., Gordon, A. C.,

and Faisal, A. A. (2018). The Artiﬁcial Intelligence

Clinician learns optimal treatment strategies for sepsis

in intensive care.

Kumar, A., Zhou, A., Tucker, G., and Levine, S. (2020).

Conservative Q-Learning for Ofﬂine Reinforcement

Learning. Curran Associates, Inc.

Lattimore, T. and Szepesv

ari, C. (2020). Bandit Algorithms.

Cambridge University Press.

Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020). Ofﬂine

reinforcement learning: Tutorial, review, and perspec-

tives on open problems.

Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A

contextual-bandit approach to personalized news arti-

cle recommendation.

Markelle, K., Longjohn, R., and Nottingham, K. The uci

machine learning repository.

Nguyen-Tang, T., Gupta, S., Nguyen, A. T., and Venkatesh,

S. (2022). Ofﬂine neural contextual bandits: Pes-

simism, optimization and generalization.

Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, S.

(2022). Bridging ofﬂine reinforcement learning and

imitation learning: A tale of pessimism.

Sakhi, O., Chopin, N., and Alquier, P. (2022). Pac-bayesian

ofﬂine contextual bandits with guarantees.

Tucker, A. D. and Joachims, T. (2023). Variance-

minimizing augmentation logging for counterfactual

evaluation in contextual bandits.

Vannella, F., Jeong, J., and Prouti

ere, A. (2023). Off-policy

learning in contextual bandits for remote electrical tilt

optimization.

Dataset Characteristics and Their Impact on Ofﬂine Policy Learning of Contextual Multi-Armed Bandits

Wang, Z., Hunt, J. J., and Zhou, M. (2023). Diffusion poli-

cies as an expressive policy class for ofﬂine reinforce-

ment learning.

Xiao, T. and Wang, D. (2021). A general ofﬂine reinforce-

ment learning framework for interactive recommenda-

tion.

Xie, T., Foster, D. J., Bai, Y., Jiang, N., and Kakade, S. M.

(2023). The role of coverage in online reinforcement

learning.

Yang, Z., Guo, Y., Xu, P., Liu, A., and Anandkumar, A.

(2023). Distributionally robust policy gradient for of-

ﬂine contextual bandits.

Zhao, X., Zhang, W., and Wang, J. (2013). Interactive col-

laborative ﬁltering.

Zhou, D., Li, L., and Gu, Q. (2020). Neural contextual

bandits with ucb-based exploration.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

Figure 1: The NeuraLCB ofﬂine method results in the Adult problem.

Figure 2: The NeuralGreedy ofﬂine method results in the Adult problem.

Dataset Characteristics and Their Impact on Ofﬂine Policy Learning of Contextual Multi-Armed Bandits

Figure 3: The NeuraLCB ofﬂine method results in the MNIST problem.

Figure 4: The NeuralGreedy ofﬂine method results in the MNIST problem.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

Figure 5: The NeuraLCB ofﬂine method results in the Mushroom problem.

Figure 6: The NeuralGreedy ofﬂine method results in the Mushroom problem.

Dataset Characteristics and Their Impact on Ofﬂine Policy Learning of Contextual Multi-Armed Bandits

Figure 7: The NeuraLCB ofﬂine method results in the Shuttle problem.

Figure 8: The NeuralGreedy ofﬂine method results in the Shuttle problem.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence