From Targets to Rewards: Continuous Target Sets in the Algorithmic
Search Framework
Milo Knell
1, a
, Sahil Rane
1, b
, Forrest Bicker
1, c
, Tiger Che
1, d
, Alan Wu
2, e
and George D. Monta
˜
nez
1 f
1
AMISTAD Lab, Department of Computer Science, Harvey Mudd College, Claremont, CA 91711, U.S.A.
2
Department of Computer Science, California Institute of Technology, Pasadena, CA 91125, U.S.A.
Keywords:
Algorithmic Search Framework, Satisfaction, Fuzzy Membership.
Abstract:
Many machine learning tasks have a measure of success that is naturally continuous, such as error under a
loss function. We generalize the Algorithmic Search Framework (ASF), used for modeling machine learning
domains as discrete search problems, to the continuous space. Moving from discrete target sets to a continuous
measure of success extends the applicability of the ASF by allowing us to model fundamentally continuous
notions like fuzzy membership. We generalize many results from the discrete ASF to the continuous space and
prove novel results for a continuous measure of success. Additionally, we derive an upper bound for the expected
performance of a search algorithm under arbitrary levels of quantization in the success measure, demonstrating
a negative relationship between quantization and the performance upper bound. These results improve the
fidelity of the ASF as a framework for modeling a range of machine learning and artificial intelligence tasks.
1 INTRODUCTION
The Algorithmic Search Framework (ASF) is a theo-
retical model that has been used to rigorously study
properties of machine learning (ML), artificial intel-
ligence (AI), and search problems (Monta
˜
nez, 2017;
Monta
˜
nez et al., 2019; Monta
˜
nez et al., 2021). This
framework has been used to bound the performance
of learning models, prove trade-offs between bias and
expressivity (Lauw et al., 2020), derive generaliza-
tion bounds for supervised classification (Ramalingam
et al., 2022), and quantify performance bounds on
transfer learning (Williams et al., 2020). However, one
fundamental limitation of the ASF is that it measures
the performance of a machine learning algorithm with
a binary target set to which elements in the search
space (also referred to as hypotheses) either belong or
do not belong, rendering them indistinguishable from
one another. This limitation makes it impossible to
a
https://orcid.org/0009-0002-2951-8324
b
https://orcid.org/0009-0001-3986-1129
c
https://orcid.org/0009-0000-9872-7619
d
https://orcid.org/0009-0000-3586-4288
e
https://orcid.org/0009-0006-2454-4354
f
https://orcid.org/0000-0002-1333-4611
Denotes equal authorship.
account for the fuzzy membership of hypotheses over
a search space where each hypothesis may have a vary-
ing degree of fidelity. As a result, the strongest and
weakest satisfactory hypotheses are treated equally. By
using a continuous metric instead we can incorporate
meaningful information about the relative certainty
of our hypothesises, allowing us to both strengthen
existing results and prove novel theorems.
Many modern applications of machine learning
could benefit from a continuous success measure,
which we term the
satisfaction
of a hypothesis. Hence,
we propose a degree of satisfaction as a continuous-
scale measure of the quality of a hypothesis function,
instead of having the notion of binary membership
in a target set. Examples of continuous measures of
success include cross-entropy loss, mean squared er-
ror, hinge loss, accuracy, and F
1
score. To accurately
model such problems in the ASF, we must account for
continuous membership measures. Prior work avoided
this limitation by implicitly defining some threshold
of acceptability, where the target set was defined as
the set of all elements with acceptable satisfaction val-
ues (Monta
˜
nez, 2017). By defining such a threshold
we lose information about the underlying satisfaction
structure between different hypotheses in the target set.
However, employing a framework that directly inter-
faces with the underlying satisfaction structure enables
558
Knell, M., Rane, S., Bicker, F., Che, T., Wu, A. and Montañez, G.
From Targets to Rewards: Continuous Target Sets in the Algorithmic Search Framework.
DOI: 10.5220/0012370600003636
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 16th International Conference on Agents and Artificial Intelligence (ICAART 2024) - Volume 3, pages 558-567
ISBN: 978-989-758-680-4; ISSN: 2184-433X
Proceedings Copyright © 2024 by SCITEPRESS Science and Technology Publications, Lda.
us to generalize the results of the ASF, and measure
the performance of machine learning algorithms more
effectively, accounting for the fuzzy membership of
hypotheses functions.
We examine related work to machine learning as
search and fuzzy membership, rigorously and math-
ematically define the continuous ASF, present novel
results, and show a real-world example applying these
novel bounds.
2 RELATED WORK
Machine learning can be modeled as search (Mitchell,
1982; Monta
˜
nez, 2017). The conversion of machine
learning problems to search problems enables us to
perform a variety of analyses on their performance,
using a mathematical and information-theoretic per-
spective. This approach helps us prove bounds on
the performance of machine learning algorithms and
gain an improved understanding of ‘big picture’ con-
cepts in machine learning such as the bias-expressivity
trade-off (Lauw et al., 2020).
Since its introduction, many results have been
proven within the context of the ASF. For instance,
researchers have demonstrated that a well-performing
machine learning algorithm cannot exist without a
predisposition to a certain group of outcomes (bias)
(Monta
˜
nez et al., 2019). Defining expressivity as the
variability of outputs a machine learning algorithm
can generate with different training data (Lauw et al.,
2020), the ASF has been employed to prove the ex-
istence of fundamental trade-offs between bias and
expressivity in machine learning (Lauw et al., 2020).
Therefore, the ASF has proven effective in establishing
fundamental properties of machine learning and ma-
chine learning algorithm performance. The framework
has been used to prove ‘famine’ results, such as the
fact that favorable algorithms for a specific task are
scarce (Monta
˜
nez, 2017).
In the ASF, researchers make several simplifying
assumptions, including the use of binary target sets
to measure the satisfaction of hypotheses (Monta
˜
nez,
2017). Our work extends the ASF and generalizes
the previously mentioned results. Moreover, we relax
some simplifying assumptions within the ASF and gen-
eralize existing results proven within the framework
to a continuous measure of satisfaction of hypotheses
(Monta
˜
nez, 2017; Monta
˜
nez et al., 2019; Lauw et al.,
2020). By extending the framework to account for con-
tinuous measures of satisfaction, we pave the way for
future progress within the ASF, offering a more gen-
eral framework applicable to a larger set of machine
learning problems.
This generalization is especially valuable in the
context of recent machine learning advances that in-
corporate fuzzy membership functions in a variety of
capacities. This includes within models to enhance
their accuracy, and in performing tasks such as im-
age classification and various engineering applications
(H
¨
ullermeier, 2005; Gottwald, 2005; Resti et al., 2022;
Ghofrani et al., 2014). Moreover, expanding the frame-
work to encompass continuous target sets broadens its
applicability, making it relevant to a broader range of
machine learning challenges, thereby increasing its
practicality.
3 THE ALGORITHMIC SEARCH
FRAMEWORK (ASF)
3.1 The Search Problem
The ASF recasts machine learning problems as search
problems, simplifying proofs for results on their per-
formance. Following Monta
˜
nez (Monta
˜
nez, 2017), we
model the search problem as a modular system of three
parts,
(, T, F)
, where
represents the discrete, finite
search space comprising hypotheses. We search within
this space to find an element in the non-empty subset
T
, known as the target set. The external information
resource
F
guides this search, providing initialization
information, and offering evaluations on sampled ele-
ments from the search space to further steer our search.
For instance, in a machine learning context, the exter-
nal information resource
F
could be a training dataset
with an accompanying loss function. Therefore, evalu-
ating the external information resource on a particular
element of the search space yields the loss function
value for a specific hypothesis.
The target set
T
corresponds to the set of hypothe-
ses that attain sufficiently high levels of satisfaction
on a dataset for some desired threshold value of sat-
isfaction. In the context of machine learning, we can
interpret the satisfaction level of hypothesis as a notion
of accuracy or performance on a test dataset. The loss
function included in
F
directs the algorithm in search-
ing through
for a hypothesis in
T
. This implies that
we use our training data in our external information
resource F to find a hypothesis in the target set T .
3.2 The Search Algorithm
The search algorithm A iteratively assigns a probabil-
ity distribution over the search space, drawing from
its search history and the evaluation of the external
information resource on each element, as shown in
From Targets to Rewards: Continuous Target Sets in the Algorithmic Search Framework
559
Figure 1. The search history comprises a query trace
and a resource evaluation trace. The query trace holds
the history of the elements that have been sampled by
the search algorithm, and the resource evaluation trace
records the history of the evaluations of the external
information resource on these elements. A search algo-
rithm within this framework is considered successful
if it samples an element of the target set during its
search. Importantly, only the external information re-
source, not the target set, guides the algorithm during
the search process.
Figure 1: A black-box search algorithm. Reproduced from
(Monta
˜
nez et al., 2019).
4 CONTINUOUS ASF
4.1 Definitions
The satisfaction measure serves as the continuous case
analog of the target set. It provides an indication of the
quality of a hypothesis in the search space, signifying
how good or bad a particular hypothesis is.
Definition 4.1
(Satisfaction)
.
The satisfaction func-
tion
s(ω) : [0, c]
maps from the search space
to real-valued quantities denoted satisfactions. We as-
sume that these satisfactions exist in the range
[0, c]
where
c
is a finite, positive real number. It is possible
to assume that total satisfaction sums to 1 over the
search space, which is achievable without loss of gen-
erality since we can linearly transform any satisfaction
space to satisfy this property
1
.
Definition 4.2
(Continuous Search Problem)
.
Let the
tuple
(, s, F)
define a search problem. The search
space
contains the elements (hypotheses) to be
queried/explored. For each
ω
,
s(ω)
denotes the
level of satisfaction corresponding to the hypothesis
ω
. The function
s(ω)
can be represented by a vec-
tor
s R
||
where
s
ω
= s(ω)
. This deviation from
binary membership target sets allows us to account for
a continuous measure of satisfaction for hypotheses.
1
One such transformation is s
0
(ω) =
s(ω)min
s(ω)
s(ω)min
s(ω)
.
F
denotes the external information resource available
to the learning algorithm, and for each element
ω
,
let
F(ω)
be the evaluation of the external information
resource corresponding to the element of the search
space
ω
. Thus, the only departure from the classic ASF
lies in replacing binary
T
with continuous satisfaction
measure s.
Definition 4.3
(Expected Per-Query Satisfaction)
.
In
Monta
˜
nez’s ASF (Monta
˜
nez, 2017), success is mea-
sured using an expected per-query probability of suc-
cess metric. In the continuous case, this generalizes
to an expected per-query satisfaction metric. We do
so by weighting the probability that each element is
sampled by a search algorithm with its corresponding
satisfaction level. Let
H
be the history of the search
algorithm,
F
the external information resource,
˜
P
a
sequence of probability distributions over the search
space assigned by the search algorithm, and
P
i
be the
probability distribution assigned by the search algo-
rithm over the search space at a time step
i
in the search
history
H
. Formally, we define the expected per-query
satisfaction as
q(s, F) = E
˜
P,H
"
1
|
˜
P|
|
˜
P|
i=1
s
>
P
i
F
#
.
Definition 4.4
(Decomposability)
.
We observe that
each
q(s, F)
can be decomposed into the inner product
of s and P
F
:
q(s, F) = E
˜
P,H
"
1
|
˜
P|
|
˜
P|
i=1
s
>
P
i
F
#
= s
>
E
˜
P,H
"
1
|
˜
P|
|
˜
P|
i=1
P
i
F
#
= s
>
P
F
,
(1)
where we have defined
P
F
:= E
˜
P,H
h
1
|
˜
P|
|
˜
P|
i=1
P
i
| F
i
as
the expected average conditional distribution on the
search space given
F
. Intuitively, this is equivalent to
weighting each satisfaction value with its correspond-
ing probability mass in P
F
.
Our notation is summarized in Table 1. We include
a real-world example in Section 6, anchoring the ASF
to a practical machine-learning problem. Note that
there are many learning processes, and some may dif-
fer from the examples below (for example, an unsuper-
vised learning problem must have a different measure
of success from a supervised classification problem).
The ASF is general enough to encompass any algorith-
mic search problem.
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
560
Table 1: Notation.
SYMBOL DEFINITION
, ω
Search space, element of search space,
e.g., an element could be a set of param-
eters, like weights.
s, s(ω)
Satisfaction vector, satisfaction of ele-
ment
ω
, e.g., how good a particular hy-
pothesis is, such as performance on the
test set.
T, t
Target set in the binary ASF (the equiv-
alent of
s
in the continuous ASF), e.g.,
a set of hypothesis that are sufficiently
satisfactory, possibly performing above
some threshold on the test set. The bi-
nary vector representation of this set is
given by t.
F, F(ω)
External information resource, evalua-
tion of external information resource on
an element
ω
, e.g., our training measure
of how good a hypothesis is, such as
training data and some loss function.
q(s, F) Expected per-query satisfaction.
q(t, F)
Expected per-query probability of suc-
cess in ASF.
φ
Decomposable satisfaction metric, e.g,
q
.
τ
k
A closed-under-permutation set of
s
such
that all s
>
1 = k.
D
τ
k
Distribution over a set of satisfaction vec-
tors in τ
k
.
A
An abstract search algorithm which itera-
tively explores the search space.
P
i
Distribution assigned by
A
over
at step
i
during the iterative search process. This
could be conditioned on F.
˜
P
Sequence of probability distributions as-
signed by algorithm
A
, each element is a
P
i
. This could be conditioned on F.
P
F
Expected averaged conditional distribu-
tion assigned by the search algorithm
given the external information resource
F.
5 RESULTS
5.1 Famine of Favorable Satisfactions
Theorem 5.1.
For fixed
k R
0
, fixed information
resource
F
, decomposable, non-negative satisfaction
metric
φ
such as
q
, and minimum acceptable per-query
satisfaction q
min
, we define
τ
k
= {s R
||
|
||
i=1
s
i
= k}, and
τ
q
min
= {s τ
k
| φ(s, F) q
min
}.
Then
µ(τ
q
min
)
µ(τ
k
)
p
q
min
where
p
is the expected per-query satisfaction under
uniform random sampling and
µ
is Lebesgue measure.
This theorem shows that the proportion of satis-
faction functions for which our algorithm performs
extremely well (with more than
q
min
expected satis-
faction) is small. In most practical applications,
p
is
extremely small, as
k
will typically be extremely small
in comparison to the size of the search space. The
upper bound for the probability of successful search
decreases linearly with the increase of threshold value
q
min
.
5.2 Success Under Dependence
Theorem 5.2.
Let
c
be a finite positive constant, and
restrict
s
to an arbitrary quantization
Q = {i · c |
i {1, ..., m}}
. Let
τ
k
be the set of satisfaction vec-
tors such that
τ
k
= {s | s Q
||
, s
>
1 = k}
, and let
H(U
τ
k
;||
)
denote the information-theoretic entropy
of the uniform distribution over
τ
k
for a search space
of cardinality ||.
Let the satisfaction vector
S D
τ
k
be a vector-
valued random variable over the set
τ
k
. Let
X
be the
random variable such that
X P
F
over the elements
of
.
S(X)
is similar to
s(ω)
, except we are dealing
with random variables
S
and
X
rather than specific re-
alizations
s
and
ω
. Then for any non-negative constant
u,
Pr(S(X) u)
I(S;F) + D
KL
(D
τ
k
k U
τ
k
) + H(S(X) | X)
H(U
τ
k
;||−1
) H(U
τ
ku
;||−1
)
.
Theorem 5.2 provides a bound on the probability
of sampling an element with a sufficiently large satis-
faction defined by threshold
u
. This expression tells
us that the upper bound of the probability of success
monotonically improves as dependence between the
satisfaction vector values and information resource
values increases. The term
D
KL
(D
τ
k
k U
τ
k
)
represents
the Kullback-Leibler (KL) divergence between the ac-
tual distribution over the set
τ
k
,
D
τ
k
and the uniform
distribution over the same set
τ
k
,
U
τ
k
. This can be in-
terpreted as the predictability of the distribution of sat-
isfaction vectors, where large values of KL-divergence
From Targets to Rewards: Continuous Target Sets in the Algorithmic Search Framework
561
represent more probability mass concentrated on a few
elements. The
H(S(X ) | X)
term indicates the condi-
tional entropy or surprisal associated with the possible
satisfaction values for an element
X
sampled from the
search space. This term is large when there are large
variations in the satisfaction values thus resulting in
an increase in the upper bound for the probability of
sampling an element with a sufficiently large satisfac-
tion value. The denominator in the bound essentially
serves as a normalizing factor appropriately scaling
the value of the upper bound.
We see that our upper bound increases with an in-
crease in the predictability of the satisfaction vector,
the dependence between the satisfaction vector and
the external information resource, and the conditional
entropy in the satisfaction values associated with an el-
ement in the search space. Thus, this theorem gives us
an interpretable upper bound on the probability of sam-
pling an element with a sufficiently large satisfaction
value. Moreover, this theorem is particularly useful
to allow us to determine situations where we cannot
expect to perform well.
5.3 Expected Satisfaction Under
Dependence
Theorem 5.3.
We will continue using all the defini-
tions from Theorem 5.2. Let
q = E[S(X )] =
cQ
Pr(S(X) = c)· c.
Then,
q
I(S; F) + D
KL
(D
τ
k
k U
τ
k
) + H(S(X) | X)
1
c
m
(H(U
τ
k
;||−1
) H(U
τ
kc
0
;||−1
))
,
where c
0
= inf Q and c
m
= sup Q.
Extending from the bound on
Pr(S(X) u)
pre-
sented in Theorem 5.2, we present a similar bound for
the expected satisfaction, i.e.,
q = E[S(X)]
, without the
need to specify a target satisfaction value defined by a
constant threshold
u
. Compared to Theorem 5.2, this
bound gives more context of the search problem, and
can serve as a more robust metric since it’s not suscep-
tible to the skewness and kurtosis of the distribution
of satisfaction values over the search space.
The interpretation of this bound is similar to The-
orem 5.2 with a small change in the scaling factor
in the denominator. Here, the bounded quantity is
the expected satisfaction instead of the probability of
exceeding a certain satisfaction. Comparing the two
bounds in 5.2 and 5.3, we see that the bound in 5.3
is useful when sub-optimal elements contribute to the
success of the search algorithm, whereas the bound
in 5.2 is useful when sub-optimal elements do not
contribute to the success of the search algorithm.
5.4 Difference in Satisfaction
We next quantify and bound the difference between
expected per-query satisfaction (i.e., for continuous
targets) and expected per-query probability of suc-
cess (i.e., for binary targets), beginning with a helpful
lemma.
Lemma 5.4.
Let
g
be the threshold value for convert-
ing a continuous target set into a discrete (binary)
target set where all elements with satisfaction greater
than or equal to the threshold
g
are included in the
target set and the rest are excluded. Given a probabil-
ity vector
w
, satisfaction vector
s
, target vector
t
, and
vector v = s t,
|v
>
w| max(1 g,g).
Theorem 5.5.
Let
P
F,s
be the average conditional dis-
tribution assigned by the search algorithm under a
continuous satisfaction measure, and let
P
F,t
be the av-
eraged conditional distribution assigned by the search
algorithm under a discrete target set. Let
r
be the
maximum rounding amount defined as
max(1 g, g)
.
Then,
|q(s, F) q(t, F)| |T |
r
1
2
D
KL
P
F,s
k P
F,t
+ r.
Theorem 5.5 bounds the difference in the success
measure in the discrete and continuous case using the
KL-divergence between the distributions learned in
the continuous and discrete cases. It indicates that the
potential for improved performance obtained by tran-
sitioning from a discrete target set to continuous target
sets relies on the chosen threshold value
g
and the size
of the target set. The degree of divergence between the
outputs of the search algorithm in the two scenarios is
measured by KL-divergence. This relationship is logi-
cal since the potential for performance improvement
between the case with discrete and continuous target
sets should be proportional to the amount of rounding
required, which is related to both the threshold and
the size of the target set. By transitioning from using
discrete to continuous target sets, we also have the
potential to gain from the divergence between the av-
erage conditional probability mass functions produced
by the search algorithm in both cases. This is because
the external information resource has the potential to
be more useful in the case of a continuous satisfaction
measure.
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
562
6 EXAMPLE
6.1 Setup
To anchor this theoretical framework we provide an
example of how it can be applied to a simple machine
learning regression problem. We first create an inde-
pendent variable
X
then use some stochastic data gen-
erating process to obtain our dependent variable
Y
. For
the purposes of this example we use
Y = 2X + 5 + ε
,
with ε N (0, 100).
Suppose we try to model the data from this learn-
ing problem with a linear regression of the form
Y = aX + b
. We would then be able to model the
training process within the ASF as a search over the
space of possible learned parameters. We consider a
finite set of values from which to take
a
and
b
. Let
A
and
B
be sets of evenly spaced numbers in the finite
interval
[a
min
, a
max
]
and
[b
min
, b
max
]
with a step size
x
.
That is,
A = {a
min
, a
min
+ x, ..., a
max
}
and likewise for
B
. Then,
= A × B
or the Cartesian product of
A
and
B
. For this example, we selected
A = [0, 4]
with a step
size of
0.01
and
B = [1, 7]
, also with a step size of
0.01
.
In general, this is done by inspecting the distribution
of X and Y .
We must also determine an external information
resource
F
that will guide our search through the pa-
rameter space. For this regression problem, we use the
mean squared error (MSE) of our hypothesized model
calculated on the training set. That is, for a particular
hypothesis
ω
corresponding to a pair of parameters
(a, b)
where
a A
and
b B
, the evaluation of the ex-
ternal information resource is the mean squared error
on the training set, or, F(ω) =
1
n
n
i
(Y
i
(aX
i
+ b))
2
.
The overarching goal of the search algorithm is
to find elements in the search space that have large
satisfaction values associated with them. Our search
algorithm determines the distribution
P
F
to attempt
to maximize
s
>
P
F
, but it only has the information
provided by
F
. In this example, the satisfaction values
can be interpreted as the mean squared error on the
test set. While we perform our search we don’t use
these values of satisfaction to guide our search, only
the evaluation of
F
. In machine learning terms, the
algorithm does not have access to the test data during
the training step.
6.2 Example Result
Let us evaluate the bound presented in Theorem
5.2
both on differing levels of quantization and with dif-
ferent evaluation scenarios to gain more insight into
the theorem. If our training data was generated by the
same process as our testing data and success measured
similarly, we would expect our mean squared error on
the training data to be reflective of the mean squared
error of the test data, thus giving us a high satisfac-
tion for a trained model. However, if the test data was
generated via a different process or the measure of suc-
cess between training and testing data were different,
we would not expect our trained model to have high
satisfaction.
Consider a case where mean squared error is used
to evaluate a hypothesized model on the training data
in the information resource, but the satisfaction mea-
sure is the mean absolute error. In this case, the infor-
mation a learning algorithm is guided by during search
is systematically different from the information it is
supposed to learn. This would be reflected by a lower
value in the mutual information term I(S; F).
By making reasonable assumptions about the struc-
ture of our example problem, we can compute the
value of
I(S; F)
and the bound for Theorem
5.2
. First,
we set
k = 1
, and
|| = |A| × |B| = 400 × 600
(the
bounds assigned in the previous section). We compute
the bounds for levels of quantization
m = 2
(binary)
and
m = 3
(ternary). For binary we set our
c = 0.5
so
s
i
{0, 0.5}
. For ternary, we set
c =
1
3
so
s
i
{0,
1
3
,
2
3
}
.
We assume that
D
τ
k
= U
τ
k
;||
, that is
S U
τ
k
;||
.
While this assumption is not necessary it simplifies
our calculations.
We produce these results in Table 2. The column
Match means that MSE is used for evaluating in both
the train and test phase (i.e
F
and
s
), while the column
Not Match means that MSE was used in the train phase
and MAE was used in the test phase. The rows
m = 2
and
m = 3
mean a binary level and ternary level of
quantization, respectively.
Table 2: Example Results.
Match Not Match
m=2
I(S;F) = 0.60
Pr(S(X)
1
2
) 0.31
I(S;F) = 0.27
Pr(S(X)
1
2
) 0.14
m=3
I(S;F) = 0.97
Pr(S(X)
2
3
) 0.34
I(S;F) = 0.55
Pr(S(X)
2
3
) 0.19
From the table, we can observe that having the
same evaluation metric for our train and test set raises
our potential for performing well. This change is
largely driven by the decreased mutual information
term
I(S; F)
displayed for each. Comparing across
levels of quantization does not necessarily make sense,
especially since the selection of
u
(i.e.,
1
2
and
2
3
) differ
in the two cases based on the level of quantization.
This demonstrates how these bounds can be ap-
plied to real-world problems, and show how changes
in
I(S; F)
influence our ability to perform well. Our
From Targets to Rewards: Continuous Target Sets in the Algorithmic Search Framework
563
ability to do well on a problem is limited by the quality
of our information with respect to what we are trying
to learn. In other words: garbage in, garbage out.
7 CONCLUSION
We extend the Algorithmic Search Framework from
discrete target sets to a continuous measure of suc-
cess, addressing one the framework’s core limitations
and increasing its versatility. We generalize theorems
previously proven using the discrete ASF to the con-
tinuous and quantized cases, and derive novel results.
Specifically, we prove an upper bound on performance
under an arbitrary level of quantization, demonstrating
that increasing the granularity of our success metric
reduces our maximum theoretical performance. We
bound the absolute difference in performance between
the binary and continuous cases. We provide an ex-
ample how the ASF can be applied to a regression
problem and show how different processes for generat-
ing data or measuring success change key terms, like
I(S; F), thus varying our bound on performance.
These results improve the ability of the ASF to
model machine learning problems that naturally have
continuous measures of success, unlocking the poten-
tial to further the body of existing ASF research. There
remain many opportunities for extension. One possible
application of this framework is that it can be used for
an information theory-based analysis of auto-ML algo-
rithms by giving us a framework to better understand
the performance of this domain of machine learning
algorithms. Strengthening this theoretical framework
will give researchers the tools to analyze learning algo-
rithms with a naturally continuous measure of success.
REFERENCES
Ghofrani, F., Helfroush, M. S., Danyali, H., and Kazemi, K.
(2014). Improving the performance of machine learn-
ing algorithms using fuzzy-based features for medical
x-ray image classification. In Journal of Intelligent &
Fuzzy Systems, volume 6, pages 3169–3180.
Gottwald, S. (2005). Mathematical aspects of fuzzy sets
and fuzzy logic: Some reflections after 40 years. In
Mathematical aspects of fuzzy sets and fuzzy logic:
Some reflections after 40 years, volume 156, pages
357–364. 40th Anniversary of Fuzzy Sets.
H
¨
ullermeier, E. (2005). Fuzzy methods in machine learn-
ing and data mining: Status and prospects. In Fuzzy
methods in machine learning and data mining: Sta-
tus and prospects, volume 156, pages 387–406. 40th
Anniversary of Fuzzy Sets.
Lauw, J., Macias, D., Trikha, A., Vendemiatti, J., and
Monta
˜
nez, G. D. (2020). The Bias-Expressivity Trade-
off. In Rocha, A. P., Steels, L., and van den Herik, H. J.,
editors, Proceedings of the 12th International Confer-
ence on Agents and Artificial Intelligence, ICAART
2020, Volume 2, Valletta, Malta, February 22-24, 2020,
pages 141–150. SCITEPRESS.
Mitchell, T. M. (1982). Generalization as search. Artificial
Intelligence, 18(2):203–226.
Monta
˜
nez, G. D. (2017). The Famine of Forte: Few Search
Problems Greatly Favor Your Algorithm. In 2017
IEEE International Conference on Systems, Man, and
Cybernetics (SMC), pages 477–482. IEEE.
Monta
˜
nez, G. D., Bashir, D., and Lauw, J. (2021). Trad-
ing Bias for Expressivity in Artificial Learning. In
Rocha, A. P., Steels, L., and van den Herik, J., edi-
tors, Agents and Artificial Intelligence, pages 332–353,
Cham. Springer International Publishing.
Monta
˜
nez, G. D., Hayase, J., Lauw, J., Macias, D., Trikha,
A., and Vendemiatti, J. (2019). The Futility of Bias-
Free Learning and Search. In 32nd Australasian Joint
Conference on Artificial Intelligence, pages 277–288.
Springer.
Ramalingam, R., Dice, N. E., Kaye, M. L., and Monta
˜
nez,
G. D. (2022). Bounding Generalization Error Through
Bias and Capacity. In 2022 International Joint Confer-
ence on Neural Networks (IJCNN), pages 1–8.
Resti, Y., Irsan, C., Amini, M., Yani, I., Passarella, R., and
Zayantii, D. A. (2022). Performance improvement
of decision tree model using fuzzy membership func-
tion for classification of corn plant diseases and pests.
In Performance Improvement of Decision Tree Model
using Fuzzy Membership Function for Classification
of Corn Plant Diseases and Pests, volume 7, page
284–290.
Williams, J., Tadesse, A., Sam, T., Sun, H., and Monta
˜
nez,
G. D. (2020). Limits of Transfer Learning. The Sixth
International Conference on Machine Learning, Opti-
mization, and Data Science (LOD 2020).
APPENDIX
Theorem 5.1.
For fixed
k R
0
, fixed information
resource
f
, decomposable, non-negative satisfaction
metric
φ
, and minimum acceptable per-query satisfac-
tion q
min
, we define
τ
k
= {s R
||
|
||
i=1
s
i
= k}, and
τ
q
min
= {s τ
k
| φ(s, F) q
min
}.
Then
µ(τ
q
min
)
µ(τ
k
)
p
q
min
where
p
is the per-query expected
satisfaction under uniform random sampling and
µ
is
Lebesgue measure.
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
564
Proof. Under uniform sampling on τ
k
, we have
µ(τ
q
min
)
µ(τ
k
)
= Pr(φ(s, F) q
min
)
E
U[τ
k
]
[φ(s, F)]
q
min
where
U[τ
k
]
is the uniform distribution over
τ
k
,
s
U[τ
k
]
, and the second step follows from Markov’s
inequality. By the decomposability of
φ
and linearity
of expectation, we have:
µ(τ
q
min
)
µ(τ
k
)
E
U[τ
k
]
s
>
P
φ,F
q
min
=
E
U[τ
k
]
s
>
P
φ,F
q
min
.
As
U[τ
k
]
is uniform,
E
U[τ
k
]
s
>
= p · 1
>
. It follows
that
µ(τ
q
min
)
µ(τ
k
)
p · 1
>
P
φ,F
q
min
.
Furthermore, as
P
φ,F
is a probability distribution,
ω
P
φ,F
(ω) = 1
and
1
>
P
φ,F
=
1
. Hence, we conclude
µ(τ
q
min
)
µ(τ
k
)
p
q
min
.
Theorem 5.2.
Let
X
be the random variable such that
X P
F
. For any non-negative u Q:
Pr(S(X) u)
I(S;F) + D
KL
(D
τ
k
k U
τ
k
) + H(S(X) | X)
H(U
τ
k
;||−1
) H(U
τ
ku
;||−1
)
.
Proof. Q
is a set of integer multiples of a constant
spacing
c
(an arbitrary quantization) and can be ex-
pressed as
Q = {0, c
0
, c
1
, ..., c
m
}
, where
c
0
corre-
sponds to the minimum positive value and
c
m
corre-
sponds to the maximum value. Now, observe that
H(U
τ
k
;||
) = log(|τ
k
|) = log
|| +
k
c
1
k
c
!
.
Note that
H(U
τ
k
;||
)
is monotonically increasing
on
||
and
k
. For notational simplicity, let
P
g
=
Pr(S(X) u). We see that:
H(S | S(X), X ) = (1 P
g
)H(S | S(X) < u, X )
+ P
g
H(S | S(X) u, X )
(1 P
g
)H(U
τ
k
;||−1
)
+ P
g
H(U
τ
ku
;||−1
)
= H(U
τ
k
;||−1
)
P
g
(H(U
τ
k
;||−1
) H(U
τ
ku
;||−1
)).
The inequality follows from the fact that the entropy
of a distribution of
s
is not larger than the entropy of
uniform distribution of s.
Also, by the chain rule of conditional entropy,
H(S, S(X) | X) = H(S | S(X), X) + H(S(X) | X)
H(S, S(X) | X ) = H(S(X ) | S, X) + H(S | X ) = H(S | X) + 0
By the data processing inequality (with the Markov
chain S F X),
H(S | F) H(S | X)
= H(S, S(X)|X )
= H(S | S(X), X ) + H(S(X) | X)
Then, we have:
H(S | F) = H(S | S(X ), X) + H(S(X) | X)
H(U
τ
k
;||−1
)
| {z }
H(U
s
)
P
g
(H(U
τ
k
;||−1
) H(U
τ
ku
;||−1
))
+ H(S(X) | X)
and by the definition of conditional entropy,
H(S) I(S; F) H(U
s
) P
g
(H(U
τ
k
;||−1
)
H(U
τ
ku
;||−1
))
+ H(S(X) | X).
Thus,
Pr(S(X) u)
I(S;F) + D
KL
(D
τ
k
k U
τ
k
) + H(S(X) | X)
H(U
τ
k
;||−1
) H(U
τ
ku
;||−1
)
.
Theorem 5.3.
Let
Q
be a set of integer multiples of
a constant spacing
c
and can be expressed as
Q =
{0, c
0
, c
1
, ..., c
m
}
, where
c
0
corresponds to the mini-
mum positive value and
c
m
corresponds to the maxi-
mum value. Let
P
c
= Pr(S(X) = c)
,
q =
cQ,c0
P
c
c
,
then we have:
q
I(S; F) + D
KL
(D
s
k U
s
) + H(S(X) | X)
1
c
m
(H(U
τ
k
;||−1
) H(U
τ
kc
0
;||−1
)).
From Targets to Rewards: Continuous Target Sets in the Algorithmic Search Framework
565
Proof.
H(S | S(X), X ) = (1
P
c
)H(S | S(X) = 0,X)
+
P
c
H(S | S(X) = c,X)
(1
P
c
)H(U
τ
k
;||−1
)
+
P
c
· c ·
H(U
τ
kc
;||−1
)
c
= H(U
τ
k
;||−1
)
P
c
c ·
1
c
(H(U
τ
k
;||−1
)
H(U
τ
kc
;||−1
))
H(U
τ
k
;||−1
)
(
P
c
c) ·
1
c
m
(H(U
τ
k
;||−1
)
H(U
τ
kc
0
;||−1
)).
The last inequality is due to the monotonicity of
H(U
τ
k
;||
)
. Then, following the same steps as used in
Theorem 5.2,
H(S) I(S; F) H(U
τ
k
;||−1
)
| {z }
H(U
s
)
(
P
c
c
|{z}
=q
)
1
c
m
(H(U
τ
k
;||−1
)
H(U
τ
kc
0
;||−1
)).
After simplification,
q
I(S; F) + D
KL
(D
s
k U
s
) + H(S(X) | X)
1
c
m
(H(U
τ
k
;||−1
) H(U
τ
kc
0
;||−1
))
.
Lemma 5.4.
Given a probability vector
w
and a vector
v = s t
|v
>
w| max(1 g,g)
Proof.
First, note that
g 1 v
i
< g
. This is because
v
i
is the value that must be subtracted from the
i
th
el-
ement of the search space to attain the value
t
i
, since
t = s v
and
g
is the cutoff threshold for either round-
ing
s
i
up to 1 or down to 0. Therefore, in the case that
we are rounding up, the most extreme value that can be
subtracted is
g 1
(which is equivalent to adding
1 g
to arrive at 1). In the case that we are rounding down,
the largest value we could subtract is strictly less than
g
. Now, we see that
|v
>
w| kvkkwk
, and since
w
is a
probability vector,
kwk 1
. Therefore,
|v
>
w| kvk
and
kvk max(g, 1 g)
, given that
g 1
is negative.
Thus, |v
>
w| max(g, 1 g).
Theorem 5.5.
Given that
P
F,s
is the averaged condi-
tional distribution assigned by the search algorithm
where our target set has a continuous satisfaction mea-
sure,
P
F,t
refers to the averaged conditional distribu-
tion assigned by the search algorithm when we use
a discrete target set, and
g
is the threshold value for
converting a continuous target set into a discrete target
set (all elements with satisfaction greater than or equal
to the threshold
g
are included in the target set and the
rest are excluded):
|q(s, F) q(t, F)|
|T |
r
1
2
D
KL
(P
F,s
k P
F,t
) + max(1 g, g).
This theorem bounds the difference in the success mea-
sure in the discrete and continuous case using the KL-
divergence between the distributions learned in the
continuous and discrete cases.
Proof.
Consider
|q(s, F) q(t, F)|
. Using the decom-
posable probability of success metrics we get:
|q(s, F) q(t, F)| = |s
>
P
F,s
t
>
P
F,t
|.
Now, we define a vector
v
such that
v = s t
. There-
fore,
|s
>
P
F,s
t
>
P
F,t
| = |(t + v)
>
P
F,s
t
>
P
F,t
|
= |t
>
P
F,s
t
>
P
F,t
+ v
>
P
F,s
|
|t
>
P
F,s
t
>
P
F,t
| + |v
>
P
F,s
|.
Using Lemma 5.4,
|v
>
P
F,s
| max(1 g, g)
. There-
fore,
|t
>
P
F,s
t
>
P
F,t
| + |v
>
P
F,s
|
|t
>
(P
F,s
P
F,t
)| + max(g, 1 g).
Defining r := max(g, 1 g), we note that
|t
>
(P
F,t
P
F,s
)| + max(g, 1 g)
|T |sup
ω
(P
F,s
(ω) P
F,t
(ω)) +r
|T |
r
1
2
D
KL
(P
F,s
k P
F,t
) + r,
where the last step follows from Pinsker’s inequality.
Hence,
|q(s, F) q(t, F)| |T |
r
1
2
D
KL
(P
F,s
k P
F,t
) + r.
Example.
We set
k = 1
. For all levels of quanti-
zation, note that
I(S; F)
can be found computation-
ally by directly computing
s(ω)
and
F(ω)
for all
ω
. We know that the
D
KL
term will equal
0
,
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
566
since the KL divergence between two identical dis-
tributions is
0
. Since
S(X)
is independent from
X
,
H(S(X )|X) = H(S(X)). We also know that
H(U
τ
k
;||
) =
sτ
k
1
|τ
k
|
log
2
1
|τ
k
|
So then we simply need to find
|τ
k
|
for a given
||
.
This can be done by simple combinatorics.
First, we set
m = 2
and
u = 0.5
with
c = 0.5
so
s
i
{0, 0.5}
. We know that
P(S(X) = 0) =
n2
n
since
there are exactly two non-zero elements (both
0.5
to
sum up to
k = 1
), and then
P(S(X) = 0.5) =
2
n
so the
distribution is known. To find
|τ
k
|
, we know that it
is formed from all sets
s
that sum up to
1
, so that is
any set with exactly two elements with value
0.5
, so
then there are
n
2
such sets. We also need to compute
|τ
ku
|
for the subtracted term in the bound, this is
τ
0.5
.
We know this happens when exactly one element is
0.5, so there are
n
1
= n such sets.
Second, we set
m = 3
and
u =
2
3
with
c =
1
3
so
s
i
{0,
1
3
,
2
3
}
. There are two possible cases for
satisfactory vectors in
τ
k
: ones that have
3
elements
with
1
3
and those with one
1
3
and one
2
3
. There are
n
3
sets satisfying the first case, and
n
2
· 2!
sets satisfying
the second so
|τ
k
| =
n
3
+
n
2
· 2!
. We must also find
|τ
ku
| = |τ
1
3
|
, this happens when exactly one element
is a
1
3
so there are
n
1
= n
such elements. Next, we
must find the distribution over
S(X)
. Let
a
be the
probability that we have the case of three
1
3
(there are
n
3
such sets) and
q
be the complementary probability
of having one
1
3
and one
2
3
(there are
n
2
· 2!
such
sets). Then,
P(S(X) = 0) = a ·
n3
n
+ b ·
n2
n
since
there are 3 non-zero elements in a and 2 in b.
P(S(X) =
1
3
) = a ·
3
n
+ b ·
1
n
since there are 3
1
3
in the a
case and 1 in the b case. Finally,
P(S(X) =
2
3
) = b ·
1
n
since there is only a
2
3
in the b case and only 1.
Now that every term has been determined either
mathematically or computationally, we can combine
them to compute the bounds as given in Section 6.
From Targets to Rewards: Continuous Target Sets in the Algorithmic Search Framework
567