The Labeling Distribution Matrix (LDM): A Tool for Estimating

Machine Learning Algorithm Capacity

Pedro Sandoval Segura

1,2

, Julius Lauw

, Daniel Bashir

, Kinjal Shah

, Sonia Sehra

Dominique Macias

and George Monta

nez

AMISTAD Lab, Department of Computer Science, Harvey Mudd College, Claremont, CA, U.S.A.

Department of Computer Science, University of Maryland, College Park, MD, U.S.A.

Keywords:

Machine Learning, Model Complexity, Algorithm Capacity, VC Dimension, Label Recorder.

Abstract:

Algorithm performance in supervised learning is a combination of memorization, generalization, and luck. By

estimating how much information an algorithm can memorize from a dataset, we can set a lower bound on

the amount of performance due to other factors such as generalization and luck. With this goal in mind, we

introduce the Labeling Distribution Matrix (LDM) as a tool for estimating the capacity of learning algorithms.

The method attempts to characterize the diversity of possible outputs by an algorithm for different training

datasets, using this to measure algorithm ﬂexibility and responsiveness to data. We test the method on several

supervised learning algorithms, and ﬁnd that while the results are not conclusive, the LDM does allow us

to gain potentially valuable insight into the prediction behavior of algorithms. We also introduce the Label

Recorder as an additional tool for estimating algorithm capacity, with more promising initial results.

1 INTRODUCTION

Determining the representational complexity of a

learning algorithm is a long-standing problem in ma-

chine learning. Well-known methods for doing so in-

clude upper bounding an algorithm’s model complex-

ity using the VC dimension and measuring an algo-

rithm’s ability to ﬁt noise with Rademacher complex-

ity. We let algorithm capacity denote the represen-

tational complexity of the outputs of a learning al-

gorithm, which is a measure of the algorithm’s abil-

ity to store information in its trained models that can

subsequently be used to make inferences–this stored

information might take the form of memorized data

points or underlying information about how the data

was generated. A method capable of providing an es-

timate of an algorithm’s capacity instead of bound-

ing it offers concrete progress towards answering the

question of whether a given algorithm will overﬁt or

underﬁt on a particular dataset.

Building a formal framework for the overﬁtting

and underﬁtting of machine learning models is of

prime importance as researchers desire to tune their

models in ways that will lead to greater generaliza-

tion performance. We want to train our models to

capture inherent data relationships and be able to per-

form accurately over unseen data. It is common belief

that machine learning algorithms perform better when

their capacity is appropriate for the true complexity of

the task, such that the algorithm is able to capture re-

lationships within the provided training data. In other

words, underﬁtting is a result of a model having in-

sufﬁcient capacity, while overﬁtting is a repercussion

of models with excess capacity. In this manuscript,

we introduce a proxy for model capacity through em-

pirically observable Labeling Distribution Matrices

(LDMs).

1.1 Existing Characterizations

To the authors’ knowledge, there is no general method

to empirically measure algorithm capacity for an ar-

bitrary machine learning method. If such a procedure

existed, we would make progress toward determining

whether overﬁtting or underﬁtting could occur on a

dataset, given information on the complexity of said

dataset. It is important to note the distinctions of dif-

ferent deﬁnitions of model capacity as currently un-

derstood.

Representational Capacity: speciﬁes the family

of functions a learning algorithm can choose from

when varying the parameters in order to reduce a

training objective.

Effective Capacity: speciﬁes that imperfections

980

Segura, P., Lauw, J., Bashir, D., Shah, K., Sehra, S., Macias, D. and Montañez, G.

The Labeling Distribution Matrix (LDM): A Tool for Estimating Machine Learning Algorithm Capacity.

DOI: 10.5220/0009178209800986

In Proceedings of the 12th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2020) - Volume 2, pages 980-986

ISBN: 978-989-758-395-7; ISSN: 2184-433X

and assumptions made in the optimization learning al-

gorithm can limit representational capacity.

Our method for measuring the entropy of the la-

beling distributions produced from a classiﬁcation

model aims to capture representational capacity. We

train a given classiﬁcation model on noisy datasets,

then measure how the training affects the distribu-

tion of output labels. Training a model on noise and

analyzing how that affects the labeling distributions

that are produced should serve as a reasonable proxy

for a classiﬁcation model’s ability to model arbitrarily

complex relationships.

1.2 Experimental Setup

Our goal is to characterize the capacity of a classi-

ﬁcation algorithm by measuring its expressiveness.

In this exploratory work, we will focus our attention

on determining the capacity of classiﬁcation models.

The procedure involves analyzing the probability dis-

tributions over all possible ways of labeling a holdout

set, given an information resource that the model was

trained on. Note that we use information resource to

mean training data within this context. We want to de-

termine to what extent is an arbitrary machine learn-

ing classiﬁcation model capable of capturing the re-

lationship between features and labels for a given set

of information resources. Our hypothesis is that if,

on datasets without any inherent relationship between

features and labels, the model is able to classify cor-

rectly beyond what is expected by random guessing

then this suggests the model has the storage capacity

to memorize elements of the dataset.

As a concrete example, consider a set of K training

datasets, each with N data points. Each data point can

be labeled as one of C possible classes. We deﬁne the

training dataset formally as follows:

= {(x

),(x

),...,(x

)}

where x

= j

data point

= label assigned to data point x

for all 1 ≤ i ≤ K. Also, let us consider a holdout set

H of size N

H = {z

,...,z

}

which we label using a trained classiﬁcation model M.

Here, z

is the j

data point of the holdout set. The

label for the data point z

of the holdout set is M(z

Based on the above deﬁnitions, we see that for

any classiﬁcation model trained on dataset f

, there

is a total of C

possible labeling combinations for a

given holdout set H, where C represents the number

of classes available for the given dataset. For exam-

ple, if the training dataset contains binary labelings

(where y

∈ {0, 1}), then C = 2 and there are 2

pos-

sible labelings for the holdout set.

1.3 Simplex Vectors

We will be using a probabilistic model to evaluate al-

gorithm expressiveness. The ﬁrst step is to construct

a series of simplex vectors P

, each of which is rep-

resented as a vector of probabilities of size C

for a

given training dataset f

. The idea is that, for each

data point z

in a given holdout set H, we can ﬁnd

the probability that the trained classiﬁcation model M

will assign each of the possible C classes. By deter-

mining the probability distribution of a classiﬁcation

model’s assignments over C

possible labeling com-

binations for a given training dataset f

, we can gain

a better understanding of the capacity of the model

based on the nature of the resulting probability dis-

tributions. If we see that the model typically assigns

probability mass on the same subset of possible label-

ings regardless of the training data, then we can say

that the model is less expressive and has lower capac-

ity. We construct the simplex vector P

for a given

training dataset f

as follows:

1. Train classiﬁcation model M on f

2. Label holdout set H using M.

3. For every j

possible labeling l of the holdout

set, compute the probability that M would have

assigned l. Set this probability as the j

entry of

4. Normalize P

Given that each entry in the simplex vector corre-

sponds to the probability of a particular labeling com-

bination for a given holdout set H, the entry is com-

puted by taking the product of the probabilities as-

signed to each class for the given holdout feature sam-

ple. Consider a holdout set H = {z

} of size 3.

If a given training dataset has classes 0, 1, or 2, then

the probability assigned to the entry corresponding to

the labeling {z

: 0, z

: 1, z

: 2} will be the product of

the probability of the class 0 being assigned, the prob-

ability of class 1 being assigned, and the probability

of class 2 being assigned. In other words,

P(M(z

) = 0,M(z

) = 1,M(z

) = 2 | z

) =

P(M(z

) = 0)P(M(z

) = 1)P(M(z

) = 2)

where we assume that the probability of a single label

is conditionally independent of the probability of any

other label, since we are randomly generating training

datasets.

The Labeling Distribution Matrix (LDM): A Tool for Estimating Machine Learning Algorithm Capacity

981

1.4 Dirichlet Characterization

Now that we have a series of simplex vectors P

for

all 1 ≤ i ≤ K, we seek to measure the diversity of these

probability distributions. Because the support of a

Dirichlet distribution can be viewed as a set of proba-

bility distributions, we worked to infer the parameters

of the Dirichlet distribution from which the simplex

vectors P

were drawn. The idea is that we can de-

termine how expressive an algorithm based on the di-

versity of the probability distributions generated over

labelings of a given holdout H, for different infor-

mation resources f

. We hypothesize that the greater

the expressiveness of an algorithm, the more diverse

the probability distributions will be, since such algo-

rithms will not tend to have ﬁxed preference for out-

put responses regardless of the training data. An ex-

pressive algorithm tends to be more “ﬂexible” such

that, given multiple information resources f

to train

on and a ﬁxed set of holdout features, it will be more

responsive to the differences in the different informa-

tion resources that it trained on. Thus, the more re-

sponsive an algorithm is to changes in training data,

the resulting simplex vectors for different datasets

should show greater diversity.

Supposing the simplex vectors were drawn inde-

pendently from a Dirichlet distribution, we could use

them to infer a vector of alpha priors

α corresponding

to that Dirichlet distribution. We claim that the en-

tropy of this distribution gives us a means to estimate

expressiveness because it tells us how uniformly the

inferred Dirichlet distribution assigns probabilities to

its support. The more uniform the distribution, the

more expressive the algorithm, since the simplex vec-

tors could then be drawn from anywhere on the sim-

plex. Diversity of sampled simplex vectors implies

more uniform mass over the support. We compute the

entropy of the Dirichlet distribution by computing the

entropy of

α, which parameterizes the distribution.

1.5 Connections to Rademacher

Complexity

One can view this technique, of training an algorithm

on noise and analyzing the distribution of possible la-

belings, as a means of better understanding the space

of functions that the classiﬁcation algorithm is rea-

soning over. Similarly, in computational learning the-

ory, Rademacher complexity provides us with another

way of measuring hypothesis space complexity.

Deﬁnition: (Rademacher Complexity) Let H ⊂

F = { f : X → R} be a class of functions we are ex-

ploring deﬁned on domain X ⊂ X , and S = {x

}

i=1

be set of samples generated by some unknown dis-

tribution D

on the same domain X . Deﬁne σ

to be

uniform random variable on ±1, for any i. The empir-

ical Rademacher complexity or Rademacher average

is deﬁned as follows:

ℜ

(H ) = E

sup

f ∈H

∑

i=1

f (x

)

The supremum measures the maximum correlation

between f (x

) and σ

over all f ∈ H , for a given set S

and Rademacher vector σ. Because we’re taking the

expectation over σ, the empirical Rademacher com-

plexity of H measures the ability of functions within

this space to ﬁt random noise. The Rademacher com-

plexity, therefore, can be thought of as the expected

noise ﬁtting ability of H over all data sets S drawn

from D

(Balcan, 2011).

1.6 Estimating Complexity from the

Labeling Distribution Matrix

(LDM)

As described in Section 1.4, we take an arbitrary clas-

siﬁcation model M, train it on an information resource

(training dataset) f

, and build a probability distri-

bution over Ω, the space of all possible labelings of

our holdout set H. Because a classiﬁcation algorithm

can only provide us with class probabilities given the

features of an example, we build probability distribu-

tions of length |Ω|, which are simplex vectors. Each

element in a simplex vector represents the probabil-

ity of encountering a particular combination of label-

ing. Iterating this process over all K training datasets,

we eventually generate a series of simplex vectors

,...,P

each of length |Ω| to characterize the

probabilities that M places on each element of Ω. This

produces our labeling distribution matrix L,

L =





| | |

··· P

| | |





Given these K simplex vectors, whose entries sum to

1 after normalization described in 1.3, we propose

a method to measure the capacity of model M that

works as follows:

1. We suppose each of the column vectors in our ma-

trix is an independent sample from some Dirichlet

distribution parameterized by

α.

2. Use an iterative method (Sklar, 2014; Minka,

2003) to infer the parameters of the Dirchlet dis-

tribution from our sample vectors, as seen in Line

5 of Algorithm 1.

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

982

3. Compute the entropy of the Dirichlet distribution

we inferred.

The pseudocode for constructing an LDM of M and

estimating its complexity by measuring the entropy

is below:

Variable deﬁnitions:

M = input classiﬁcation model

K = number of datasets (columns) for LDM

D = input dataset

Algorithm 1: LDM Entropy Calculation Algorithm.

1: for i = 1, ... , K do

2: f

, H ← splitIntoTrainHoldout(D)

3: P

← getSimplexVector(M, f

, H)

4: L ← [P

,...,P

]

5: dirichletAlphas ← ﬁndDirichlet(L)

6: entropy ← computeEntropy(dirichletAlphas)

We claim that this method can be used to measure

model complexity. To give an intuitive sense of why,

we ﬁrst look at the labeling distribution matrix itself.

If our model M was trained on dataset D

, then the

output is a row vector P

Ω

), indicating a probability

distribution placed on Ω when M was trained on D

If we train on K distinct datasets D

,...,D

, M

produces K such column simplex vectors. Intuitively,

if M is a particularly expressive algorithm, then M

will be able to adapt well to different datasets, cap-

turing the true relationships (and perhaps noise). As

a result, we would expect more variance in the poten-

tial distributions over Ω that M would be able to pro-

duce. On the other hand, if the simplex vectors in our

algorithm’s matrix are mostly very similar, then our

algorithm M does not have much capacity to adapt its

parameters to the given dataset.

In general, if the LDM is “compressible” (con-

taining redundant simplex vectors whose distributions

are similar) then our algorithm is not very expressive.

Thus, we consider the simplex vectors of our matrix

themselves as samples from a Dirichlet distribution.

If these vectors imply that the Dirichlet distribution

covers most of its support, i.e., has a uniform distri-

bution, then that means that our algorithm is as ex-

pressive as possible. This is because any vector in the

support has the same probability assigned to it as any

other vector, meaning that our algorithm is capable of

producing anything in the support with equal likeli-

hood. Therefore, the entropy of the inferred distribu-

tion is effectively a measure of the expressiveness of

the algorithm that produced our labeling distribution

matrix—a low entropy indicates a relatively less ex-

pressive algorithm, while high entropy implies a more

expressive algorithm.

1.7 Results

Using the methodology described in our experimen-

tal setup, we calculated the entropy of the labeling

distribution matrices of a variety of machine learn-

ing models averaged over 20 runs. Every classiﬁca-

tion algorithm was trained on scikit-learn’s Iris

Dataset (Fisher, 1936), which consists of 3 classes,

4 real-valued features for every example, and 150 ex-

amples total. Our Gaussian Process Classiﬁer uses a

default RBF kernel. Our Decision Tree uses a maxi-

mum depth of 5. Our Random Forest Classiﬁer uses

10 estimators, a maximum of 1 feature, and a maxi-

mum depth of 5.

The entropies computed give us some insight as

to a model’s capacity and expressiveness. Every in-

formation resource on which we train a classiﬁcation

model is randomly generated, since we permute the

labels of the original dataset when generating a col-

umn of an LDM. Thus, computing the entropy of

parameterizing a Dirichlet distribution over an LDM

gives us a measure of the diversity of distributions

over labelings that the classiﬁcation algorithm is ca-

pable of producing. The results are given in Table 1.

Additionally, the LDM heatmaps shown in Fig-

ure 1 provides a visualization of the probability dis-

tributions of an LDM, with dark cells representing

low probability, and brighter cells representing higher

probability. Every index i of the x-axis represents

a particular labeling distribution P

, while every in-

dex j of the y-axis represents a unique labeling of the

holdout set H, of which there are C

The average entropy of different classiﬁcation

models, as outlined in Figure 1, coincides with what

the average LDM looks like for the same model.

Note, for example, that the highest entropy values are

achieved by models like the Gaussian Process Classi-

ﬁer and AdaBoost Classiﬁer.

If we look at the Gaussian Process Classiﬁer’s

LDM, as represented by the heatmap in Figure 1c, we

see that the texture of the matrix is consistent (in color

and pattern) over different information resources and

labelings. Notice, it has an overall “brightness” that

exceeds that of other classiﬁers, meaning that its mod-

els place more probability on different labelings than

other models do. This means that, despite being

trained on different training datasets where the re-

lationship between features and labels is broken or

nonexistent, the Gaussian Process Classiﬁer is still

able to assign nonzero probability to most labelings

of the holdout set.

In similar vein, for the AdaBoost Classiﬁer’s

LDM, as represented by the heatmap in Figure 1d, we

see that the texture of the matrix is distinct from oth-

The Labeling Distribution Matrix (LDM): A Tool for Estimating Machine Learning Algorithm Capacity

983

Table 1: Average Entropy of LDM for a variety of classiﬁers.

Model Average Entropy of LDM

Random Forest -3100

Gaussian Na

ıve Bayes -1728

Gaussian Process Classiﬁer -1244

AdaBoost Classiﬁer -1264

Quadratic Discriminant Analysis -1128

Decision Tree Classiﬁer -4908

K-Nearest Neighbors (K = 10) -1178

K-Nearest Neighbors (K = 5) -8429

K-Nearest Neighbors (K = 3) -18175

K-Nearest Neighbors (K = 1) -17227

(a) Random Forest (b) Gaussian Naive Bayes

(e) QDA (f) KNN-3

(g) Decision Tree

Figure 1: LDM Heatmaps for various classiﬁers.

ers, but still regular. It seems like there are three bands

or sections regardless of the information resource that

the model was trained on. In generating this matrix,

we used a holdout set of size 5, where there were 3

classes. Since the ordering of the labelings along the

y-axis are ordered lexicographically (with “00000” as

the ﬁrst labeling of the y-axis), the ridges of the matrix

are potentially caused by the shift from labelings that

start with a “0”, to labelings that start with a “1”, and

ﬁnally to labelings that start with a “2”. This is fur-

ther veriﬁed by the fact that there are 3

= 81 unique

labelings of the holdout set that begin with a “0” and

we can see that the ﬁrst ridge of the heatmap occurs

at exactly index 81.

Conversely, the heatmap for the Decision Tree

Classiﬁer is dark overall, scattered with tiny spikes of

light color. This means that its model conﬁdently as-

signs most probability mass to a single or few possible

labelings, and no probability mass to other possible

labelings. The Random Forest and KNN-3 classiﬁers

have an appearance that is somewhere in between the

extremes represented by the Gaussian Process Clas-

siﬁer and the Decision Tree Classiﬁer. Comparing to

the table, however, we see that the heatmap appear-

ance doesn’t ﬁt neatly into the pattern of increasing

entropy values for successive models.

The LDM attempts to capture two aspects simul-

taneously: the variation between output simplex vec-

tors for models trained on different datasets, and the

conﬁdence (sparsity) within each individual simplex

vector which determines how much probability mass

is placed on the preferred labeling for a given dataset.

In combining these two dimensions into a single num-

ber, the LDM estimation procedure loses some infor-

mation, in that we do not know if a ﬁnal high en-

tropy value represents the averaging of many diverse

high conﬁdence vectors, or the averaging of less con-

ﬁdent models, which produce homogeneous simplex

vectors and lead to the same observed value. Flexi-

ble models should make conﬁdent predictions and be

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

984

responsive to new data, leading to diverse output sim-

plex vectors. Combining these features into a single

number that measures capacity is the goal of the LDM

process, of which it is only partially successful.

2 SHORTCOMINGS

As notes in our Results section, the full LDM pro-

cess seems to struggle in combining the two aspects

of ﬂexible models in an unambiguous way. Further-

more, we observe some values and trends which dis-

agree with our traditional understanding of the rel-

ative ﬂexibility of various methods. For example,

KNN-1 should be the most prone to overﬁtting, hav-

ing the greatest ﬂexibility, yet its average entropy

value is lower than that of KNN-10, which should be

far more constrained and thus far less ﬂexible.

The problem could stem from one or more aspects

of the procedure. Perhaps crucial information was lost

as a result of averaging the values of the simplex vec-

tors, as suggested in the previous section. In addition,

by making algorithms output probabilities based on

conditional independence of test instance labelings,

this allows an algorithm like KNN-10 to place posi-

tive probability mass on many more individual test in-

stances (likely having some nonzero number of neigh-

bors with any chosen class label), whereas KNN-1

can only ever assign positive probability to the la-

bel of its single neighbor. Treating arbitrary simplex

vectors as parameters for a Dirichlet model may also

be problematic, since this modeling assumption was

made for simplicity.

Lastly, given the negative entropy values of the

LDM process, it is difﬁcult to understand these as

positive capacity values, undermining the purpose for

which the LDM was proposed. Negative entropy val-

ues can arise when using differential entropy, as when

estimating the entropy of a continuous Dirichlet dis-

tribution. For the LDM-Dirichlet process to be used,

one would still need a way of correlating entropy

scores to storage capacity in bits.

3 FUTURE WORK

Given the aforementioned shortcomings of the LDM

process and the continued need for methods of esti-

mating algorithm capacity, other approaches should

continue to be pursued. The question of how to es-

timate algorithm capacity is important, and failing to

ﬁnd a general solution to the question does not render

the question any less important.

One particularly promising idea, inspired by re-

search in deep neural networks, is to use a form of

autoencoder (Doersch, 2016; Olshausen and Field,

1996; Lee et al., 2007; Bengio et al., 2014; Bengio

et al., 2013; Kingma and Welling, 2014) as applied

to training data with labels that are independent of

features. Generalization requires being able to pre-

dict labels given knowledge of the true relationship

between features and labels. For a dataset with no

relationship (in other words, with independence be-

tween features and labels), the only way an algorithm

can reproduce the labels from the training dataset con-

sistently is to memorize them, which it can do only in

proportion to its capacity. Thus, for binary labels, the

number of labels the algorithm can correctly retrieve

in testing is the capacity (in bits) of how many labels

it could memorize, plus some small number of luck-

ily guessed labels (the number of which can be bound

with high probability).

A label recorder takes randomly generated labels

and independent training features comprising a train-

ing dataset, trains on that dataset, then tests on the

same training set. The number of correctly repro-

duced labels will give a point estimate of the capacity,

subject to random variation. Repeating this process

and taking the average of the observed capacities will

allow one to get an increasingly tighter estimate of the

true algorithm capacity, arguably with fewer assump-

tions and steps than the LDM process.

Table 2 shows preliminary label recorder results

for the models tested. Each method was tested on

a set of 150 instances from the Iris dataset (Fisher,

1936), with labels generated independently and uni-

formly at random. The point estimates were the aver-

age number of labels correctly recovered at test time,

averaged over 1000 independent trials for each model.

As can be seen from the table, we have unpruned De-

cision Trees and the Random Forest Classiﬁer having

the highest estimated capacity, while more bias-heavy

models such as Quadratic Discriminant Analysis and

Gaussian Na

ıve Bayes have less capacity. Further-

more, the estimated capacities for KNN as a func-

tion of the regularization parameter K show decreas-

ing capacity with increasing K, aligning better with

our intuition than the LDM inferred entropies. Thus,

label recorders present a promising avenue for esti-

mating algorithm capacity. Creating label recorders

and using them to provide rigorous bounds on algo-

rithm capacity is the subject of future work, which

we hope will complement (if not supersede) the work

presented here.

The Labeling Distribution Matrix (LDM): A Tool for Estimating Machine Learning Algorithm Capacity

985

Table 2: Estimated capacity of a variety of classiﬁers using label recorders, with 95% conﬁdence intervals.

Model Estimated Capacity 95% CI

Random Forest 144.79 [144.66, 144.92]

Gaussian Na

ıve Bayes 60.79 [60.53, 61.06]

Gaussian Process Classiﬁer 75.00 [74.75, 75.26]

AdaBoost Classiﬁer 92.58 [92.24, 92.91]

Quadratic Discriminant Analysis 69.53 [69.22, 69.83]

Decision Tree Classiﬁer 149.34 [149.31, 149.37]

K-Nearest Neighbors (K = 10) 73.26 [72.91, 73.61]

K-Nearest Neighbors (K = 5) 82.85 [82.50, 83.19]

K-Nearest Neighbors (K = 3) 93.74 [93.40, 94.08]

K-Nearest Neighbors (K = 1) 149.34 [149.31, 149.37]

4 CONCLUSIONS

In an attempt to estimate the capacity of algorithms,

as a measure of the amount of data their models can

store, we introduce the LDM. We tested the LDM pro-

cedure on several learning models and observed the

average entropy values over a variety of datasets. Our

results highlighted interesting behaviors of the algo-

rithms tested, but were not as conclusive or consistent

as initially hoped.

Although the LDM process falls short in several

respects, the underlying idea of a Labeling Distribu-

tion Matrix could prove useful in developing other

methods for measuring algorithm capacity. Perhaps

other ways of averaging or combining data from the

LDM could lead to better, more interpretable results.

Recent theoretical work on entropic expressivity may

be one such application of the LDM (Lauw et al.,

2019). Thus, even though the LDM process is not

a perfect capacity estimation method, the LDM ob-

ject itself might still prove useful as a component of

such a method. Lastly, the label recorders brieﬂy pre-

sented here hold promise for serving as a method of

estimating algorithm capacity, as shown through our

preliminary results.

REFERENCES

Balcan, M.-F. (2011). Rademacher Complexity.

Bengio, Y., Laufer, E., Alain, G., and Yosinski, J. (2014).

Deep generative stochastic networks trainable by

backprop. In International Conference on Machine

Learning, pages 226–234.

Bengio, Y., Yao, L., Alain, G., and Vincent, P. (2013). Gen-

eralized denoising auto-encoders as generative mod-

els. In Advances in neural information processing sys-

tems, pages 899–907.

Doersch, C. (2016). Tutorial on variational autoencoders.

arXiv preprint arXiv:1606.05908.

Fisher, R. A. (1936). The use of multiple measurements in

taxonomic problems. Annals of Eugenics, 7(7):179–

188.

Kingma, D. P. and Welling, M. (2014). Stochastic gradi-

ent vb and the variational auto-encoder. In Second In-

ternational Conference on Learning Representations,

ICLR, volume 19.

Lauw, J., Macias, D., Trikha, A., Vendemiatti, J., and Mon-

tanez, G. D. (2019). The Bias-Expressivity Trade-off.

arXiv e-prints, page arXiv:1911.04964.

Lee, H., Battle, A., Raina, R., and Ng, A. Y. (2007). Efﬁ-

cient sparse coding algorithms. In Advances in neural

information processing systems, pages 801–808.

Minka, T. (2003). Estimating a Dirichlet distribution. Tech-

nical report.

Olshausen, B. A. and Field, D. J. (1996). Emergence

of simple-cell receptive ﬁeld properties by learn-

ing a sparse code for natural images. Nature,

381(6583):607.

Sklar, M. (2014). Fast MLE computation for the Dirichlet

multinomial.

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

986