Decoupling the Backward Pass Using Abstracted Gradients
Kyle Rogers
1, a
, Hao Yu
1, b
, Seong-Eun Cho
2 c
, Nancy Fulda
1 d
, Jordan Yorgason
3 e
and Tyler J. Jarvis
2 f
1
Department of Computer Science, Brigham Young University, Provo, Utah, U.S.A.
2
Department of Mathematics, Brigham Young University, Provo, Utah, U.S.A.
3
Cellular Biology and Physiology, Center for Neuroscience, Brigham Young University, Provo, Utah, U.S.A.
Keywords:
Machine Learning, Matrix Abstraction, Biologically Inspired Learning Algorithm, Model Parallelization,
Network Modularization, Backpropagation, Skip Connections, Neuromorphic.
Abstract:
In this work we introduce a novel method for decoupling the backward pass of backpropagation using math-
ematical and biological abstractions to approximate the error gradient. Inspired by recent findings in neuro-
science, our algorithm allows gradient information to skip groups of layers during the backward pass, such that
weight updates at multiple depth levels can be calculated independently. We explore both gradient abstrac-
tions using the identity matrix as well as an abstraction that we derive mathematically for network regions that
consist of piecewise-linear layers (including layers with ReLU and leaky ReLU activations). We validate the
derived abstraction calculation method on a fully connected network with ReLU activations. We then test both
the derived and identity methods on the transformer architecture and show the capabilities of each method on
larger model architectures. We demonstrate empirically that a network trained using an appropriately chosen
abstraction matrix can match the loss and test accuracy of an unmodified network, and we provide a roadmap
for the application of this method toward depth-wise parallelized models and discuss the potential of network
modularization by this method.
1 INTRODUCTION
There are numerous types of deep neural networks
which excel on various tasks, but they heavily rely
on a rigid error backpropagation procedure. From
multilayer perceptrons to convolutional neural nets to
transformer-based architectures, these models com-
pute the gradient of the loss with respect to each
model parameter to find a local minimum on the
model’s loss surface. Gradient computation is an ex-
pensive process and requires the gradient to be calcu-
lated layerwise backward through the neural network.
This learning paradigm is incredibly successful, but
also inflexible as gradient computation requires dif-
ferentiable operations and sequential processing of
data through the network.
a
https://orcid.org/0000-0001-8494-5121
b
https://orcid.org/0009-0003-5106-6042
c
https://orcid.org/0009-0003-3416-461X
d
https://orcid.org/0000-0001-9391-8301
e
https://orcid.org/0000-0002-5687-0676
f
https://orcid.org/0000-0002-3767-029X
These authors contributed equally
In this work we present a new tool, termed ab-
straction matrices, which enable gradient information
to be passed backward to multiple locations in the net-
work in a decoupled fashion. We show that break-
ing up the backward pass in this way does not hinder
model performance and allows more flexibility during
backpropagation. Given this result, we explore sev-
eral implications of our method: 1) theoretical depth-
wise model parallelization, 2) network modulariza-
tion, and 3) algorithm innovation.
Our method introduces a set of matrices
{M
1
,...,M
n
} which correspond to the abstracted
network regions. These matrices are calculated dur-
ing each forward pass in such a way that, when M
k
is
multiplied by gradient information from the network
layer immediately following the kth abstracted
region, the result is a reasonable approximation of the
gradient information which would have been passed
to the preceding layer via traditional backpropagation
methods. Said another way, during the backward
pass, the abstraction matrices {M
1
,...,M
n
} are used
to quickly transmit error information backward across
multiple layer blocks via a simple matrix multiplica-
Rogers, K., Yu, H., Cho, S., Fulda, N., Yorgason, J. and Jarvis, T.
Decoupling the Backward Pass Using Abstracted Gradients.
DOI: 10.5220/0012362800003636
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 16th International Conference on Agents and Artificial Intelligence (ICAART 2024) - Volume 3, pages 507-518
ISBN: 978-989-758-680-4; ISSN: 2184-433X
Proceedings Copyright © 2024 by SCITEPRESS Science and Technology Publications, Lda.
507
tion rather than via more complex backpropagation
calculations.
The biological inspiration for this method, which
both motivates and, to an extent, justifies the use of
imperfect gradient approximation in lieu of rigorously
calculated gradients, lies in observed findings from
foundational neuroscience studies that identified feed-
back signals in biological brains are backpropagated
both through localized synaptic retrograde signalling
and through shortened feedback loops to distant lay-
ers (Seger and Miller, 2010; Sesack and Grace, 2010).
The synaptic updates are more precise, whereas the
shortened feedback loops are less accurate but facil-
itate a faster training response because they bypass
many of the intervening neurons (Gerdeman et al.,
2002; Alger, 2002). These biological foundations
provide both the inspiration and a motivating prece-
dent for our study of approximate signal mediation
via abstraction matrices.
The contributions of this paper are as follows: (a)
We present a biologically inspired paradigm for neu-
ral network training based on abstracted gradient in-
formation mediated via simple matrix multiplications
(Sections 3.1 and 3.2); (b) We present a justifica-
tion for a least-squares method for computing abstrac-
tion matrices {M
1
,...,M
k
} in the case that the cor-
responding layers are comprised of piecewise-linear
functions; (c) We introduce a simplification paradigm
M
k
= I k that reduces calculation overhead and is
rooted in biological precedents (Section 3.4); (d) We
validate the effectiveness of abstraction matrices in
both multilayer perception and transformer architec-
tures, and show that the abstraction of layer blocks
via M
k
can be achieved without a drop in training ac-
curacy (Sections 4.1 and 4.2); (e) We examine antici-
pated speedups that could be obtained by implement-
ing our abstracted architecture in a fully parallelized
environment (Section 5.1): and (d) we discuss the ap-
plication of our method in algorithm innovation and
network modularization (Section 5.2).
2 RELATED WORK
Incomplete Gradient-Based Learning: Computing
the error gradient with respect to model parameters
is, in its pure form, a prohibitively intensive pro-
cess. True gradient descent involves iterating through
the entire dataset, computing each weight’s gradi-
ent with respect to calculated error, and then updat-
ing the parameters in proportion to the learning rate.
This is highly impractical, and thus gradients are usu-
ally computed for only a subset of the data at a time
(Amari, 1993). Despite using only an approximation
of the true gradient, SGD methods have proved to be
quite effective in training neural networks in a super-
vised manner. Our work builds on this precedent by
using abstraction matrices to quickly transmit approx-
imations of the calculated error gradient.
Further approximations of the error gradient
have been utilized to implement layer-parallelization
(G
¨
unther et al., 2019; Song et al., 2021). Both works
use optimized approximations of the forward pass and
(Song et al., 2021) requires additional external com-
pute power. Our method does not interfere with the
forward pass, which does exclude the possibility of a
parallelized forward pass, but addresses the more ex-
pensive backward pass. Additionally, our method is
lightweight and only requires additional computation
for solving the least squares problem.
One unexpected aspect of our work (see Section
4.2) is the superior effectiveness of a simplified ap-
proximation of the gradients over a more theoretically
sound abstraction on certain models. While this result
appears highly unintuitive, it is similar to prior work
by (Neftci et al., 2017), who have shown that in neu-
romorphic contexts a neural network can be trained
using random feedback weights multiplied by the er-
ror gradient. More generally, feedback alignment uti-
lizes randomly initialized backward weight matrices
which still facilitate learning as presented by (Lilli-
crap et al., 2016). (Lillicrap et al., 2016) also provide
some justification as to why feedback alignment is ef-
fective which we also rely on partially in motivating
our use of the identity matrix. Despite connections to
these works, we draw our inspiration for, and to some
extent justify, use of the identity matrix from biology
as described in Section 3.4.
Residual Connections: Our work also is themati-
cally related to residual connections as originally pre-
sented in (He et al., 2015; Srivastava et al., 2015).
Conceptually, our work can be viewed as an extension
of this concept to multilayer blocks, with the residual
connection taking the form of an abstraction matrix M
that delivers an approximation of what the calculated
gradients would have been.
3 METHODOLOGY
3.1 Biological Foundations
In order for supervised machine learning or biological
learning to occur, there must be an update in synap-
tic weights based on some error and resulting adjust-
ment. In traditional machine learning this adjustment
is often performed using an error signal that is back-
propagated through the same pathway as the forward
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
508
propagating signal, a method which is very effec-
tive and in some ways analogous to the neurobiologi-
cal mechanisms of backpropagating action potentials
(Stuart and H
¨
ausser, 2001; Letzkus et al., 2006) and
release of retrograde neurotransmitters (e.g., cannabi-
noids and nitric oxide) (Wilson and Nicoll, 2001;
Hardingham et al., 2013). These are effective mech-
anisms for transmitting learning signals across local
connections (i.e., one layer of neurons). However,
such signaling mechanisms do not typically propagate
across multiple layers in biology due to interference
from ongoing activity such as ion channel activation
refractory periods (Burke et al., 2001). Instead, bio-
logical systems seem to prefer a combined approach
where local tuning is performed by backpropagating
action potentials and retrograde transmitters, while
more distant upstream layers are connected and tuned
via long indirect and short direct feedback loops that
bypass the initial layers (Sesack and Grace, 2010).
These feedback loops provide a faster method for tun-
ing upstream neurons and are used throughout the
brain, including, for example, the cortico-basal gan-
glia network for reward learning (Sesack and Grace,
2010).
Our work utilizes an abstraction matrix M which
is computed to allow the gradient to flow around cer-
tain groups of layers of a neural network, a function
analogous to the role feedback loops play in biologi-
cal brains. In traditional backpropagation, the gradi-
ent is computed from the output layer sequentially up
through the rest of the network. Using the matrix M,
however, the gradient calculation can be divided such
that the gradient in different regions of the network
does not have to be computed sequentially.
3.2 Layer Abstraction to Compute the
Gradient
In an effort to design a learning scheme more anal-
ogous to the human brain in deep neural networks
(DNNs), we design a method to abstract the gradient
computation process of several sequential layers of a
DNN using a single matrix we denote M. The lay-
ers abstracted by M thus become a localized learning
region with neurons whose gradient propagation pro-
cess is detached from that of upstream layers. A visu-
alization of this abstraction using M is shown in Fig-
ure 1. In some cases the identity matrix will be used in
lieu of M, as depicted in Figure 1.(3-1) and described
in Section 3.4. In all cases, we assume that the default
regions of the network are trained using backpropa-
gation. As such, during the backward pass of training
the error gradient with respect to the model weights
is computed sequentially backward through the net-
Figure 1: (1-1) shows a model composed of 4 layers, la-
beled as L1 to L4. Input is given through L1, and the for-
ward data flow is indicated by green arrows. Loss is intro-
duced through L4, and the backward data flow is indicated
by yellow arrows. (2-1) and (3-1) show the same model im-
plementing our method, with M abstracting the backward
processes. (3-1) uses identity matrix as M. As illustrated
in (1-2), backward processes of the traditional model are
sequential. In comparison,shown by (2-2) and (3-2), back-
ward processes using our method can become parallelized,
since L2 obtains loss values through M, instead of L3.
work until reaching the region of layers abstracted by
M. Then, instead of continuing the standard back-
propagation procedure, the gradient is approximated
for the abstracted layers using M, and the gradient
computation continues around these layers according
to Equation 1. G
i
represents the gradient of the layer
after (from the backward pass perspective) the layers
abstracted by M, and G
j
is the gradient of the layer
immediately before the layers abstracted by M. Layer
j is among the ancestor layers of layer i.
G
i
M = G
j
(1)
The layers abstracted by M can then either be
trained according to standard backpropagation or a
more simple learning rule which more closely mim-
ics biological learning behavior. Assume, however,
that the layers abstracted are also trained using back-
propagation. In this case G
i
is used to continue the
backward pass through the layers abstracted by M,
but, critically, there now exist two gradient compu-
tation paths after layer i. These two paths can be
computed in parallel, which thus introduces a poten-
tial new type of parallelism in which model training
can be distributed depth-wise.
3.3 Derivation of the Abstraction
Matrix
Consider a neural network N defining a function
F
N
(x) = a
k
(W
k
a
k1
(W
k1
...a
1
(W
1
x))), where a
i
is
the activation function for the ith layer and the matrix
W
i
consists of the weights for layer i. (Note that if
Decoupling the Backward Pass Using Abstracted Gradients
509
the input vector is extended with an additional 1, the
bias term can be included as an additional column in
the weight matrix). For a given i < j k and input
x let L
i
= a
i
(W
i
a
i1
(W
i1
...a
1
(W
1
x))) be the output
of the ith layer, let L
j
= a
j
(W
j
a
j1
(W
j1
...W
i+1
L
i
)))
be the output of the jth layer (thought of as a function
of L
i
), and let L
k
= a
k
(W
k
a
k1
(W
k1
...W
j+1
L
j
))) be
the output of the kth layer.
As a fundamental part of backpropagation we
must compute gradients G
j
= G
i
L
j
L
i
T
, where
L
j
L
i
=
L
j
L
j1
L
j1
L
j2
···
L
i+1
L
i
(2)
is the derivative of the layer L
j
as a function of L
i
. It’s
relatively expensive to compute these derivatives by
computing the corresponding matrix products in (2).
Moreover, each of these matrix derivatives depends
on the value of the input x, so the product must be re-
computed for each x
in a given batch. To emphasize
this dependence, we use a superscript x
on the layers:
L
x
j
L
x
i
.
Expressed mathematically, the main idea of this
paper is to approximate all the different transposed
derivatives
L
x
j
L
x
i
T
with a single abstraction matrix
M, which depends on the batch, but is the same for all
choices of x
.
Our choice of M is motivated by the observation
that any piecewise-linear function f satisfies the dif-
ferential relation f (x) = D
x
f (x) · x, where D
x
f (x) is
the derivative of f with respect to x. Specifically, if
the activation functions in the neural network N are
all piecewise linear (e.g., ReLU or leaky ReLU), then
for any input x we have
L
x
j
=
L
x
j
L
x
i
L
x
i
.
A matrix M
T
that approximates every derivative
L
x
j
L
x
i
should, therefore, give a good approximate solu-
tion to the system of equations
M
T
L
x
i
= L
x
j
B, (3)
where B is the set of all indices in the batch. Assem-
bling the various columns L
x
i
together into a single
matrix L
B
i
and the columns L
x
j
together into a single
matrix L
B
j
, we can write the system (3) as
M
T
L
B
i
= L
B
j
. (4)
The natural choice for an approximate solution to
any (potentially non-square) linear system is the least-
squares solution of (4), which can be written as
M =
L
B
i
T+
L
B
j
T
, (5)
where
L
B
i
T+
is the Moore–Penrose pseudoinverse
of (L
B
i
)
T
. This motivates our choice of the abstraction
matrix M to be defined by (5).
3.4 A Simplified Abstraction Matrix
In the cortico-basal ganglia brain region from which
we take our inspiration, feedback loops that bypass
initial layers do not use an estimation of those layers’
gradients, but instead pass the error signal directly to
the more distant neurons (Sesack and Grace, 2010;
Seger and Miller, 2010). To mimic this behavior, we
also ran a number of experiments with M equal to the
identity matrix rather than the derived value given in
Eq. (5). This simplification (k,M
k
=I) reduces calcu-
lation overhead and is better aligned with biological
precedents; however, it is a less accurate way of esti-
mating the abstracted gradients. Our expectation was
that it would result in reduced neural network perfor-
mance as compared to the more rigorously calculated
M; however, as described in Section 4.2, this was not
the case.
4 EXPERIMENTS
We explore the effectiveness of the layer abstraction
M on a variety of models and training tasks, with
the goal of establishing (a) the performance of mod-
els trained using abstraction matrices as compared to
unmodified models, and (b) the theoretical speedup
which might be gained if the model were parallelized
along the layer blocks approximated by abstraction
matrices. We further consider two distinct meth-
ods for calculating M: The theoretical derivation de-
scribed in Sec. 3.3, and a biologically motivated sim-
plification using the identity matrix (M=I).
4.1 Multilayer Perceptron
While small multilayer perceptron (MLP) (Block
et al., 1962) models are not the most suitable candi-
dates for the downstream implications of our method,
we chose them as an initial testbed due to their sim-
plicity and conformity with the constraints of Section
3.3. Our aim in this experiment is two-fold: (1) to en-
sure that training using M (which creates a decoupled
backward pass) does not decrease final model accu-
racy and (2) to establish an algorithm that enables the
separation of the backward pass into multiple proce-
dures.
To validate the mathematical theory behind layer
abstraction we verify that we can compute and use
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
510
M on a five-layer multilayer perceptron (MLP) com-
prised of fully connected layers with ReLU activa-
tions. We train this MLP model on three simple
image recognition datasets—MNIST (Deng, 2012),
EMNIST (Cohen et al., 2017), and FashionMNIST
(FMNIST) (Xiao et al., 2017)—and compare the
abstracted model’s performance to a baseline MLP
model trained without any abstractions. In this exper-
iment different models are used for different datasets,
with some base code from (Koehler, 2020). In the
model, the computed matrix M spanned three of the
five MLP layers, leaving the input and output layers
unmodified. All models contained hidden layers rang-
ing from dimension (392,196) to (49, 10) or (49,26)
for EMNIST, with larger layers on input side and
smaller layers closer to output layers, as defined in
(Gregor Koehler and Markovics, 2020). Models were
trained for 10 epochs, using the Adam optimizer and
negative log likelihood loss, on batches of size 64 and
an initial learning rate of 0.0001.
For this experiment, we also studied the simple
M = I abstraction. One limitation of using such a sim-
ple abstraction, however, is that the gradient vectors,
G
i
,G
j
, must be of the same size since I is a square
matrix. Thus, to apply the M = I abstraction to this
MLP model we instead utilize a block identity ma-
trix, I
block
= [I 0]. This effectively adds zero padding
to maintain the proper gradient chain. Observe that
using I
block
is essentially dropping gradient informa-
tion in order to project the gradient to a different di-
mension size.
Results are shown in Table 1. We see that the de-
rived matrix M matches the performance of the non-
abstracted model on the MNIST and FMNIST dataset
and nearly matches on EMNIST. As the model archi-
tecture is the same in both cases, this suggests that
like many other aspects of neural architecture design,
the effectiveness of the abstracted gradient calculation
technique is partially dependent on the specific task
being solved. The M = I
block
abstraction performs
measurably worse on all three datasets, validating the
worth of deriving the matrix M as presented in Sec-
tion 3.3.
4.2 Seq2Seq Transformer
Our next experiment leverages the popular Seq2Seq
transformer as presented by (Vaswani et al., 2017),
using its implementation from (Klein et al., 2017).
This model has been leveraged as a base architecture
for many modern DNNs including the GPT line of
language models (Radford et al., 2019; Brown et al.,
2020; Black et al., 2021; Ouyang et al., 2022), audio
processing models (Dong et al., 2018; Gulati et al.,
2020; Chen et al., 2021), and computer vision appli-
cations (Carion et al., 2020; Dosovitskiy et al., 2021).
Thus, examining layer abstractions in the base model
presented by (Vaswani et al., 2017) offers valuable
intuition and preliminary information about layer ab-
stractions in other, more modern, transformer-based
architectures.
We begin by highlighting that the transformer ar-
chitecture does not meet the constraints required by
Section 3.3, as the M matrix must abstract n lin-
ear layers with ReLU activations to be an exact ab-
straction. Therefore, we compare the derived ap-
proximation to both the unmodified baseline and a
biologically-inspired value for M using the identity
matrix (M=I), as discussed in Section 3.4. Per-
formance of both models was evaluated using the
GermanEnglish translation task from the Multi30k
dataset (Elliott et al., 2016). Our transformer model
consisted of six encoder layers and six decoder lay-
ers. In each abstraction model the last 3 attention lay-
ers were abstracted by a single M within the encoder
portion of the network. We evaluated performance us-
ing BLEU (Papineni et al., 2002), METEOR (Baner-
jee and Lavie, 2005), and COMET (Rei et al., 2020)
scores, all of which are established metrics in the field
of machine translation.
Results are shown in Table 2. Somewhat unex-
pectedly, we find that the biologically-inspired gradi-
ents M=I performed much better than the mathemati-
cally derived gradients from Section 3.3. Despite the
transformer model not meeting the constraints of our
derived abstraction we did not expect performance to
suffer as it did. We hypothesize that this is due to vari-
ations in the magnitude and direction of the difference
between M and the true gradients G
i
. To validate this
unusual result, we applied the gradient approximation
M = I to a much larger dataset, IWSLT17 (Cettolo
et al., 2017), again using the GermanEnglish trans-
lation as our benchmark and comparing to our base-
line model. The results, shown in Table 3, confirm
that the approximate gradients transmitted by M=I are
sufficient for effective learning. This means that de-
coupling of the backward pass can be achieved with-
out any significant reduction in model performance.
4.3 Ablation Study
This study shows that the strong performance of M=I
is not caused by skipping unnecessary layers on the
backward pass and that M=I does not cause the ab-
stracted layers to become irrelevant. For this exper-
iment, we set up multiple Seq2Seq transformers us-
ing the same structure presented by (Vaswani et al.,
2017). We held the model dimension fixed at 512, us-
Decoupling the Backward Pass Using Abstracted Gradients
511
Table 1: Test set accuracy and standard deviation across five training runs for each MNIST variant. Test set accuracy is
selected as the maximum test set accuracy after each epoch. Model architectures are as described in Section 4.1.
MNIST FMNIST EMNIST
(acc, stddev) (acc, stddev) (acc, stddev)
baseline model (0.9763, 0.002) (0.8793, 0.003) (0.9022, 0.002)
abstracted gradients (derived) (0.9762, 0.001) (0.8724, 0.002) (0.8948, 0.002 )
abstracted gradients (M = I
block
) (0.9612, 0.002) (0.8642, 0.003) (0.8584, 0.003)
Table 2: Model performance of a baseline transformer as compared to models leveraging both derived and biologically
inspired abstraction matrices M. Evaluations were performed using the Multi30k dataset, ende task. The first number of
each tuple shows the average accuracy across ten training runs. The second number shows the standard deviation across the
ten trials.
baseline model abstracted gradients (M=I) abstracted gradients (derived)
(acc, stddev) (acc, stddev) (acc, stddev)
BLEU (0.386, 0.008) (0.383, 0.008) (0.199, 0.033)
METEOR (0.708, 0.006) (0.705, 0.004) (0.477, 0.042)
COMET (0.774, 0.004) (0.772, 0.004) (0.627, 0.025)
ing a batch size of 32 and Adam optimizer with adap-
tive learning rate. While the original structure from
(Vaswani et al., 2017) used 6 encoder layers and 6
decoder layers, we also tried variants with 3 encoder
layers and 3 decoder layers. We used n to represent
the numbers of encoder and decoder layers. When
using M and n = 6, the last 3 layers of the encoder
block are abstracted. When n = 3, the last layer of the
encoder block are abstracted. We trained each model
setup for 14 epochs with ve trials on the Multi30k
GermanEnglish training data. Then, we picked the
model with lowest validation loss from each trial to
perform the translation task on Multi30k test set. We
measure each trial’s BLEU and METEOR score and
take average across five trials with the same model
setup. We also measured scores after removing those
layers which would have become abstracted layers if
M had been used. The results are in Table 4.
As seen in Table 4, we can first conclude that
the effectiveness of M=I isn’t due to redundancy in
the trained model. Before ablation of the baseline
model, n = 6 baseline performs better than n = 3 base-
line, demonstrating that the extra complexity of the
model is matched by a corresponding increase in per-
formance. It is therefore not the case that the abstrac-
tion matrix is merely approximating a smaller model;
it is instead successfully retaining the complexity of
the larger one. Moreover, ablation of the baseline
model resulted in reduced performance, suggesting
that the layers that would have been abstracted by M
are impactful to the tasks. Therefore, we can con-
clude that M did not skip unnecessary layers during
the backward pass. To determine whether layers still
retain their importance after abstraction using M=I,
we compare the performance of M=I and the baseline
both before and after abstraction. Our data indicates
that both M=I models and the corresponding baseline
models lost similar amounts of performance after re-
moving layers of abstraction locations. This indicates
that those layers retain their importance even after the
abstraction process. We present additional ablation
studies varying the position and size of M in the ap-
pendix.
5 DISCUSSION
5.1 Theoretical Speedup
Efficient, large-scale parallelization of deep learn-
ing models is a highly specialized field, requir-
ing the successful navigation of challenges includ-
ing partitioning, re-materialization, and data trans-
fer (Griewank and Walther, 2000; Chen et al., 2016;
Huang et al., 2019). Such an endeavor is beyond
the scope of this work, and we note in particu-
lar that a naive parallelization implementation of
this novel decoupling method using, for example,
torch.multiprocessing (Foundation, 2023) is un-
likely to be effective. However, we provide here a
small theoretical analysis showing the predicted im-
pact on wall clock time of the backward pass of a
parallelized implementation of our abstracted neural
network.
In Section 3.2, we showed that we can approxi-
mate the gradient calculation of certain groups of lay-
ers which are abstracted by M. Our gradient deriva-
tion method for nonadjacent layers can be written as,
G
i
M = G
j
(6)
where G
i
represents the ith layer’s gradient and G
j
represents the jth layer’s gradient. This allows the
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
512
Table 3: Model performance of a baseline transformer compared to a model using the M=I abstraction matrix. Evaluations
were performed using the IWSLT17 dataset, ende task. Column values show average accuracy and standard deviation
across ten data runs.
baseline model abstracted gradients (M=I) abstracted gradients (derived)
(acc, stddev) (acc, stddev) (acc, stddev)
BLEU (0.294, 0.002) (0.291, 0.002) (0.194, 0.006)
METEOR (0.706, 0.004) (0.705, 0.005) (0.652, 0.006)
Table 4: Model performance of deen translations on Multi30k test set, average of ve trials. Only models from each trial,
scored lowest validation loss, were picked for translation tasks. The term “ablated” following a scoring metric means that
translation tasks were performed after removing certain layers from the models. n refers to the number of encoder layers.
When n=6, ablation removed the last 3 attention layers from model’s encoder block. When n=3, ablation removed the last
attention layer from model’s encoder. The removed layers occupied the same positions as the layers replaced by M in the
abstracted models.
n=6 (baseline) n=3 (baseline) n=6 (M=I) n=3 (M=I)
BLEU 0.386 0.385 0.383 0.374
BLEU (ablated) 0.374 0.380 0.374 0.372
METEOR 0.708 0.709 0.705 0.702
METEOR (ablated) 0.698 0.704 0.696 0.697
gradient for the layers after layer i to be computed us-
ing M, rather than sequentially computing the gradi-
ent through layers j through i. Importantly, the layers
abstracted by M are still updated using backpropaga-
tion, but this occurs after the abstracted matrix M has
mediated the approximate gradients. Networks can
use more than one M to have a parallelized backward
pass through the layers abstracted by M, as shown in
Figure 2. A speedup can be obtained even though
computing M for each backward pass requires addi-
tional matrix operations.
We can model the backward process computation
time of the layers abstracted by M (the light blue
boxes in Figure 2) as shown below:
t
R
ml and t
M
mo + l. (7)
where t
R
is the backward time on a regular neu-
ral network without M, t
M
is the backward time con-
sumption on a neural network with M implemented.
In these equations m is the amount of M matrices we
have in a network. l is the estimated computation time
needed to perform the backward pass on the layers
skipped by a single M matrix. o is the amount of over-
head needed to derive M and passing gradient through
M.
If we require the backward pass of a network to be
δ times faster, then:
t
M
t
R
1
δ
om + l
1
δ
ml (8)
For example, when m = 6, o = 2 and l = 6, we
have a δ = 2 times speed up on a transformer model’s
backward processes. In other words, a 2 times speed
up can be achieved when there are 6 M and overhead
time for each M is only one third of the amount of
time of a group of abstracted layers’ backpropagation.
5.2 Optimization Algorithm Innovation
and Network Modularization
With abstraction matrices used to transfer error sig-
nals across intermediate layers, abstracted layers are
no longer required to perform traditional gradient de-
scent to generate loss values for their upstream layers.
Consequently, abstracted layers could potentially em-
ploy optimization algorithms other than gradient de-
scent, while gradient descent could still be used on
some layers to maintain the network’s performance.
This could open up research opportunities for new
optimization algorithms. More concretely, a network
trained with backpropagation and another optimiza-
tion algorithm, denoted as algorithm A and algorithm
B respectively, could utilize algorithm A in all layers
except the layers abstracted by M and the layers ab-
stracted by M could learn according to algorithm B.
The incoming gradient to the layers abstracted by M
could be ignored, modified or substituted according
to whatever details are required by algorithm B. Thus,
training using M is a robust approach to utilizing dif-
ferent optimization strategies in different regions of
a network. We leave the exploration of these alter-
native optimization strategies to future work, but we
acknowledge their potential in introducing network
modularization.
Assuming a network trained using one such opti-
mization method on the abstracted regions of the net-
work can still achieve acceptable test accuracy, then
an imperfect learning signal could coerce the network
to learn a sort of network modularization. This be-
havior has been proven to be true in biological brains,
where some neurons exhibit learning behavior more
Decoupling the Backward Pass Using Abstracted Gradients
513
Figure 2: A comparison of backpropagation computation with and without M. Observe that the backward computation of the
layers abstracted by multiple M matrices have their gradient computation decoupled.
attuned to learning XOR logic than to other logic (Gi-
don et al., 2020).
While network modularization of this form has
yet to be explored in artificial neural networks, bi-
ological neural network modularization is well un-
derstood. There is a link between how optimization
algorithms are associated with network modulariza-
tion in the biological brain. In the biological brain,
neuronal connections are maintained and updated by
different neurotransmitters, which influences neurons
to exhibit different weight update behaviors (Amunts
et al., 2010; Huang and Reichardt, 2001). More-
over, in different brain regions, neurotransmitter types
vary (Amunts et al., 2010; Amunts and Zilles, 2015;
Huang and Reichardt, 2001; Paxinos and Mai, 2003).
Such variation results in different weight update be-
haviors forming distinct brain regions which carry out
different functions (Amunts et al., 2010), (Amunts
and Zilles, 2015; Huang and Reichardt, 2001; Paxinos
and Mai, 2003). With our M algorithm enabling usage
of different optimization algorithms, we can mimic
the existence of different neurotransmitters in differ-
ent brain regions. Therefore, training with M enables
a way toward bringing biological brain modulariza-
tion into artificial neural networks.
6 CONCLUSIONS
This work has presented a biologically-inspired learn-
ing mechanism whereby approximate gradient infor-
mation is propagated quickly through the network via
a set of abstraction matrices M
k
. This decouples gra-
dient computation of each set of abstracted layers.
Decoupled computation allows the weight updates
within each block of abstracted layers to be theoret-
ically executed in parallel, with potential applications
for speeding up the backward pass of large compu-
tationally expensive networks. The next logical step
in this line of research would be the utilization of ab-
straction matrices to create a depth-wise parallelized
network architecture, and to explore potential appli-
cations toward online learning and real-time network
updates.
The gradient abstraction techniques introduced in
this work have research potential that extends beyond
gradient decoupling. In biological brains, cortico-
basal ganglia pathways mimicked in our work by
abstraction matrices – and localized logic updates are
not mutually exclusive. It is often the case that im-
precise learning signals are propagated quickly via
the cortico-basal ganglia feedback loops, then fol-
lowed by more precise updates mediated by retro-
grade synaptic connections between neurons (Sesack
and Grace, 2010; Wilson and Nicoll, 2001). Our
method for propagating abstracted gradients could be
leveraged toward a similar setup where network pa-
rameters are updated both via the abstraction matrix
and also via more traditional methods. This idea has
particular relevance in the domain of neuromorphic
computing and spiking neural networks.
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
514
REFERENCES
Alger, B. E. (2002). Retrograde signaling in the regulation
of synaptic transmission: Focus on endocannabinoids.
Progress in Neurobiology, 68(4):247 – 286. Cited by:
486.
Amari, S. (1993). Backpropagation and stochastic gradient
descent method. Neurocomputing, 5:185–196.
Amunts, K., Lenzen, M., Friederici, A. D., Schleicher, A.,
Morosan, P., Palomero-Gallagher, N., and Zilles, K.
(2010). Broca’s region: Novel organizational princi-
ples and multiple receptor mapping. PLOS Biology,
8(9):1–16.
Amunts, K. and Zilles, K. (2015). Architectonic map-
ping of the human brain beyond brodmann. Neuron,
88(6):1086–1107.
Banerjee, S. and Lavie, A. (2005). METEOR: An automatic
metric for MT evaluation with improved correlation
with human judgments. In Proceedings of the ACL
Workshop on Intrinsic and Extrinsic Evaluation Mea-
sures for Machine Translation and/or Summarization,
pages 65–72, Ann Arbor, Michigan. Association for
Computational Linguistics.
Black, S., Gao, L., Wang, P., Leahy, C., and Biderman, S.
(2021). GPT-Neo: Large Scale Autoregressive Lan-
guage Modeling with Mesh-Tensorflow.
Block, H. D., Knight Jr, B., and Rosenblatt, F. (1962). Anal-
ysis of a four-layer series-coupled perceptron. ii. Re-
views of Modern Physics, 34(1):135.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,
G., Henighan, T., Child, R., Ramesh, A., Ziegler,
D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler,
E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner,
C., McCandlish, S., Radford, A., Sutskever, I., and
Amodei, D. (2020). Language models are few-shot
learners.
Burke, D., Kiernan, M. C., and Bostock, H. (2001). Ex-
citability of human axons. Clinical Neurophysiology,
112(9):1575 – 1585. Cited by: 368.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov,
A., and Zagoruyko, S. (2020). End-to-end object de-
tection with transformers. In Computer Vision–ECCV
2020: 16th European Conference, Glasgow, UK, Au-
gust 23–28, 2020, Proceedings, Part I 16, pages 213–
229. Springer.
Cettolo, M., Federico, M., Bentivogli, L., Niehues, J.,
St
¨
uker, S., Sudoh, K., Yoshino, K., and Federmann,
C. (2017). Overview of the IWSLT 2017 evaluation
campaign. In Proceedings of the 14th International
Conference on Spoken Language Translation, pages
2–14, Tokyo, Japan. International Workshop on Spo-
ken Language Translation.
Chen, T., Xu, B., Zhang, C., and Guestrin, C. (2016). Train-
ing deep nets with sublinear memory cost. arXiv
preprint arXiv:1604.06174.
Chen, X., Wu, Y., Wang, Z., Liu, S., and Li, J. (2021).
Developing real-time streaming transformer trans-
ducer for speech recognition on large-scale dataset.
In ICASSP 2021-2021 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing
(ICASSP), pages 5904–5908. IEEE.
Cohen, G., Afshar, S., Tapson, J., and Van Schaik, A.
(2017). Emnist: Extending mnist to handwritten let-
ters. In 2017 international joint conference on neural
networks (IJCNN), pages 2921–2926. IEEE.
Deng, L. (2012). The mnist database of handwritten digit
images for machine learning research. IEEE Signal
Processing Magazine, 29(6):141–142.
Dong, L., Xu, S., and Xu, B. (2018). Speech-transformer: a
no-recurrence sequence-to-sequence model for speech
recognition. In 2018 IEEE international conference
on acoustics, speech and signal processing (ICASSP),
pages 5884–5888. IEEE.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,
N. (2021). An image is worth 16x16 words: Trans-
formers for image recognition at scale.
Elliott, D., Frank, S., Sima’an, K., and Specia, L. (2016).
Multi30k: Multilingual english-german image de-
scriptions. In Proceedings of the 5th Workshop on
Vision and Language, pages 70–74. Association for
Computational Linguistics.
Foundation, T. P. (2023). Multiprocess-
ing package - torch.multiprocessing.
https://pytorch.org/docs/stable/multiprocessing.html.
Gerdeman, G. L., Ronesi, J., and Lovinger, D. M. (2002).
Postsynaptic endocannabinoid release is critical to
long-term depression in the striatum. Nature Neuro-
science, 5(5):446 – 451. Cited by: 607.
Gidon, A., Zolnik, T. A., Fidzinski, P., Bolduan, F., Pa-
poutsi, A., Poirazi, P., Holtkamp, M., Vida, I., and
Larkum, M. E. (2020). Dendritic action potentials and
computation in human layer 2/3 cortical neurons. Sci-
ence, 367(6473):83–87.
G
¨
unther, S., Ruthotto, L., Schroder, J. B., Cyr, E. C., and
Gauger, N. R. (2019). Layer-parallel training of deep
residual neural networks.
Gregor Koehler, A. A. and Markovics, P. (2020).
Mnist handwritten digit recognition in pytorch.
https://nextjournal.com/gkoehler/pytorch-mnist.
Griewank, A. and Walther, A. (2000). Algorithm 799: re-
volve: an implementation of checkpointing for the
reverse or adjoint mode of computational differenti-
ation. ACM Transactions on Mathematical Software
(TOMS), 26(1):19–45.
Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu,
J., Han, W., Wang, S., Zhang, Z., Wu, Y., and Pang, R.
(2020). Conformer: Convolution-augmented Trans-
former for Speech Recognition. In Proc. Interspeech
2020, pages 5036–5040.
Hardingham, N., Dachtler, J., and Fox, K. (2013). The role
of nitric oxide in pre-synaptic plasticity and home-
ostasis. Frontiers in Cellular Neuroscience, (OCT).
Cited by: 176; All Open Access, Gold Open Access,
Green Open Access.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep resid-
ual learning for image recognition.
Decoupling the Backward Pass Using Abstracted Gradients
515
Huang, E. and Reichardt, L. (2001). Neurotrophins: Roles
in neuronal development and function. Annual Review
of Neuroscience, 24:677 736. Cited by: 3394; All
Open Access, Green Open Access.
Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen,
M., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., et al. (2019).
Gpipe: Efficient training of giant neural networks us-
ing pipeline parallelism. Advances in neural informa-
tion processing systems, 32.
Klein, G., Kim, Y., Deng, Y., Senellart, J., and Rush, A. M.
(2017). Opennmt: Open-source toolkit for neural ma-
chine translation. In Proc. ACL.
Koehler, G. (2020). Mnist handwritten digit recognition in
pytorch.
Letzkus, J. J., Kampa, B. M., and Stuart, G. J. (2006).
Learning rules for spike timing-dependent plasticity
depend on dendritic synapse location. Journal of Neu-
roscience, 26(41):10420 – 10429. Cited by: 217; All
Open Access, Bronze Open Access, Green Open Ac-
cess.
Lillicrap, T. P., Cownden, D., Tweed, D. B., and Akerman,
C. J. (2016). Random synaptic feedback weights sup-
port error backpropagation for deep learning. Nature
communications, 7(1):13276.
Neftci, E. O., Augustine, C., Paul, S., and Detorakis, G.
(2017). Event-driven random back-propagation: En-
abling neuromorphic deep learning machines. Fron-
tiers in Neuroscience, 11.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright,
C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K.,
Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller,
L., Simens, M., Askell, A., Welinder, P., Christiano,
P., Leike, J., and Lowe, R. (2022). Training language
models to follow instructions with human feedback.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).
Bleu: a method for automatic evaluation of machine
translation. In Proceedings of the 40th Annual Meet-
ing of the Association for Computational Linguistics,
pages 311–318, Philadelphia, Pennsylvania, USA.
Association for Computational Linguistics.
Paxinos, G. and Mai, J. K. (2003). The Human Nervous
System. Cited by: 24.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D.,
Sutskever, I., et al. (2019). Language models are un-
supervised multitask learners. OpenAI blog, 1(8):9.
Rei, R., Stewart, C., Farinha, A. C., and Lavie, A. (2020).
COMET: A neural framework for MT evaluation. In
Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP),
pages 2685–2702, Online. Association for Computa-
tional Linguistics.
Seger, C. A. and Miller, E. K. (2010). Category learning
in the brain. Annual Review of Neuroscience, 33:203
219. Cited by: 242; All Open Access, Green Open
Access.
Sesack, S. R. and Grace, A. A. (2010). Cortico-basal
ganglia reward network: Microcircuitry. Neuropsy-
chopharmacology, 35(1):27 47. Cited by: 732; All
Open Access, Bronze Open Access, Green Open Ac-
cess.
Song, Y., Meng, C., Liao, R., and Ermon, S. (2021). Ac-
celerating feedforward computation via parallel non-
linear equation solving.
Srivastava, R. K., Greff, K., and Schmidhuber, J. (2015).
Highway networks. arXiv preprint arXiv:1505.00387.
Stuart, G. J. and H
¨
ausser, M. (2001). Dendritic coincidence
detection of epsps and action potentials. Nature Neu-
roscience, 4(1):63 – 71. Cited by: 279.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I.
(2017). Attention is all you need.
Wilson, R. and Nicoll, R. (2001). Endogenous cannabi-
noids mediate retrograde signalling at hippocampal
synapses. Nature, 410(6828):588 592. Cited by:
1267.
Xiao, H., Rasul, K., and Vollgraf, R. (2017). Fashion-
mnist: a novel image dataset for benchmarking ma-
chine learning algorithms. CoRR, abs/1708.07747.
APPENDIX
Extended Ablation Results and
Additional Experiments
Extended Results. We present here an expanded ta-
ble demonstrating results on both derived M and M=I,
along with the additional SacreBLEU scoring metric.
This experiment seeks to answer the following
question: Is it possible that the use of the M=I ab-
straction is effective, not because M=I is a reasonable
approximation for the gradients, but because the net-
work is learning, in effect, to ignore the intervening
network layers. In other words, would it be more ef-
fective to simple train a smaller network rather than
using M=I as an abstracted gradient representation?
We address this question by comparing abstracted
models of various size with corresponding non-
abstracted models in which the layers bridged by M
have been entirely removed. If the ablated version of
each model matches the performance of the abstracted
model, then that would suggest that the abstraction is
in fact not useful for learning, and is instead simply
functioning as a mechanism to simulate a model with
fewer layers overall. Results are shown in Table 5.
Abstraction Position Experiment. Here we in-
vestigate the impact of abstraction layer positioning
on model performance. We provide an ablation study
where we vary the position of the abstraction matrix
within the transformer model. Results are shown in
Table 6.
Abstraction Size Experiment. The power of
potential parallelization increases as we define addi-
tional abstractions or increase abstraction size. As the
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
516
results show in Table 7, abstraction on different po-
sitions perform similar to each other before ablation.
This is a positive result as the abstraction’s usefulness
is not necessarily limited by position. However, after
ablation, A3(D234) has the most amount of perfor-
mance drop. We do not yet have an conrete explana-
tion for this behavior, so we will explore it in future
works.
We observe that for each pair of corresponding
models, the ablated version performs less well than
the abstracted version, suggesting that the model is
indeed leveraging the inherent learning capacity of
the additional layers. Additionally, the performance
drop from the abstracted models is about the same as
the drop seen in the baseline model when the same
layers are removed. We therefore conclude that the
question above can be answered in the negative. The
abstraction M=I is indeed preserving useful learning
capacity in the abstracted layers. We note, however,
that the performance of the model does seem slightly
better when only one layer is abstracted rather than
three. This raises the question of how many network
layers can be effectively abstracted at one time before
network performance begins to degrade. Further re-
search is needed before this question can be answered
with confidence.
Additional Biological Foundations
Our work is inspired by recent findings in neuro-
science. In both biological learning and machine
learning via artificial neural networks, an update
mechanism must exist which adjusts the synaptic
weights based on observed error in the output signal.
Traditional machine learning achieves this via back-
propagated error signals, a method which is analo-
gous to the neurobiological mechanisms of backprop-
agating action potentials (Stuart and H
¨
ausser, 2001;
Letzkus et al., 2006) and release of retrograde neuro-
transmitters (e.g. cannabinoids and nitric oxide) (Wil-
son and Nicoll, 2001; Hardingham et al., 2013)
Our work expands upon this foundation by intro-
ducing an alternate approach to the transmission of
error gradients. Researchers have observed that, in
biological brains, neither backpropagating action po-
tentials nor retrograde neurotransmitter signals typi-
cally propagate across multiple layers due to interfer-
ence from ongoing activity and ion channel activation
refractory periods (Burke et al., 2001). Instead, bio-
logical systems seem to rely on feedback loops more
distant upstream layers are connected via feedback
loops that bypass the initial layers (Sesack and Grace,
2010). We attempt to implement a similar system via
the abstraction matrix M, which is inspired in partic-
ular by the cortico-basal ganglia network for reward
learning (Sesack and Grace, 2010).
Research on cortico-basal ganglia dopamine net-
work connectivity and behavioral implications is on-
going, and much of the circuit framework is still hy-
pothetical. However, a consistent theme is that ven-
tral tegmental area (VTA) dopamine cell bodies re-
ceive sensory input and project to the ventral striatum
(i.e. nucleus accumbens) where dopamine release oc-
curs in response to rewards and associated sensory
stimuli to encode valence and form learned associ-
ations (Sesack and Grace, 2010). When a sensory
stimuli is reinforcing, it drives dopamine release onto
output medium spiny neurons (MSNs), concomitant
to glutamate signals from cortical, thalamic, amyg-
dala and hippocampal inputs encoding additional im-
portant aspects of the stimuli (such as its emotional
value) (Seger and Miller, 2010). The dopamine sig-
nal acts as a gain modulator to facilitate or diminish
propagation of that signal through MSNs. The MSNs
then propagate the signals to their respective output
layers called the direct and indirect pathways. Im-
portantly, those two pathways also form two parallel
feedback loops that have different numbers of layers
and can thus influence future VTA dopaminergic ac-
tivity through either a short or long feedback mech-
anism (Seger and Miller, 2010; Sesack and Grace,
2010). Furthermore, local striatal synaptic activity
is still tuned by retrograde cannabinoid neurotrans-
mission (Gerdeman et al., 2002; Alger, 2002). Thus,
biological mechanisms can include backpropagating
techniques (i.e. retrograde transmission) or feedback
loops that skip layers to tune upstream activity.
The complexity of neurobiological feedback
mechanisms in biological brains are too complex to
imitate in their entirety. However, we take inspiration
from the behavior of dopamine signals in the cortico-
basal ganglia network in the creation of an abstraction
matrix M which allows error signals to bypass clus-
tered groups of layers in an artificial neural network.
Traditionally, artificial neural networks have ignored
these longer feedback loops and have typically fo-
cused only on backpropagating error signals between
proximate neurons. We believe that this oversight
fundamentally limits the opportunities for learning in
deep neural networks. The abstraction matrix M in-
troduces an alternative pathway for the propagation of
error signals, and as such may open new computation
paradigms for deep learning systems.
Decoupling the Backward Pass Using Abstracted Gradients
517
Table 5: Ablation study. B=BLEU, SB=SacreBLEU, M=METEOR, (a)=ablated model. Each tuple shows average final
accuracy and standard deviation across five training runs.
n=6 (baseline) n=3 (baseline) n=6 (M=I) n=3 (M=I) n=6 (derived) n=3 (derived)
B (0.386, 0.008) (0.385, 0.011) (0.383, 0.008) (0.374, 0.010) (0.199, 0.033) (0.317, 0.017)
B(a) (0.374, 0.010) (0.380, 0.006) (0.374, 0.007) (0.372, 0.006) (0.153, 0.038) (0.299, 0.009)
SB (0.386, 0.008) (0.385, 0.011) (0.383, 0.008) (0.374, 0.010) (0.199, 0.033) (0.317, 0.017)
SB(a) (0.374, 0.007) (0.380, 0.006) (0.374, 0.007) (0.372, 0.006) (0.153, 0.038) (0.299, 0.009)
M (0.708, 0.006) (0.709, 0.007) (0.705, 0.004) (0.702, 0.007) (0.477, 0.042) (0.623, 0.016)
M(a) (0.698, 0.006) (0.704, 0.004) (0.696, 0.007) (0.697, 0.007) (0.407, 0.058) (0.603, 0.011)
Table 6: Performances of different abstraction sizes on Multi30k dataset over 10 trials. Models are transformer models with
6 encoder layers and 6 decoder layers. Abstraction method is M = I. A3(E345) means abstract 3 consecutive layers with
a single M, from 3rd to 5th encoder layers. E and D means the encoder and decoder layers respectively. And (a) means
performances measured after ablated abstracted layers.
A3(E234) A3(E345) A3(D234) A3(D345)
(acc, stddev) (acc, stddev) (acc, stddev) (acc, stddev)
BLEU (0.368, 0.010) (0.371, 0.008) (0.362, 0.007) (0.362, 0.010)
BLEU(a) (0.332, 0.006) (0.353, 0.011) (0.307, 0.010) (0.334, 0.011)
COMET (0.759, 0.004) (0.759, 0.005) (0.751, 0.004) (0.753, 0.005)
COMET(a) (0.734, 0.005) (0.749, 0.006) (0.679, 0.017) (0.715, 0.008)
METEOR (0.692, 0.007) (0.690, 0.006) (0.679, 0.007) (0.683, 0.007)
METEOR(a) (0.642, 0.009) (0.671, 0.009) (0.614, 0.014) (0.644, 0.009)
Table 7: Performances of different abstraction sizes on Multi30k dataset over 10 trials. Models above are transformer models
with 6 encoder layers and 6 decoder layers. Abstraction method in this experiment is M = I. A6(L3-5) means abstracted 6
layers in total, with two blocks of 3 consecutive layers, from 3rd to 5th layers in both the encoders and decoders.
A4(L4-5) A6(L3-5) A8(L2-5)
(acc, stddev) (acc, stddev) (acc, stddev)
BLEU (0.362, 0.010) (0.359, 0.008) (0.360, 0.011)
COMET (0.754, 0.006) (0.750, 0.005) (0.746, 0.007)
METEOR (0.684, 0.009) (0.678, 0.008) (0.678, 0.008)
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
518