Investigation into the Training Dynamics of Learned Optimizers

Jan Sobotka, Petr

Sim

anek and Daniel Va

sata

Faculty of Information Technology, Czech Technical University in Prague, Th

akurova 9, Prague, Czech Republic

ﬁ

Keywords:

Learning to Optimize, Meta-Learning, Optimization.

Abstract:

Optimization is an integral part of modern deep learning. Recently, the concept of learned optimizers has

emerged as a way to accelerate this optimization process by replacing traditional, hand-crafted algorithms

with meta-learned functions. Despite the initial promising results of these methods, issues with stability and

generalization still remain, limiting their practical use. Moreover, their inner workings and behavior under

different conditions are not yet fully understood, making it difﬁcult to come up with improvements. For

this reason, our work examines their optimization trajectories from the perspective of network architecture

symmetries and parameter update distributions. Furthermore, by contrasting the learned optimizers with their

manually designed counterparts, we identify several key insights that demonstrate how each approach can

beneﬁt from the strengths of the other.

1 INTRODUCTION

In the last decade, the ﬁeld of machine learning has

undergone many trends, the most signiﬁcant being

the replacement of manual feature engineering with

learned features. Relatively recently, a similar per-

spective has been applied to optimization, the driving

force behind deep learning. Speciﬁcally, the subﬁeld

of meta-learning called learning to optimize (L2O)

has the ambitious goal of learning the optimization

itself, more or less replacing the hand-engineered al-

gorithm approach.

There is a substantial body of literature devoted to

learning to optimize with available gradient informa-

tion. These works try to learn the ”gradient descent-

like” method to optimize the parameters (weights and

biases) of neural networks. The approaches range

from learning adaptive step size (learning rate) to

learning the whole gradient descent algorithm with re-

current neural networks (Andrychowicz et al. (2016),

Lv et al. (2017), Metz et al. (2020), Metz et al. (2021),

Simanek et al. (2022)).

Compared to the relatively long history of hand-

engineered optimizers, where extensive research into

the theory, together with many empirical ﬁndings,

informed numerous advances (momentum Polyak

(1964), AdaGrad Duchi et al. (2011), RMSProp

Tieleman et al. (2012), Adam Kingma and Ba

(2017)), the ﬁeld of learning to optimize is in its

nascent stages. Many fundamental questions remain

unanswered, and an extensive investigation into the

training dynamics of learned optimizers is still lack-

ing, which hinders well-informed development and

further progress of the whole ﬁeld. This is especially

important since these methods have been shown to be

brittle, difﬁcult to scale, and relatively ineffective at

generalizing across diverse problems.

Additionally, it might be possible to improve man-

ually engineered optimization algorithms by noticing

which strategies L2O has found to be useful. In other

words, research on hand-designed and learned opti-

mizers can be mutually beneﬁcial.

In light of the aforementioned lack of understand-

ing and our desire to improve traditional optimizers,

we empirically study various properties of the opti-

mization trajectories. Speciﬁcally, we 1) analyze the

impact of symmetries introduced by network archi-

tectures; 2) examine the heavy-tailedness of noise in

the predicted parameter updates; 3) investigate the up-

date covariance, and lastly; 4) inspect the progression

of the update size.

Our results show several major differences as well

as similarities between the training dynamics under

traditional and learned optimizers. Moreover, we no-

tice close parallels to the recently proposed optimizer

Lion (Chen et al., 2023) and shed more light on the

strengths of these two approaches.

In particular, our experiments demonstrate that

similarly to Lion, learned optimizers break the ge-

ometric constraints on gradients that stem from ar-

Sobotka, J., Šimánek, P. and Vašata, D.

Investigation into the Training Dynamics of Learned Optimizers.

DOI: 10.5220/0012317000003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 2, pages 135-146

ISBN: 978-989-758-680-4; ISSN: 2184-433X

135

chitectural symmetries and that deviations from these

constraints are signiﬁcantly larger than those ob-

served with previous optimizers like Adam or SGD.

In the case of learned optimizers, we observe that a

large deviation from these geometric constraints al-

most always accompanies the initial rapid decrease in

loss during optimization. More importantly, regular-

izing against it severely damages performance, hint-

ing at the importance of this freedom in L2O parame-

ter updates.

Furthermore, by studying the noise and covariance

in the L2O parameter updates, we also demonstrate

that, on the one hand, L2O updates exhibit less heavy-

tailed stochastic noise, and, on the other hand, the

variation in updates across different samples is larger.

The paper is organized as follows. In Section 2,

learned optimizers and the Lion optimizer are intro-

duced. Then Section 3 describes the theoretical anal-

ysis of symmetries, gradient geometry, stochastic gra-

dient noise, and update covariance. This, in turn,

leads to a series of experiments presented in Section 4

followed by a discussion in Section 5. Lastly, Section

6 connects our work to previous studies and shows

promising parallels.

2 BACKGROUND

Let us start by explaining the learning to optimize

method, which is the main subject of our study, and

then follow with the recently introduced Lion opti-

mizer.

2.1 Learning to Optimize

The primary task is to minimize a given function L(θ)

by optimizing its vector of parameters θ. We focus

primarily on ﬁrst-order optimization methods, where

at each step t of the algorithm, the optimizer has ac-

cess to the gradient ∇L(θ

) and suggests an update g

to get θ

t+1

= θ

+ g

. (1)

Our approach follows Andrychowicz et al. (2016).

The core idea is to use a recurrent neural network M,

parameterized by φ, that acts as the optimizer. Specif-

ically, at each time step t, this network takes its hid-

den state h

together with the gradient ∇L(θ

) and pro-

duces an update g

and a new hidden state h

t+1

, h

t+1

] = M



∇

L(θ

), h

, φ



. (2)

The sequence obtained by (1) then aims to con-

verge to a local minimum of L. Therefore, M is called

the optimizer, or meta-learner, and L(θ) is called the

optimizee.

LSTM

Figure 1: Single optimization step of L2O.

The parameters φ of the optimizer M are learned

by stochastic gradient descent and updated every u-th

training step where the hyper-parameter u is called the

unroll. The loss of the optimizer is the expectation of

the weighted unrolled trajectory of the optimizee,

L (φ) = E

∑

τ=1

L(θ

τ+ ju−1

) (3)

where E

is the expectation with respect to some dis-

tribution of optimizee functions L and w

are weights

that are typically set to 1. Additionally, j denotes the

number of previously unrolled trajectories; therefore,

( j + 1)-th unrolled trajectory corresponds to training

steps t = ju, ju + 1, ..., ( j + 1)u −1. In practice, there

is often only one optimizee function L, so the expec-

tation in (3) is removed.

In addition to updates of φ along a single opti-

mizee training (an inner loop), there is also an outer

loop where the entire optimizee training is restarted

from some initial θ

t=0

while the parameters φ con-

tinue to learn. In particular, the outer loop takes place

only during the meta-training phase, where the main

goal is to learn a good set of parameters φ. Evalu-

ation of the learned optimizer then happens in what

is sometimes called meta-testing where φ is ﬁxed and

only the optimizee’s parameters θ are updated.

In practice, when there are several thousand or

more parameters in θ, applying a general recurrent

neural network is almost impossible. The authors in

Andrychowicz et al. (2016) avoid this issue by imple-

menting the update rule coordinate-wise using a two-

layer LSTM network with shared parameters. This

means that the optimizer M is a small network with

multiple instances operating on each parameter of the

optimizee separately while sharing its parameters φ

across all of them. A visualization of a single opti-

mization step is presented in Figure 1.

We refer the reader to Andrychowicz et al. (2016)

for further algorithmic details and preprocessing.

2.2 Lion Optimizer

By applying program search techniques to the discov-

ery of optimization algorithms, the authors of Chen

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

136

et al. (2023) came up with a simple yet highly ef-

fective adaptive algorithm called Lion (EvoLved Sign

Momentum). The major distinction between Lion and

other adaptive optimizers lies in its uniform update

magnitude for each parameter, calculated through the

sign operation. Remarkably, the authors demon-

strated that this relatively simple optimization algo-

rithm can outperform standard optimizers, such as

Adam or Adafactor, across a wide range of bench-

marks.

A single step of the Lion optimizer starts with a

standard weight decay step with a strength parameter

λ. Then, Lion updates the parameters with the sign

of the interpolation between the current gradient and

momentum. The size of the step is determined by a

learning rate η and the interpolation is controlled by

an exponential moving average (EMA) factor β

. Fi-

nally, an update of the EMA is carried out using the

EMA factor β

According to the authors, the sign operation in

Lion introduces noise into the updates, which serves

as a form of regularization. Empirical observations by

the authors and previous studies Foret et al. (2021),

Neelakantan et al. (2015) suggest that this noise can

contribute to improved generalization. Additionally,

their analysis indicated that Lion’s updates generally

exhibit a larger norm compared to those of Adam.

3 METHODS

3.1 Symmetries and Gradient Geometry

In this section, we formalize the notion of differen-

tiable symmetries, discuss the associated geometric

properties of gradients in the context of neural net-

works, and present important examples.

3.1.1 Differentiable Symmetries

Symmetries of a function express its property of be-

ing invariant under a certain group of transformations.

When these transformations are differentiable, we call

them differentiable symmetries. In this paper, we con-

sider symmetries of loss functions, and the subject of

transformations in our case are the neural network’s

parameters.

More formally, a function f (θ) where θ ∈R

pos-

sesses a differentiable symmetry if it is invariant un-

der the differentiable action ψ : R

×G → R

of a

group G on the function’s argument space. In other

words, for all θ and all α ∈ G, the output of the func-

tion does not change:

f (ψ(θ, α)) = f (θ). (4)

Figure 2: Example function g(x, y) = x

+ y

Note that our topic of interest is different from the

work on equivariant neural networks, where the sym-

metry groups act on the input space, output space, and

hidden feature spaces. Speciﬁcally, their focus is on

the equivariance or invariance of y = f

(x) with re-

spect to x and y, but in the following sections, we

are interested in symmetries acting on the parameter

space instead.

3.1.2 Geometric Constraints on Gradients

Let us start with an intuitive example to introduce the

concept of gradient geometry. Consider the function

g(x, y) = x

+ y

depicted in Figure 2 that possesses

rotational symmetry

ψ(x, y, α) =



cosα −sinα

sinα cosα





, (5)

i.e. rotating around the origin by some angle α leaves

the output of g unchanged.

Taking the partial derivative at the identity α = 0,

we get

∂



α=0



−y



, (6)

where ∂



α=0

is the vector ﬁeld that generates the

symmetry. Taking the gradient of g at (x, y) yields

∇g = 2





(7)

Hence, the gradient is perpendicular to the vector

ﬁeld:

⟨∇g, ∂



α=0

⟩ = 0 for all x, y. (8)

See Figure 3 for a depiction of this geometrical prop-

erty.

As will be now easily shown, such perpendicular-

ity is not speciﬁc for the presented example, but it

Investigation into the Training Dynamics of Learned Optimizers

137

Figure 3: Gradient geometry for the function g(x, y) =

. Left: Directions of the gradient ∇g. Middle: Direc-

tions of the vector ﬁeld ∂



α=0

that generates the symme-

try. Right: Orthogonality of the gradient and the symmetry-

generating vector ﬁeld.

holds generally. Let us assume that f is almost ev-

erywhere differentiable. By taking the derivative with

respect to α at the identity α = I ∈ G of both sides of

(4) and using the chain rule of differentiation, we get



∇ f , ∂



α=I



= 0, (9)

which implies that for all θ, the gradient ∇ f whenever

exists is perpendicular to the vector ﬁeld ∂



α=I

that

generates the symmetry.

Similar relationships hold for the Hessian matrices

as can be found in Kunin et al. (2021) together with

formal proofs and a more detailed discussion of the

above.

Since our analysis focuses on learning to optimize

methods, we investigate how the predicted parame-

ter updates adhere to the constraints on gradients and

how deviations from these geometric constraints com-

pare to those of SGD, Adam, and Lion.

Additionally, part of our analysis includes regu-

larization of learned optimizers during meta-training

against the aforementioned deviations. In such cases,

we calculate the absolute deviations for each pre-

dicted update during meta-training using the opti-

mizee’s detached parameters and include these abso-

lute deviations as part of the optimizer’s loss.

A description of the three symmetries considered,

along with methods for calculating deviations from

the associated geometric constraints, is provided be-

low.

Translation Symmetry. In this paper, translation

symmetry refers to the invariance of a function to

action ψ(θ, α) : θ 7→ θ + αξ, where ξ ∈ R

is some

constant vector and α ∈ R. We will consider only

cases where ξ =

, which is the indicator vector for

some subset X of {1, ..., n}. An example is the vector

{1,3}

= (1, 0, 1, 0, . . . , 0). From (9), we get



∇ f , ∂



α=0





∇ f ,



= 0. (10)

In other words, the gradient is orthogonal to the indi-

cator vector.

Let us have a subset X = {i

, i

, . . . , i

} of

{1, ..., n} and denote by ∇

f the gradient of f with

respect to a subset of its arguments given by θ

(θ

, θ

, . . . , θ

) ∈

. Then, we may rewrite the pre-

vious orthogonality as



∇ f ,





∇

f ,



= 0, (11)

where = (1, . . . , 1).

In the context of deep learning, the translation

symmetry is present in the softmax function σ

(z) =

with z = W x + b. Shifting any column W

:,i

or the

bias vector b by a real constant leaves the output of

the softmax unchanged. Therefore, the softmax func-

tion is invariant under translation, and this symmetry

induces from (11) the following gradient property:



∇

:,i





∇



= 0, (12)

where we denote by L the loss function.

The equation above provides us with a simple way

to calculate absolute deviation from this symmetry-

induced constraint as follows:





(∇L),





∑





:,i

(∇L),





, (13)

where the function g denotes the update rule of an op-

timizer with the corresponding components g

, g

:,i

and the index i iterates over all the columns of W .

Rescale Symmetry. Rescale symmetry is in our

case deﬁned by the group GL

(R) and the action

ψ(θ, α) : θ 7→ a

(α) ⊙a

(α

−1

) ⊙θ, where X

and

are two disjoint subsets of {1, ..., n}, α ∈ (0, +∞),

(α) = + (α − 1)

, and ⊙ is the element-

wise multiplication. An example of the notation is

{1,3}

(α) = (α, 1, α, 1, . . . , 1). Function f possesses

rescale symmetry if f (θ) = f (ψ(θ, α)) for all α ∈

(R). Taking similar steps as before, this sym-

metry implies gradient orthogonality as



∇ f , ∂



α=1





∇ f , θ⊙

−θ ⊙



= 0. (14)

Full formal proofs can be found in Kunin et al. (2021).

Interestingly, the rescale symmetry is present at

every hidden neuron of networks with continuous, ho-

mogeneous activation functions such as ReLU and

Leaky ReLU. A simple illustration of this comes

from considering a single neuron’s computational

path w

σ(w

x + b) and noticing that scaling w

and

b by a real constant and w

by its inverse has no effect

since the constants can be passed through the activa-

tion function, canceling their effect. Thus, the gradi-

ent constraints in this case are



∇

L, w





∇

L, b



−



∇

L, w



= 0. (15)

Absolute deviation of the optimizer’s updates

given by g from the rescale symmetry constraints can

be therefore calculated as





(∇L), w





(∇L), b



−



(∇L), w





(16)

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

138

Scale Symmetry. Scale symmetry can be deﬁned

by the group GL

(R) and the action ψ(θ, α) : θ 7→

(α) ⊙ θ, where the notation is the same as for

the rescale symmetry. Given this deﬁnition, a func-

tion possesses scale symmetry if f (θ) = f (ψ(θ, α)) =

f (α

⊙θ) for all α ∈ GL

(R). The gradient orthog-

onality arising from this symmetry is



∇ f , ∂



α=1





∇ f , θ ⊙





∇

f , θ



= 0

(17)

i.e. the gradient of the function is everywhere perpen-

dicular to the parameter vector itself.

Batch normalization, as used in deep learning, has

this scale invariance during training. To see this, con-

sider the incoming weights w ∈ R

and bias b ∈ R

of a neuron with batch normalization

z−¯z

√

var(z)

, where

z = w

x + b, ¯z is the sample average, and var(z) is

some standard estimation of variance. Scaling w and

b by a non-zero real constant has no effect on the out-

put since it factors from z, ¯z, and

var(z) canceling

the effect. Thus, w and b observe scale symmetry in

the loss, and their gradients satisfy

⟨∇

L, w⟩+ ⟨∇

L, b⟩ = 0. (18)

More details can be found in Ioffe and Szegedy (2015)

and Kunin et al. (2021).

Therefore, we can calculate the absolute deviation

of the updates given by g from the scale symmetry

constraints as



⟨g

(∇L), w⟩+ ⟨g

(∇L), b⟩



. (19)

3.2 Stochastic Gradient Noise and

α-Stable Distributions

3.2.1 Stochastic Gradient Noise

Stochastic gradient noise refers to the ﬂuctuations in

the gradient of the loss function. This noise can be

characterized by the gradient noise vector, deﬁned as

= ∇L

−∇L, (20)

where ∇L

is the k-th mini-batch gradient and ∇L is

the full-batch gradient of the loss function.

Although many theoretical works have assumed

to be Gaussian with independent components, the

authors of Simsekli et al. (2019) have shown that

this is not entirely the case. Their theory suggested

that the gradient noise converges to a heavy-tailed α-

stable random variable and proposed a novel analy-

sis of SGD, which their experiments showed to be

more suitable. Speciﬁcally, in a series of experiments

with common deep learning architectures, the gradi-

ent noise exhibited a highly non-Gaussian and heavy-

tailed distribution.

More importantly, they proved that gradient noise

with heavier tails increases the probability of ending

up in a wider basin of the loss landscape, an indication

of better generalization (Hochreiter and Schmidhuber

(1997)).

3.2.2 Symmetric α-Stable Distributions

One can view the symmetric α-stable (SαS) distri-

bution as a heavy-tailed generalization of a centered

Gaussian distribution. The SαS distributions are de-

ﬁned through their characteristic function

X ∼ SαS(σ) ⇐⇒ E



iωX



= e

−|σω|

. (21)

In general, the probability density function does

not have a closed-form formula, but it is known that

the density decays with a power law tail like |x|

−α−1

where α ∈ (0, 2] is called the tail-index: as α gets

smaller, the distribution has a heavier tail.

Under the assumption of the same tail-index for all

components of the gradient noise and of the parameter

update noise deﬁned by

= g

−g, (22)

where g

is the k-th mini-batch parameter update and

g is the full-batch update, we estimate the tail-index

α of the distribution of the gradient noise components

and the parameter update noise components using the

estimator introduced in Mohammadi et al. (2015) and

speciﬁcally its implementation from Simsekli et al.

(2019).

3.3 Update Covariance

To further investigate the noise in the mini-batch pa-

rameter updates for different optimizers, we study the

magnitude of their variation across different samples.

We do so by considering the update covariance matrix

K =

∑

i=0

−g)(g

−g)

, (23)

where g

is the parameter update on the sample x

, g is

the full-batch update, and N is the number of training

samples.

Computing and storing K in memory is expensive

due to the quadratic cost in the number of parame-

ters. However, since we want to analyze the magni-

tude of the deviations of the parameter updates from

the full-batch update, it is sufﬁcient to determine only

the largest eigenvalue of K.

To address this, we use the mini-batch updates as

follows. We begin by sampling L mini-batch update

of size M and compute the corresponding L ×L Gram

matrix K

with entries

i, j

⟨g

−g, g

−g⟩, (24)

Investigation into the Training Dynamics of Learned Optimizers

139

where g

is the i-th mini-batch update and g is the

estimation of the full-batch update based on the L

mini-batches (in our analysis, we use L = 93). Sub-

sequently, we ﬁnd the maximum eigenvalue of K

using the power iteration method. The empirical ﬁnd-

ings of the authors of Jastrzebski et al. (2020) demon-

strate that the largest eigenvalue of K

approximates

the largest eigenvalue of K quite well.

4 EXPERIMENTS

All our meta-training is performed on feed-forward

neural network optimizees with 1 hidden layer of 20

neurons with either sigmoid, Leaky ReLU, or batch

normalization followed by ReLU. We put the soft-

max activation function at the output and train the

optimizees on the MNIST classiﬁcation task with the

cross-entropy loss function and batch size of 128. We

set unroll to 20 iterations, learning rate to 0.001, and

perform meta-training for 50 epochs, each consisting

of 20 separate optimization runs with the maximum

iteration number of 200.

During meta-testing, we evaluate L2O on the op-

timizee architectures from meta-training as well as on

larger networks described in each experiment sepa-

rately.

For SGD, Adam, and Lion, by performing a hy-

perparameter search, we chose a learning rate of 0.1

and momentum of 0.9 for SGD, a learning rate of 0.05

for Adam, and learning rates 0.001, 0.005, 0.01 for

sigmoid, Leaky ReLU, and ReLU with batch normal-

ization, respectively, for Lion.

4.1 Breaking the Geometric Constraints

To assess the importance of learned optimizers being

free from the geometric constraints that might bind

classical optimizers, we follow a two-step procedure.

First, we measure the deviations of L2O from the

aforementioned geometric constraints for the trans-

lation, rescale, and scale symmetries. Second, we

regularize the learned optimizer during meta-training

against breaking these constraints and observe its ef-

fects on the performance during meta-testing. The

goal is to better understand the difference between

the optimization trajectories of L2O and classical op-

timizers through the lens of symmetries in the opti-

mizee architecture.

Deviations from the Constraints. We meta-train

and subsequently meta-test L2O on the aforemen-

tioned optimizee models and track the progression of

0 50 100

Absolute deviation

Adam SGD Lion L2O

0 50 100

Iteration Iteration

Figure 4: Deviations from the geometric constraints. Left:

Rescale symmetry breaking on the Leaky ReLU optimizee.

Right: Scale symmetry breaking on the optimizee with

batch normalization and ReLU.

the deviations of parameter updates from the geomet-

ric constraints. Additionally, we compare the devi-

ations of hand-engineered optimizers such as SGD,

Adam, and Lion.

As can be seen in Figure 4, where L2O was meta-

trained on optimizees with Leaky ReLU and ReLU

with batch normalization, the deviations from the ge-

ometric constraints from the rescale and scale symme-

tries are mostly larger than those of SGD and Adam.

But most strikingly, the increase in this symmetry

breaking is largest for L2O at the beginning of op-

timization, whereas for Lion, it increases more grad-

ually and achieves higher values later.

Symmetry Breaking Regularization. To get a

deeper insight into how L2O leverages the freedom of

parameter updates, we meta-train with an additional

regularization loss that penalizes the absolute size of

the L2O’s update deviations from the geometric con-

straints on gradients. The performance for various

regularization strengths β is shown in Figure 5.

Interestingly, as regularization increases, the

L2O’s optimization speed signiﬁcantly drops. This

0 100

Iteration Iteration

200 0 100 200

Train loss (log scale)

0.3

Adam

SGD

Lion

L2O,β=0

L2O,β=0.1

L2O,β=0.5

L2O,β=1

Figure 5: Performance after the symmetry breaking regu-

larization. Left: Rescale symmetry breaking regularized on

the Leaky ReLU optimizee. Right: Scale symmetry break-

ing regularized on the optimizee with ReLU and batch nor-

malization.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

140

Adam

SGD

Lion

L2O, β=0.0

L2O, β=0.1

L2O, β=0.5

0 100 200

Iteration

0.2

Train loss (log scale)

Adam

SGD

Lion

L2O, β=0.0

L2O, β=0.1

L2O, β=0.5

0 100 200

Iteration

0.2

Train loss (log scale)

Adam

SGD

Lion

L2O, β=0.0

L2O, β=0.1

L2O, β=0.5

0 100 200

Iteration

0.2

Train loss (log scale)

Figure 6: Effects of regularization strength β on the performance of L2O on novel optimizee architectures. Left: L2O meta-

trained on the Leaky ReLU optimizee with regularization against the rescale symmetry breaking and then meta-tested on

5x wider and 2x deeper optimizee network with sigmoid. Middle: L2O meta-trained on the Leaky ReLU optimizee with

regularization against the rescale symmetry breaking and then meta-tested on ReLU with batch normalization. Right: L2O

meta-trained on the sigmoid optimizee with regularization against the translation symmetry breaking and then meta-tested on

ReLU with batch normalization.

observation reiterates the interesting L2O character-

istic that its optimization trajectories do not escape

the symmetry-induced equivalent parameter combi-

nations as gradient would, but rather take a route more

aligned with the level sets of the loss landscape. Such

a search for a better starting point for the next step

might be related to the symmetry teleportation pre-

sented in Zhao et al. (2022), where the authors per-

formed loss-invariant symmetry transformations of

the parameters to ﬁnd combinations with a steeper

gradient. Given that L2O is meta-trained and its pre-

dicted parameter updates are history-dependent, it is

an interesting topic for future study to analyze this

connection further.

The same observation for the effect of regulariza-

tion can be made for most of the optimizee architec-

tures on which L2O was not meta-trained (Figure 6).

Results for L2O meta-trained on the optimizee with

sigmoid activation function can be found in the Ap-

pendix.

Symmetry Breaking and Performance of Tradi-

tional Optimizers. We can also relate the symme-

try breaking of classical hand-designed optimizers to

their performance. Speciﬁcally, by interpolating be-

tween the optimization steps of SGD and Lion, we

analyze how the change in symmetry breaking of this

combined optimizer coalesces with the change in its

performance.

Figure 7 shows how the Lion-SGD interpolation

maps onto the deviations from the symmetry con-

straints (left) and the performance (right). λ

refers

to the interpolation coefﬁcient in

Lion-SGD

= λ

·g

Lion

+ (1 −λ

) ·g

SGD

, (25)

where g’s denote the parameter updates of the corre-

sponding optimizers. The optimizee, in this case, is

a feed-forward neural network with 2 hidden layers

of 100 neurons with sigmoid activation function (the

same as in Figure 6 on the left).

Similarly to the earlier observation that the regu-

larization of L2O symmetry breaking hinders its per-

formance, we see that the increasing symmetry break-

ing of Lion-SGD correlates with an increase in perfor-

mance. This indicates that breaking the strict geomet-

ric constraints is beneﬁcial not only for L2O but also

for more traditional, manually designed optimization

algorithms.

0 100

200

Train loss (log scale)

Absolute deviation

0 100 200

0.2

Iteration Iteration

0.5

1.0

Lion

Lion-SGD, λ

=0.5

Lion-SGD, λ

=0.25

Lion-SGD, λ

=0.1

SGD

Figure 7: Deviations from the geometric constraints and the

performance of the Lion-SGD optimizer. Left: Translation

symmetry breaking on the sigmoid optimizee from Figure 6

on the left. Right: Training loss on the same optimizee.

4.2 Heavy-Tailed Distribution of

Gradient and Parameter Update

Noise

In this section, we investigate the progression of the

estimated tail-index α of the gradient noise compo-

nents and the parameter update noise components for

different optimizers. With lower values of α indicat-

Investigation into the Training Dynamics of Learned Optimizers

141

Largest eigenvalue

Adam

SGD

Lion

L2O

0 500

Iteration

1000

0 500

Iteration

1000

0.5

0.6

0.7

0.8

0.9

Alpha estimate

Adam - updates

Adam - gradients

SGD - updates

SGD - gradients

L2O - updates

L2O - gradients

Figure 8: Heavy-tailedness and update covariance. Left: α

estimates for the gradient and update noise on the Leaky

ReLU optimizee. Right: Largest eigenvalue of the update

covariance on the Leaky ReLU optimizee.

ing a more heavy-tailed distribution of noise of gra-

dients and updates, we are particularly interested in

comparing how the suggested parameter updates from

SGD, Adam, and L2O evolve over a training run.

The results for L2O meta-trained on the Leaky

ReLU optimizee are shown in Figure 8 (left).

First, we see that the estimates of α for L2O pa-

rameter updates are much higher than for the gradi-

ents, implying that the distribution of the noise in its

updates is less heavy-tailed than that of the gradient.

This shows that L2O effectively attenuates the heavy-

tail portion of deviations in the gradient estimates on

its input, taking a less jittery optimization trajectory.

Moreover, one can see that except for a few drops of

α at the beginning of training, the noise in the updates

from L2O is generally less heavy-tailed than the up-

dates from Adam or SGD. Given the previous studies

(Simsekli et al. (2019)) on the beneﬁts of stochastic

gradient noise for generalization, future developments

of the L2O methods might want to take this obser-

vation of less heavy-tailed update noise into account

when troubleshooting optimizee generalization gaps.

Since our experiments are performed on a rather sim-

ple task of MNIST classiﬁcation, we do not encounter

such problems. Results for the other settings can be

found in the Appendix.

Second, since L2O was meta-trained only on

training runs of 200 iterations, we observed that the

loss in the meta-testing on sigmoid and ReLU with

batch normalization optimizees starts to increase after

around the 500th iteration. This inability to general-

ize to longer training sequences accurately correlates

with decreasing α estimates for L2O’s parameter up-

dates and gradients on these two optimizee architec-

tures as can be seen in the Appendix. With the Leaky

ReLU optimizee network in Figure 8, the training loss

does not increase as for the other two optimizee archi-

tectures and so the α estimates are kept at around the

same value even after the 500th iteration.

4.3 Update Covariance

As described in Methods (Section 3.3), we want to

investigate the noise (variation) in the mini-batch pa-

rameter updates for different optimizers. We follow

the same setup as in the previous experiments. The

results for update covariance of all considered opti-

mizers on the Leaky ReLU optimizee are shown in

Figure 8 (right). Results for the other optimizees can

be found in the Appendix.

Interestingly, once again, we observe a similar rel-

ative ordering among the considered optimizers. In

this case, the variation in updates across different

samples appears to be the lowest for SGD, followed

by Adam, Lion, and, ﬁnally, the largest variation is

observed for L2O.

4.4 Update Histograms

To compare the strategies of different optimization al-

gorithms further, we examine the absolute values of

their updates, as shown in Figure 9.

One can notice that the L2O starts with the largest

updates and then slowly approaches the update distri-

bution of Adam. These large initial updates closely

parallel the rapid symmetry breaking of the learned

optimizer at the beginning of the optimization run.

Such behavior holds also for the other optimizees we

considered (in the Appendix).

5 DISCUSSION

We found that one of the most pronounced features of

learned optimizers is their rapid symmetry breaking

at the beginning of the optimization run. Remarkably,

the good performance of L2O in the initial phase of

training correlates with this behavior very well, as is

also demonstrated by the symmetry-breaking regular-

ization, which heavily hindered the optimizer.

Another aspect is the less heavy-tailed distribution

of L2O updates despite the gradients exhibiting very

heavy-tailed behavior. Together with the high varia-

tion of updates across different samples, as shown by

large maximum eigenvalues of update covariance, this

points to one interesting observation: L2O appears to

act as a stabilizing force in the optimization process.

While the inherent stochasticity and heavy-tailed na-

ture of gradients might lead to erratic updates and

slow convergence, the noise clipping of L2O seems

to mitigate these issues.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

142

0.00 0.02 0.04 0.06 0.08

Count

Iteration 1

Adam SGD Lion L2O

0.0 0.1 0.2 0.3

Count

Iteration 5

0.00 0.05 0.10 0.15 0.20

Count

Iteration 10

0.00 0.05 0.10 0.15

Count

Iteration 30

0.00 0.05 0.10 0.15

Count

Iteration 100

0.00 0.05 0.10 0.15

Absolute Value

Count

Iteration 500

Figure 9: Histograms of the absolute values of parameter

updates of different optimizers on the sigmoid optimizee.

6 RELATED WORK

Since this paper mainly focuses on the investigation

of the inner workings and behavior of L2O, we restrict

the scope of this section to other similar works that

attempt to analyze L2O.

In Harrison et al. (2022), the authors used dynam-

ical systems tools in a noisy quadratic model to study

the stability of black-box optimizers. Their ﬁndings

led them to propose a model they called stabilized

through ample regularization (STAR) learned opti-

mizer. The authors argued that designing similar in-

ductive biases informed by well-grounded theory and

previous analysis is a potentially exciting and promis-

ing future direction. Closely related is the study of in-

stability in the meta-training phase of L2O presented

in Metz et al. (2019).

Similarly to our comparison of L2O with hand-

designed optimizers, Maheswaranathan et al. (2021)

found that L2O has a relatively interpretable behavior.

Speciﬁcally, they identiﬁed four standard algorithmic

techniques: momentum, gradient clipping, learning

rate schedules, and learning rate adaptation.

In comparison to the above work which primar-

ily studied the L2O’s hidden state dynamics, we take a

more high-level perspective and look at the op-

timization trajectories as a whole.

Important to our paper is Kunin et al. (2021)

where the authors present a uniﬁed theoretical frame-

work to understand the dynamics of neural network

parameters during training with hand-designed opti-

mizers. They based their study on intrinsic symme-

tries embedded in a network’s architecture that are

present for any dataset and that impose stringent geo-

metric constraints on gradients and Hessians.

The connection between symmetries and opti-

mization was also explored in the work of Bamler and

Mandt (2018), where they showed that representation

learning models possess an approximate continuous

symmetry that leads to a slow convergence of gradient

descent. This observation prompted them to introduce

an optimization algorithm called Goldstone Gradient

Descent, which aims to overcome this issue through

continuous symmetry breaking. The core of their al-

gorithm involves alternating between standard gradi-

ent descent and a specialized traversal in the subspace

of symmetry transformations.

7 CONCLUSION

Our investigation of the parameter update dynam-

ics of the learned optimizers revealed several intrigu-

ing ingredients and strategies used by these meta-

learning methods. Furthermore, our comparison with

hand-engineered optimization algorithms has not only

shown clear differences from traditional optimizers

like SGD and Adam, but also illuminated similari-

ties with the recently introduced method known as

Lion. We believe that these ﬁndings pave the way

for promising future research directions, where the in-

sights gleaned from these side-by-side examinations

can inform the design of more robust, scalable, and

faster optimizers.

ACKNOWLEDGEMENTS

This work was supported by the Student Summer Re-

search Program 2023 of FIT CTU in Prague.

REFERENCES

Andrychowicz, M., Denil, M., Colmenarejo, S. G., Hoff-

man, M. W., Pfau, D., Schaul, T., Shillingford, B.,

and de Freitas, N. (2016). Learning to learn by gra-

dient descent by gradient descent. In Proceedings of

Investigation into the Training Dynamics of Learned Optimizers

143

the 30th NIPS, NIPS’16, page 3988–3996, Red Hook,

NY, USA. Curran Associates Inc.

Bamler, R. and Mandt, S. (2018). Improving optimization

for models with continuous symmetry breaking.

Chen, X., Liang, C., Huang, D., Real, E., Wang, K., Liu,

Y., Pham, H., Dong, X., Luong, T., Hsieh, C.-J., Lu,

Y., and Le, Q. V. (2023). Symbolic discovery of opti-

mization algorithms.

Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive sub-

gradient methods for online learning and stochastic

optimization. Journal of Machine Learning Research,

12(61):2121–2159.

Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B.

(2021). Sharpness-aware minimization for efﬁciently

improving generalization. In International Confer-

ence on Learning Representations.

Harrison, J., Metz, L., and Sohl-Dickstein, J. (2022). A

closer look at learned optimization: Stability, robust-

ness, and inductive biases. In Koyejo, S., Mohamed,

S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A.,

editors, Advances in Neural Information Processing

Systems, volume 35, pages 3758–3773. Curran Asso-

ciates, Inc.

Hochreiter, S. and Schmidhuber, J. (1997). Flat Minima.

Neural Computation, 9(1):1–42.

Ioffe, S. and Szegedy, C. (2015). Batch normalization:

Accelerating deep network training by reducing in-

ternal covariate shift. In Bach, F. and Blei, D., ed-

itors, Proceedings of the 32nd International Confer-

ence on Machine Learning, volume 37 of Proceedings

of Machine Learning Research, pages 448–456, Lille,

France. PMLR.

Jastrzebski, S., Szymczak, M., Fort, S., Arpit, D., Tabor,

J., Cho*, K., and Geras*, K. (2020). The break-even

point on optimization trajectories of deep neural net-

works. In International Conference on Learning Rep-

resentations.

Kingma, D. P. and Ba, J. (2017). Adam: A method for

stochastic optimization.

Kunin, D., Sagastuy-Brena, J., Ganguli, S., Yamins, D. L.,

and Tanaka, H. (2021). Neural mechanics: Symmetry

and broken conservation laws in deep learning dynam-

ics. In ICLR 2021.

Lv, K., Jiang, S., and Li, J. (2017). Learning gradient

descent: Better generalization and longer horizons.

In Proceedings of the 34th International Conference

on Machine Learning - Volume 70, ICML’17, page

2247–2255. JMLR.org.

Maheswaranathan, N., Sussillo, D., Metz, L., Sun, R.,

and Sohl-Dickstein, J. (2021). Reverse engineering

learned optimizers reveals known and novel mecha-

nisms. In Ranzato, M., Beygelzimer, A., Dauphin, Y.,

Liang, P., and Vaughan, J. W., editors, Advances in

Neural Information Processing Systems, volume 34,

pages 19910–19922. Curran Associates, Inc.

Metz, L., Freeman, C. D., Maheswaranathan, N., and Sohl-

Dickstein, J. (2021). Training learned optimizers with

randomly initialized learned optimizers.

Metz, L., Maheswaranathan, N., Freeman, C. D., Poole, B.,

and Sohl-Dickstein, J. (2020). Tasks, stability, archi-

tecture, and compute: Training more effective learned

optimizers, and using them to train themselves.

Metz, L., Maheswaranathan, N., Nixon, J., Freeman, D.,

and Sohl-Dickstein, J. (2019). Understanding and cor-

recting pathologies in the training of learned optimiz-

ers. In Chaudhuri, K. and Salakhutdinov, R., editors,

Proceedings of the 36th International Conference on

Machine Learning, volume 97 of Proceedings of Ma-

chine Learning Research, pages 4556–4565. PMLR.

Mohammadi, M., Mohammadpour, A., and Ogata, H.

(2015). On estimating the tail index and the spec-

tral measure of multivariate α-stable distributions.

Metrika: International Journal for Theoretical and

Applied Statistics, 78(5):549–561.

Neelakantan, A., Vilnis, L., Le, Q. V., Sutskever, I., Kaiser,

L., Kurach, K., and Martens, J. (2015). Adding gra-

dient noise improves learning for very deep networks.

CoRR, abs/1511.06807.

Polyak, B. (1964). Some methods of speeding up the con-

vergence of iteration methods. USSR Computational

Mathematics and Mathematical Physics, 4(5):1–17.

Simanek, P., Vasata, D., and Kordik, P. (2022). Learning

to optimize with dynamic mode decomposition. In

2022 International Joint Conference on Neural Net-

works (IJCNN). IEEE.

Simsekli, U., Sagun, L., and Gurbuzbalaban, M. (2019). A

tail-index analysis of stochastic gradient noise in deep

neural networks. In Chaudhuri, K. and Salakhutdinov,

R., editors, Proceedings of the 36th ICML, volume 97

of Proceedings of Machine Learning Research, pages

5827–5837. PMLR.

Tieleman, T., Hinton, G., et al. (2012). Lecture 6.5-

rmsprop: Divide the gradient by a running average of

its recent magnitude. COURSERA: Neural networks

for machine learning, 4(2):26–31.

Zhao, B., Dehmamy, N., Walters, R., and Yu, R. (2022).

Symmetry teleportation for accelerated optimization.

In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave,

D., Cho, K., and Oh, A., editors, Advances in Neu-

ral Information Processing Systems, volume 35, pages

16679–16690. Curran Associates, Inc.

APPENDIX

Additional Results

Symmetry Breaking. Figure 10 shows the param-

eter update deviations from the constraints on gradi-

ents induced by translation symmetry, along with the

associated performance after applying the symmetry

breaking regularization.

We see similar trends as for the Leaky ReLU and

batch normalization optimizees. Speciﬁcally, L2O

breaks the constraints by a large margin early in the

training run.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

144

0 50

100

Train loss (log scale)

Absolute deviation

Adam

SGD

Lion

L2O, β=0.0

L2O, β=0.1

L2O, β=0.5

0 100 200

Iteration Iteration

0.3

Figure 10: Symmetry breaking. Left: Translation symme-

try breaking on the optimizee with sigmoid. Right: Perfor-

mance after the translation symmetry breaking regulariza-

tion on the optimizee with sigmoid.

Heavy-Tailedness and Update Covariance. The α

estimates for the gradient and update noise, as well as

the progression of the largest eigenvalue of the update

covariance, are shown in Figures 11 and 12.

Similarly to the results for the Leaky ReLU op-

timizee in the main part of this paper, we can again

observe that L2O dampens the heavy-tailed gradient

noise into high-variation parameter updates. Also, as

mentioned earlier, we observed that the loss during

L2O meta-testing on sigmoid and batch normalization

optimizees starts to increase after around the 500th it-

eration and this accurately correlates with the decreas-

ing α estimates of L2O gradient and update noise we

see here.

Additionally, an interesting observation in itself,

we found that standardizing the image data has a rel-

atively large impact on how heavy-tailed the gradi-

ent and update noise distributions are. Speciﬁcally,

when we replaced the data normalization to the range

[0, 1] with standardization, the alpha estimates shifted

around 0.4 lower. Since we tried to follow the setup

from Andrychowicz et al. (2016), all our results were

obtained using data normalization.

0.4

0.6

0.8

1.0

-3

-2

-1

Adam

SGD

Lion

L2O

0 500

Iteration

1000

0 500

Iteration

1000

Alpha estimate

Adam - updates

Adam - gradients

SGD - updates

SGD - gradients

L2O - updates

L2O - gradients

Largest eigenvalue

Figure 11: Heavy-tailedness and update covariance. Left:

Gradient and update noise on the sigmoid optimizee. Right:

Update covariance on the sigmoid optimizee.

0.4

0.6

0.8

1.0

-3

-2

-1

Largest eigenvalue

Adam

SGD

Lion

L2O

0 500

Iteration

1000

0 500

Iteration

1000

Alpha estimate

Adam - updates

Adam - gradients

SGD - updates

SGD - gradients

L2O - updates

L2O - gradients

Figure 12: Heavy-tailedness and update covariance. Left:

Gradient and update noise on the optimizee with batch nor-

malization and ReLU. Right: Update covariance on the op-

timizee with batch normalization and ReLU.

Update Histograms. Figures 13 and 14 present his-

tograms of the absolute values of parameter updates

for the optimizee with batch normalization and ReLU

and for the optimizee with Leaky ReLU, respectively.

0.00 0.01 0.02 0.03 0.04 0.05

Count

Iteration 1

Adam SGD Lion L2O

0.000 0.025 0.050 0.075 0.100

Count

Iteration 5

0.00 0.02 0.04 0.06 0.08

Count

Iteration 10

0.00 0.02 0.04 0.06 0.08

Count

Iteration 30

0.00 0.02 0.04 0.06 0.08

Count

Iteration 100

0.00 0.02 0.04 0.06 0.08

Absolute Value

Count

Iteration 500

Figure 13: Histograms of the absolute values of parameter

updates for L2O meta-trained and meta-tested on the opti-

mizee with batch normalization and ReLU.

We can again notice that the learned optimizer

gradually changes the shape of its update distribution

from one in which large values are predominant to a

shape similar to that of Adam.

Investigation into the Training Dynamics of Learned Optimizers

145

0.00 0.01 0.02

Count

Iteration 1

Adam SGD Lion L2O

0.00 0.02 0.04 0.06

Count

Iteration 5

0.00 0.02 0.04 0.06

Count

Iteration 10

0.00 0.02 0.04 0.06

Count

Iteration 30

0.00 0.02 0.04 0.06

Count

Iteration 100

0.00 0.02 0.04 0.06

Absolute Value

Count

Iteration 500

Figure 14: Histograms of the absolute values of parameter

updates for L2O meta-trained and meta-tested on the Leaky

ReLU optimizee.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

146