Gradient Clipping in Deep Learning: A Dynamical Systems Perspective

Arunselvan Ramaswamy

Dept. of Mathematics and Computer Science, Karlstad University, 651 88 Karlstad, Sweden

Keywords:

Deep Learning, Adaptive Gradient Clipping, Dynamical Systems Perspective, Learning Theory, Supervised

Learning.

Abstract:

Neural networks are ubiquitous components of Machine Learning (ML) algorithms. However, training them

is challenging due to problems associated with exploding and vanishing loss-gradients. Gradient clipping

is shown to effectively combat both the vanishing gradients and the exploding gradients problems. As the

name suggests, gradients are clipped in order to prevent large updates. At the same time, very small neural

network weights are updated using larger step-sizes. Although widely used in practice, there is very little

theory surrounding clipping. In this paper, we analyze two popular gradient clipping techniques – the classic

norm-based gradient clipping method and the adaptive gradient clipping technique. We prove that gradient

clipping ensures numerical stability with very high probability. Further, clipping based stochastic gradient

descent converges to a set of neural network weights that minimizes the average scaled training loss in a

local sense. The averaging is with respect to the distribution that generated the training data. The scaling

is a consequence of gradient clipping. We use tools from the theory of dynamical systems for the presented

analysis.

1 INTRODUCTION

The proliferation of complex neural network (NN)

architectures, coupled with sophisticated Machine

Learning (ML) algorithms and cheap increased com-

putational capacity has caused a push towards au-

tomation in various aspects of the society. Supervised

learning is an important paradigm of ML, where the

aim is to learn an unknown map f : X → Y using

ﬁnitely many examples – {(x

, f (x

)) | x ∈ X , 1 ≤ i ≤

N} (LeCun et al., 2015). X is referred to as the input

space and Y as the output space. Typically, X ≡ R

for some d ≥ 1, and Y is either discrete or continuous.

When Y is discrete, we are in the setting of classiﬁca-

tion. When Y is continuous, it is typically a subset of

the real space R, and the setting is called regression.

The unknown function f is called the target. The aim

is to train a predictor in order to approximate f . The

example data used for training is called training data

– D ≡ {(x

) | x ∈ X , y

∈ Y , 1 ≤ i ≤ N}. The stan-

dard assumption in ML is that the dataset D is gen-

erated by sampling N datapoints from X × Y in an

independent manner using the same joint probability

distribution, say µ. The reader is referred to (Bishop

and Nasrabadi, 2006) for details.

In deep learning, the predictor used to approxi-

https://orcid.org/0000-0001-7547-8111

mate the target is a deep neural network (DNN). It

is a NN with two or more hidden layers. The goal

is to ﬁnd a set of DNN weights so that the predic-

tion errors are minimized, and the resulting DNN is

a good approximation of the target f . An appropri-

ate loss function is deﬁned and the NN weights are

iteratively updated to minimize loss using the training

data D. Minimizing the loss minimizes the prediction

errors. In this paper, we consider NN training using

two important loss functions: cross-entropy loss (for

classiﬁcation) and the mean squared error (for regres-

sion). Stochastic gradient descent is a simple yet pop-

ular choice for training a NN (LeCun et al., 2015). It

involves the following update step, at time t, for the

NN weight vector θ

θ:

θ(t + 1) =θ

θ(t) − a(t)

∑

j=1

∇

ℓ



θ(t),x

i( j)

(t),y

i( j)

(t)



(1)

where a(t) is the step-size or learning rate, ℓ is the

loss,



i( j)

(t),y

i( j)

(t)



∈ D, and M < ∞ is the sample-

batch size. The sample-batch is sampled uniformly

from D. For the sake of simplicity in presentation,

we let M = 1. Also, we use (x(t),y(t)) to represent

the datapoint sampled at time t for the update. Hence,

Ramaswamy, A.

Gradient Clipping in Deep Learning: A Dynamical Systems Perspective.

DOI: 10.5220/0011678000003411

In Proceedings of the 12th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2023), pages 107-114

ISBN: 978-989-758-626-2; ISSN: 2184-4313

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

107

(1) becomes

θ(t + 1) = θ

θ(t) − a(t)∇

ℓ



θ(t),x(t),y(t)



. (2)

Derivatives in NNs are computed using the back

propagation algorithm. In practice, due to the bot-

tleneck in computer capacity, the farther the loss-

derivative needs to be backpropagated, the more they

vanish. In effect, some NN weights are never updated.

The complementary problem is that of exploding gra-

dients – numerical instability caused by large update

values. These issues are very well documented, see

(Rehmer and Kroll, 2020), (Schmidhuber, 2015) and

(LeCun et al., 2015). Gradient clipping is a popular

solution to the exploding gradient problem. In this pa-

per, we explore two methods, the classic norm-based

gradient clipping and the adaptive clipping method.

Adaptive clipping additionally addresses the vanish-

ing gradients issue, albeit only partially.

The principle behind gradient clipping is to clip

it when it grows too large in order to prevent numer-

ically unstable large updates. In the classic version

of the norm-based gradient clipping scheme, the gra-

dient is divided (scaled-down) by its norm when the

norm exceeds a certain predetermined value. We ana-

lyze its stability properties in Section 3.1 and discuss

its asymptotic properties in Section 4. In particular,

we show that with high probability, the clipped gra-

dient descent is numerically stable and converges to

a local minimizer of the loss function, on an average.

Although widely used in deep learning solutions and

empirically successful there is very little theory sur-

rounding it. In the past researchers have studied the

performance under idealized conditions. To the best

of our knowledge ours is the ﬁrst study that analyzes

the practical version of the algorithm.

Formally speaking, in the classic method, we di-

vide the gradient update by

∥

∇

ℓ

∥

for some λ > 0,

when the gradient norm exceeds a predetermined

value. There is a variant to this approach that is

less conservative called the adaptive gradient clipping

method. Here, the update is divided by

∥

∇

ℓ

∥

, when

the gradient norm exceeds a certain multiple of the

norm of the NN weights. Here, the updates are larger

and convergence faster (Zhang et al., 2019). Although

very insightful, again, only an idealized version of this

algorithm is analyzed in (Zhang et al., 2019). In this

paper, we remedy this. Speciﬁcally, we show that nu-

merical stability is not automatic since the updates are

in the order of the NN weights. If stable, then we

show that the adaptive variant converges to an averag-

ing of the local minimizer of the loss function. The

averaging is with respect to the training data distribu-

tion.

The organization of this paper is as follows. First,

in Section 2, we formally discuss the general gradient

descent in the context of deep learning. Next, in Sec-

tion 3, we introduce the classic and adaptive clipped

gradient methods and discuss their numerical stabil-

ity. Finally, we discuss their asymptotic properties in

Section 4.

2 STOCHASTIC GRADIENTS IN

DEEP LEARNING

Let us begin with the simple stochastic gradient de-

scent algorithm involved in training any neural net-

work (NN). It is described by the following iteration:

θ(t + 1) = θ

θ(t) − a(t)g(t), (3)

where θ(t) represents the vector of NN weights at

time t, a(t) is the learning rate or the step-size at time

t; g(t) is the loss-gradient. In this paper, we consider

two loss functions, one that is popular for classiﬁca-

tion and the other that is popular for regression. We,

in particular, consider the cross entropy and the mean

squared losses. It must however be noted that theory

presented herein can be generally applied to other loss

functions.

In a typical regression application, the output of

the NN is a real number. Let us represent the output

by f (x,θ), where x is the input query instance and θ

is the NN weight-vector. The mean squared error is

then

ℓ

= ( f (θ

θ,x) − y)

, (4)

where y is the true label of x. The loss-gradient is then

2( f (θ

θ,x) − y) ∇

f (θ

θ,x). In order to obtain ∇

f (θ

θ,x),

the algorithm performs a backward pass (back prop-

agation) through the NN. The forward pass yields

f (x,θ).

In the K-class classiﬁcation setting, the NN has

K neurons (activations) in the output layer. Say, the

outputs of these activations are z

, 1 ≤ i ≤ K. Then,

∑

j=1

is interpreted as the posterior probabil-

ity that the true label is i. This is called the soft-max

classiﬁer, and is trained using the cross-entropy loss:

ℓ

= −

∑

i=1

y=i

log p

. (5)

Here, 1

y=i

is the indicator function that takes values

1 or 0, depending on whether the true class-label is i

or not. In the traditional setting of classiﬁcation, ex-

actly one label is associated with each query instance

x. Suppose the true label is i, then the partial deriva-

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

108

tive of ℓ

with respect to z

is given by

−

∑

j=1







∑

j=1

k=i

−

∑

j=1







. (6)

We can further simplify the above equation to get

−1

k=i

∑

j=1

= p

− 1

k=i

. This gradient informa-

tion is back propagated through the NN in order to

update the NN weights. We are now ready to make a

couple of assumptions in our analysis.

(A1) All activations in the NN are twice continuously

differentiable. Popular examples include sig-

moid, tanh and the Gaussian error linear unit.

The input to the NN is a vector x ∈ R

for some d ≥ 1. This input is ﬁrst passed

through the input layer, this is represented

by Σ(θ

x), θ

is a matrix of dimension

d × the number of activations in the input layer,

Σ is an operator that takes a vector z, of some

dimension, as input and outputs a vector of the same

dimension such that its i

component equals the

activation applied to the i

component of z, i.e.,

Σ(z)

= σ(z

). The vector Σ(θ

x) is then passed

through the ﬁrst hidden layer to get Σ(θ

Σ(θ

x)).

Here, θ

is a matrix of appropriate dimension. It

represents the set of NN weights from the ﬁrst hidden

layer. Now, Σ(θ

Σ(θ

x)) may be passed through

more hidden layer, however, for the sake of simplicity

we assume there are no more hidden layers. This

means that Σ(θ

Σ(θ

x)) is passed through the output

layer, which again looks like Σ(θ

Σ(θ

x))).

In the classiﬁcation setting, the output of the NN,

f (θ

θ,x), is then a class-label distribution given by



∑

j=1



1≤i≤K

, where K is the number of classes

and z

= Σ(θ

Σ(θ

x)))

. Note that θ

θ is the

vector of all NN weights – θ

, θ

and θ

. In

regression, f (θ

θ,x) =

∑

, the output is a linear

combination of the activation outputs from the output

layer, and z

is as deﬁned before. Note that θ

is a

vector that is also a part of the NN weight vector θ

θ.

Passing the input through the various layers of a

NN is called the forward pass. It must be noted that

the above description of a NN is not complete. In par-

ticular, it must be noted that every vector is appended

with 1 before passing it through the next layer. In par-

ticular (1,x) is the input. Then,



1,Σ





the input to the hidden layer and so on. This is done to

incorporate bias information. These biases are train-

able as a part of the NN weight vector. Again, for the

sake of simplicity in presentation we omit the biases.

Claim 1. Under (A1),

∂ℓ

∂θ

∂ℓ

∂θ

∂

ℓ

∂θ

and

∂

ℓ

∂θ

exist

and are continuous, where θ

and θ

are components

of θ

θ, ℓ

and ℓ

are deﬁned in (4) and (5), respectively.

Therefore, ℓ

, ℓ

∇

ℓ

and ∇

ℓ

are all locally Lips-

chitz continuous.

Proof. Let us ﬁrst consider the case of regression

with the loss function being the mean-squared error.

Since, (A1) requires that the activation functions are

twice continuously differentiable, we directly get that

f (θ

θ,x) is also twice continuously differentiable, as

it is a composition of operations that are themselves

twice continuously differentiable. Speciﬁcally, it is a

direct consequence of f being a repeated composition

of activations and linear combinations. Hence,

∂ f (θ

θ,x)

∂θ

and

∂

f (θ

θ,x)

∂θ

are continuous, directly implying that

∂ℓ

∂θ

and

∂

ℓ

∂θ

are also continuous.

Now, we move onto classiﬁcation. Recall the loss-

gradient with respect to z

, 1 ≤ k ≤ K, when the true

label is y = i, from (6)

− 1

k=i

We therefore get:

∂ℓ

∂θ

∑

k=1

∂z

∂θ

∂ℓ

∂z

. (7)

Suppose θ

is not used to calculate the output value

, then clearly

∂z

∂θ

= 0. For example, θ

2,3

is only

used in the calculation of z

, but not to the other z

′

e.g.,

∂z

∂θ

2,3

= 0. Recall that θ

is a matrix, hence

2,3

is the element on row-2 and column-3. The NN

weight vector θ

θ is obtained by ﬂattening all the ma-

trices and appending them. As before,

∂

∂θ

is read-

ily calculated, and is continuous as a consequence of

(A1). Since

∂ℓ

∂z

= p

− 1

k=i

, it is also continuous.

Also,

∂

ℓ

∂θ

∑

k=1

∂

∂θ

∂ℓ

∂z

∂θ

∑

l=1

∂z

∂θ

∂

ℓ

∂z

(8)

We therefore get the required continuity of

∂

ℓ

∂θ

and

∂ℓ

∂θ

A function f : R

→ R

is Lipschtiz continu-

ous suppose there exists L such that ∥ f (x) − f (y)∥ ≤

L∥x − y∥, where d,m ≥ 1. It is locally Lipschitz con-

tinuous, iff for every x ∈ R

∃L

> 0andr

> 0 such

Gradient Clipping in Deep Learning: A Dynamical Systems Perspective

109

that ∥ f (x) − f (y)∥ ≤ L

∥x − y∥ for every y ∈ B

(x),

where B

(x) is the ball of radius r

centered at point

x. Suppose f is continuously differentiable, then f

is locally Lipschtiz continuous (Rudin et al., 1976).

Similarly, ∇

f is locally Lipschitz continuous when

the Hessian is continuous. The second part of the

claim, regarding Lipschitzness, is a direct conse-

quence of these.

3 GRADIENT CLIPPING AND

NUMERICAL STABILITY

In this section, we introduce the concept of gradient

clipping. We delve into the details of two important

gradient clipping methods. We also formally discuss

their numerical stability. We begin with a discussion

on the classic norm-based gradient clipping that is

popular in Deep Learning.

3.1 Gradient Clipping in Deep Learning

Let us return to the popular stochastic gradient de-

scent algorithm given by (3). Suppose assumption

(A1) is satisifed, then we get from Claim 1 that

g(t) is locally Lipschitz continuous in the θ

θ coordi-

nate. Recall that the loss-gradient at time t is calcu-

lated using the datapoint (x(t),y(t)) from the dataset

D. In the regression setting the loss gradient g(t) ≡

2( f (θ

θ(t),x(t)) − y(t))∇

f (θ

θ(t),x(t)), and it is simi-

larly calculated in the classiﬁcation setting. At time t,

the datapoint (x(t),y(t)) is processed from the given

training dataset in order to obtain the loss-gradient

g(t). There could be more than one datapoint pro-

cessed, e.g., in a batch-processing implementation of

the learning algorithm. For the sake of simplicity in

presentation we assume that a single datapoint is pro-

cessed at every point in time in order to calculate the

loss-gradient.

As the NN is highly non-linear the calculated gra-

dient g(t) may be so large that the weight update pro-

cess is numerically unstable. This is called the ex-

ploding gradients problems. This is a common prob-

lem when using the cross-entropy loss function. The

non-linearity of a NN is directly proportional to its

depth. However, this means that the loss-gradients do

not propagate back into the network. The weights in

the input and the initial hidden layers are updated us-

ing very small numerically insigniﬁcant gradient val-

ues. This is called the vanishing gradient problem,

and is common when using deep network architec-

tures.

The exploding gradient problem is often coun-

tered by projecting the iterate after every step onto

a predetermined compact set K , see (Boyd and Van-

denberghe, 2004) and (Vu and Raich, 2022). The pro-

jected scheme is given by

θ(t + 1) ← θ

θ(t) − a(t)g(t),

θ(t + 1) ← Π

θ(t + 1),

(9)

where Π

is the projection operator. It projects

θ(t + 1) onto the nearest point from K . Hence, (9) is

guaranteed to be numerically stable. However, it does

not always converge to the optimal. Gradient clipping

is another popular solution to the exploding gradients

problem, our main focus here. Intuitively, the idea is

to “clip” large gradients in order to maintain stability.

In particular, instead of (3) or (9), the update is given

θ(t + 1) = θ

θ(t) − a(t)

g(t)

c(t)

, (10)

where c(t) is the clipping value at time t. In the clas-

sical version of norm-based clipping

c(t)

∥g(t)∥

∨ 1, (11)

where ∨ is the max operator. When ∥g(t)∥ > λ, then

the gradient update is scaled down by

∥g(t)∥

, λ > 0 is

a predetermined value. The NN weights are updated

using gradients that are “clipped at λ” to prevent the

updates from exploding. In order to formally show

numerical stability under norm-based gradient clip-

ping, we need the following assumption on the learn-

ing rate.

(A2) a(t) > 0 for all t ≥ 0,

∞

∑

t=0

a(t) = ∞ and

∞

∑

t=0

a(t)

∞.

We now claim that (10) does not experience

“ﬁnite-time blowup” with very high probability. By

ﬁnite-time blowup we mean that the NN weights, up-

dated according to (10), remain within a sphere of ra-

dius M centered around the origin for the entire du-

ration of the experiment. We may think of M as a

very large positive number – determined by the largest

number that can be processed by a computer and the

upper-bound on the duration of the experiment.

Lemma 1. Assuming (A1) and (A2), the stochastic

gradient descent with the classic norm-based clip-

ping, (10), does not experience ﬁnite blow-up with

very high probability.

Proof. Let us rewrite (10) as follows

θ(s) = θ

θ(0) −

s−1

∑

t=0

a(t)

g(t)

c(t)

. (12)

Let T < ∞ be the duration of the experiment, and let

M > 0 be a very large arbitrary constant. We need

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

110

to show that ∥θ

θ(t)∥ < M for 0 ≤ t ≤ T with very

high probability. Since M is arbitrary, we show an

equivalent variant of this, speciﬁcally we show that

∥θ

θ(t)−θ

θ(0)∥ < M for 1 ≤ t ≤ T with very high prob-

ability. First, we begin by observing that for s ≥ 1

∥θ

θ(s)∥ ≤ ∥θ

θ(0)∥ +

s−1

∑

t=0

a(t)

∥g(t)∥

c(t)

. (13)

Next, we deﬁne the following random variables for

0 ≤ s ≤ T , M

= ∥θ

θ(0)∥ +

s−1

∑

t=0

a(t)

∥g(t)∥

c(t)

. We also de-

ﬁne the following sigma-algebras – F

≡ σ

⟨

θ(0)

⟩

and

≡ σ

θ(t),((x(u),y(u))) : 0 ≤ t ≤ s, 0 ≤ u ≤ s−1

Since the datapoints from the training dataset are in-

dependently generated, we get that E [M

s+1

| F

] ≥

for 0 ≤ s ≤ T − 1. Hence, (M

)

0≤t≤T

is a

super-Martinagle. The reader is referred to (Dur-

rett, 2019) or (Durrett, 2018) for the deﬁnitions of a

super-Martingale and a sigma-algebra. Since c(t) =

∥g(t)∥

∨ 1, we get that

∥g(t)∥

c(t)

≤ λ, further we get that

t+1

− M

| ≤ a(t)λ. It follows from the Hoeffding-

Azuma inequality (Bercu et al., 2015) that

P(|M

− M

| > M) ≤ 2e

−

s−1

∑

t=0

a(t)

(14)

for 1 ≤ s ≤ T . Now, since

s−1

∑

t=0

a(t)

≤

∞

∑

t=0

a(t)

| {z }

=γ

< ∞,

P(|M

− M

| > M) ≤ 2e

−

s−1

∑

t=0

a(t)

≤ 2e

−

/γλ

(15)

Let us say that a(t) = 1/t is the learning rate, then

∑

t≥1

, i.e., γ =

. Since M is arbitrary, let us ﬁx

it as 2

. A typical value for λ is something like 30.

Plugging, these values into (15), we get that

P(|M

− M

| > M) ≤

1416

. (16)

The RHS of (16) is very close to zero. If we choose

s = T, then we get that P(|M

−M

| ≤ M) ≈ 1. Since,

the super-Martingale sequence is almost surely in-

creasing, we get that P(|M

− M

| ≤ M | 1 ≤ s ≤

T ) ≈ 1. We get from (12) that ∥θ

θ(s) − θ

θ(0)∥ =



s−1

∑

t=0

a(t)

g(t)

c(t)



. The RHS of this equality is further less

than

s−1

∑

t=0

a(t)

∥g(t)∥

c(t)

= |M

− M

|. Hence, we conclude

that ∥θ

θ(s) − θ

θ(0)∥ ≤ M whenever |M

− M

| ≤ M.

This gives us the required – P



∥θ

θ(s) − θ

θ(0)∥ ≤ M |

1 ≤ s ≤ T



≈ 1.

3.2 Adaptive Gradient Clipping in Deep

Learning

The classic gradient clipping method may slow down

the rate of convergence via conservative updates.

Adaptive gradient clipping tries to remedy this by

modifying c(t) as follows:

c(t) =

∥g(t)∥

λ[∥θ

θ(t)∥ ∨ ε]

∨ 1. (17)

Like λ, ε is also predetermined. Usually, ε is chosen

to be less than 1. Let us suppose that the NN weights

are tiny, in that ∥θ

θ(t)∥ ≤ ε. Also, let λε < ∥g(t)∥ < λ.

Then, the order of update at time t is λε in the adaptive

clipping scenario, less than the corresponding update

when using the classic clipping method. Now, con-

sider the scenario wherein θ

θ(t) is very large, in par-

ticular ∥θ

θ(t)∥ > 1. When ∥g(t)∥ ≥ λ∥θ

θ(t)∥, the up-

date according to the adaptive clip is in the order of

∥θ

θ(t)∥λ, and the update recommended by the classic

method is in the order of λ. In the latter case, the adap-

tive clipping method updates proportional to the norm

of the NN weight vector.

The adaptive clipping given by (17) is simplistic.

In practice, the updates for different layers of the NN

are clipped differently. In order to update θ

i, j

, the

adaptive clip method calculates the scaling factor as

∥g(t)∥

∥θ

(t)∥

∨ ε

∨ 1, (18)

where ∥θ

(t)∥

is the Frobenious norm of the input

weight matrix at time t. Suppose we ﬂatten θ

(t)

into a vector, then one can show that the Frobenius

norm equals the Euclidean norm of the ﬂattened ver-

sion (Halmos, 2017). By clipping updates to different

layers differently, the weights are updated in a man-

ner that overcomes the vanishing gradient problem.

Although qualitatively different, we will use (17) and

not (18) in our analysis, since there is no difference

in terms of the steps involved. Further, it greatly re-

duces kitsch and improves readability and ease of un-

derstanding.

Recall that in the classic clipping method (10), the

updates are always bounded by λ. However, the up-

date in the adaptive clipping scheme can be in the or-

der of the NN weights. Formally, we claim the fol-

lowing.

Claim 2. In the adaptive clipping scheme, (17),



g(t)

c(t)



≤ K(1 + ∥θ

θ(t)∥), where K > 0 is an iterate in-

dependent constant.

Proof. Suppose

∥

g(t)

∥

θ(t)

∥

∨ε

≤ λ, then c(t) = 1. Also,

Gradient Clipping in Deep Learning: A Dynamical Systems Perspective

111

∥

g(t)

∥

≤ λ[

∥

θ(t)

∥

∨ ε]. Deﬁne K

= λε ∨ λ ∨ 1, then

we get that



g(t)

c(t)



≤ K(1 + ∥θ

θ(t)∥).

Now, suppose

∥

g(t)

∥

θ(t)

∥

∨ε

> λ, then c(t) =

∥g(t)∥

λ[∥θ

θ(t)∥∨ε]

Further,



g(t)

c(t)



λ[∥θ

θ(t)∥ ∨ ε]

∥

g(t)

∥

g(t)

∥

. (19)

We can choose the same K as before in order to get

the required linear bound even in this case.

We have thus shown that the updates in the adap-

tive clipping scheme grows linearly as a function of

the NN weight vector. Stability is therefore not guar-

anteed. Additional conditions must be satisﬁed for

the clipped gradient scheme to be numerically sta-

ble. There are extensive studies on the stability of

systems that grow linearly. The reader is referred

to (Ramaswamy and Bhatnagar, 2017), (Ramaswamy

and Bhatnagar, 2018) and (Borkar, 2009) for more on

this. We assume that sufﬁcient conditions for stabil-

ity, as discussed in the aforementioned literature, can

be veriﬁed for the adaptive clipping case. We will,

in effect, assume that the adaptive clipping method is

stable with very high probability.

(A3) The adaptive clipping gradient descent scheme

is numerically stable with very high probability.

4 ASYMPTOTICS OF GRADIENT

CLIPPING METHODS

In Section 3, we saw that under mild conditions the

clipped gradient descent scheme (10) is numerically

stable with very high probability. We also saw that

the adaptive clipping scheme is not always numeri-

cally stable on account of aggressive gradient updates.

However, in this case we dip may dip into the theory

of linear systems to ensure stability. Given that the

gradient descent algorithm with clipping, classic or

adaptive, is numerically stable, the question remains,

does it converge to the required set of NN weights that

minimizes loss? In this section, we use tools from Dy-

namical Systems Theory for the convergence analysis

of the two gradient clipping methods. Important liter-

ature on Dynamical Systems Theory include (Borkar,

2009), (Aubin and Cellina, 2012) and (Bena

ım et al.,

2005).

For the convergence analysis, we use the theory

from (Bena

ım et al., 2005). This theory is based

on viewing the clipped gradient descent scheme as a

stochastic approximation algorithm, and associating

with it an ordinary differential equation. The asso-

ciated o.d.e. has the same asymptotic properties as

the stochastic approximation algorithm. Recall the

clipped gradient descent scheme.

θ(t + 1) = θ

θ(t) − a(t)

g(t)

c(t)

. (recalled from (10))

Here the clipping constant c(t) is either (11) or (17),

depending on if we are in the classic or the adaptive

setting. In order to utilize the theory developed in

(Bena

ım et al., 2005) we rewrite (10) as the stochastic

approximation algorithm below

θ(t + 1) = θ

θ(t) − a(t)



(x(t),y(t))∼µ



g(t)

c(t)



+ M(t + 1)



(20)

where M

t+1

g(t)

c(t)

− E

(x(t),y(t))∼µ

g(t)

c(t)

; µ is the joint

probability distribution over X × Y . Recall that the

training dataset D is generated by sampling from µ.

First, we show that the objective function in (20)

is a Marchaud map. A point-to-set map H : R

→

{ subsets of R

} is called Marchaud, iff it satisﬁes

the following conditions (a) H(x) is convex and com-

pact, (b) sup

z∈H(x)

∥

≤ K(1 +

∥

) for ﬁxed K > 0, and

→ x and z

→ z in R

such that z

∈ H(x

)

for all n ≥ 0, then z ∈ H(x) (upper semicontinuity

property). Although our objective function H(θ

θ)

(x,y)∼µ





is a standard point-to-point map, we may

view it as a trivial point-to-set map such that H(θ

θ) is

the singleton



(x,y)∼µ







Lemma 2. Under assumption (A1), the trivial point-

to-set map H : θ 7→



(x,y)∼µ







, the objective func-

tion H(θ

θ) in the clipped gradient descent (20) is Mar-

chaud.

Proof. Since the objective H(θ) is really a point-to-

point map, it is trivially convex and compact. Now,

we note that g ≡ ∇

ℓ

for regression and g ≡ ℓ

for

classiﬁcation. We know from Claim 1 that g is con-

tinuous. The clipping function c is also continuous,

this follows from the continuity of the “max opera-

tor”, and from the continuity of

∥

. Hence,

is con-

tinuous in the θ

θ variable, keeping x and y ﬁxed. The

continuity of E

(x,y)∼µ





can be shown using the Dom-

inated Convergence Theorem (Rudin et al., 1976) or

(Durrett, 2019). The upper semicontinuity property

of H boils down to standard continuity as the range of

H only consists of singletons.

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

112

It is left to show that there exists K > 0 such that



(x,y)∼µ







≤ K(1 +

∥

). When the clipping func-

tion is given by (10), then we know that



≤ λ.

For the adaptive clipping function (17), it follows

from Claim 2 that there exists K > 0 such that



≤

K(1 +

∥

). Therefore, in both the clipping schemes,

we get the following:



(x,y)∼µ



≤ E

(x,y)∼µ



(x,y)∼µ



≤ E

(x,y)∼µ

[K(1 +

∥

)].

We have thus shown that the objective in (20) is a

Marchaud map. In other words, the expected clipped

loss-gradient is continuous and grows linearly as a

function of the NN weights.

t+1

g(t)

c(t)

− E

(x(t),y(t))∼µ

g(t)

c(t)

, in (20), is inter-

preted as the sampling error at time t, the error in-

duced when using the sample (x(t),y(t)) to calcu-

late the loss-gradient in lieu of the expected clipped

loss-gradient. On the sample paths (read “runs of the

clipped gradient descent algorithm”) where the con-

dition

∑

t≥0

a(t)M

t+1

< ∞ is met, it can be shown that

(20) has the desired asymptotic properties. Shortly,

we will show that numerical stability of the algorithm

implies that the above mentioned condition holds.

Hence, the condition holds with very high probability.

Before we can state the precise statement, we need to

deﬁne the following random variables – ξ(0) = 0 and

ξ(t + 1) =

∑

s=0

a(s)M(s + 1) for t ≥ 0 – and we need

to recall the following ﬁltration – F

≡ σ

⟨

θ(0)

⟩

and

≡ σ

θ(t),((x(u),y(u))) : 0 ≤ t ≤ s, 0 ≤ u ≤ s−1

for s ≥ 1.

Lemma 3. Assuming (A1), (A2) and (A3) we get

that ⟨ξ(t),F

⟩

t≥0

is a Martingale sequence such

that lim

t→∞

ξ(t) exists with high probability. Hence,

∑

t≥0

a(t)M(t + 1) < ∞ with high probability.

Proof. First, observe that E [ξ(t + 1) | F

] = ξ

E [a(t)M

t+1

| F

]. In order to conclude that the

given sequence is a Martingale we need to show

that E [a(t)M

t+1

| F

] = 0. Recall that M

t+1

g(t)

c(t)

− E

(x(t),y(t))∼µ

g(t)

c(t)

, hence E [a(t)M

t+1

| F

] =

a(t)



g(t)

c(t)

| F

− E

(x(t),y(t))∼µ

g(t)

c(t)



. It however

follows from the deﬁnition of the ﬁltration that

g(t)

c(t)

and F

are independent, hence E

g(t)

c(t)

| F

(x(t),y(t))∼µ

g(t)

c(t)

. This yields the required equality to

zero.

Now, we show that the Martingale sequence has a

limit with high probability. For this, we need to de-

ﬁne the associated quadratic variation process ⟨ξ(t +

1)⟩

= (ξ(t + 1) − ξ(t))

(ξ(t + 1) − ξ(t)), where

is the operator for the element-wise multiplication of

vectors and t ≥ 0. On those sample points where

∑

t≥1

⟨ξ(t)⟩ < ∞, the Martingale sequence has a limit

(Durrett, 2019). Consider the following:



∑

t≥1

⟨ξ

⟩



≤

∑

t≥0

a(t)

∥

M(t + 1)

∥

∑

t≥0

a(t)

∥

M(t + 1)

∥

≤

∑

t≥0

a(t)

C(1 +

∥

θ(t)

∥

The second inequality in the above equation follows

from Claim 2 for some C > 0. Combining the assump-

tion that

∑

t≥0

a(t)

< ∞ – (A2) – with Lemma 1 (in case

of classic clipping) or (A3) (in case of adaptive clip-

ping), we get that

∑

t≥0

a(t)

C(1 +

∥

θ(t)

∥

) < ∞ with

high probability. Therefore,

∑

t≥1

⟨ξ(t)⟩ < ∞ with high

probability.

In this lemma we have essentially shown that

whenever the clipped gradient descent is numerically

stable, the errors that arise due to the use of training

samples vanish asymptotically. Since we have already

shown stability with high probability, the sampling er-

rors also vanish with the same high probability. Now,

we need to analyze the convergence set of the clipped

gradient scheme.

The loss-gradient is a function of the NN weights

and the training data. In order to state the main result

of this paper, we will make this dependence explicit.

Speciﬁcally, we let g(t) ≡ ∇

ℓ(θ

θ(t),x(t), y(t)), also

note that we have dropped the “r” and “c” subscripts

from the loss-gradient to mask the setting being con-

sidered – regression or classiﬁcation. Similarly, we let

the clipping function c(t) ≡ c(θ

θ(t),x(t), y(t)). There-

fore, in the re-written stochastic approximation equiv-

alent of the clipped gradient descent, we get that the

objective function

H(θ

θ) ≡ E

(x,y)∼µ



∇

ℓ(θ

θ,x,y)

c(θ

θ,x,y)



. (21)

Before we state the main result of this paper, we recall

the clipped gradient descent algorithm below.

θ(t + 1) = θ(t) −a(t) (H(θ

θ(t)) + M(t + 1)),

((20) recalled)

where H is redeﬁned in (21) and M(t + 1) is as previ-

ously deﬁned following (20).

Gradient Clipping in Deep Learning: A Dynamical Systems Perspective

113

Theorem 1. The classic clipped gradient descent

converges with high probability to θ

θ(∞) such that

(x,y)∼µ

−∇

ℓ(θ

θ(∞),x,y)

c(θ

θ(∞),x,y)

= 0, (22)

provided assumptions (A1) and (A2) are satisﬁed.

If we additionally assume (A3), then the adaptive

clipped gradient descent also converges with high

probability to θ(∞) satisfying (22).

Proof. In Lemma 2, we showed that H is a Mar-

chaud map. We also showed that the Martingale

noise vanishes asymptotically with high probability,

see Lemma 3. We may couple these Lemmas with the

assumptions made to apply the theory developed in

(Bena

ım et al., 2005). It lets us conclude the follow-

ing. When the clipped gradient descent is numerically

stable, which we show happens with high probabil-

ity, the limit of the clipped gradient descent coincides

with the limit of an associated solution to the differ-

ential inclusion

θ(t) ∈ −H(θ

θ(t)), the limit is taken

as t → ∞. Further, this limit has the property that it

is an equilibrium of H. This means, −H(θ

θ(∞)) = 0

given that θ

θ(∞) is the limit. Since the clipped gradi-

ent schemes – classic and adaptive – are stable with

high probability, we get that it converges with high

probability to θ

θ(∞) satisfying (22).

Colloquially speaking, the clipped gradient de-

scent converges, with high probability, to the set of

NN weights with the property that, on an average, the

clipped loss-gradient is zero. Since, the clipping fac-

tor is always positive, we may conclude that the loss-

gradient itself is zero on an average. By “average”,

we mean that an expectation taken with respect to the

data distribution µ. When the loss-gradient is zero, we

can conclude that in most circumstances the algorithm

has converged to a local minimum of the training loss

function.

5 CONCLUSIONS

In this paper, we considered the method of gradient

clipping used to control the exploding gradients prob-

lem in deep learning. We discussed the classic and

adaptive variants of norm-based clipping. For the

classic clipping method, we showed that it is numeri-

cally stable with high probability. It further converges

to a local minimum of the average loss function. In

the case of adaptive clipping, we observed that the

updates are in the order of the NN weights. Due to

the linear growth of the update as a function of the

NN weights, one cannot guarantee stability without

additional assumptions. However, we observed that

we may dip into the theory of linear dynamical sys-

tems in order to ensure stability in the adaptive clip-

ping scheme. Once numerical stability is guaranteed,

the adaptive clipping method converges, as before, to

a local minimizer of the average training loss. It must

be noted that the averaging in both variants is with

respect to the distribution that generated the training

data.

REFERENCES

Aubin, J.-P. and Cellina, A. (2012). Differential inclu-

sions: set-valued maps and viability theory, volume

264. Springer Science & Business Media.

Bena

ım, M., Hofbauer, J., and Sorin, S. (2005). Stochas-

tic approximations and differential inclusions. SIAM

Journal on Control and Optimization, 44(1):328–348.

Bercu, B., Delyon, B., and Rio, E. (2015). Concentration

inequalities for sums and martingales. Springer.

Bishop, C. M. and Nasrabadi, N. M. (2006). Pattern recog-

nition and machine learning, volume 4. Springer.

Borkar, V. S. (2009). Stochastic approximation: a dynami-

cal systems viewpoint, volume 48. Springer.

Boyd, S. and Vandenberghe, L. (2004). Convex optimiza-

tion. Cambridge university press.

Durrett, R. (2018). Stochastic calculus: a practical intro-

duction. CRC press.

Durrett, R. (2019). Probability: theory and examples, vol-

ume 49. Cambridge university press.

Halmos, P. R. (2017). Finite-dimensional vector spaces.

Courier Dover Publications.

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learn-

ing. nature, 521(7553):436–444.

Ramaswamy, A. and Bhatnagar, S. (2017). A generaliza-

tion of the borkar-meyn theorem for stochastic recur-

sive inclusions. Mathematics of Operations Research,

42(3):648–661.

Ramaswamy, A. and Bhatnagar, S. (2018). Stability of

stochastic approximations with “controlled markov”

noise and temporal difference learning. IEEE Trans-

actions on Automatic Control, 64(6):2614–2620.

Rehmer, A. and Kroll, A. (2020). On the vanishing and

exploding gradient problem in gated recurrent units.

IFAC-PapersOnLine, 53(2):1243–1248.

Rudin, W. et al. (1976). Principles of mathematical analy-

sis, volume 3. McGraw-hill New York.

Schmidhuber, J. (2015). Deep learning in neural networks:

An overview. Neural networks, 61:85–117.

Vu, T. and Raich, R. (2022). On asymptotic linear con-

vergence of projected gradient descent for constrained

least squares. IEEE Transactions on Signal Process-

ing, 70:4061–4076.

Zhang, J., He, T., Sra, S., and Jadbabaie, A. (2019). Why

gradient clipping accelerates training: A theoretical

justiﬁcation for adaptivity. In International Confer-

ence on Learning Representations.

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

114