Gradient Clipping in Deep Learning: A Dynamical Systems Perspective
Arunselvan Ramaswamy
a
Dept. of Mathematics and Computer Science, Karlstad University, 651 88 Karlstad, Sweden
Keywords:
Deep Learning, Adaptive Gradient Clipping, Dynamical Systems Perspective, Learning Theory, Supervised
Learning.
Abstract:
Neural networks are ubiquitous components of Machine Learning (ML) algorithms. However, training them
is challenging due to problems associated with exploding and vanishing loss-gradients. Gradient clipping
is shown to effectively combat both the vanishing gradients and the exploding gradients problems. As the
name suggests, gradients are clipped in order to prevent large updates. At the same time, very small neural
network weights are updated using larger step-sizes. Although widely used in practice, there is very little
theory surrounding clipping. In this paper, we analyze two popular gradient clipping techniques – the classic
norm-based gradient clipping method and the adaptive gradient clipping technique. We prove that gradient
clipping ensures numerical stability with very high probability. Further, clipping based stochastic gradient
descent converges to a set of neural network weights that minimizes the average scaled training loss in a
local sense. The averaging is with respect to the distribution that generated the training data. The scaling
is a consequence of gradient clipping. We use tools from the theory of dynamical systems for the presented
analysis.
1 INTRODUCTION
The proliferation of complex neural network (NN)
architectures, coupled with sophisticated Machine
Learning (ML) algorithms and cheap increased com-
putational capacity has caused a push towards au-
tomation in various aspects of the society. Supervised
learning is an important paradigm of ML, where the
aim is to learn an unknown map f : X Y using
finitely many examples – {(x
i
, f (x
i
)) | x X , 1 i
N} (LeCun et al., 2015). X is referred to as the input
space and Y as the output space. Typically, X R
d
,
for some d 1, and Y is either discrete or continuous.
When Y is discrete, we are in the setting of classifica-
tion. When Y is continuous, it is typically a subset of
the real space R, and the setting is called regression.
The unknown function f is called the target. The aim
is to train a predictor in order to approximate f . The
example data used for training is called training data
D {(x
i
,y
i
) | x X , y
i
Y , 1 i N}. The stan-
dard assumption in ML is that the dataset D is gen-
erated by sampling N datapoints from X × Y in an
independent manner using the same joint probability
distribution, say µ. The reader is referred to (Bishop
and Nasrabadi, 2006) for details.
In deep learning, the predictor used to approxi-
a
https://orcid.org/0000-0001-7547-8111
mate the target is a deep neural network (DNN). It
is a NN with two or more hidden layers. The goal
is to find a set of DNN weights so that the predic-
tion errors are minimized, and the resulting DNN is
a good approximation of the target f . An appropri-
ate loss function is defined and the NN weights are
iteratively updated to minimize loss using the training
data D. Minimizing the loss minimizes the prediction
errors. In this paper, we consider NN training using
two important loss functions: cross-entropy loss (for
classification) and the mean squared error (for regres-
sion). Stochastic gradient descent is a simple yet pop-
ular choice for training a NN (LeCun et al., 2015). It
involves the following update step, at time t, for the
NN weight vector θ
θ
θ:
θ
θ
θ(t + 1) =θ
θ
θ(t) a(t)
1
M
M
j=1
θ
θ
θ
θ(t),x
i( j)
(t),y
i( j)
(t)
,
(1)
where a(t) is the step-size or learning rate, is the
loss,
x
i( j)
(t),y
i( j)
(t)
D, and M < is the sample-
batch size. The sample-batch is sampled uniformly
from D. For the sake of simplicity in presentation,
we let M = 1. Also, we use (x(t),y(t)) to represent
the datapoint sampled at time t for the update. Hence,
Ramaswamy, A.
Gradient Clipping in Deep Learning: A Dynamical Systems Perspective.
DOI: 10.5220/0011678000003411
In Proceedings of the 12th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2023), pages 107-114
ISBN: 978-989-758-626-2; ISSN: 2184-4313
Copyright
c
2023 by SCITEPRESS Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)
107
(1) becomes
θ
θ
θ(t + 1) = θ
θ
θ(t) a(t)
θ
θ
θ
θ(t),x(t),y(t)
. (2)
Derivatives in NNs are computed using the back
propagation algorithm. In practice, due to the bot-
tleneck in computer capacity, the farther the loss-
derivative needs to be backpropagated, the more they
vanish. In effect, some NN weights are never updated.
The complementary problem is that of exploding gra-
dients numerical instability caused by large update
values. These issues are very well documented, see
(Rehmer and Kroll, 2020), (Schmidhuber, 2015) and
(LeCun et al., 2015). Gradient clipping is a popular
solution to the exploding gradient problem. In this pa-
per, we explore two methods, the classic norm-based
gradient clipping and the adaptive clipping method.
Adaptive clipping additionally addresses the vanish-
ing gradients issue, albeit only partially.
The principle behind gradient clipping is to clip
it when it grows too large in order to prevent numer-
ically unstable large updates. In the classic version
of the norm-based gradient clipping scheme, the gra-
dient is divided (scaled-down) by its norm when the
norm exceeds a certain predetermined value. We ana-
lyze its stability properties in Section 3.1 and discuss
its asymptotic properties in Section 4. In particular,
we show that with high probability, the clipped gra-
dient descent is numerically stable and converges to
a local minimizer of the loss function, on an average.
Although widely used in deep learning solutions and
empirically successful there is very little theory sur-
rounding it. In the past researchers have studied the
performance under idealized conditions. To the best
of our knowledge ours is the first study that analyzes
the practical version of the algorithm.
Formally speaking, in the classic method, we di-
vide the gradient update by
θ
λ
for some λ > 0,
when the gradient norm exceeds a predetermined
value. There is a variant to this approach that is
less conservative called the adaptive gradient clipping
method. Here, the update is divided by
θ
λ
θ
θ
θ
, when
the gradient norm exceeds a certain multiple of the
norm of the NN weights. Here, the updates are larger
and convergence faster (Zhang et al., 2019). Although
very insightful, again, only an idealized version of this
algorithm is analyzed in (Zhang et al., 2019). In this
paper, we remedy this. Specifically, we show that nu-
merical stability is not automatic since the updates are
in the order of the NN weights. If stable, then we
show that the adaptive variant converges to an averag-
ing of the local minimizer of the loss function. The
averaging is with respect to the training data distribu-
tion.
The organization of this paper is as follows. First,
in Section 2, we formally discuss the general gradient
descent in the context of deep learning. Next, in Sec-
tion 3, we introduce the classic and adaptive clipped
gradient methods and discuss their numerical stabil-
ity. Finally, we discuss their asymptotic properties in
Section 4.
2 STOCHASTIC GRADIENTS IN
DEEP LEARNING
Let us begin with the simple stochastic gradient de-
scent algorithm involved in training any neural net-
work (NN). It is described by the following iteration:
θ
θ
θ(t + 1) = θ
θ
θ(t) a(t)g(t), (3)
where θ(t) represents the vector of NN weights at
time t, a(t) is the learning rate or the step-size at time
t; g(t) is the loss-gradient. In this paper, we consider
two loss functions, one that is popular for classifica-
tion and the other that is popular for regression. We,
in particular, consider the cross entropy and the mean
squared losses. It must however be noted that theory
presented herein can be generally applied to other loss
functions.
In a typical regression application, the output of
the NN is a real number. Let us represent the output
by f (x,θ), where x is the input query instance and θ
is the NN weight-vector. The mean squared error is
then
r
= ( f (θ
θ
θ,x) y)
2
, (4)
where y is the true label of x. The loss-gradient is then
2( f (θ
θ
θ,x) y)
θ
f (θ
θ
θ,x). In order to obtain
θ
f (θ
θ
θ,x),
the algorithm performs a backward pass (back prop-
agation) through the NN. The forward pass yields
f (x,θ).
In the K-class classification setting, the NN has
K neurons (activations) in the output layer. Say, the
outputs of these activations are z
i
, 1 i K. Then,
p
i
:
=
e
z
i
/
K
j=1
e
z
j
is interpreted as the posterior probabil-
ity that the true label is i. This is called the soft-max
classifier, and is trained using the cross-entropy loss:
c
=
K
i=1
1
y=i
log p
i
. (5)
Here, 1
y=i
is the indicator function that takes values
1 or 0, depending on whether the true class-label is i
or not. In the traditional setting of classification, ex-
actly one label is associated with each query instance
x. Suppose the true label is i, then the partial deriva-
ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods
108
tive of
c
with respect to z
k
is given by
K
j=1
e
z
j
e
z
i
e
z
i
K
j=1
e
z
j
1
k=i
e
z
i
e
z
k
K
j=1
e
z
j
!
2
. (6)
We can further simplify the above equation to get
1
k=i
+
e
z
k
K
j=1
e
z
j
= p
k
1
k=i
. This gradient informa-
tion is back propagated through the NN in order to
update the NN weights. We are now ready to make a
couple of assumptions in our analysis.
(A1) All activations in the NN are twice continuously
differentiable. Popular examples include sig-
moid, tanh and the Gaussian error linear unit.
The input to the NN is a vector x R
d
,
for some d 1. This input is first passed
through the input layer, this is represented
by Σ(θ
θ
θ
ip
x), θ
θ
θ
ip
is a matrix of dimension
d × the number of activations in the input layer,
Σ is an operator that takes a vector z, of some
dimension, as input and outputs a vector of the same
dimension such that its i
th
component equals the
activation applied to the i
th
component of z, i.e.,
Σ(z)
i
= σ(z
i
). The vector Σ(θ
θ
θ
ip
x) is then passed
through the first hidden layer to get Σ(θ
θ
θ
h
Σ(θ
θ
θ
ip
x)).
Here, θ
θ
θ
h
is a matrix of appropriate dimension. It
represents the set of NN weights from the first hidden
layer. Now, Σ(θ
θ
θ
h
Σ(θ
θ
θ
ip
x)) may be passed through
more hidden layer, however, for the sake of simplicity
we assume there are no more hidden layers. This
means that Σ(θ
θ
θ
h
Σ(θ
θ
θ
ip
x)) is passed through the output
layer, which again looks like Σ(θ
θ
θ
op
Σ(θ
θ
θ
h
Σ(θ
θ
θ
ip
x))).
In the classification setting, the output of the NN,
f (θ
θ
θ,x), is then a class-label distribution given by
e
z
i
/
K
j=1
e
z
j
1iK
, where K is the number of classes
and z
i
:
= Σ(θ
θ
θ
op
Σ(θ
θ
θ
h
Σ(θ
θ
θ
ip
x)))
i
. Note that θ
θ
θ is the
vector of all NN weights θ
θ
θ
ip
, θ
θ
θ
h
and θ
θ
θ
op
. In
regression, f (θ
θ
θ,x) =
z
i
θ
θ
θ
l
i
, the output is a linear
combination of the activation outputs from the output
layer, and z
i
is as defined before. Note that θ
θ
θ
l
is a
vector that is also a part of the NN weight vector θ
θ
θ.
Passing the input through the various layers of a
NN is called the forward pass. It must be noted that
the above description of a NN is not complete. In par-
ticular, it must be noted that every vector is appended
with 1 before passing it through the next layer. In par-
ticular (1,x) is the input. Then,
1,Σ
θ
θ
θ
ip
1
x

is
the input to the hidden layer and so on. This is done to
incorporate bias information. These biases are train-
able as a part of the NN weight vector. Again, for the
sake of simplicity in presentation we omit the biases.
Claim 1. Under (A1),
r
∂θ
θ
θ
i
,
c
∂θ
θ
θ
i
,
2
r
∂θ
θ
θ
i
θ
θ
θ
j
and
2
c
∂θ
θ
θ
i
θ
θ
θ
j
exist
and are continuous, where θ
θ
θ
i
and θ
θ
θ
j
are components
of θ
θ
θ,
r
and
c
are defined in (4) and (5), respectively.
Therefore,
r
,
c
θ
r
and
θ
c
are all locally Lips-
chitz continuous.
Proof. Let us first consider the case of regression
with the loss function being the mean-squared error.
Since, (A1) requires that the activation functions are
twice continuously differentiable, we directly get that
f (θ
θ
θ,x) is also twice continuously differentiable, as
it is a composition of operations that are themselves
twice continuously differentiable. Specifically, it is a
direct consequence of f being a repeated composition
of activations and linear combinations. Hence,
f (θ
θ
θ,x)
∂θ
θ
θ
i
and
2
f (θ
θ
θ,x)
∂θ
θ
θ
i
θ
θ
θ
j
are continuous, directly implying that
r
∂θ
θ
θ
i
and
2
r
∂θ
θ
θ
i
θ
θ
θ
j
are also continuous.
Now, we move onto classification. Recall the loss-
gradient with respect to z
k
, 1 k K, when the true
label is y = i, from (6)
p
k
1
k=i
.
We therefore get:
c
∂θ
θ
θ
i
=
K
k=1
z
k
∂θ
θ
θ
i
c
z
k
. (7)
Suppose θ
θ
θ
i
is not used to calculate the output value
z
k
, then clearly
z
k
∂θ
θ
θ
i
= 0. For example, θ
θ
θ
op
2,3
is only
used in the calculation of z
2
, but not to the other z
j
s,
e.g.,
z
1
∂θ
θ
θ
op
2,3
= 0. Recall that θ
θ
θ
op
is a matrix, hence
θ
2,3
is the element on row-2 and column-3. The NN
weight vector θ
θ
θ is obtained by flattening all the ma-
trices and appending them. As before,
2
z
k
∂θ
θ
θ
i
θ
θ
θ
j
is read-
ily calculated, and is continuous as a consequence of
(A1). Since
c
z
k
= p
k
1
k=i
, it is also continuous.
Also,
2
c
∂θ
θ
θ
j
θ
θ
θ
i
=
K
k=1
"
2
z
k
∂θ
θ
θ
j
θ
θ
θ
i
c
z
k
+
z
k
∂θ
θ
θ
i
K
l=1
z
l
∂θ
θ
θ
j
2
c
z
l
z
k
!#
.
(8)
We therefore get the required continuity of
2
c
∂θ
θ
θ
j
θ
θ
θ
i
and
c
∂θ
θ
θ
i
.
A function f : R
d
R
m
is Lipschtiz continu-
ous suppose there exists L such that f (x) f (y)
Lx y, where d,m 1. It is locally Lipschitz con-
tinuous, iff for every x R
d
L
x
> 0andr
x
> 0 such
Gradient Clipping in Deep Learning: A Dynamical Systems Perspective
109
that f (x) f (y) L
x
x y for every y B
r
x
(x),
where B
r
x
(x) is the ball of radius r
x
centered at point
x. Suppose f is continuously differentiable, then f
is locally Lipschtiz continuous (Rudin et al., 1976).
Similarly,
x
f is locally Lipschitz continuous when
the Hessian is continuous. The second part of the
claim, regarding Lipschitzness, is a direct conse-
quence of these.
3 GRADIENT CLIPPING AND
NUMERICAL STABILITY
In this section, we introduce the concept of gradient
clipping. We delve into the details of two important
gradient clipping methods. We also formally discuss
their numerical stability. We begin with a discussion
on the classic norm-based gradient clipping that is
popular in Deep Learning.
3.1 Gradient Clipping in Deep Learning
Let us return to the popular stochastic gradient de-
scent algorithm given by (3). Suppose assumption
(A1) is satisifed, then we get from Claim 1 that
g(t) is locally Lipschitz continuous in the θ
θ
θ coordi-
nate. Recall that the loss-gradient at time t is calcu-
lated using the datapoint (x(t),y(t)) from the dataset
D. In the regression setting the loss gradient g(t)
2( f (θ
θ
θ(t),x(t)) y(t))
θ
f (θ
θ
θ(t),x(t)), and it is simi-
larly calculated in the classification setting. At time t,
the datapoint (x(t),y(t)) is processed from the given
training dataset in order to obtain the loss-gradient
g(t). There could be more than one datapoint pro-
cessed, e.g., in a batch-processing implementation of
the learning algorithm. For the sake of simplicity in
presentation we assume that a single datapoint is pro-
cessed at every point in time in order to calculate the
loss-gradient.
As the NN is highly non-linear the calculated gra-
dient g(t) may be so large that the weight update pro-
cess is numerically unstable. This is called the ex-
ploding gradients problems. This is a common prob-
lem when using the cross-entropy loss function. The
non-linearity of a NN is directly proportional to its
depth. However, this means that the loss-gradients do
not propagate back into the network. The weights in
the input and the initial hidden layers are updated us-
ing very small numerically insignificant gradient val-
ues. This is called the vanishing gradient problem,
and is common when using deep network architec-
tures.
The exploding gradient problem is often coun-
tered by projecting the iterate after every step onto
a predetermined compact set K , see (Boyd and Van-
denberghe, 2004) and (Vu and Raich, 2022). The pro-
jected scheme is given by
θ
θ
θ(t + 1) θ
θ
θ(t) a(t)g(t),
θ
θ
θ(t + 1) Π
K
θ
θ
θ(t + 1),
(9)
where Π
K
is the projection operator. It projects
θ(t + 1) onto the nearest point from K . Hence, (9) is
guaranteed to be numerically stable. However, it does
not always converge to the optimal. Gradient clipping
is another popular solution to the exploding gradients
problem, our main focus here. Intuitively, the idea is
to “clip” large gradients in order to maintain stability.
In particular, instead of (3) or (9), the update is given
by
θ
θ
θ(t + 1) = θ
θ
θ(t) a(t)
g(t)
c(t)
, (10)
where c(t) is the clipping value at time t. In the clas-
sical version of norm-based clipping
c(t)
:
=
g(t)
λ
1, (11)
where is the max operator. When g(t) > λ, then
the gradient update is scaled down by
g(t)
λ
, λ > 0 is
a predetermined value. The NN weights are updated
using gradients that are “clipped at λ to prevent the
updates from exploding. In order to formally show
numerical stability under norm-based gradient clip-
ping, we need the following assumption on the learn-
ing rate.
(A2) a(t) > 0 for all t 0,
t=0
a(t) = and
t=0
a(t)
2
<
.
We now claim that (10) does not experience
“finite-time blowup” with very high probability. By
finite-time blowup we mean that the NN weights, up-
dated according to (10), remain within a sphere of ra-
dius M centered around the origin for the entire du-
ration of the experiment. We may think of M as a
very large positive number determined by the largest
number that can be processed by a computer and the
upper-bound on the duration of the experiment.
Lemma 1. Assuming (A1) and (A2), the stochastic
gradient descent with the classic norm-based clip-
ping, (10), does not experience finite blow-up with
very high probability.
Proof. Let us rewrite (10) as follows
θ
θ
θ(s) = θ
θ
θ(0)
s1
t=0
a(t)
g(t)
c(t)
. (12)
Let T < be the duration of the experiment, and let
M > 0 be a very large arbitrary constant. We need
ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods
110
to show that θ
θ
θ(t) < M for 0 t T with very
high probability. Since M is arbitrary, we show an
equivalent variant of this, specifically we show that
θ
θ
θ(t)θ
θ
θ(0) < M for 1 t T with very high prob-
ability. First, we begin by observing that for s 1
θ
θ
θ(s) θ
θ
θ(0) +
s1
t=0
a(t)
g(t)
c(t)
. (13)
Next, we define the following random variables for
0 s T , M
s
:
= θ
θ
θ(0) +
s1
t=0
a(t)
g(t)
c(t)
. We also de-
fine the following sigma-algebras F
0
σ
θ
θ
θ(0)
and
F
s
σ
D
θ
θ
θ(t),((x(u),y(u))) : 0 t s, 0 u s1
E
.
Since the datapoints from the training dataset are in-
dependently generated, we get that E [M
s+1
| F
s
]
M
s
for 0 s T 1. Hence, (M
t
,F
t
)
0tT
is a
super-Martinagle. The reader is referred to (Dur-
rett, 2019) or (Durrett, 2018) for the definitions of a
super-Martingale and a sigma-algebra. Since c(t) =
g(t)
λ
1, we get that
g(t)
c(t)
λ, further we get that
|M
t+1
M
t
| a(t)λ. It follows from the Hoeffding-
Azuma inequality (Bercu et al., 2015) that
P(|M
s
M
0
| > M) 2e
2M
2
/
s1
t=0
a(t)
2
λ
2
(14)
for 1 s T . Now, since
s1
t=0
a(t)
2
t=0
a(t)
2
| {z }
:
=γ
< ,
P(|M
s
M
0
| > M) 2e
2M
2
/
s1
t=0
a(t)
2
λ
2
2e
2M
2
/γλ
2
.
(15)
Let us say that a(t) = 1/t is the learning rate, then
t1
1
t
2
=
π
2
6
, i.e., γ =
π
2
6
. Since M is arbitrary, let us fix
it as 2
10
. A typical value for λ is something like 30.
Plugging, these values into (15), we get that
P(|M
s
M
0
| > M)
2
e
1416
. (16)
The RHS of (16) is very close to zero. If we choose
s = T, then we get that P(|M
T
M
0
| M) 1. Since,
the super-Martingale sequence is almost surely in-
creasing, we get that P(|M
s
M
0
| M | 1 s
T ) 1. We get from (12) that θ
θ
θ(s) θ
θ
θ(0) =
s1
t=0
a(t)
g(t)
c(t)
. The RHS of this equality is further less
than
s1
t=0
a(t)
g(t)
c(t)
= |M
s
M
0
|. Hence, we conclude
that θ
θ
θ(s) θ
θ
θ(0) M whenever |M
s
M
0
| M.
This gives us the required P
θ
θ
θ(s) θ
θ
θ(0) M |
1 s T
1.
3.2 Adaptive Gradient Clipping in Deep
Learning
The classic gradient clipping method may slow down
the rate of convergence via conservative updates.
Adaptive gradient clipping tries to remedy this by
modifying c(t) as follows:
c(t) =
g(t)
λ[θ
θ
θ(t) ε]
1. (17)
Like λ, ε is also predetermined. Usually, ε is chosen
to be less than 1. Let us suppose that the NN weights
are tiny, in that θ
θ
θ(t) ε. Also, let λε < g(t) < λ.
Then, the order of update at time t is λε in the adaptive
clipping scenario, less than the corresponding update
when using the classic clipping method. Now, con-
sider the scenario wherein θ
θ
θ(t) is very large, in par-
ticular θ
θ
θ(t) > 1. When g(t) λθ
θ
θ(t), the up-
date according to the adaptive clip is in the order of
θ
θ
θ(t)λ, and the update recommended by the classic
method is in the order of λ. In the latter case, the adap-
tive clipping method updates proportional to the norm
of the NN weight vector.
The adaptive clipping given by (17) is simplistic.
In practice, the updates for different layers of the NN
are clipped differently. In order to update θ
θ
θ
ip
i, j
, the
adaptive clip method calculates the scaling factor as
g(t)
λ
h
θ
θ
θ
ip
(t)
F
ε
i
1, (18)
where θ
θ
θ
ip
(t)
F
is the Frobenious norm of the input
weight matrix at time t. Suppose we flatten θ
θ
θ
ip
(t)
into a vector, then one can show that the Frobenius
norm equals the Euclidean norm of the flattened ver-
sion (Halmos, 2017). By clipping updates to different
layers differently, the weights are updated in a man-
ner that overcomes the vanishing gradient problem.
Although qualitatively different, we will use (17) and
not (18) in our analysis, since there is no difference
in terms of the steps involved. Further, it greatly re-
duces kitsch and improves readability and ease of un-
derstanding.
Recall that in the classic clipping method (10), the
updates are always bounded by λ. However, the up-
date in the adaptive clipping scheme can be in the or-
der of the NN weights. Formally, we claim the fol-
lowing.
Claim 2. In the adaptive clipping scheme, (17),
g(t)
c(t)
K(1 + θ
θ
θ(t)), where K > 0 is an iterate in-
dependent constant.
Proof. Suppose
g(t)
θ
θ
θ(t)
ε
λ, then c(t) = 1. Also,
Gradient Clipping in Deep Learning: A Dynamical Systems Perspective
111
g(t)
λ[
θ
θ
θ(t)
ε]. Define K
:
= λε λ 1, then
we get that
g(t)
c(t)
K(1 + θ
θ
θ(t)).
Now, suppose
g(t)
θ
θ
θ(t)
ε
> λ, then c(t) =
g(t)
λ[θ
θ
θ(t)∥∨ε]
.
Further,
g(t)
c(t)
=
λ[θ
θ
θ(t) ε]
g(t)
g(t)
. (19)
We can choose the same K as before in order to get
the required linear bound even in this case.
We have thus shown that the updates in the adap-
tive clipping scheme grows linearly as a function of
the NN weight vector. Stability is therefore not guar-
anteed. Additional conditions must be satisfied for
the clipped gradient scheme to be numerically sta-
ble. There are extensive studies on the stability of
systems that grow linearly. The reader is referred
to (Ramaswamy and Bhatnagar, 2017), (Ramaswamy
and Bhatnagar, 2018) and (Borkar, 2009) for more on
this. We assume that sufficient conditions for stabil-
ity, as discussed in the aforementioned literature, can
be verified for the adaptive clipping case. We will,
in effect, assume that the adaptive clipping method is
stable with very high probability.
(A3) The adaptive clipping gradient descent scheme
is numerically stable with very high probability.
4 ASYMPTOTICS OF GRADIENT
CLIPPING METHODS
In Section 3, we saw that under mild conditions the
clipped gradient descent scheme (10) is numerically
stable with very high probability. We also saw that
the adaptive clipping scheme is not always numeri-
cally stable on account of aggressive gradient updates.
However, in this case we dip may dip into the theory
of linear systems to ensure stability. Given that the
gradient descent algorithm with clipping, classic or
adaptive, is numerically stable, the question remains,
does it converge to the required set of NN weights that
minimizes loss? In this section, we use tools from Dy-
namical Systems Theory for the convergence analysis
of the two gradient clipping methods. Important liter-
ature on Dynamical Systems Theory include (Borkar,
2009), (Aubin and Cellina, 2012) and (Bena
¨
ım et al.,
2005).
For the convergence analysis, we use the theory
from (Bena
¨
ım et al., 2005). This theory is based
on viewing the clipped gradient descent scheme as a
stochastic approximation algorithm, and associating
with it an ordinary differential equation. The asso-
ciated o.d.e. has the same asymptotic properties as
the stochastic approximation algorithm. Recall the
clipped gradient descent scheme.
θ
θ
θ(t + 1) = θ
θ
θ(t) a(t)
g(t)
c(t)
. (recalled from (10))
Here the clipping constant c(t) is either (11) or (17),
depending on if we are in the classic or the adaptive
setting. In order to utilize the theory developed in
(Bena
¨
ım et al., 2005) we rewrite (10) as the stochastic
approximation algorithm below
θ
θ
θ(t + 1) = θ
θ
θ(t) a(t)
E
(x(t),y(t))µ
g(t)
c(t)
+ M(t + 1)
,
(20)
where M
t+1
=
g(t)
c(t)
E
(x(t),y(t))µ
h
g(t)
c(t)
i
; µ is the joint
probability distribution over X × Y . Recall that the
training dataset D is generated by sampling from µ.
First, we show that the objective function in (20)
is a Marchaud map. A point-to-set map H : R
m
{ subsets of R
m
} is called Marchaud, iff it satisfies
the following conditions (a) H(x) is convex and com-
pact, (b) sup
zH(x)
z
K(1 +
x
) for fixed K > 0, and
(c) if x
n
x and z
n
z in R
m
such that z
n
H(x
n
)
for all n 0, then z H(x) (upper semicontinuity
property). Although our objective function H(θ
θ
θ)
:
=
E
(x,y)µ
g
c
is a standard point-to-point map, we may
view it as a trivial point-to-set map such that H(θ
θ
θ) is
the singleton
E
(x,y)µ
g
c
.
Lemma 2. Under assumption (A1), the trivial point-
to-set map H : θ 7→
E
(x,y)µ
g
c
, the objective func-
tion H(θ
θ
θ) in the clipped gradient descent (20) is Mar-
chaud.
Proof. Since the objective H(θ) is really a point-to-
point map, it is trivially convex and compact. Now,
we note that g
θ
r
for regression and g
c
for
classification. We know from Claim 1 that g is con-
tinuous. The clipping function c is also continuous,
this follows from the continuity of the “max opera-
tor”, and from the continuity of
g
. Hence,
g
c
is con-
tinuous in the θ
θ
θ variable, keeping x and y fixed. The
continuity of E
(x,y)µ
g
c
can be shown using the Dom-
inated Convergence Theorem (Rudin et al., 1976) or
(Durrett, 2019). The upper semicontinuity property
of H boils down to standard continuity as the range of
H only consists of singletons.
ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods
112
It is left to show that there exists K > 0 such that
E
(x,y)µ
g
c
K(1 +
θ
θ
θ
). When the clipping func-
tion is given by (10), then we know that
g
c
λ.
For the adaptive clipping function (17), it follows
from Claim 2 that there exists K > 0 such that
g
c
K(1 +
θ
θ
θ
). Therefore, in both the clipping schemes,
we get the following:
E
(x,y)µ
h
g
c
i
E
(x,y)µ
g
c
,
E
(x,y)µ
g
c
E
(x,y)µ
[K(1 +
θ
θ
θ
)].
We have thus shown that the objective in (20) is a
Marchaud map. In other words, the expected clipped
loss-gradient is continuous and grows linearly as a
function of the NN weights.
M
t+1
=
g(t)
c(t)
E
(x(t),y(t))µ
h
g(t)
c(t)
i
, in (20), is inter-
preted as the sampling error at time t, the error in-
duced when using the sample (x(t),y(t)) to calcu-
late the loss-gradient in lieu of the expected clipped
loss-gradient. On the sample paths (read “runs of the
clipped gradient descent algorithm”) where the con-
dition
t0
a(t)M
t+1
< is met, it can be shown that
(20) has the desired asymptotic properties. Shortly,
we will show that numerical stability of the algorithm
implies that the above mentioned condition holds.
Hence, the condition holds with very high probability.
Before we can state the precise statement, we need to
define the following random variables – ξ(0) = 0 and
ξ(t + 1) =
t
s=0
a(s)M(s + 1) for t 0 and we need
to recall the following filtration F
0
σ
θ
θ
θ(0)
and
F
s
σ
D
θ
θ
θ(t),((x(u),y(u))) : 0 t s, 0 u s1
E
for s 1.
Lemma 3. Assuming (A1), (A2) and (A3) we get
that ξ(t),F
t
t0
is a Martingale sequence such
that lim
t
ξ(t) exists with high probability. Hence,
t0
a(t)M(t + 1) < with high probability.
Proof. First, observe that E [ξ(t + 1) | F
t
] = ξ
t
+
E [a(t)M
t+1
| F
t
]. In order to conclude that the
given sequence is a Martingale we need to show
that E [a(t)M
t+1
| F
t
] = 0. Recall that M
t+1
=
g(t)
c(t)
E
(x(t),y(t))µ
h
g(t)
c(t)
i
, hence E [a(t)M
t+1
| F
t
] =
a(t)
E
h
g(t)
c(t)
| F
t
i
E
(x(t),y(t))µ
h
g(t)
c(t)
i
. It however
follows from the definition of the filtration that
g(t)
c(t)
and F
t
are independent, hence E
h
g(t)
c(t)
| F
t
i
=
E
(x(t),y(t))µ
h
g(t)
c(t)
i
. This yields the required equality to
zero.
Now, we show that the Martingale sequence has a
limit with high probability. For this, we need to de-
fine the associated quadratic variation process ξ(t +
1)
:
= (ξ(t + 1) ξ(t))
N
(ξ(t + 1) ξ(t)), where
N
is the operator for the element-wise multiplication of
vectors and t 0. On those sample points where
t1
ξ(t) < , the Martingale sequence has a limit
(Durrett, 2019). Consider the following:
t1
ξ
t
t0
a(t)
2
M(t + 1)
2
,
t0
a(t)
2
M(t + 1)
2
t0
a(t)
2
C(1 +
θ
θ
θ(t)
2
).
The second inequality in the above equation follows
from Claim 2 for some C > 0. Combining the assump-
tion that
t0
a(t)
2
< – (A2) with Lemma 1 (in case
of classic clipping) or (A3) (in case of adaptive clip-
ping), we get that
t0
a(t)
2
C(1 +
θ
θ
θ(t)
2
) < with
high probability. Therefore,
t1
ξ(t) < with high
probability.
In this lemma we have essentially shown that
whenever the clipped gradient descent is numerically
stable, the errors that arise due to the use of training
samples vanish asymptotically. Since we have already
shown stability with high probability, the sampling er-
rors also vanish with the same high probability. Now,
we need to analyze the convergence set of the clipped
gradient scheme.
The loss-gradient is a function of the NN weights
and the training data. In order to state the main result
of this paper, we will make this dependence explicit.
Specifically, we let g(t)
θ
(θ
θ
θ(t),x(t), y(t)), also
note that we have dropped the r and c subscripts
from the loss-gradient to mask the setting being con-
sidered regression or classification. Similarly, we let
the clipping function c(t) c(θ
θ
θ(t),x(t), y(t)). There-
fore, in the re-written stochastic approximation equiv-
alent of the clipped gradient descent, we get that the
objective function
H(θ
θ
θ) E
(x,y)µ
θ
(θ
θ
θ,x,y)
c(θ
θ
θ,x,y)
. (21)
Before we state the main result of this paper, we recall
the clipped gradient descent algorithm below.
θ
θ
θ(t + 1) = θ(t) a(t) (H(θ
θ
θ(t)) + M(t + 1)),
((20) recalled)
where H is redefined in (21) and M(t + 1) is as previ-
ously defined following (20).
Gradient Clipping in Deep Learning: A Dynamical Systems Perspective
113
Theorem 1. The classic clipped gradient descent
converges with high probability to θ
θ
θ() such that
E
(x,y)µ
θ
(θ
θ
θ(),x,y)
c(θ
θ
θ(),x,y)
= 0, (22)
provided assumptions (A1) and (A2) are satisfied.
If we additionally assume (A3), then the adaptive
clipped gradient descent also converges with high
probability to θ() satisfying (22).
Proof. In Lemma 2, we showed that H is a Mar-
chaud map. We also showed that the Martingale
noise vanishes asymptotically with high probability,
see Lemma 3. We may couple these Lemmas with the
assumptions made to apply the theory developed in
(Bena
¨
ım et al., 2005). It lets us conclude the follow-
ing. When the clipped gradient descent is numerically
stable, which we show happens with high probabil-
ity, the limit of the clipped gradient descent coincides
with the limit of an associated solution to the differ-
ential inclusion
˙
θ
θ
θ(t) H(θ
θ
θ(t)), the limit is taken
as t . Further, this limit has the property that it
is an equilibrium of H. This means, H(θ
θ
θ()) = 0
given that θ
θ
θ() is the limit. Since the clipped gradi-
ent schemes classic and adaptive are stable with
high probability, we get that it converges with high
probability to θ
θ
θ() satisfying (22).
Colloquially speaking, the clipped gradient de-
scent converges, with high probability, to the set of
NN weights with the property that, on an average, the
clipped loss-gradient is zero. Since, the clipping fac-
tor is always positive, we may conclude that the loss-
gradient itself is zero on an average. By “average”,
we mean that an expectation taken with respect to the
data distribution µ. When the loss-gradient is zero, we
can conclude that in most circumstances the algorithm
has converged to a local minimum of the training loss
function.
5 CONCLUSIONS
In this paper, we considered the method of gradient
clipping used to control the exploding gradients prob-
lem in deep learning. We discussed the classic and
adaptive variants of norm-based clipping. For the
classic clipping method, we showed that it is numeri-
cally stable with high probability. It further converges
to a local minimum of the average loss function. In
the case of adaptive clipping, we observed that the
updates are in the order of the NN weights. Due to
the linear growth of the update as a function of the
NN weights, one cannot guarantee stability without
additional assumptions. However, we observed that
we may dip into the theory of linear dynamical sys-
tems in order to ensure stability in the adaptive clip-
ping scheme. Once numerical stability is guaranteed,
the adaptive clipping method converges, as before, to
a local minimizer of the average training loss. It must
be noted that the averaging in both variants is with
respect to the distribution that generated the training
data.
REFERENCES
Aubin, J.-P. and Cellina, A. (2012). Differential inclu-
sions: set-valued maps and viability theory, volume
264. Springer Science & Business Media.
Bena
¨
ım, M., Hofbauer, J., and Sorin, S. (2005). Stochas-
tic approximations and differential inclusions. SIAM
Journal on Control and Optimization, 44(1):328–348.
Bercu, B., Delyon, B., and Rio, E. (2015). Concentration
inequalities for sums and martingales. Springer.
Bishop, C. M. and Nasrabadi, N. M. (2006). Pattern recog-
nition and machine learning, volume 4. Springer.
Borkar, V. S. (2009). Stochastic approximation: a dynami-
cal systems viewpoint, volume 48. Springer.
Boyd, S. and Vandenberghe, L. (2004). Convex optimiza-
tion. Cambridge university press.
Durrett, R. (2018). Stochastic calculus: a practical intro-
duction. CRC press.
Durrett, R. (2019). Probability: theory and examples, vol-
ume 49. Cambridge university press.
Halmos, P. R. (2017). Finite-dimensional vector spaces.
Courier Dover Publications.
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learn-
ing. nature, 521(7553):436–444.
Ramaswamy, A. and Bhatnagar, S. (2017). A generaliza-
tion of the borkar-meyn theorem for stochastic recur-
sive inclusions. Mathematics of Operations Research,
42(3):648–661.
Ramaswamy, A. and Bhatnagar, S. (2018). Stability of
stochastic approximations with “controlled markov”
noise and temporal difference learning. IEEE Trans-
actions on Automatic Control, 64(6):2614–2620.
Rehmer, A. and Kroll, A. (2020). On the vanishing and
exploding gradient problem in gated recurrent units.
IFAC-PapersOnLine, 53(2):1243–1248.
Rudin, W. et al. (1976). Principles of mathematical analy-
sis, volume 3. McGraw-hill New York.
Schmidhuber, J. (2015). Deep learning in neural networks:
An overview. Neural networks, 61:85–117.
Vu, T. and Raich, R. (2022). On asymptotic linear con-
vergence of projected gradient descent for constrained
least squares. IEEE Transactions on Signal Process-
ing, 70:4061–4076.
Zhang, J., He, T., Sra, S., and Jadbabaie, A. (2019). Why
gradient clipping accelerates training: A theoretical
justification for adaptivity. In International Confer-
ence on Learning Representations.
ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods
114