scape that can contain a local minimum, thereby hin-
dering overall weight exploration for local minima.
The latter can be a primary cause of overfitting be-
cause a reduction in the probability of discovering lo-
cal minima could potentially force the optimizer to
search and converge within a limited region of the
weight (search) space, which could contain unfavor-
able narrow (sharp) local minima. The optimization
algorithm is therefore prevented from further explor-
ing other regions in the weight space that could have
corresponded to flatter more favorable local minima,
where the coinciding weights would have been more
generalizable and insensitive to more extensive input
data distributions, as proven by (Cha et al., 2021) and
(He et al., 2019) under mild assumptions. Further-
more, (Keskar et al., 2016) showed that small-batch
SGD consistently converges to flatter, more general-
izable minimizers as compared to large-batch SGD,
which tends to converge to sharp minimizers of the
training and testing functions. Additionally, they at-
tributed this reduction in the generalization gap when
performing small-batch SGD to the inherent noise in
the gradient estimation when performing the weight
updates.
Having said that, one can observe an apparent
tradeoff between the variance (noise) and stability of
the gradients. The gradient’s variance-stability trade-
off can be closely linked to the popular exploration-
exploitation tradeoff that occurs in reinforcement
learning systems, where exploration involves move-
ments such as discovery, variation, risk-taking, and
search, while exploitation involves actions such as re-
finement, efficiency, and selection. When searching
for local minima of the loss landscape in the weight
space, the amount of variance in the gradient is analo-
gous to exploration, while the gradient’s stability is
analogous to exploitation. That is because the high
variance in the gradients causes the weight updates
to be noisy, which coincides with oscillations in the
loss function, either due to bouncing off some local
minimum’s basin or skipping over sharp ones. In con-
trast, higher gradient stability implies an exploitative
approach to the local geometry of the loss landscape,
which can be achieved by considering more gradient
statistics (a larger batch size) or by relying on a his-
tory of past gradients, such as incorporating a momen-
tum factor or using a previous fixed mini-batch gradi-
ent direction as an anchor in order to stabilize future
SGD updates. The latter is implemented in algorithms
such as SVRG and SARAH in a double-loop fashion,
with the fixed mini-batch gradient being updated in
the outer loop
1
k
times the number of updates in the
inner loop, where k corresponds to the number of in-
dividual SGD updates in the inner loop.
One common and straightforward way that is used
to pseudo-reduce the amount of variance in the weight
updates is by altering the step size. Where to reduce
the noise in the gradients, an extremely small step size
is used, or a decaying factor on the gradient’s step
size is employed accordingly, such that the step size
shrinks continuously as the training proceeds in order
to reduce the amount of fluctuation in the weight up-
dates, thereby inducing stabilization (exploitation). In
a similar fashion, employing a large step size would
correspond to a pseudo-increase in the amount of vari-
ance in our weight updates, leading to further ex-
ploration in the weight space. However, picking the
right step size is a labor-intensive, non-trivial prob-
lem, as its appropriate value widely varies from model
to model and task to task and requires tedious manual
hyperparameter tuning depending on multiple factors
such as the data type, the dataset used, the selected
choice of optimization algorithm, and other factors.
Furthermore, there exists a surplus of research to
relate flatness with generalizability, such as the works
conducted by (Keskar et al., 2016), (He et al., 2019),
(Wen et al., 2018), and (Izmailov et al., 2018), where
they demonstrate the effectiveness of finding a flat-
ter local minima of the loss surface in improving the
model’s generalizability, and hence, one must seek
to formulate techniques that can either smoothen and
flatten the loss landscape on the training dataset, or
lead to convergence towards a flatter local minimum.
Accordingly, we can condense our primary goal
of achieving model generalization to having the opti-
mizer converge to the best possible generalizable lo-
cal minimum by finding and attracting the flattest one,
because seeking flat minima can achieve better gen-
eralizability by maximizing classification margins in
both in-domain and out-of-domain (Cha et al., 2021).
3 METHODOLOGY
Motivated by the variance-stability ∼ exploration-
exploitation tradeoff as well as the particular findings
of (Lengyel et al., 2021), (Cha et al., 2021), and
(Nar and Sastry, 2018), we propose a novel adaptive
variant of SGD, presented in Algorithm 1, named
Bouncing Gradient Descent (BGD), which aims
to ameliorate SGD’s deficiency of getting trapped
in suboptimal minima by using ”unorthodox” ap-
proaches in the weight updates to achieve better
model generalization by attracting flat local minima.
The authors of (Lengyel et al., 2021) established
a strong correlation between the flatness of a loss
surface’s basin and the wideness of the classification
margins associated with it, and (Cha et al., 2021)
BGD: Generalization Using Large Step Sizes to Attract Flat Minima
241