
gion without approaching the genuine minimum.
6 CONCLUSION
Our empirical results strongly support the hypothe-
sis that loss functions exhibit a predictable convexity
structure proceeding from the initial non-convexity
towards final convexity, enabling targeted optimiza-
tion strategies that outperform conventional methods.
Initial weight parameters (small random values) fall
into the non-convex region, while a broad environ-
ment of loss minimum is convex. The validity of
this hypothesis can be observed in the development of
the gradient norm in dependence on the instantaneous
loss: a norm growing with decreasing loss indicates
non-convexity, while a shrinking norm suggests con-
vexity.
This can be exploited to identify the swap point
(gradient norm peak) between both. Then, an efficient
non-convex algorithm such as Adam can be applied
in the initial non-convex phase, and a fast second-
order algorithm such as CG with guaranteed super-
linear convergence can be used in the second phase.
A set of benchmarks has been used to test the va-
lidity of the hypothesis and the subsequent efficiency
of this optimization scheme. Although they are rel-
atively small to remain feasible with given comput-
ing resources, they cover relevant variants of the ViT
architecture that can be expected to impact convex-
ity properties: using or not using an MLP, defining
the similarity in the attention mechanism symmetri-
cally or asymmetrically, and putting the value vec-
tors of embeddings in a compressed or uncompressed
form (matrices W
v
and W
o
). A completely different ar-
chitecture, the convolutional network VGG5, has also
been tested.
The results have been surprisingly unambiguous.
All variants exhibited the same pattern of the gradient
norm increasing towards a swap point and decreasing
after it. The final losses with a two-phase algorithm
have always been better than those with a single algo-
rithm (Adam). CG alone did not perform well in the
initial non-convex phase, which caused a considerable
lag so that the convex region was not attained. The
same is true with a single exception for CIFAR-100.
An analogical behavior can be observed for the per-
formance of the validation set, which has been admit-
tedly relatively poor for CIFAR-100 because of the
excessive overdetermination with given models — the
parameter sets seem to have been insufficient for im-
age classification with 100 classes. The top-5 accu-
racy on this dataset was more acceptable, over 50 %.
Of course, it must be questioned how far this em-
pirical finding can be generalized to arbitrary archi-
tectures, mainly to large models. One of the very dif-
ficult questions is the convexity structure of loss func-
tions with arbitrary models or even with a model class
relevant to practice. However, it is essential to note
that there is no particular risk when using the two-
phase method. Gradient norms can be automatically
monitored and deviations from the hypothesis can be
identified. If there is evidence against a single gra-
dient norm peak corresponding to the swap point, a
non-convex method can be used to continue as a safe
resort. If the hypothesis can be confirmed, there is an
almost certain reward in convergence speed and accu-
racy.
Nevertheless, the next goal of our work is to verify
the hypothesis on a large text-based model.
REFERENCES
Bermeitinger, B., Hrycej, T., Pavone, M., Kath, J., and
Handschuh, S. (2024). Reducing the Transformer Ar-
chitecture to a Minimum. In Proceedings of the 16th
International Joint Conference on Knowledge Discov-
ery, Knowledge Engineering and Knowledge Manage-
ment, pages 234–241, Porto, Portugal. SCITEPRESS.
Chen, C., Shen, L., Zou, F., and Liu, W. (2022). Towards
practical Adam: Non-convexity, convergence theory,
and mini-batch acceleration. J. Mach. Learn. Res.,
23(1):229:10411–229:10457.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Min-
derer, M., Heigold, G., Gelly, S., Uszkoreit, J., and
Houlsby, N. (2021). An image is worth 16x16 words:
Transformers for image recognition at scale. In In-
ternational Conference on Learning Representations,
page 21, Vienna, Austria.
Ergen, T. and Pilanci, M. (2023). The Convex Landscape of
Neural Networks: Characterizing Global Optima and
Stationary Points via Lasso Models.
Fletcher, R. and Reeves, C. M. (1964). Function minimiza-
tion by conjugate gradients. The Computer Journal,
7(2):149–154.
Fotopoulos, G. B., Popovich, P., and Papadopoulos, N. H.
(2024). Review Non-convex Optimization Method for
Machine Learning.
Hrycej, T., Bermeitinger, B., Cetto, M., and Handschuh,
S. (2023). Mathematical Foundations of Data Sci-
ence. Texts in Computer Science. Springer Interna-
tional Publishing, Cham.
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M.,
and Tang, P. T. P. (2017). On Large-Batch Training for
Deep Learning: Generalization Gap and Sharp Min-
ima.
Kingma, D. P. and Ba, J. (2015). Adam: A Method for
Stochastic Optimization. 3rd International Confer-
ence on Learning Representations.
A Convexity-Dependent Two-Phase Training Algorithm for Deep Neural Networks
85