Authors:
Jacob Nielsen
;
Lukas Galke
and
Peter Schneider-Kamp
Affiliation:
Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
Keyword(s):
Quantization, Quantization-Aware Training, Graph Neural Networks, Text Classification, Language Models.
Abstract:
Contemporary machine learning models, such as language models, are powerful, but come with immense resource requirements both at training and inference time. Quantization aware pre-training with ternary weights (1.58 bits per weight) has shown promising results in decoder-only language models and facilitates memory-efficient inference. However, little is known about how quantization-aware training influences the training dynamics beyond such Transformer-based decoder-only language models. Here, we engage in a bottom-up exploration of quantization-aware training, starting with multi-layer perceptrons and graph neural networks. Then, we explore 1.58-bit training in other transformer-based language models: encoder-only and encoder-decoder models. Our results show that in all of these settings, 1.58-bit training is on par with standard 32/16-bit models, yet we also identify challenges specific to 1.58-bit encoder-decoder models. Our results on decoder-only language models hint at a possi
ble regularization effect introduced by quantization-aware training.
(More)