Authors:
Vinícius Di Oliveira
1
;
2
;
Pedro Brom
2
;
3
and
Li Weigang
2
Affiliations:
1
Secretary of Economy, Brasilia, Federal District, Brazil
;
2
TransLab, University of Brasilia, Brasilia, Federal District, Brazil
;
3
Federal Institute of Brasilia, Brasilia, Federal District, Brazil
Keyword(s):
Large Language Models, Statistical Evaluation, Linear Mixed Models, Bootstrap Resampling, Variance Decomposition, Retrieval-Augmented Generation, LLM Evaluation.
Abstract:
Large Language Models (LLMs) have advanced natural language processing across diverse applications, yet their evaluation remains methodologically limited. Standard metrics such as accuracy or BLEU offer aggregate performance snapshots but fail to capture the inherent variability of LLM outputs under prompt changes and decoding parameters like temperature and top-p. This limitation is particularly critical in high-stakes domains, such as legal, fiscal, or healthcare contexts, where output consistency and interpretability are essential. To address this gap, we propose IMMBA: Integrated Mixed Models with Bootstrap Analysis, a statistically principled framework for robust LLM evaluation. IMMBA combines Linear Mixed Models (LMMs) with bootstrap resampling to decompose output variability into fixed effects (e.g., retrieval method, decoding configuration) and random effects (e.g., prompt phrasing), while improving estimation reliability under relaxed distributional assumptions. We validate
IMMBA in a Retrieval-Augmented Generation (RAG) scenario involving structured commodity classification under the Mercosur Common Nomenclature (NCM). Our results demonstrate that IMMBA isolates meaningful performance factors and detects significant interaction effects across configurations. By integrating hierarchical modelling and resampling-based inference, IMMBA offers a reproducible and scalable foundation for evaluating LLMs in sensitive, variance-prone settings.
(More)