IMMBA: Integrated Mixed Models with Bootstrap Analysis - A Statistical Framework for Robust LLM Evaluation

Vinícius Di Oliveira, Vinícius Di Oliveira, Pedro Brom, Pedro Brom, Li Weigang

2025

Abstract

Large Language Models (LLMs) have advanced natural language processing across diverse applications, yet their evaluation remains methodologically limited. Standard metrics such as accuracy or BLEU offer aggregate performance snapshots but fail to capture the inherent variability of LLM outputs under prompt changes and decoding parameters like temperature and top-p. This limitation is particularly critical in high-stakes domains, such as legal, fiscal, or healthcare contexts, where output consistency and interpretability are essential. To address this gap, we propose IMMBA: Integrated Mixed Models with Bootstrap Analysis, a statistically principled framework for robust LLM evaluation. IMMBA combines Linear Mixed Models (LMMs) with bootstrap resampling to decompose output variability into fixed effects (e.g., retrieval method, decoding configuration) and random effects (e.g., prompt phrasing), while improving estimation reliability under relaxed distributional assumptions. We validate IMMBA in a Retrieval-Augmented Generation (RAG) scenario involving structured commodity classification under the Mercosur Common Nomenclature (NCM). Our results demonstrate that IMMBA isolates meaningful performance factors and detects significant interaction effects across configurations. By integrating hierarchical modelling and resampling-based inference, IMMBA offers a reproducible and scalable foundation for evaluating LLMs in sensitive, variance-prone settings.

Download


Paper Citation


in Harvard Style

Di Oliveira V., Brom P. and Weigang L. (2025). IMMBA: Integrated Mixed Models with Bootstrap Analysis - A Statistical Framework for Robust LLM Evaluation. In Proceedings of the 21st International Conference on Web Information Systems and Technologies - Volume 1: WEBIST; ISBN 978-989-758-772-6, SciTePress, pages 92-102. DOI: 10.5220/0013819400003985


in Bibtex Style

@conference{webist25,
author={Vinícius Di Oliveira and Pedro Brom and Li Weigang},
title={IMMBA: Integrated Mixed Models with Bootstrap Analysis - A Statistical Framework for Robust LLM Evaluation},
booktitle={Proceedings of the 21st International Conference on Web Information Systems and Technologies - Volume 1: WEBIST},
year={2025},
pages={92-102},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013819400003985},
isbn={978-989-758-772-6},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 21st International Conference on Web Information Systems and Technologies - Volume 1: WEBIST
TI - IMMBA: Integrated Mixed Models with Bootstrap Analysis - A Statistical Framework for Robust LLM Evaluation
SN - 978-989-758-772-6
AU - Di Oliveira V.
AU - Brom P.
AU - Weigang L.
PY - 2025
SP - 92
EP - 102
DO - 10.5220/0013819400003985
PB - SciTePress