
explore integration with established psychometric ap-
proaches, such as Classical Test Theory, to further re-
fine output reliability assessment.
7 CONCLUSIONS
This study proposed a statistically principled frame-
work for evaluating Large Language Models (LLMs),
integrating Linear Mixed Models with bootstrap
resampling to decompose performance variability
across model configurations, retrieval methods, and
linguistic factors.
Empirical results demonstrate that a significant
portion of LLM output variability arises from interac-
tions between architectural choices, decoding param-
eters, and prompt phrasing, which are often obscured
by conventional aggregate evaluation metrics. The
proposed methodology enables systematic quantifica-
tion of these effects, providing robust, interpretable
insights into model behaviour.
Variance decomposition revealed that fixed ef-
fects account for 23.2% of output variability, while
prompt phrasing contributes 7.2%, underscoring the
need for hierarchical modelling in LLM assessment.
Bootstrap-enhanced estimation further improved the
reliability of parameter inference, mitigating overcon-
fidence in performance comparisons.
By addressing a key methodological gap in LLM
evaluation, this work advances current practice to-
wards more rigorous, reproducible, and interpretable
assessment standards. The framework supports both
academic research and real-world deployment in
high-stakes applications where output precision and
reliability are essential.
Future work will explore extensions to multilin-
gual evaluation, domain-adaptive LLMs, and integra-
tion with psychometric approaches such as Classi-
cal Test Theory, further enhancing the robustness and
generalisability of LLM performance assessment.
7.1 Future Work
Building upon the proposed statistical evaluation
framework, several avenues for future research are
identified.
Firstly, extending the methodology to multilingual
LLMs and domain-adaptive architectures would as-
sess its generalisability beyond the structured clas-
sification tasks considered in this study. As LLM
applications expand to increasingly diverse linguis-
tic and contextual environments, robust, variance-
decomposed evaluation will be essential to maintain-
ing performance consistency.
Secondly, integrating additional random effects,
such as rater variability or dataset-level heterogeneity,
may further refine the attribution of output variability
and enhance the precision of model comparisons.
Finally, future work will explore the combina-
tion of this framework with established psychomet-
ric techniques. Such integration may offer comple-
mentary insights into LLM reliability, particularly in
high-stakes scenarios where both statistical and cog-
nitive evaluation dimensions are relevant.
ACKNOWLEDGEMENTS
ChatGPT 4o was used in all sections of this work
to standardise and improve the writing in British En-
glish. This research is partially funded by the Brazil-
ian National Council for Scientific and Technological
Development (CNPq).
REFERENCES
Abbasi Yadkori, Y. and Kuzborskij, I. (2024). To believe or
not to believe your llm: Iterative prompting for esti-
mating epistemic uncertainty. In Advances in Neural
Information Processing Systems (NeurIPS).
Bezerra, Y. F. and Weigang, L. (2025). Llmquoter:
enhancing rag capabilities through efficient quote
extraction from large contexts. arXiv preprint
arXiv:2501.05554.
C.-H. Liu, C.-W. Chang, J. H. J. J. H. L. P. S. L.-A. L. C.-T.
H. Y.-P. C. E. S. H. and Wang, S.-L. (2023). Brain
computed tomography reading of stroke patients by
resident doctors from different medical specialities:
An eye-tracking study. Journal of Clinical Neuro-
science, 117:173–180.
Di Oliveira, V., Bezerra, Y., Weigang, L., Brom, P., and
Celestino, V. (2024). Slim-raft: A novel fine-tuning
approach to improve cross-linguistic performance for
mercosur common nomenclature. In Proceedings of
the 20th International Conference on Web Information
Systems and Technologies - WEBIST, pages 234–241.
INSTICC, SciTePress.
Di Oliveira, V., Weigang, L., and Filho, G. P. R. (2022).
Eleven data-set: A labeled set of descriptions of
goods captured from brazilian electronic invoices. In
Proceedings of the 18th International Conference on
Web Information Systems and Technologies - WE-
BIST, pages 257–264. INSTICC, SciTePress.
H. Wang, F. Liu, Y. D. and Yu, D. (2022). Entropy of eye
movement during rapid automatized naming. Fron-
tiers in Human Neuroscience, 16.
J. Rystrøm, H. R. K. and Hale, S. (2025). Multilingual ̸=
multicultural: Evaluating gaps between multilingual
capabilities and cultural alignment in llms.
IMMBA: Integrated Mixed Models with Bootstrap Analysis - A Statistical Framework for Robust LLM Evaluation
101