
longer sequences. This enhanced capability for pro-
cessing large prompts may explain why Mistral out-
performs the other models.
However, we emphasize that this reasoning is
speculative, we have not performed targeted ablations
to isolate these effects. Our interpretation represents
a plausible hypothesis grounded in known architec-
tural trade-offs, not a confirmed explanation. Further
controlled experiments are needed to validate whether
these specific design choices cause the observed per-
formance gap.
Our results demonstrate that LLM selection plays
a crucial role in recommender system performance.
This factor has been largely overlooked, as the pre-
vailing assumption suggests that any LLM of simi-
lar parameter size would perform equivalently. While
much attention has focused on architecture, prompt-
ing strategies, and fine-tuning approaches, the impor-
tance of LLM selection within the same parameter
class has been underestimated. With numerous op-
tions now available at each parameter size, choosing
the right model can significantly impact accuracy.
5 CONCLUSION
In our comparative study, we tested five 7-8B param-
eter LLMs on an identical task with the same archi-
tecture and dataset to examine accuracy variations,
an aspect often overlooked in benchmarks and re-
search when selecting LLMs of similar size. Our rec-
ommender system testing revealed Hit@1 accuracy
variations of nearly twenty percentage points, with
Mistral outperforming other models. Despite shar-
ing the same parameter count, these models demon-
strated that model selection alone can meaningfully
influence accuracy. Through membership inference
attacks, we verified that no model benefited from hav-
ing the dataset in their pre-training corpus, confirm-
ing that the observed accuracy gaps represent genuine
differences in model capability. We speculated that
these variations stem from architectural differences.
Future work should isolate the causes by evaluating
across additional datasets, quantifying robustness via
repeated negative-sampling draws with 95% confi-
dence intervals, and examining the effects of fine-
tuning and scaling to larger, reasoning-focused mod-
els.
ACKNOWLEDGEMENTS
This work was conducted with the financial support of
the Research Ireland Centre for Research Training in
Digitally-Enhanced Reality (d-real) under Grant No.
18/CRT/6224. For the purpose of Open Access, the
author has applied a CC BY public copyright licence
to any Author Accepted Manuscript version arising
from this submission
REFERENCES
Bao, K., Zhang, J., Zhang, Y., Wang, W., Feng, F., and He,
X. (2023). Tallrec: An effective and efficient tuning
framework to align large language model with recom-
mendation. In Proceedings of the 17th ACM Confer-
ence on Recommender Systems, pages 1007–1014.
Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H.,
O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S.,
Prashanth, U. S., and Raff, E. (2023). Pythia: A suite
for analyzing large language models across training
and scaling. In International Conference on Machine
Learning, pages 2397–2430.
Borisov, V., Seßler, K., Leemann, T., Pawelczyk, M.,
and Kasneci, G. (2022). Language models are
realistic tabular data generators. arXiv preprint
arXiv:2210.06280.
Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-
Voss, A., Lee, K., Roberts, A., Brown, T., Song, D.,
and Erlingsson (2021). Extracting training data from
large language models. In 30th USENIX security sym-
posium (USENIX Security 21), pages 2633–2650.
Chen, L., Gao, C., Du, X., Luo, H., Jin, D., Li, Y., and
Wang, M. (2024). Enhancing id-based recommen-
dation with large language models. arXiv preprint
arXiv:2411.02041.
Diao, S., Wang, P., Lin, Y., Pan, R., Liu, X., and
Zhang, T. (2023). Active prompting with chain-of-
thought for large language models. arXiv preprint
arXiv:2302.12246.
Friedman, L., Ahuja, S., Allen, D., Tan, Z., Sidahmed,
H., Long, C., Xie, J., Schubiner, G., Patel, A., and
Lara, H. (2023). Leveraging large language models in
conversational recommender systems. arXiv preprint
arXiv:2305.07961.
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T.,
Foster, C., Phang, J., He, H., Thite, A., and Nabeshima
(2020). The pile: An 800gb dataset of diverse text for
language modeling. arXiv preprint arXiv:2101.00027.
Gao, Y., Sheng, T., Xiang, Y., Xiong, Y., Wang, H., and
Zhang, J. (2023). Chat-rec: Towards interactive
and explainable llms-augmented recommender sys-
tem. arXiv preprint arXiv:2303.14524.
Geng, S., Liu, S., Fu, Z., Ge, Y., and Zhang, Y. (2022).
Recommendation as language processing (rlp): A uni-
fied pretrain, personalized prompt & predict paradigm
(p5). In Proceedings of the 16th ACM conference on
recommender systems, pages 299–315.
He, R., Kang, W.-C., and McAuley, J. (2017). Translation-
based recommendation. In Proceedings of the
eleventh ACM conference on recommender systems,
pages 161–169.
KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval
370