Authors:
Kavach Dheer
;
Peter Corcoran
and
Josephine Griffith
Affiliation:
University of Galway, Ireland
Keyword(s):
Large Language Models, Recommender Systems, Next-Item Prediction, Model Benchmarking, Data Leakage Analysis.
Abstract:
Large language models (LLMs) are rapidly being integrated into recommender systems. New LLMs are released frequently, offering numerous architectures that share identical parameter sizes within their class, giving practitioners many options to choose from. While existing benchmarks evaluate LLM-powered recommender systems on various tasks, none have examined how same-sized LLMs perform under identical experimental conditions as a recommender system. Additionally, these benchmarks do not verify whether the evaluation datasets were part of the LLMs pre-training data. This research evaluates five open-source 7–8B parameter models (Gemma, Deepseek, Qwen, Llama-3.1, and Mistral) using a fixed A-LLMRec architecture for next-item prediction using the Amazon Luxury-Beauty Dataset. We measure top-1 accuracy (Hit@1) and evaluate dataset leakage through reference-model membership-inference attacks to ensure no model gains advantages from pre-training exposure. Although all models show negligi
ble dataset leakage rates $(<0.2\%)$, Hit@1 varies dramatically across 20 percentage points, from 44\% for Gemma to 64\% for Mistral, despite identical parameter counts and evaluation conditions. These findings demonstrate that selecting among the most appropriate LLMs is a crucial design decision in LLM-based recommender systems.
(More)