Evaluating Large Language Models for Literature Screening: A Systematic Review of Sensitivity and Workload Reduction

Elias Sandner; Elias Sandner; Luca Fontana; Kavita Kothari; Andre Henriques; Igor Jakovljevic; Alice Simniceanu; Andreas Wagner; Christian Gütl

doi:10.5220/0013562900003967

Evaluating Large Language Models for Literature Screening: A Systematic Review of Sensitivity and Workload Reduction

Elias Sandner, Elias Sandner, Luca Fontana, Kavita Kothari, Andre Henriques, Igor Jakovljevic, Alice Simniceanu, Andreas Wagner, Christian Gütl

2025

Abstract

Systematic reviews provide high-quality evidence but require extensive manual screening, making them time-consuming and costly. Recent advancements in general-purpose large language models (LLMs) have shown potential for automating this process. Unlike traditional machine learning, LLMs can classify studies based on natural language instructions without task-specific training data. This systematic review examines existing approaches that apply LLMs to automate the screening phase. Models used, prompting strategies, and evaluation datasets are analyzed, and the reported performance is compared in terms of sensitivity and workload reduction. While several approaches achieve sensitivity above 95%, none consistently reach the 99% threshold required for replacing human screening. The most effective models use ensemble strategies, calibration techniques, or advanced prompting rather than relying solely on the latest LLMs. However, generalizability remains uncertain due to dataset limitations and the absence of standardized benchmarking. Key challenges in optimizing sensitivity are discussed, and the need for a comprehensive benchmark to enable direct comparison is emphasized. This review provides an overview of LLM-based screening automation, identifying gaps and outlining future directions for improving reliability and applicability in evidence synthesis.

Download

Paper Citation

in Harvard Style

Sandner E., Fontana L., Kothari K., Henriques A., Jakovljevic I., Simniceanu A., Wagner A. and Gütl C. (2025). Evaluating Large Language Models for Literature Screening: A Systematic Review of Sensitivity and Workload Reduction. In Proceedings of the 14th International Conference on Data Science, Technology and Applications - Volume 1: DATA; ISBN 978-989-758-758-0, SciTePress, pages 508-517. DOI: 10.5220/0013562900003967

in Bibtex Style

@conference{data25,
author={Elias Sandner and Luca Fontana and Kavita Kothari and Andre Henriques and Igor Jakovljevic and Alice Simniceanu and Andreas Wagner and Christian Gütl},
title={Evaluating Large Language Models for Literature Screening: A Systematic Review of Sensitivity and Workload Reduction},
booktitle={Proceedings of the 14th International Conference on Data Science, Technology and Applications - Volume 1: DATA},
year={2025},
pages={508-517},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013562900003967},
isbn={978-989-758-758-0},
}

in EndNote Style

TY - CONF

JO - Proceedings of the 14th International Conference on Data Science, Technology and Applications - Volume 1: DATA
TI - Evaluating Large Language Models for Literature Screening: A Systematic Review of Sensitivity and Workload Reduction
SN - 978-989-758-758-0
AU - Sandner E.
AU - Fontana L.
AU - Kothari K.
AU - Henriques A.
AU - Jakovljevic I.
AU - Simniceanu A.
AU - Wagner A.
AU - Gütl C.
PY - 2025
SP - 508
EP - 517
DO - 10.5220/0013562900003967
PB - SciTePress