
Clarivate (2025). Web of science: Advanced search. Ac-
cessed: February 27, 2025.
Cook, D. J., Greengold, N. L., Ellrodt, A. G., and Wein-
garten, S. R. (1997). The relation between systematic
reviews and practice guidelines. Annals of internal
medicine, 127(3):210–216.
Elsevier (2025). Embase: Advanced search. Accessed:
February 27, 2025.
Europe PMC (2025). Europe pmc: An archive of life sci-
ences literature. Accessed: February 27, 2025.
Fox, E. and Shaw, J. (1994). Combination of multiple
searches. NIST special publication SP, pages 243–
243.
Gargari, O. K., Mahmoudi, M. H., Hajisafarali, M., and
Samiee, R. (2024). Enhancing title and abstract
screening for systematic reviews with gpt-3.5 turbo.
BMJ Evidence-Based Medicine, 29(1):69–70.
Guo, E., Gupta, M., Deng, J., Park, Y.-J., Paget, M., and
Naugler, C. (2024). Automated paper screening for
clinical reviews using large language models: Data
analysis study. Journal of Medical Internet Research,
26:e48996.
Higgins, J. P., Thomas, J., Chandler, J., Cumpston, M.,
Li, T., Page, M. J., and Welch, V. A., editors (2024).
Cochrane Handbook for Systematic Reviews of Inter-
ventions. Cochrane, version 6.5 (updated august 2024)
edition.
Issaiy, M., Ghanaati, H., Kolahi, S., Shakiba, M., Jalali,
A. H., Zarei, D., Kazemian, S., Avanaki, M. A.,
and Firouznia, K. (2024). Methodological insights
into chatgpt’s screening performance in systematic
reviews. BMC Medical Research Methodology,
24(1):78.
Kanoulas, E., Li, D., Azzopardi, L., and Spijker, R. (2017).
Clef 2017 technologically assisted reviews in empiri-
cal medicine overview. In CEUR Workshop Proceed-
ings. CEUR-WS.org.
Kanoulas, E., Li, D., Azzopardi, L., and Spijker, R. (2018).
Clef 2018 technologically assisted reviews in empiri-
cal medicine overview. In CEUR workshop proceed-
ings, volume 2125.
Kanoulas, E., Li, D., Azzopardi, L., and Spijker, R. (2019).
Clef 2019 technology assisted reviews in empirical
medicine overview. In CEUR workshop proceedings,
volume 2380, page 250.
Khraisha, Q., Put, S., Kappenberg, J., Warraitch, A., and
Hadfield, K. (2024). Can large language models re-
place humans in systematic reviews? evaluating gpt-
4’s efficacy in screening and extracting data from
peer-reviewed and grey literature in multiple lan-
guages. Research Synthesis Methods.
Li, M., Sun, J., and Tan, X. (2024). Evaluating the effective-
ness of large language models in abstract screening: a
comparative analysis. Systematic reviews, 13(1):219.
McCutcheon, A. L. (1987). Latent class analysis. Sage.
Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I.,
Hoffmann, T. C., Mulrow, C. D., Shamseer, L., Tet-
zlaff, J. M., Akl, E. A., Brennan, S. E., et al. (2021).
The prisma 2020 statement: an updated guideline for
reporting systematic reviews. bmj, 372.
Sandner, E., G
¨
utl, C., Jakovljevic, I., and Wagner, A.
(2024a). Screening automation in systematic reviews:
Analysis of tools and their machine learning capabili-
ties. In dHealth 2024, pages 179–185. IOS Press.
Sandner, E., Hu, B., Simiceanu, A., Fontana, L., Jakovl-
jevic, I., Henriques, A., Wagner, A., and G
¨
utl,
C. (2024b). Screening automation for system-
atic reviews: A 5-tier prompting approach meeting
cochrane’s sensitivity requirement. In 2024 2nd Inter-
national Conference on Foundation and Large Lan-
guage Models (FLLM), pages 150–159. IEEE.
Shekelle, P. G., Maglione, M. A., Luoto, J., et al.
(2013). Global Health Evidence Evaluation Frame-
work. Agency for Healthcare Research and Qual-
ity (US), Rockville, MD. Table B.9, NHMRC Evi-
dence Hierarchy: designations of ‘levels of evidence’
according to type of research question (including ex-
planatory notes).
Spillias, S., Tuohy, P., Andreotta, M., Annand-Jones, R.,
Boschetti, F., Cvitanovic, C., Duggan, J., Fulton,
E. A., Karcher, D. B., Paris, C., et al. (2024). Human-
ai collaboration to identify literature for evidence syn-
thesis. Cell Reports Sustainability, 1(7).
Thomas, J., McDonald, S., Noel-Storr, A., Shemilt, I., El-
liott, J., Mavergames, C., and Marshall, I. J. (2021).
Machine learning reduced workload with minimal risk
of missing studies: development and evaluation of a
randomized controlled trial classifier for cochrane re-
views. Journal of Clinical Epidemiology, 133:140–
151.
Tran, V.-T., Gartlehner, G., Yaacoub, S., Boutron, I.,
Schwingshackl, L., Stadelmaier, J., Sommer, I.,
Aboulayeh, F., Afach, S., Meerpohl, J., et al. (2023).
Sensitivity, specificity and avoidable workload of us-
ing a large language models for title and abstract
screening in systematic reviews and meta-analyses.
medRxiv, pages 2023–12.
Wang, S., Scells, H., Zhuang, S., Potthast, M., Koopman,
B., and Zuccon, G. (2024). Zero-shot generative large
language models for systematic review screening au-
tomation. In European Conference on Information Re-
trieval, pages 403–420. Springer.
Wolters Kluwer (2025). Ovid: Advanced search platform.
Accessed: February 27, 2025.
Zhou, H., Hu, C., Yuan, Y., Cui, Y., Jin, Y., Chen, C., Wu,
H., Yuan, D., Jiang, L., Wu, D., et al. (2024). Large
language model (llm) for telecommunications: A
comprehensive survey on principles, key techniques,
and opportunities. IEEE Communications Surveys &
Tutorials.
Evaluating Large Language Models for Literature Screening: A Systematic Review of Sensitivity and Workload Reduction
517