ALISE: An Automated Literature Screening Engine for Research

Hendrik Roth

1 a

and Carsten Lanquillon

2 b

Artiﬁcial Intelligence (M.Sc.), Johannes Kepler University Linz, Austria

Business Information Systems, Heilbronn University of Applied Sciences, Heilbronn, Germany

Keywords:

Literature Review, Screening, Automation.

Abstract:

The screening process needs the most time of a literature review. An automated approach saves a lot of time,

making it easier for researchers to review literature. Most current approaches do not consider the full text for

screening, which can cause the exclusion of relevant papers. The Automated LIterature Screening Engine

(ALISE) performs full-text screening based on a research question about the retrieved papers of the literature

search. With an average of 61.87% nWSS and a median of 74.38% nWSS, ALISE can save time for reviewers

but cannot be used without human screening afterwards. Furthermore, ALISE is sensitive to the given research

question(s).

1 INTRODUCTION

A literature review is a widely used research method

intended to provide an overview of previous research,

identify new research opportunities, or draw new con-

clusions from previously unrecognised correlations

(Rowe, 2014; Okoli, 2015). However, conducting a

literature review with an increasing amount of litera-

ture is impractical due to the time-consuming search

and screening process (van Dinter et al., 2021). This

is primarily because extensive screening is required

(van Dinter et al., 2021). Researchers initially retrieve

many publications, e.g., from a keyword search of-

ten numbering in the hundreds or thousands, making

thorough review impractical (Kitchenham and Char-

ters, 2007). Hence, researchers typically rely on ti-

tles and abstracts for preliminary screening, adapted

from established review frameworks (Kitchenham

and Charters, 2007; Page et al., 2021). While this

title and abstract screening saves time, it comes with

limitations. The shortness of titles and abstracts can

lead to the omission of relevant publications and thus

excluding papers that address the research questions

of researchers (Blake, 2010; Penning de Vries et al.,

2020; Wang et al., 2020). This problem is reduced

by full-text screening (Penning de Vries et al., 2020).

However, it has become more difﬁcult as the litera-

ture volume increases continuously. In response, re-

https://orcid.org/0009-0007-2602-9679

https://orcid.org/0000-0002-9319-1437

searchers have explored automation to aid literature

reviews, employing machine learning algorithms for

screening and categorization (Noroozi et al., 2023;

van Dinter et al., 2021). Many automated methods

still hinge on title and abstract screening (van Din-

ter et al., 2021), perpetuating the risk of overlooking

relevant literature. Large language models (LLMs)

can effectively comprehend and respond to text-based

queries, even rivaling human performance in some

tasks (Ouyang et al., 2022; Liu et al., 2023). This

makes them suitable for automating the screening of

full-text papers. Especially chaining an LLM can

achieve higher results on a downstream task rather

than using only one standard prompt (Yu et al., 2023;

Haji et al., 2023). Despite the possible beneﬁts, there

are currently no studies on applying LLM chains to

automated literature reviews. For this reason, this

paper addresses this gap, aiming to develop an auto-

mated full-text literature screening engine based on a

given research question while following established

literature review protocol guidelines like (Kitchen-

ham and Charters, 2007). To reach this goal, the pa-

per seeks to answer the following research question:

How can the full-text screening process of a literature

review be automated using an LLM chain?

2 RELATED WORK

There are several studies on automated screening pro-

cesses for literature reviews. (van Dinter et al., 2021)

Roth, H. and Lanquillon, C.

ALISE: An Automated Literature Screening Engine for Research.

DOI: 10.5220/0012415200003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 2, pages 453-461

ISBN: 978-989-758-680-4; ISSN: 2184-433X

453

provide an overview of automated literature review

approaches, identifying various studies, which focus

on title and abstract screening. Yet, only a few ap-

proaches focus on full-text screening. This is be-

cause several challenges come with screening over

full-texts, e.g., PDF ﬁles have to be converted to ac-

cessible text (Cohen et al., 2010). However, as also

stated by (Portenoy and West, 2020), it is question-

able if the returned papers by these methods are ac-

tually relevant, or only show strong topic similari-

ties. While considering only keywords or topics to

identify relevant studies, full-text screening seems to

be worse than only screening abstract and title (Di-

este and Padua, 2007). Nevertheless, when deﬁn-

ing a relevant paper for a literature review as a pa-

per that addresses a research question (Templier and

Par

e, 2015), this conclusion cannot be made because,

logically, a research question of a reviewer may not

necessarily be answered directly by the abstract or ti-

tle, but instead by paragraphs or sentences of a paper

(Blake, 2010; Penning de Vries et al., 2020). Hence,

(La Quatra et al., 2021) use a text summarizer and

correlation calculations to classify if a cited paper

contains relevant information in its full text. (Wil-

son et al., 2023) compare the effectiveness of regu-

lar expression matching and a machine learning clas-

siﬁer that was trained particularly for human screen-

ing categorization when it came to automated full-text

screening. By employing language models as phrase

embeddings, (Alchokr et al., 2022) suggested a differ-

ent method that involved weighting and clustering the

literature according to its relevance. Although the au-

thors’ approach is conducted on assessing titles and

abstracts, they recognized the potential relevance of

this method to full-text analysis, highlighting the need

for more research in this ﬁeld. In a different study,

(Noroozi et al., 2023) iteratively classiﬁed relevant

and irrelevant literature during the systematic search

process using a random forest classiﬁer based on full-

text feature similarity. The goal of this iterative clas-

siﬁcation strategy was to enhance the accuracy of the

screening process and improve the selection of per-

tinent publications. There is no study yet that uses

LLMs for automating the screening process respect-

ing the full text of a paper and a given research ques-

tion.

3 BACKGROUND

3.1 Conducting a Literature Review

There are several common literature review method-

ologies for various domains. For the information sys-

tems domain, the methodology proposed by (Brocke

et al., 2009) is a frequently used methodological

framework, whereas the framework of (Kitchenham

and Charters, 2007) is often utilized in the software

engineering domain. PRIMSA by (Page et al., 2021)

is often applied in the biomedical domain, and the

methodology by (Snyder, 2019) is common for busi-

ness research. However, they all basically consist of

the same general steps, the differences are mainly ref-

erences to domain-speciﬁc journals and quality as-

sessments or more detailed descriptions of some steps

(Templier and Par

e, 2015). Therefore, (Templier and

Par

e, 2015) as well as (Okoli, 2015) modeled general

steps for a literature review based on these common

methodology frameworks. The only difference be-

tween (Okoli, 2015) and (Templier and Par

e, 2015)

is that they switch the general steps 5 and 6 and, fur-

thermore, split the screening process into two steps

(an initial title and abstract screening, which is fol-

lowed by full-text screening). The last step of (Okoli,

2015) can be ignored because it is about writing the

review, not conducting the review. Hence, there are 6

general steps for conducting a literature review based

on (Okoli, 2015) and (Templier and Par

e, 2015). The

steps are iterative and can lead to reﬁnement of the

previous steps (Brocke et al., 2009; Templier and

Par

e, 2015). Figure 1 visualises these six general

steps.

Figure 1: General literature review methodology based on

(Templier and Par

e, 2015) and (Okoli, 2015).

Step 1 - The ﬁrst step consists of deﬁning the

problem inclusive the research question(s) (Okoli,

2015). (Kitchenham and Charters, 2007) noted, that

each literature review must have a research question

for guidance of the review. Therefore, this step also

includes the deﬁnition of the general conditions based

on this research question(s) and problem conception,

such as the deﬁnition of the search terms (Brocke

et al., 2009).

Step 2 - After the conceptualisation of the prob-

lem and research questions, a search is performed

with the deﬁned search terms and ﬁlter criteria with

the goal to get a literature collection of various litera-

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

454

ture databases (Templier and Par

e, 2015).

Step 3 - When various papers are retrieved by the

search in databases, the literature has to be checked

for relevance, also called screening, where the goal

is to ﬁnd papers which helps answer the deﬁned

research question (Templier and Par

e, 2015). The

Screening Process consists typically of initial ab-

stract/title screening to shorten the large volume of re-

trieved papers, followed by a full-text screening pro-

cess on the reduced paper corpus (Okoli, 2015; Tem-

plier and Par

e, 2015).

Step 4 - For each relevant literature, the quality

must be assessed (Brocke et al., 2009; Templier and

Par

e, 2015; Okoli, 2015). Even if a paper is relevant,

it can have a low quality and hence must be rejected

for inclusion due to quality standards (Okoli, 2015).

There are several techniques to assess the quality of a

paper (Templier and Par

e, 2015).

Step 5 - With the completion of step 4, the review-

ers have now a literature corpus for the data extraction

related to their research question(s), which represents

then the actual ﬁndings of the review (Templier and

Par

e, 2015; Okoli, 2015). The extracted data depends

on the study and research question, which then also

deﬁnes the method which can be used for extracting

(Templier and Par

e, 2015).

Step 6 - The last step is to analyze and synthe-

size the extracted data (Okoli, 2015; Templier and

Par

e, 2015). Typical methods are a concept matrix

by (Webster and Watson, 2002) or a table/forest plot

as indicated by (Kitchenham and Charters, 2007).

3.2 Explainability

As (Kitchenham and Charters, 2007) and PRISMA

by (Page et al., 2021) stated in their methodology, the

point of the literature review protocol is to record ev-

erything in such a way that it is comprehensible and

explainable. For this reason, notes should also be

made on relevant papers while screening (Kitchen-

ham and Charters, 2007). By taking notes, re-

searchers can keep track of their thought processes,

criteria, and justiﬁcations for including or excluding

speciﬁc papers (Okoli, 2015). Most automated meth-

ods do not indicate why a paper is relevant, but just

return a corpus labelled as relevant without justiﬁca-

tion (Portenoy and West, 2020). Thus, ALISE must

be able to explain why a paper is relevant, as it can

also be done when screened manually.

4 APPROACH

4.1 Problem Deﬁnition

As ALISE aims to be integrated into commonly used

literature review methodology processes, the screen-

ing process can be described similar to the gen-

eral literature review methodology by (Okoli, 2015)

or (Templier and Par

e, 2015) and thus covers the

methodology processes of (Kitchenham and Char-

ters, 2007), (Brocke et al., 2009), (Snyder, 2019),

and (Page et al., 2021). Given our study’s focus

on full-text screening, we omit the initial abstract

and title screening step. Furthermore, this task can

be seen as a classiﬁcation task determining as rele-

vant or not relevant paper (Olorisade et al., 2019),

which is also respected by the deﬁnitions. Typi-

cally, the screening process involves the application

of inclusion and exclusion criteria (Templier and Par

2015). Exclusion criteria, used to apply automatic ﬁl-

ters (e.g., language, article type, date), can be applied

during the initial literature database search (Brocke

et al., 2009). Quality-related exclusion criteria are

assessed during the quality assessment step follow-

ing the screening process (Kitchenham and Charters,

2007). For this reason, the inclusion criteria consid-

ered are only from the content perspective as men-

tioned by (Okoli, 2015), which is to review if the pa-

per addresses the speciﬁc research question(s) (Tem-

plier and Par

e, 2015). With this context, we describe

the screening process as follows:

Let RQ be the given research question. Given

an initial set P = {p

, p

, . . . , p

} of n papers p

and

RQ, retrieved from an initial search (keyword search,

snowballing, etc.), the screening process in a litera-

ture review involves checking if a paper addresses the

research question and documenting the reasons why

a paper is considered relevant. Hereby, ∀p

∈ P, the

relevance labelling function is deﬁned as follows:

f (p

) =

(

1, if p

addresses RQ

0, otherwise

(1)

Thus, the relevance labels l

∈ {0, 1} resulting

from f are used for each paper p

∈ P. Addition-

ally, there is a need for the review protocol to cap-

ture the reasons why each paper is considered rel-

evant (Kitchenham and Charters, 2007). This is

deﬁned as a set R consisting of paper-speciﬁc rea-

sons r

∀p

∈ P. Hence, for each paper p

that ad-

dresses the research question, t paper speciﬁc reasons

= {reason

, reason

, . . . , reason

} are assigned to

explain its relevance.

To conclude, the screening process involves evalu-

ating each paper p

and documenting the correspond-

ALISE: An Automated Literature Screening Engine for Research

455

ing relevance reasons r

if it is relevant. Hence,

the results of the screening process are two sets

S = {p

| l

= 1, p

∈ P} and R = {(p

, r

) | l

1, r

explains relevance of p

}. The set S represents

the subset of papers from P that are relevant, hence

which address the given research question RQ (label

l is equal to 1). The set R contains pairs of papers p

and their corresponding relevance reasons r

. Conse-

quently, during the automated screening process, each

paper p

of P is examined, and if it is found to ad-

dress the research question RQ, l is set to 1, and a

relevance reason r

is documented in R. By selecting

papers based on the value of f and documenting the

relevance reasons in R, the review protocol ensures

transparency and provides a record of the justiﬁcation

behind the inclusion of each relevant paper that ad-

dresses the research question in the literature review.

If there are multiple research questions, this proce-

dure will be logically performed for all research ques-

tions. As output, S can be used for the quality assess-

ment, which is the next step in the general literature

review methodology (Templier and Par

e, 2015).

4.2 Technical Details

To assess whether a paper addresses a given RQ, we

utilize an LLM chain as described by (Wu et al.,

2022), since it has the potential to outperform vari-

ous classical retriever-reader architectures (Yu et al.,

2023). Our chain is inspired by the generate-read

chain of (Yu et al., 2023) and the multi-hop QA chain

of (Haji et al., 2023) using Flan-t5-XL due to hard-

ware limitations. Thus, the LLM chain with manual

prompt templates ﬁrst generates the evidence E based

on the chunks C which serve as an answer to RQ and,

then, generates the ﬁnal answer A using E as con-

text. Here, the chunks C were created by an straight-

forward approach. The template length was sub-

tracted from the maximum input token length of 512

to determine the chunk size l

chunk

= 512 − l

template

For each sentence, it was checked whether adding the

sentence to the current chunk would exceed the token

limit in order to avoid truncated sentences. Chunk-

ing by logical sections of the paper also seemed in-

tuitive, but handling long sections was challenging

and kind of arbitrary in some cases, so we chose the

simpler and more straight-forward approach of cut-

ting right before reaching the token limit. The us-

age of the evidence-answer chain also enables simul-

taneously getting the reason r

for a paper when con-

ducting MRC because it generates the evidence for

the answers and the answer itself as a reason. This

also implicates the labelling function because if the

RQ is not answerable by a paper p

, the LLM chain

returns unanswerable. If the answer is not unanswer-

able, l = f (p

) = 1, otherwise l = f (p

) = 0. When

l = 1, the reason r

can be returned by referring to the

extracted pieces of evidence related to that question.

However, most retrieved papers P from the search are

in PDF format (van Dinter et al., 2021), necessitating

a PDF-to-text conversion before being used as textual

input for the evidence-answer chain. This is a chal-

lenge due to the diverse layouts of scientiﬁc texts, in-

cluding multiple columns, different headers and foot-

ers, variable abstract positions, and ﬁgures and tables

affecting text ﬂow (Bast and Korzen, 2017). Address-

ing these issues, (Tauchert et al., 2020) employed op-

tical character recognition (OCR) to convert scientiﬁc

PDFs into plain text format. While they used OCR-

tesseract, better libraries have emerged, with Grobid

being a notable choice as evaluated by (Miah et al.,

2022). Grobid is also utilized by the Semantic Scholar

Open Research Corpus (Lo et al., 2020), offering both

effectiveness and scalability for handling large vol-

umes of scientiﬁc papers. For this reason, we chose

the s2orc json converter of the Semantic Scholar Open

Research Corpus (Lo et al., 2020). Figure 2 visualises

the ﬂow of ALISE.

Figure 2: Implementation of ALISE.

5 EVALUATION

5.1 Metrics

ALISEs goal is to assist scientists in the screening

process and reduce the time and effort of reviewers.

To evaluate its performance, we follow the precedent

set by other automated screening approaches against

human performance, like (Cohen et al., 2006; Kusa

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

456

et al., 2023). We use standard NLP metrics based

on the confusion matrix: true positives (TP) for cor-

rectly classiﬁed papers, false positives (FP) for pa-

pers ALISE incorrectly labels as relevant, true neg-

atives (TN) for correctly classiﬁed irrelevant papers,

and false negatives (FN) for papers ALISE misses.

The evaluation metrics equations are provided below:

Accuracy (Acc) =

T P + T N

T P + T N + FP + FN

(2)

Precision(Pr) =

T P

T P + FP

(3)

Recall (Re) =

T P

T P + FN

(4)

F1 = 2 ×

Pr × Re

Pr + Re

(5)

W SS =

T N + FN

− (1.0 − Re) (6)

The WSS metric measures the work saved over

sampling (Cohen et al., 2006). It represents the ra-

tio of articles initially identiﬁed through a literature

search that researchers can skip reading because they

have already been screened out by ALISE.

nW SS =

T N

T N + FP

(7)

The nWSS metric by (Kusa et al., 2023) is the nor-

malized WSS metric to enable better comparisons be-

tween different literature reviews, hence not the same

reviews must be evaluated as a baseline. Furthermore,

the nWSS is equal to the true negative rate (Kusa

et al., 2023).

5.2 Dataset

The common dataset of (Cohen et al., 2006) for eval-

uating automated screening was not used due to its

limitation on titles and abstracts, whereas ALISE uses

full-texts. In the evaluation of automated screening

approaches, researchers often contend with the chal-

lenge of manual annotation. Some studies evaluate

these approaches based on a single literature review

(Noroozi et al., 2023), while others consider two dif-

ferent reviews (Alchokr et al., 2022). Our selec-

tion of three literature reviews from random searches

on ACM, IEEE, and SpringerLink due to the labor-

intensive nature of manual annotation, followed spe-

ciﬁc criteria. We considered reviews that were peer-

reviewed, reproducible (yielding consistent search re-

sults with the provided searches), accessible (in lit-

erature databases to which we had access), well-

documented (with relevant papers clearly listed, such

as in a table), and comprehensible (with well-deﬁned

inclusion and exclusion criteria to minimize FP dur-

ing manual annotation). The following reviews met

our speciﬁed criteria, while many others were unsuit-

able due to factors such as irreproducibility, inaccessi-

ble databases, or the impracticality of manually down-

loading thousands of papers. Consequently, our eval-

uation baseline comprises three literature reviews: lit-

erature review 1 (LR1) (Jakob, 2022), literature re-

view 2 (LR2) (da Silva Junior et al., 2022), and lit-

erature review 3 (LR3) (Omran and Treude, 2017).

Table 1 provides an overview of the evaluated litera-

ture reviews regarding the number of papers screened

(n) and how many papers are actually relevant.

Table 1: Overview of LRs used for evaluation.

Literature Review n relevant

LR1 (Jakob, 2022) 101 60

LR2 (da Silva Junior et al., 2022) 262 6

LR3 (Omran and Treude, 2017) 232 33

5.3 Setup

Manually downloading all papers from the three se-

lected literature reviews was necessary since there are

no open API accesses available for obtaining full-text

content from SpringerLink, IEEE, and ACM Digital

Library or the automation of this task was longer than

manually downloading. To save time, and considering

that automated downloading was solely for evaluation

purposes and not part of the screening process, we

opted for manual downloads. The automated screen-

ing process ran on an NVIDIA Tesla T4, with no mod-

iﬁcations to the quantization of Flan-t5-XL. Where

the literature reviews had multiple research questions,

one search was performed for each research question,

and duplicate results were removed. Each iteration

took approximately 45 minutes to two hours, result-

ing in a total evaluation time ranging from 2.5 to 6

hours, depending on the number of papers evaluated.

5.4 Results

This section presents the evaluation results for each

literature review. In the evaluation of the literature

Table 2: Confusion matrix of all literature reviews.

LR TP FP FN TN

LR1 57 17 1 23

LR2 6 61 0 192

LR2* 10 57 0 192

LR3 33 197 0 2

LR3* 21 54 12 145

LR3** 53 22 12 145

ALISE: An Automated Literature Screening Engine for Research

457

review by (Jakob, 2022) (LR1), out of an initial popu-

lation of 101 papers, 98 were evaluated due to limited

full-text access. ALISE achieved 57 TPs, 17 FPs, 1

FNs, and 23 TNs. See table 2 for the confusion ma-

trix values. For the literature review by (da Silva Ju-

nior et al., 2022) (LR2), which initially screened on

title and abstracts, 67 papers were classiﬁed as rel-

evant by ALISE. Subsequently, two independent re-

viewers of the related domain manually reviewed the

FPs of the ﬁrst evaluation LR2, leading to 10 TPs

and 57 FPs. The ﬁnal confusion matrix values are

listed in table 2 as LR2*. The third literature review

by (Omran and Treude, 2017) (LR3), initially evalu-

ated with the same research questions, has a gold stan-

dard of 33 relevant papers. However, ALISE classi-

ﬁed 230 papers as relevant out of 232. An error analy-

sis revealed several issues causing this misclassiﬁca-

tion, including sensitivity to certain keywords. Due

to the search by keyword with ”natural language” in

several major high-ranked software engineering con-

ferences, this results in all papers mentioning natural

language and additionally in 226 out of 232 also ”pro-

cess”, causing ALISE to classify nearly all papers as

relevant to the ﬁrst RQ of (Omran and Treude, 2017).

Whereas RQ two and three of LR3 results in lower

relevant papers, we identiﬁed that Flan-t5-XL also

classiﬁed NLP algorithms like latent Dirichlet alloca-

tion as NLP library, which is not the library used for

implementation, but an algorithm. The fourth ques-

tion ”If so, how was the choice justiﬁed?” makes no

sense when iterating over each question because it is

related to the third question as a follow-up. However,

this completely failed evaluation shows two valuable

conclusions: Follow-up research questions currently

cannot be handled when iterating over the questions

and Flan-t5-XL requires some more input rather than

just buzzwords, e.g. an example what an NLP li-

brary is or what is covered under natural language

processing. The evaluation of LR3 was repeated with

a new research question, yielding 32 TPs and 22 FPs.

The confusion matrix for this evaluation is in table

2 as LR3*. After reevaluation regarding the new re-

search question, we encountered the same issue of

FPs as in the evaluation of LR2. Two independent

reviewers, both research engineers in NLP, followed

the same procedure as in LR2: conducting individ-

ual assessments followed by a ﬁnal comparison and

discussion. Of the initial 54 FPs, the reviewers iden-

tiﬁed 32 as genuinely relevant due to their mention of

NLP libraries used in research implementation. This

signiﬁcant disparity in the number of relevant papers

not identiﬁed by (Omran and Treude, 2017) can be

explained by their screening strategy. This evalua-

tion is referred to as LR3** based on the indepen-

dent reviewers’ annotations. Based on these confu-

Table 3: Evaluation results of ALISE.

Acc Re Pr F1 WSS nWSS

LR1 81.63 98.28 77.03 86.36 22.77 57.50

LR2 76.45 100.00 08.96 16.44 74.13 75.89

LR2* 78.00 100.00 14.93 25.97 74.13 77.11

LR3 15.09 100.00 14.35 25.10 00.87 01.05

LR3* 71.55 63.64 28.00 38.89 31.31 72.86

LR3** 85.34 81.54 70.67 75.71 49.21 86.83

sion matrices, table 3 provides a summary of evalu-

ation metrics for all literature reviews. The accuracy

ranges from 15.09% to 85.34%, with perfect recalls

for LR2, LR2*, and LR3. Precision varies between

8.96% (LR2) and 77.03% (LR1). The F1 score ranges

from 16.44% to 86.36%. WSS metrics vary widely,

with some in the intermediate range, while the nWSS

has only one outlier LR3.

5.5 Result Analysis

(Cohen et al., 2006) noted that the goal of automated

screening tools should be at least having a 95.00% re-

call compared to the human baseline and a WSS as

high as possible. (Kusa et al., 2023) adapted this goal

also with the nWSS. The evaluation metrics show (ta-

ble 3), that ALISE can surpass this goal by reach-

ing 98.28% and 100.00% for LR1 and LR2. ALISE

also reached 100.00% recall for the ﬁrst evaluation

of LR3. Yet, this must be taken with caution be-

cause nearly every paper was classiﬁed as relevant

for LR3 (see table 3). This is also then represented

by the low WSS and nWSS of 00.87% and 1.05%

indicating that nearly no work was saved by manu-

ally screening the literature. In contrast, exceeding

72.86% nWSS for the majority of literature reviews,

this is a strong indication that, in general, ALISE is

capable of saving a lot of time for human reviewers.

Nevertheless, ALISE cannot be used without manual

human evaluation after classiﬁcation due to the clas-

siﬁcation of some FPs in each evaluated literature re-

view. Otherwise, the nWSS would have also been per-

fect 100.00%. Furthermore, the outlier of LR3 and the

following evaluation LR3** shows the sensitivity for

the research question used as input since there is an

improvement of 85,78% nWSS score. Consequently,

the results mark the validation of ALISE being used

for automated screening over full-texts, but having

some limitations. The nWSS also makes it possible

to compare these results with other automated tools.

In this study, ALISE has an average nWSS score of

61.87%, which is better than the best method in av-

erage evaluated by (Kusa et al., 2023) with 57.21%.

Without evaluation LR3 being an outlier due to the

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

458

wrong research questions needed for the LLM, the

average is even 74.04% nWSS which clearly outper-

forms the best method stated in (Kusa et al., 2023).

Except for model E evaluated by (Kusa et al., 2023)

with an average nWSS of 55.50%, all other ﬁve eval-

uated models are below an average of 41.41%. How-

ever, a user of ALISE may also not initially deﬁne

the research question(s) as well as required for the

model and would then have to iteratively adjust that,

so the average without the outlier should be viewed

with caution. For this reason, the average with outlier

and the median of 74.38% is more meaningful. The

median of the best model D evaluated by (Kusa et al.,

2023) is 60.9%. This also indicates a strong proof that

ALISE can be used as an automated screening method

when considering its limitations.

6 CONCLUSION

ALISE can be used as an automated screening tool

over full-texts for literature reviews. Not only is rele-

vant literature classiﬁed as relevant, ALISE also pro-

vides reasons for the review protocol why a paper is

relevant. An evaluation of three different literature

reviews was conducted to measure the performance

of ALISE. The highest nWSS score is 86.83%, indi-

cating a large time saving for the reviewers after the

literature search. With an average of 61.87% nWSS

considering all evaluated literature reviews, and a me-

dian of 74.38% nWSS, ALISE can save a lot of time

but cannot be used without a followed human screen-

ing iteration over the literature classiﬁed as relevant

by ALISE. However, there are some limitations when

using ALISE regarding RQ sensitivity and hardware.

LIMITATIONS

ALISE shares LLM limitations, making it sensitive

to the RQ and the chain prompts. In addition, fast

inference requires a GPU, which makes it costly. Fur-

thermore, the PDF conversion may introduce errors,

potentially affecting the results. Two of the three LRs

evaluated initially screened titles and abstracts before

full-text, introducing the possibility of FNs not identi-

ﬁed by either reviewers or ALISE. Utilizing LRs with

full-text screening from the start could have mitigated

this issue. An inadequately deﬁned research question

can lead to suboptimal results, negating the time sav-

ings and potentially requiring signiﬁcant reﬁnement.

Even if subsequent iterations were error-free, the cu-

mulative computation time may exceed the manual

screening.

REFERENCES

Alchokr, R., Borkar, M., Thotadarya, S., Saake, G., and

Leich, T. (2022). Supporting systematic literature re-

views using deep-learning-based language models. In

Proceedings of the 1st International Workshop on Nat-

ural Language-Based Software Engineering, NLBSE

’22, page 67–74, New York, NY, USA. Association

for Computing Machinery.

Bast, H. and Korzen, C. (2017). A benchmark and evalua-

tion for text extraction from pdf. In 2017 ACM/IEEE

Joint Conference on Digital Libraries (JCDL), pages

1–10.

Blake, C. (2010). Beyond genes, proteins, and abstracts:

Identifying scientiﬁc claims from full-text biomed-

ical articles. Journal of Biomedical Informatics,

43(2):173–189.

Brocke, J. v., Simons, A., Niehaves, B., Riemer, K., Plat-

tfaut, R., and Cleven, A. (2009). Reconstructing the

giant: On the importance of rigour in documenting the

literature search process. In European Conference on

Information Systems.

Cohen, A. M., Hersh, W. R., Peterson, K., and Yen, P.-

Y. (2006). Reducing workload in systematic review

preparation using automated citation classiﬁcation.

Journal of the American Medical Informatics Associ-

ation, 13(2):206–219.

Cohen, K. B., Johnson, H. L., Verspoor, K., Roeder, C.,

and Hunter, L. E. (2010). The structural and content

aspects of abstracts versus bodies of full text journal

articles are different. BMC bioinformatics, 11(1):492.

da Silva Junior, B. A., Silva, J., Cavalheiro, S., and Foss, L.

(2022). Pattern recognition in computing education:

A systematic review. In Anais do XXXIII Simp

osio

Brasileiro de Inform

atica na Educac¸

ao, pages 232–

243, Porto Alegre, RS, Brasil. SBC.

Dieste, O. and Padua, A. G. (2007). Developing search

strategies for detecting relevant experiments for sys-

tematic reviews. In First International Symposium

on Empirical Software Engineering and Measurement

(ESEM 2007), pages 215–224.

Haji, S., Suekane, K., Sano, H., and Takagi, T. (2023).

Exploratory inference chain: Exploratorily chaining

multi-hop inferences with large language models for

question-answering. In 2023 IEEE 17th International

Conference on Semantic Computing (ICSC), pages

175–182.

Jakob, D. (2022). Voice controlled devices and older adults

– a systematic literature review. In Gao, Q. and Zhou,

J., editors, Human Aspects of IT for the Aged Pop-

ulation. Design, Interaction and Technology Accep-

tance, pages 175–200, Cham. Springer International

Publishing.

Kitchenham, B. A. and Charters, S. (2007). Guidelines

for performing systematic literature reviews in soft-

ware engineering. Technical Report EBSE-2007-01,

School of Computer Science and Mathematics, Keele

University.

Kusa, W., Lipani, A., Knoth, P., and Hanbury, A. (2023).

An analysis of work saved over sampling in the eval-

uation of automated citation screening in systematic

ALISE: An Automated Literature Screening Engine for Research

459

literature reviews. Intelligent Systems with Applica-

tions, 18:200193.

La Quatra, M., Cagliero, L., and Baralis, E. (2021). Lever-

aging full-text article exploration for citation analysis.

Scientometrics, 126(10):8275–8293.

Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig,

G. (2023). Pre-train, prompt, and predict: A system-

atic survey of prompting methods in natural language

processing. ACM Comput. Surv., 55(9).

Lo, K., Wang, L. L., Neumann, M., Kinney, R., and Weld,

D. (2020). S2ORC: The semantic scholar open re-

search corpus. In Proceedings of the 58th Annual

Meeting of the Association for Computational Lin-

guistics, pages 4969–4983, Online. Association for

Computational Linguistics.

Miah, M. S. U., Sulaiman, J., Sarwar, T. B., Naseer, A.,

Ashraf, F., Zamli, K. Z., and Jose, R. (2022). Sentence

boundary extraction from scientiﬁc literature of elec-

tric double layer capacitor domain: Tools and tech-

niques. Applied Sciences, 12(3).

Noroozi, M., Moghaddam, H. R., Shah, A., Charkhgard,

H., Sarkar, S., Das, T. K., and Pohland, T. (2023). An

ai-assisted systematic literature review of the impact

of vehicle automation on energy consumption. IEEE

Transactions on Intelligent Vehicles, pages 1–22.

Okoli, C. (2015). A guide to conducting a standalone sys-

tematic literature review. Commun. Assoc. Inf. Syst.,

37:43.

Olorisade, B. K., Brereton, P., and Andras, P. (2019). The

use of bibliography enriched features for automatic ci-

tation screening. Journal of Biomedical Informatics,

94:103202.

Omran, F. N. A. A. and Treude, C. (2017). Choosing an

nlp library for analyzing software documentation: A

systematic literature review and a series of experi-

ments. In Proceedings of the 14th International Con-

ference on Mining Software Repositories, MSR ’17,

page 187–197. IEEE Press.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright,

C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K.,

Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L.,

Simens, M., Askell, A., Welinder, P., Christiano, P. F.,

Leike, J., and Lowe, R. (2022). Training language

models to follow instructions with human feedback.

In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave,

K. Cho, and A. Oh, editors, Advances in Neural Infor-

mation Processing Systems, volume 35, pages 27730–

27744. Curran Associates, Inc.

Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron,

I., Hoffmann, T. C., Mulrow, C. D., Shamseer, L.,

Tetzlaff, J. M., Akl, E. A., Brennan, S. E., Chou,

R., Glanville, J., Grimshaw, J. M., Hr

objartsson, A.,

Lalu, M. M., Li, T., Loder, E. W., Mayo-Wilson, E.,

McDonald, S., McGuinness, L. A., Stewart, L. A.,

Thomas, J., Tricco, A. C., Welch, V. A., Whiting,

P., and Moher, D. (2021). The prisma 2020 state-

ment: An updated guideline for reporting systematic

reviews. Journal of Clinical Epidemiology, 134:178–

189.

Penning de Vries, B. B., van Smeden, M., Rosendaal, F. R.,

and Groenwold, R. H. (2020). Title, abstract, and key-

word searching resulted in poor recovery of articles in

systematic reviews of epidemiologic practice. Journal

of Clinical Epidemiology, 121:55–61.

Portenoy, J. and West, J. D. (2020). Constructing and eval-

uating automated literature review systems. Sciento-

metrics, 125(3):3233–3251.

Rowe, F. (2014). What literature review is not: diversity,

boundaries and recommendations. European Journal

of Information Systems, 23(3):241–255.

Snyder, H. (2019). Literature review as a research method-

ology: An overview and guidelines. Journal of Busi-

ness Research, 104:333–339.

Tauchert, C., Bender, M., Mesbah, N., and Buxmann, P.

(2020). Towards an integrative approach for auto-

mated literature reviews using machine learning. In

Hawaii International Conference on System Sciences.

Templier, M. and Par

e, G. (2015). A framework for guiding

and evaluating literature reviews. Commun. Assoc. Inf.

Syst., 37:6.

van Dinter, R., Tekinerdogan, B., and Catal, C. (2021). Au-

tomation of systematic literature reviews: A system-

atic literature review. Information and Software Tech-

nology, 136:106589.

Wang, Z., Nayfeh, T., Tetzlaff, J., O’Blenis, P., and Murad,

M. H. (2020). Error rates of human reviewers during

abstract screening in systematic reviews. PLOS ONE,

15(1):1–8.

Webster, J. and Watson, R. T. (2002). Analyzing the past

to prepare for the future: Writing a literature review.

MIS Quarterly, 26(2):xiii–xxiii.

Wilson, E., Cruz, F., Maclean, D., Ghanawi, J., McCann,

S. K., Brennan, P. M., Liao, J., Sena, E. S., and

Macleod, M. (2023). Screening for in vitro system-

atic reviews: a comparison of screening methods and

training of a machine learning classiﬁer. Clinical Sci-

ence, 137(2):181–193.

Wu, T., Terry, M., and Cai, C. J. (2022). Ai chains: Trans-

parent and controllable human-ai interaction by chain-

ing large language model prompts. In Proceedings of

the 2022 CHI Conference on Human Factors in Com-

puting Systems, CHI ’22, New York, NY, USA. Asso-

ciation for Computing Machinery.

Yu, W., Iter, D., Wang, S., Xu, Y., Ju, M., Sanyal, S., Zhu,

C., Zeng, M., and Jiang, M. (2023). Generate rather

than retrieve: Large language models are strong con-

text generators. In The Eleventh International Confer-

ence on Learning Representations.

APPENDIX

In this appendix, we list some implementation details

of ALISE.

Libraries

• Langchain (https://python.langchain.com)

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

460

• s2orc-doc2json (https://github.com/allenai/s2orc-

doc2json)

• Transformers

(https://github.com/huggingface/transformers)

Langchain was used for the evidence-answer LLM

chain with Flan-t5-XL coming with transformers and

hugging face hub. To transform the pdf papers into

strings, we used the s2orc-doc2json converter.

Prompt Templates

Figure 3 contains the evidence and answer prompt

templates used. These templates performed best

in the evidence-response chain evaluation with the

QASPER dataset.

Figure 3: Prompt templates.

ALISE: An Automated Literature Screening Engine for Research

461