A Comparison of Methods for the Evaluation of

Text Summarization Techniques

Marcello Barbella, Michele Risi

and Genoveffa Tortora

Department of Computer Science, University of Salerno, Fisciano (SA), Italy

Keywords:

Automatic Text Summarization Algorithms, Extractive, Abstractive, ROUGE Metric, Bert.

Abstract:

Automatic Text Summarization techniques aim to extract key information from one or more input texts au-

tomatically, producing summaries and preserving the meaning of content. These techniques are divided into

two main families: Extractive and Abstractive, which differ for their operating mode. The former picks up

sentences directly from the document text, whilst the latter produces a summary by interpreting the text and

rephrases sentences by incorporating information. Therefore, there is the need to evaluate and verify how close

a summary is to original text. The research question is: how to evaluate the quality of the summaries produced

by these techniques? Different metrics and scores have been proposed in the literature (e.g., ROUGE) for the

evaluation of text summarization. Thus, the main purpose of this paper is to deeply estimate the behaviour

of the ROUGE metric. In particular, we performed a ﬁrst experiment to compare the metric efﬁciency for

the evaluation of the Abstractive versus Extractive Text Summarization algorithms while, in a second one, we

compared the obtained score for two different summary approaches: the simple execution of a summarization

algorithm versus the multiple execution of different algorithms on the same text. The conclusions lead to the

following interesting results: ROUGE does not achieve excellent results, because it has similar performance

on both the Abstractive and Extractive algorithms; multiple execution works better than single one most of the

time.

1 INTRODUCTION

Today is available a great deal of textual data, arising

from various sources. Being able to get knowledge

from these, especially from long texts, is an increas-

ingly complex challenge for humans. The informa-

tion technology development, particularly in the Ar-

tiﬁcial Intelligence (AI) ﬁeld, has brought out ever

more sophisticated techniques for managing and pro-

cessing data. New innovative algorithms, inter alia,

are always proposed for the processing and extraction

of the most useful information from texts, even those

written by human.

These techniques, grouped as Automatic Text

Summarization, allow the creation of summaries

composed of key concepts of any input text. In order

to obtain algorithms capable of producing a summary

from a text, these are divided into two main families:

• Extractive: it selects phrases from the input text,

choosing those that best cover all the key informa-

tion, discarding redundancy;

https://orcid.org/0000-0003-1114-3480

https://orcid.org/0000-0003-4765-8371

• Abstractive: it tries to elaborate a new corpus, us-

ing different and more appropriate words, a differ-

ent semantic composition, so as to output a sim-

pler text.

As can be easily deduced from the above, the

Abstractive Automatic Text Summarization (AATS)

techniques are much more interesting than the Extrac-

tive (EATS) ones. The literature proposes different

approaches for both methodologies, exploiting super-

vised and unsupervised algorithms (Dalal and Malik,

2013). The main question is to evaluate the quality of

the summaries produced by these methods. The dif-

ﬁculty in comparing one summary with another, re-

quires a metric that is easily comparable and as possi-

ble unbiased.

ROUGE (Recall-Oriented Understudy Gisting

Evaluation) is the most used topic evaluation metric.

It is based on the overlapping (expressed as a numeric

value) of words between the system summary and the

human one, without considering their semantic and

syntactic correctness.

The current literature is more focused on ﬁnding

new summarization algorithms rather than on those to

200

Barbella, M., Risi, M. and Tortora, G.

A Comparison of Methods for the Evaluation of Text Summarization Techniques.

DOI: 10.5220/0010523002000207

In Proceedings of the 10th International Conference on Data Science, Technology and Applications (DATA 2021), pages 200-207

ISBN: 978-989-758-521-0

evaluate them. This work, instead, tends to compare

various algorithms on a standard dataset and with a

well-deﬁned methodology. In particular, the goal con-

sists of two distinct research activities: i) evaluate the

performance of the ROUGE metric on the results of

the Abstractive and Extractive algorithms; ii) test the

efﬁciency of its score on two different summarization

methods.

By deﬁnition, in order to produce a summary,

EATS choose parts from the original text, while

AATS tend to use new words so, the former should

perform much better because they may have a greater

overlapping of words. Is done a ﬁrst experiment to

prove this hypothesis. A second one, instead, involves

a test on two different methodologies: the simple exe-

cution of a text summarization (TS) algorithm versus

a multiple sequential execution on the same text, in

order to evaluate by the chosen metric, which of the

technique is better if the compression rate increases.

The paper is structured as follows: In Section 2,

the most recent technologies proposed by the litera-

ture on Abstractive and Extractive methods are ex-

plored. Section 3 shows the experiment design, taking

care to ensure its reproducibility. The experiment is

conducted on a standard dataset

. Section 4 discusses

the experiment results, whilst in Section 5 the threats

to validity are discussed. Finally, Section 6 draws the

conclusions as well as outlines some ideas for future

works.

2 STATE OF THE ART

As seen in the previous section, there are two types of

approaches to TS: Extractive and Abstractive. Below

are explored the most common techniques for the two

types, trying to understand what is the best way of

evaluation of a system-generated summary.

(de Oliveira, 2005) shows by speciﬁc experiments,

that for a human, the summary quality evaluation is

subjective, because it depends on individual criteria

of relevance, comprehensibility and readability.

Unfortunately, often the evaluation metrics of-

fer a statistical approach, counting the overlapping

of words between the generated summary and the

original text, not considering the semantic mean-

ing of what the text offers. In this work, the

ROUGE metric for the experiment results is consid-

ered (but there are various other evaluation methods

such as Pyramid (Nenkova and Passonneau, 2004)

and SSAS (Vadapalli et al., 2017; Gupta and Gupta,

2019)).

CNN/Daily-Mail is mostly used by researchers to eval-

uate new summarization techniques.

2.1 Extractive Method

Most recent studies, particularly in Deep Learning

(DL) and AI ﬁelds, have led to consolidate innovative

and more sophisticated EATS techniques. Here are

described some of the most interesting ones found in

literature, including those based on Neural Networks,

Graphs and Semantic approaches.

In (Keyvanpour et al., 2019) are clearly explained

the basic steps of a TS task. These include a pre-

processing phase, a sentence scoring phase and a ﬁ-

nal phase of text extraction and summary generation.

In order to represent a text with the aim of text sum-

marization, the literature suggests some methods, by

the use of vectors and matrices that will represent the

features extracted from the text.

Neural Network Approaches. Neural networks are

widely used especially for the generation of complex

features starting from the input text. (Suleiman and

Awajan, 2019) gives an overview of the most com-

mon algorithms used today: the Restricted Boltz-

mann Machine (Verma and Nidhi, 2017; Rezaei et al.,

2019), particular neural networks composed of an in-

put layer and a hidden layer, where the connection

occurs only between neurons of different layers, the

Variation Auto-Encoder (Youseﬁ-Azar and Hamey,

2017) based on a neural network consisting of an en-

coder, a decoder and a loss function and the Recurrent

Neural Networks (Chen and Le Nguyen, 2019) made

up of a sequence of hidden layers where, for the TS

ﬁeld, a sequence of words is given as input to the net-

work, while output is formed by the summary words.

Graph Approaches. Instead in (Janjanam and

Reddy, 2019) is introduced an approach for TS us-

ing Graphs. A popular TS model based on the graph

approach is PageRank. It is based on Hits algorithms

developed by Google (Page et al., 1999). Other inter-

esting approaches in the literature see the use of the

graph as a semantic network between sentences (Han

et al., 2016) or as input for a convolutional net-

work (Yasunaga et al., 2017).

Semantic Approaches. Many commonly used TS

models are based on statistical methods, which usu-

ally do not consider the semantic or contextual mean-

ing of the analyzed text. But through the Semantic

Role Labeling, sentences in a document are well clus-

tered, facilitating the formation of groups of similar

elements. In (Liu et al., 2018), authors used Seman-

tic Analysis to represent sentences in an Abstractive

mode, generating summaries with promising results.

Instead, in (Khan et al., 2018) is established that se-

mantic knowledge of the text is fundamental also in

the methodology of AATS.

A Comparison of Methods for the Evaluation of Text Summarization Techniques

201

2.2 Abstractive Method

In recent years, after the rapid expansion of neural

networks and DL techniques, researchers have built a

solid base for the creation of Abstractive algorithms,

which goes beyond the classical approaches of Nat-

ural Language Processing (NLP). In particular, DL

models like those based on seq2seq and attention-

model, have brought the study of AATS to a high

level, sometimes obtaining better results than Extrac-

tive methods.

Seq2seq Model. It is a popular neural network model

based on the encoder-decoder architecture. Gener-

ally, in the literature, particular types of neural net-

works are chosen as internal components for the en-

coder and decoder, such as Recurrent Neural Net-

work (RNN) (Nallapati et al., 2016), Gated Recurrent

Neural Network (GRU) or Long Short Term Memory

(LSTM). The latter are the most used because they de-

termine long-term dependencies, overcoming the gra-

dient problem.

Transformer Network. In (Vaswani et al., 2017)

is proposed a new network architecture, called

Transformer, based exclusively on the attention-

mechanism. It avoids the use of recurrence, as has

been done so far by sequence modeling, providing a

new approach that allows the dependencies modeling

without the consideration of their distance in the input

or output sequence.

2.3 ROUGE Metric

The ROUGE measure identiﬁes a set of different in-

dexes to quantify the quality of a system summary.

In literature, the most used metrics are ROUGE-

1, ROUGE-2 and ROUGE-L based on the different

granularity of the analyzed texts. Indeed, ROUGE-1

refers to the overlapping of unigrams between the sys-

tem summary and the reference summary; ROUGE-

2 to the overlapping of bigrams; instead ROUGE-L

measures the longest common word sequence, com-

puted by the Longest Common Subsequence (LCS)

algorithm.

2.4 Bert

In (Devlin et al., 2018) is introduced Bert (Bidirec-

tional Encoder Representations from Transformers).

Its model architecture consists of a bidirectional mul-

tilayer Transformer encoder, based on the implemen-

tation of (Vaswani et al., 2017).

3 EXPERIMENT DESIGN

3.1 Scope Deﬁnition

The proposed experiment has two research questions

(RQs). The ﬁrst is inspired by a wide study of the lit-

erature, which has raised some doubts regarding to the

most commonly used metric to evaluate the quality of

Automatic Text Summarization algorithms. In partic-

ular, this work focuses on understanding how well the

ROUGE metric score is actually good for evaluating

the quality of a summary, both for Extractive and Ab-

stractive methods (as explained above the Extractive

methods should produce better results). The second

target is based on exploring how better it is to use a

multiple execution of TS algorithms instead of a sin-

gle execution, based on the ROUGE scores. Summing

up:

• RQ1: How different is the ROUGE score achieved

by the EATS methods compared to the AATS

ones? Can this metric score be representative of

the quality of a summary generated by a text sum-

marization algorithm?

Object of study is the ROUGE score obtained in

both the EATS and AATS algorithms. Purpose is

to estimate reliability and efﬁciency of this metric

in both cases.

• RQ2: How much differs the multiple execution

of a summary (execution in sequence of two TS

algorithms on the same text, where the output of

the ﬁrst is used as input for the second) compared

to the single execution (the summary is obtained

with a single execution of an algorithm)? Is the

ROUGE score relevant to evaluate the two meth-

ods?

Object of study is related to the multiple and single

execution of TS algorithms. Purpose is to evalu-

ate the quality of the generated summary by the

two techniques against the ROUGE score.

For both RQs: Perspective is a researcher’s point of

view and Context of experiment execution is the use

of TS algorithms on a standard texts dataset.

3.2 Experiment Planning

The planning phase shows in detail the various steps

followed for the experiment.

Context Selection. The experiment was conducted

in off-line mode. The goal for the ﬁrst RQ, is to com-

pare the validity and accuracy of the ROUGE met-

ric for the two types of Automatic Text Summariza-

tion approaches. For the second RQ instead, will be

DATA 2021 - 10th International Conference on Data Science, Technology and Applications

202

compared two TS methodologies to evaluate their ef-

ﬁciency by using the same metric.

Hypotheses Formulation. For the statistical analy-

sis of the experiment, two hypotheses are formulated:

the null and the alternative, with the aim to conﬁrm

or reject one of them. Formal descriptions of both

hypotheses for the two RQs are provided in the fol-

lowing, taking into consideration the ROUGE metric

for the comparison.

• RQ1 Null: AATS methods have different perfor-

mances from the EATS ones. (This is because the

Extractive methods use in the output summary,

parts of the original text that should provide a

different overlap ratio of N-grams, compared to

the Abstractive methods, which use new words in

the generated summary and therefore different N-

grams).

: µ

ROUGE Ext

6= µ

ROUGE Abs

(1)

where µ is the mean and ROUGE the score of each

summary.

• RQ1 Alternative: The AATS methods have almost

the same performance as the EATS ones. (This

may show that the ROUGE metric is not suitable

for the system generated summary evaluation).

: µ

ROUGE Ext

= µ

ROUGE Abs

(2)

• RQ2 Null: Multiple executions of the TS algo-

rithms on the same text, lead to less or equal result

of a single execution on the same text.

: µ

ROUGE Multiple

≤ µ

ROUGE Single

(3)

• RQ2 Alternative: Multiple executions of the TS

algorithms on the same text, lead to better results

than those of a single execution on the same text.

: µ

ROUGE Multiple

> µ

ROUGE Single

(4)

where µ is the mean and ROUGE the score of each

summary.

Variable Selection. A fundamental step during the

experiment planning, is the variables selection. Inde-

pendent variables are the ones that we can control and

modify during the experiment. The dependent vari-

ables instead, measure the effect of the experiment on

the different combinations of independent variables.

Our RQs are:

• RQ1 Independent Variables: the EATS and AATS

methods. For each of these, different algorithms

will be executed.

• RQ1 Dependent Variables: The ROUGE score for

the output of each algorithm. The results will

be averaged in order to have a single comparable

measure.

• RQ2 Independent Variables: TS techniques that

included single and multiple executions of the

summary. For each of these, will be executed dif-

ferent algorithms in different combinations.

• RQ2 Dependent Variables: The ROUGE score for

the output of each algorithm. The results will

be averaged in order to have a single comparable

measure.

Subjects Selection. For both RQ1 and RQ2, the

sampling of the subjects follows the Simple Random

Sampling model, where the subjects are randomly

chosen from a list of the population. In this case,

therefore, the texts to be summarized are randomly

chosen from the referenced dataset composed of ap-

proximately 287.000 elements. For computational ef-

ﬁciency, each algorithm is run on a block of 1000

texts. Each of these selected texts is submitted to all

algorithms, in order to have an accurate comparison

of the results on the same input. In the ﬁrst experi-

ment, a total of 40.000 summaries are executed, so 40

blocks of 1000 texts for each one. In the second ex-

periment, are considered only 1000 texts, due to the

computational complexity and the long time required

to the conclusion of the experiment.

Design Type Choice and Tools. In this section, is

described the test methodology.

• Principle General Design. The made choice is

to use randomization and balancing techniques.

Tests are performed by random blocks of data.

For the balancing design principle, each test will

be run with a block of 1000 texts to be analyzed.

This allows great performances and valid statisti-

cal results for each test.

• Standard Design Type. For RQ1, is chosen a fac-

tor with two treatments as Design Type. Indeed,

by these experiments, we want to compare the

EATS and AATS methods. Also for RQ2 is cho-

sen the same Design Type, that is a factor with two

treatments. In particular, we want to compare the

Figure 1: Boxplot of ROUGE metric scores computed on

1000 summaries by TextRank algorithm.

A Comparison of Methods for the Evaluation of Text Summarization Techniques

203

performance of the summary obtained by a sin-

gle versus a multiple execution. For each exper-

iment, are used all the considered algorithms. In

particular, for the multiple tasks, we consider the

execution of an EATS algorithm followed by an

execution of an AATS one (and vice versa). For

the experiments execution, has been developed a

particular software using Python.

3.3 Operation Phase

The experiment operation phase consists of three

steps:

Preparation. For the experiment to conduct, in this

phase must be checked the correctness of: the code

that will extract the random texts from the dataset, the

scores of the ROUGE metric and the algorithm set-

ting. It is also important to set up the code for collect-

ing the results. All calculated scores are placed in a

dataset containing the average results for each block

of summaries.

Execution. The experiment lasted several days, due

to the computation time required to run tests. For

RQ1, the algorithms worked in parallel, grouped ac-

cording to the TS methodology and fed by the same

input texts. For RQ2 was chosen a sequential com-

putation of the scores. Initially, for both RQs, dataset

texts were randomly selected and summarized by var-

ious algorithms. Finally, all the summaries were ana-

lyzed with the ROUGE metric.

Data Validation. Data validation was done by ran-

domly checking some entries and making sure of the

consistency of the CSV ﬁle. It was also checked

whether the sample’s ROUGE score respected the

standard expected from the algorithm authors.

4 RESULTS ANALYSIS

In this section are discussed, analyzed and interpreted

the two experiments results, highlighted by some

graphs, and ﬁnally verifying their statistical validity.

4.1 Descriptive Analysis

Starting from the hypothesis of same distribution for

each block of summaries, since texts are randomly

Table 1: Descriptive statistics.

ROUGE Metric Mean Median Standard deviation

ROUGE-1 0.205 0.194 0.002

ROUGE-2 0.059 0.041 0.002

ROUGE-L 0.204 0.189 0.003

chosen from dataset, below are shown some key fea-

tures of the achieved results. For this purpose, is ana-

lyzed a random execution of TextRank algorithm. Ta-

ble 1 shows mean, median and standard deviation val-

ues for the three types of ROUGE metrics to better

understand the subsequent plots.

First of all, it’s analyzed the results distribution by

a boxplot and a histogram of randomly results. (Each

result is referred to a set of 1000 summaries, which

differs for algorithm and input text). Figure 1 shows

1000 scores for ROUGE-1, ROUGE-2 and ROUGE-

L, taken from the execution of TextRank algorithm.

Figure 2 shows each ROUGE metric distribution

by three representative histograms. As anticipated

by boxplots, ROUGE-1 and ROUGE-L approximate

quite well the normal distribution. This guarantees the

good distribution of data points along all the observa-

tions, and allows us to consider the mean as a valid

representation measure. So, as highlighted, when be-

low is mentioned the ROUGE metric for the evalua-

tion of the experiments, the scores will be referred to

the ROUGE-1.

4.2 Results

4.2.1 RQ1 Results

Aim of the ﬁrst research question was to estimate the

effectiveness of the ROUGE metric on evaluating TS

algorithms. The comparison was made on the results

of both the EATS and AATS methods, on randomly

chosen texts by the experiment design guidelines.

For this purpose have been considered 40 blocks,

everyone composed of 1000 summaries. To represent

results, was made an average within each block and

subsequently an overall average for each algorithm.

Figure 3 reports the average score for the 7 exam-

ined algorithms related to ROUGE metric. The ﬁrst

four (i.e., TextRank, Lsa, Luhn and LexRank) are Ex-

tractive, and the last three (i.e., GloVe, Word2Vec,

Doc2Vec) are Abstractive. As we can see, the aver-

age mean is very close to each algorithm. In gen-

eral, LexRank is the best performer algorithm, scoring

about 10% better than the others. On the other hand,

the Abstractive algorithms tend to have very similar

values and, even if slightly, all their scores below the

mean.

Bert is one of the most innovative TS algorithms in

recent years, and it leads to better results. It is avail-

able for both the Extractive and the Abstractive ap-

proaches and results for each of them are very close.

Table 2 summarizes all the algorithms used in this

experiment, reporting the relative mean and standard

deviation, for ROUGE metric.

DATA 2021 - 10th International Conference on Data Science, Technology and Applications

204

Figure 2: Histogram showing the data distribution for ROUGE-1, ROUGE-2, ROUGE-L scores using TextRank algorithm.

Table 2: Mean and standard deviation for all the used algo-

rithms and ROUGE metric.

Algorithm ROUGE

Extractive Mean Standard deviation

TextRank 0.205 0.002

Lsa 0.223 0.004

Luhn 0.220 0.003

LexRank 0.242 0.003

Bert Ext. 0.284 0.099

Abstractive Mean Standard deviation

Word2Vec 0.213 0.003

Doc2Vec 0.215 0.002

GloVe 0.213 0.003

Bert Abs. 0.288 0.099

A statistical validity test to conﬁrm or reject the

hypothesis has been also conducted. To this end, was

performed the t-test on the distribution of results for

each summary, paired for Abstractive and Extractive.

The freedom degrees for this test are equal to the ob-

served population of 40.000 summaries. Test results

provide a p-value = 2.2e-16, below the required 0.05

for the experiment’s statistical validity. This conﬁrms

the alternative hypothesis of equality between the Ex-

tractive and Abstractive ROUGE scores.

The tested idea was that Extractive approaches

should work much better than Abstractive ones. Re-

sults, instead, showed that this assumption is not true.

In most cases, both algorithms worked similarly. The

reasons can be very different, and conﬁrm the as-

sumption that ROUGE is not a good metric to evalu-

ate TS algorithms. Indeed, remembering that ROUGE

Figure 3: ROUGE average scores of the experiment con-

ducted on Abstractive and Extractive algorithms.

compares a system-generated summary with a human

one, and that the score is given by a statistical cal-

culation relating to the N-grams overlapping between

the two texts, by this logic the more the summaries

use different words, the more the ROUGE metric will

perform poorly. But this system does not consider

the semantics of sentences. So, the Abstractive meth-

ods would be at a strong disadvantage and it could be

deduced that algorithms that take random sentences

from the original text work very well.

To reinforce the thesis that ROUGE is not much

representative, we can also consider the gold standard

that is the human-generated summary. It should be

the best available summary of a text (and is the opti-

mal target of our algorithms). But if we consider dif-

ferent human-generated summaries, starting from the

same source text, the results can be widely different

but all valid and acceptable. However, if we calculate

the ROUGE score between two gold standards, this

would not satisfy us.

ROUGE does not take care of all these consider-

ations and can direct us towards results that do not

represent the key quality of a summary. So, the ex-

periments showed that the Abstractive algorithms per-

form similarly to the Extractive ones, considering the

ROUGE score. This conﬁrms that ROUGE can be

considered a bad method for evaluating summaries

generated by TS algorithms.

4.2.2 RQ2 results

Aim of the second research question was to analyze

the achieved results of single versus multiple sum-

mary executions. For each block of summaries, is

computed the ROUGE score average on both a sum-

mary and the total of compared summaries. Two types

of multiple execution were considered:

1. Extractive algorithms on Abstractive input.

2. Abstractive algorithms on Extractive input.

Figure 4 shows the two obtained results. We can

clearly note that multiple algorithms executions per-

formed better than single ones almost always. Only

exception is for Bert, which in both cases obtained

A Comparison of Methods for the Evaluation of Text Summarization Techniques

205

Figure 4: Results comparison for summaries between single computation (blue) and multiple computation (green) for each

methodology.

the best scores in single execution (conﬁrmed by the

average of the ROUGE score shown in the two plots

too).

For both types of experiment is performed a t-test,

to ensure the statistical validity. This test was con-

ducted considering the differences between the Ex-

tractive and Abstractive methodologies (see Fig. 4).

For each test, the multiple execution approach of a

summary is compared with the single one. The popu-

lation is made up of 1000 paired summaries.

Results are quite different one from another: t-test

for the Extractive method scored a p-value of 0.4, so

there is no statistical validity for this experiment. In-

stead, t-test for the Abstractive method obtains a p-

value of 0.018, less than the required 0.05 for statis-

tical validity so, in this case, we can conﬁrm the al-

ternative hypothesis, that the multiple execution per-

forms better than the single execution. These results

are very interesting, and they shows that the multi-

ple execution method worked better than the single

one in all algorithms, except for Bert (see Fig. 4b).

A possible explanation can be attributed to the com-

pression ratio obtained from multiple algorithms runs:

a ﬁrst execution can remove redundant information,

and a second one compresses key concepts into a

better-scoring summary. This demonstrates that the

compression ratio has led algorithms to preserve as

much information as possible of the original text for

putting them in the output summary. On the other

hand, Bert produces different results. The average of

the achieved score is higher than all the other algo-

rithms. Of course, Bert is actually one of the best

approaches that comes from the state of the art in text

summarization. It uses a new architecture of a deep

neural network, that is far from the other algorithms

tested in this experiment. Unlike what was expected,

the single execution on Bert performs over 30% com-

pared to the multiple one.

Having a more compressed reference summary

can take advantage from the ROUGE score, because

the ratio of overlapping N-grams, especially if well

selected between two algorithms, can lead to dis-

torted results. In fact, we must think that although the

ROUGE score is very good for multiple executions,

must be seriously considered also the summary read-

ability. In this experiment, we therefore conﬁrmed the

alternative hypothesis. In most cases, multiple runs

score better than single ones.

5 THREATS TO VALIDITY

The results validity of an experiment can be compro-

mised by various types of threats: the Conclusion, In-

ternal, Construct and External Validity. In this section

we see these threats for RQ1 (extendable to RQ2).

Conclusion Validity. Having comparable results for

all experiments successfully closed, there is no threat

of low statistical power. By the choice of doing a

large number of tests for each algorithm with after the

computing of results average of each block of sum-

maries, are reduced the violation hypothesis of statis-

tical tests and the random heterogeneity of subjects

and also avoided Fishing. Finally the ROUGE metric

guarantees the reliability of the measurements.

Internal Validity. The experiment results are all re-

producible, so there are not risks of historical and

maturation threats. By an in-depth study of software

packages used for tests, are mitigated also the instru-

mentation threats.

Construct Validity. The use of the ROUGE metric

allows a numeric comparison, so it neutralizes the

inadequate preoperative explanation of constructs.

Since we are evaluating a special measure for texts

summaries quality, is not possible the use of a second

metric, so there could be a risk relative to the mono-

method bias in case of measurement bias

. Others

threats of this type are not considered because are re-

lated to human behaviors.

“Measurement bias” refers to any systematic or non-

random error that occurs in a study data collection.

DATA 2021 - 10th International Conference on Data Science, Technology and Applications

206

External Validity. For the random selection of the

texts to be summarized, is reduced the threat of in-

teraction of selections and treatments (that is having

a non-representative population sample). The lack of

a large calculation power does not allow the use of

optimal algorithms for the topic. For this, a series of

algorithms was used to have a comparison of their re-

sults, reducing the threat of interaction of settings and

treatments. Finally, the only threat to interaction of

history and treatment can come from new and more

powerful TS methods.

6 CONCLUSIONS

The main goal of this paper was to doubt on the valid-

ity of the ROUGE evaluation metric for TS algorithms

and after, try to understand if a single execution of

an algorithm led to better results than a multiple ex-

ecution one. From our experiments, we deduced that

ROUGE is not efﬁcient, and that a multiple execu-

tion leads to better results than the single one (also if

evaluated by ROUGE). Summing up, a good ROUGE

score is not synonymous of good summary quality, if

we consider readability and syntactic correctness too.

For future developments, it will be possible to

extend the analysis to other algorithms, also if less

known. The goal may be to discover new approaches

that can directly evaluate the summary quality, avoid-

ing statistical measurements. One idea could be the

use of NLP algorithms for text comprehension. An-

other scenario may be the evaluation of summaries re-

lated to a speciﬁc topic, training different algorithms

with data from a narrow interest ﬁeld, in order to have

interesting and more accurate results.

REFERENCES

Chen, L. and Le Nguyen, M. (2019). Sentence selective

neural extractive summarization with reinforcement

learning. In 11th Intl. Conf. on Knowl. and Sys. Eng.

(KSE), pages 1–5. IEEE.

Dalal, V. and Malik, L. (2013). A survey of extractive and

abstractive text summarization techniques. In 6th Inlt.

Conf. on Emerging Trends in Eng. and Tech., pages

109–110. IEEE.

de Oliveira, P. C. F. (2005). How to evaluate the ‘goodness’

of summaries automatically. PhD thesis, University

of Surrey.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. arXiv preprint

arXiv:1810.04805.

Gupta, S. and Gupta, S. (2019). Abstractive summarization:

An overview of the state of the art. Expert Systems

with Applications, 121:49–65.

Han, X., Lv, T., Hu, Z., Wang, X., and Wang, C. (2016).

Text summarization using framenet-based semantic

graph model. Sci. Prog., 2016.

Janjanam, P. and Reddy, C. P. (2019). Text summariza-

tion: An essential study. In Intl. Conf. on Compu-

tational Intelligence in Data Science (ICCIDS), pages

1–6. IEEE.

Keyvanpour, M. R., Shirzad, M. B., and Rashidghalam,

H. (2019). Elts: A brief review for extractive

learning-based text summarizatoin algorithms. In 5th

Intl. Conf. on Web Research (ICWR), pages 234–239.

IEEE.

Khan, A., Salim, N., Farman, H., Khan, M., Jan, B., Ah-

mad, A., Ahmed, I., and Paul, A. (2018). Abstractive

text summarization based on improved semantic graph

approach. International Journal of Parallel Program-

ming, 46(5):992–1016.

Liu, F., Flanigan, J., Thomson, S., Sadeh, N., and

Smith, N. A. (2018). Toward abstractive summariza-

tion using semantic representations. arXiv preprint

arXiv:1805.10399.

Nallapati, R., Zhou, B., Gulcehre, C., Xiang, B.,

et al. (2016). Abstractive text summarization us-

ing sequence-to-sequence rnns and beyond. arXiv

preprint arXiv:1602.06023.

Nenkova, A. and Passonneau, R. J. (2004). Evaluating con-

tent selection in summarization: The pyramid method.

In Human Lang. Tech. Conf. of the North American

Ch. of the Assoc. for Comput. Ling. (HLT-NAACL),

pages 145–152.

Page, L., Brin, S., Motwani, R., and Winograd, T. (1999).

The pagerank citation ranking: Bringing order to the

web. Technical Report 1999-66, Stanford InfoLab.

Rezaei, A., Dami, S., and Daneshjoo, P. (2019). Multi-

document extractive text summarization via deep

learning approach. In 5th Conf. on Knowledge Based

Engineering and Innovation (KBEI), pages 680–685.

IEEE.

Suleiman, D. and Awajan, A. A. (2019). Deep learn-

ing based extractive text summarization: Approaches,

datasets and evaluation measures. In 6th Intll. Conf. on

Social Networks Analysis, Manag. and Sec. (SNAMS),

pages 204–210. IEEE.

Vadapalli, R., Kurisinkel, L. J., Gupta, M., and Varma, V.

(2017). Ssas: semantic similarity for abstractive sum-

marization. In 8th Intl. Joint Conf. on Natural Lang.

Proc., pages 198–203.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. Adv. in Neural Inf.

Processing Systems, 30:5998–6008.

Verma, S. and Nidhi, V. (2017). Extractive sum-

marization using deep learning. arXiv preprint

arXiv:1708.04439.

Yasunaga, M., Zhang, R., Meelu, K., Pareek, A., Srini-

vasan, K., and Radev, D. (2017). Graph-based neu-

ral multi-document summarization. arXiv preprint

arXiv:1706.06681.

Youseﬁ-Azar, M. and Hamey, L. (2017). Text summariza-

tion using unsupervised deep learning. Expert Systems

with Applications, 68:93–105.

A Comparison of Methods for the Evaluation of Text Summarization Techniques

207