Compressing Multi-document Summaries through Sentence

Simpliﬁcation

Sara Botelho Silveira and Ant´onio Branco

University of Lisbon, Lisbon, Portugal

Edif´ıcio C6, Departamento de Inform´atica, Faculdade de Ciˆencias, Universidade de Lisboa,

Campo Grande, 1749-016 Lisboa, Portugal

Keywords:

Multi-document Summarization, Sentence Simpliﬁcation, Sentence Compression.

Abstract:

Multi-document summarization aims at creating a single summary based on the information conveyed by a

collection of texts. After the candidate sentences have been identiﬁed and ordered, it is time to select which

will be included in the summary. In this paper, we propose an approach that uses sentence simpliﬁcation, both

lexical and syntactic, to help improve the compression step in the summarization process. Simpliﬁcation is

performed by removing speciﬁc sentential constructions conveying information that can be considered to be

less relevant to the general message of the summary. Thus, the rationale is that sentence simpliﬁcation not

only removes expendable information, but also makes room for further relevant data in a summary.

1 INTRODUCTION

The increased use of mobile devices brought concerns

about text compression, by providing less space for

the same amount of text. Compression must be ac-

curate and all the information displayed should be es-

sential. Multi-document text summarization seeks to

identify the most relevant information in a collection

of texts, complying with a compression rate that de-

termines the length of the summary.

Ensuring at the same time the compression rate

and the informativeness of the summary is not an easy

task. The most common solution allows the last sen-

tence to be cut in two in the number of words, where

the exact compression rate has been reached, com-

promising the ﬂuency and grammaticality of the sum-

mary, and thus the quality of the ﬁnal text. An alter-

native is the one where the last candidate sentence is

kept in full, surpassing the compression rate. None of

these solutions is optimal. Compromising the com-

pression rate by enhancing the quality of the text may

not introduce relevant information. Still, compromis-

ing the quality of the text can be troublesome for a

user wanting to make use of the summary.

Given this, our proposal is to use sentence simpli-

ﬁcation to compress the extracted sentences down to

their main content only, so that more information can

ﬁt into the summary, producing a more informative

text. After the summarization process has determined

the most signiﬁcant sentences, sentential structures,

that are less essential to ﬁgure in the summary’s short

space, can be removed.

The rationale behind using sentence simpliﬁcation

in a summarization context is twofold. On the one

hand, it removes expendable information, generating

a simpler and easier to read text. On the other hand,

it allows the addition of more individual (simpliﬁed)

sentences to the summary, that otherwise have not

been included. Experiments made with human users

(Silveira and Branco, 2012b) have shown that simpli-

ﬁcation indeed helps to improve the summaries pro-

duced.

Note that, sentence simpliﬁcation is also referred

in the literature as sentence compression. In this

work, the expression ”sentence simpliﬁcation” is used

to deﬁne ”sentence compression”, in order to distin-

guish it from ”compression” itself. We name ”com-

pression” as the step that follows simpliﬁcation in the

summarization process, where the sentences identi-

ﬁed as the most relevant ones are selected, based on

a predeﬁned compression rate, thus compressing the

initial set of sentences contained in the collection of

texts submitted as input.

At this point, consider the following list of sen-

tences that can be part of the summary:

111

Botelho Silveira S. and Branco A..

Compressing Multi-document Summaries through Sentence Simpliﬁcation.

DOI: 10.5220/0004251201110119

In Proceedings of the 5th International Conference on Agents and Artiﬁcial Intelligence (ICAART-2013), pages 111-119

ISBN: 978-989-8565-39-6

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

1. EU leaders signed a new treaty to control budgets

on Friday.

2. Only Britain and the Czech Republic opted out

of the pact, signed in Brussels at a summit of EU

leaders.

3. UK Prime Minister David Cameron, who with the

Czechs refused to sign it, said his proposals for cut-

ting red tape and promoting business had been ig-

nored.

4. The countries signed up to a promise to anchor

in their constitutions – if possible – rules to stop their

public deﬁcits and debt spiralling out of control in the

way that led to the eurozone crisis.

5. The treaty must now be ratiﬁed by the parliaments

of the signatory countries.

This list contains 105 words. However, a com-

pression rate of 80% of the original text states that the

summary must only contain 84 words. As the sum

of the words of the three ﬁrst sentences (57 words)

does not meet the desired total number of words for

the summary, the fourth sentence is also added. Yet,

by adding the fourth sentence, the summary makes up

92 words, so the total number of words deﬁned by

the compression rate has been surpassed in 9 words.

The ﬁrst option would be to cut the last nine words

of the last sentence. That would produce an incorrect

sentence.

There are particular constructions that can be re-

moved from these sentences making room for the in-

clusion of more relevant information. Appositions,

parenthetical phrases and relative clauses are exam-

ples of those constructions.

Consider, for instance, the following expressions can-

didates for removal:

• The parenthetical phrase: signed in Brussels at a

summit of EU leaders

• The relative clause: who with the Czechs refused

to sign

• The parenthetical phrase: if possible

These expressions sum a total of 18 words. The

last sentence that has not been added to the summary

sums a total of 13 words. So, if all these expres-

sions were removed from the sentences, we would

have been able to include in the summary the last sen-

tence. Otherwise that sentence would not be included

in the ﬁnal text, despite being relevant to the overall

informativeness of the summary.

The summary, in which sentences have been sim-

pliﬁed, contains 84 words and is shown below:

EU leaders signed a new treaty to control budgets on

Friday.

Only Britain and the Czech Republic opted out of the

pact.

UK Prime Minister David Cameron said his propos-

als for cutting red tape and promoting business had

been ignored.

The countries signed up to a promise to anchor in

their constitutions rules to stop their public deﬁcits

and debt spiraling out of control in the way that led

to the eurozone crisis.

The treaty must now be ratiﬁed by the signatory

countries’ parliaments.

This way, it is possible to produce a comprehen-

sible and ﬂuent summary, that contains the maximum

relevant information conveyed by the original collec-

tion of texts.

This paper is organized as follows: Section 2 re-

ports the related work; Section 3 overviews the sum-

marization process; Section 4 describes our empirical

approach to simpliﬁcation; Section 5 details the ex-

periments that combine simpliﬁcation with summa-

rization; and, ﬁnally, in Section 6, conclusions are

drawn.

2 RELATED WORK

Text simpliﬁcation is an Natural Language Process-

ing (NLP) task that aims at making a text shorter

and more readable by simplifying its sentences struc-

turally, while preserving as much as possible the

meaning of the original sentence. This task is com-

monly addressed in two ways: lexical and syntac-

tic simpliﬁcation. Lexical simpliﬁcation involves re-

placing infrequent words by their simpler more com-

mon and accessible synonyms. Syntactic simpliﬁca-

tion, in turn, includes a linguistic analysis of the input

texts, that produces detailed tree-structure representa-

tions, over which transformations can be made (Feng,

2008).

Previous works (Chandrasekar et al. (1996) and

Jing (2000)) have focused on syntactic simpliﬁcation,

targeting speciﬁc types of structures identiﬁed using

rules induced through an annotated aligned corpus of

complex and simpliﬁed texts.

Jing and McKeown (2000) used simpliﬁcation in

a single-document summarizer, by performing opera-

tions, based on the analysis of human abstracts, that

remove inessential phrases from the sentences. Blair-

Goldensohn et al. (2004) remove appositives and

relative clauses in a preprocessing phase of a multi-

document summarization process. Another proposal

is the one of Conroy et al. (2005), that combine a sim-

ICAART2013-InternationalConferenceonAgentsandArtificialIntelligence

112

pliﬁcation method, that uses shallow parsing to detect

lexical cues that trigger phrase eliminations, with an

HMM sentence selection approach, to create multi-

document summaries.

Closer to our work is the work of Siddharthan et

al. (2004), in which sentence simpliﬁcation is applied

together with summarization. However, they used

simpliﬁcation to improve content selection, that is,

before extracting sentences to be summarized. Their

simpliﬁcation system is based on syntactic simpliﬁca-

tion performed using hand-crafted rules that specify

relations between simpliﬁed sentences.

Zajic et al. (2007) applied sentence compression

techniques to multi-document summarization, using

a parse-and-trim approach to generate headlines for

news stories. Constituents are removed iteratively

from the sentence parse tree, using rules that perform

lexical simpliﬁcation by replacing temporal expres-

sions, preposed adjuncts, determiners, conjunctions,

modal verbs -, and syntactic simpliﬁcation - by se-

lecting speciﬁc phenomena in the parse tree.

A different approach was used by Cohn and Lap-

ata (2009), that experimented a tree-to-tree transduc-

tion method for sentence compression. They trained a

model that uses a synchronous tree substitution gram-

mar, which allows local distortions of a tree topology,

used to capture structural mismatches between trees.

A word graph method, to create a single simpli-

ﬁed sentence of a cluster of similar or related sen-

tences, was used by Filippova (2010). Considering all

the words in these related sentences, a directed word

graph is built by linking word A to word B through an

adjacency relation, in order to avoid redundancy. This

method was used to avoid redundancy in the sum-

maries produced.

Lloret (2011) proposed a text summarization sys-

tem that combines textual entailment techniques, to

detect and remove information, with term frequency

metrics used to identify the main topics in the collec-

tion of texts. In addition, a word graphmethod is used

to compress and fuse information, in order to produce

abstract summaries.

More recently, Wubben et al. (2012) investigated

the usage of a machine translation technique to per-

form sentence simpliﬁcation. They created a method

for simplifying sentences by using Phrase Based Ma-

chine Translation, along with a re-ranking heuristic

based on dissimilarity. Then, they trained it on a

monolingual parallel corpus, and achieved state-of-

the-art results. Finally, Yoshikawa et al. (2012) pro-

posed new semantic constraints, to perform sentence

compression. These constraints are based on seman-

tic roles, in order to directly capture the relations be-

tween a predicate and its arguments.

3 SUMMARIZATION PROCESS

The system used is an extractive multi-document

summarizer that receives a collection of texts in Por-

tuguese and produces highly informative summaries.

Summarization is performed by means of two

main phases executed in sequence: clustering by sim-

ilarity and clustering by keywords. Aiming to avoid

redundancy, sentences are clustered by similarity, and

only one sentence from each cluster is selected. Yet,

the keyword clustering phase seeks to identify the

most relevant content within the input texts. The key-

words of the input texts are retrieved and the sen-

tences that are successfully grouped to a keyword

cluster are selected to be used in the next step of

the process. Furthermore, each sentence has a score,

which is computed using the tf-idf (term frequency –

inverse document frequency) of each sentence word,

smoothed by the number of words in the sentence.

This score deﬁnes the relevance of each sentence and

it is thus used to order all the sentences. Afterwards,

the simpliﬁcation process detailed in Section 4 is per-

formed, producing the ﬁnal summary. A detailed de-

scription of this extractive summarization process can

be found in (Silveira and Branco, 2012a).

4 SIMPLIFICATION PROCESS

In this work, simpliﬁcation is performedtogether with

compression.

Firstly, from the original input list of sentences, a

new list is created, by selecting one sentence at the

time, until the total number of words in the list sur-

passes the maximum number of words determined by

the compression rate.

Afterwards, sentences are simpliﬁed by remov-

ing the expendable information in view of the general

summarization purpose. There are a number of struc-

tures that can be seen as containing ”elaborative” in-

formation about the content already expressed. In this

work, ﬁve types of structures are targeted:

• Appositions – noun phrases that describe, detail

or modify their antecedent (also a noun phrase);

• Adjectives;

• Adverbs or adverb phrases;

• Parentheticals – phrases that explain or qualify

other information being expressed;

• Relative clauses – clauses, introduced by a relative

pronoun, that modify a noun phrase.

After all these structures have been obtained, by

considering the sentence parse tree, they are removed

CompressingMulti-documentSummariesthroughSentenceSimplification

113

from the original sentence tree. The ﬁnal step deter-

mines if the new simpliﬁed sentence can replace the

former sentence, based on a speciﬁc criteria that takes

into account the sentence score.

Due to the fact that simpliﬁcation removes words

from the sentence, once simpliﬁcation has been per-

formed, new sentences are added to the list of sen-

tences to achieve the maximum number of words of

the summary once again. Those newly added sen-

tences are then simpliﬁed. This process is repeated

while the list is changed or if the compression rate

has not been meet.

4.1 Structure Identiﬁcation

In order to perform simpliﬁcation, a parse tree is cre-

ated for each sentence, using a constituency parser for

Portuguese (Silva et al., 2010). The structures prone

to be removed are identiﬁed in the tree using Tregex

(Levy and Andrew, 2006), a utility for matching pat-

terns in trees. Tregex takes a parse tree and a regular

expression pattern. It, then, returns a subtree of the

initial tree which top node meets the pattern.

After identifying the subtrees representing each

structure, these subtrees are replaced by null trees in

the original sentence parse tree, removing its content

and generating a new tree without the identiﬁed struc-

ture. The two steps mentioned above are applied to

each of the candidate sentences that have been se-

lected to be a part of the summary.

Main Clause Selection. First, the main clause of

the sentence is identiﬁed. In this phase, other than

the next, the desired subtree is selected, ignoring the

other subtrees of the main tree. Consider the follow-

ing sentence and its parse tree:

No M´edio Oriente, apenas Israel saudou a operac¸ ˜ao.

In the Middle East, only Israel welcomed the operation.

(PP (PP (P Em_) (NP (ART o)

(N’ (N M

edio) (N Oriente)))) (PNT ,))

(S (NP (ADV apenas) (NP (N Israel)))

(VP (V saudou)

(NP (ART a) (N operac¸

ao))))

)

(PNT .))

Apenas Israel saudou a operac¸ ˜ao.

Only Israel welcomed the operation.

The expression ”No M

edio Oriente” is ignored,

since it is not part of the main clause (in bold). The

original sentence is replaced by the simpliﬁed one, in

which further simpliﬁcation rules are applied.

The main clause is obtained by applying the fol-

lowing pattern to the parse tree: [

S < (VP , NP)

The main clause

immediately dominates (

) a verb

phrase

that immediately follows (

) a noun phrase

. This pattern can retrieve several sentences

, con-

tained in a sentence. The main clause selected is the

one which is closer to the tree root node. Note that

this pattern does not retrieve, for instance, sentences

that are not in SVO structure. If the sentence is not in

this format, the whole original sentence is used. The

main clause is then used to identify the other struc-

tures.

Phrase Compression. In this stage, the main clause

obtained previously is used to identify the structures

to be removed. The subtrees of the structures are iden-

tiﬁed in the sentence parse tree. As was already men-

tioned, ﬁve types of structures are targeted: (i) apposi-

tions, (ii) adjectives, (iii) adverbs, (iv) parentheticals,

and (v) relative clauses. Examples of every of these

structures, and the patterns used are discussed below.

Appositions are noun phrases that describe, detail or

modify its antecedent (also a noun phrase). Consider

the following sentence:

Jos´e S´ocrates, primeiro-ministro, e Jaime Gama querem

cortar os sal´arios dos seus gabinetes.

Jos´e S´ocrates, the Prime Minister, and Jaime Gama want

to cut the salaries of their ofﬁces.

(S (S (S

(NP (NP (N’ (N Jos

e) (N S

ocrates))

(NP

(PNT ,)

(NP (N’ (ORD primeiro) (N ministro)))

(PNT ,)))

(CONJP (CONJ e) (NP (N’ (N Jaime) (N Gama)))))

(VP (V querem) (VP (V cortar)

(PP (P em_) (NP (ART os) (N’ (N sal

arios)

(PP (P de_) (NP (ART os)

(N’ (POSS seus) (N gabinetes))))))))))

(PNT .)))

Jos´e S´ocrates e Jaime Gama querem cortar os sal´arios

dos seus gabinetes.

Jos´e S´ocrates and Jaime Gama want to cut the salaries of

their ofﬁces.

Appositions are identiﬁed using two patterns, dif-

fering in the punctuation symbol used to enclose the

apposition, that can either be a comma or a dash.

The comma apposition pattern is: [

NP|AP <<, (PNT

< ,) <<- (PNT < ,)

]. An apposition subtree has a

noun phrase

or (

) an adjective phrase

as its

top node. Its leftmost (

<<,

) and rightmost (

<<-

) de-

scendant is a punctuation token

PNT

that immediately

ICAART2013-InternationalConferenceonAgentsandArtificialIntelligence

114

dominates (

) a comma (or a dash).

Adjectives qualify nouns or noun phrases, thus being

structures prone to be removed. Consider the follow-

ing sentence:

O palco tem um pilar central, com 50 metros de altura.

The stage has a central pillar, 50 meters high.

(NP (ART O) (N palco))

(VP (V tem)

(NP (ART um) (N’ (N’ (N pilar)

(A central)

)

(PP (PNT ,) (PP (P com)

(NP (CARD 50) (N’ (N metros)

(PP (P de) (NP (N altura))))))))))

(PNT .))

O palco tem um pilar, com 50 metros de altura.

The stage has a pillar, 50 meters high.

Adjectives are identiﬁed using the pattern: [

A $,

]. The subtree top node is an adjective

, which is a

sister of a noun

and immediately follows it (

Adverbs or adverb phrases are considered differently

whether they appear in a noun or in a verb phrase, due

to the usage of the adverbs of negation, which precede

the verb. The adverbs appearing in a VP are handled

differently, to avoid removing negative adverbs and

modifying the meaning of the sentence. Consider the

following sentence:

Jos´e S´ocrates chegou um pouco atrasado ao debate.

Jos´e S´ocrates arrived a little late to the debate.

(NP (N’ (N Jos

e) (N S

ocrates)))

(VP (VP (V chegou)

(AP

(ADVP (ART um) (ADV pouco))

(A atrasado)))

(PP (P a_) (NP (ART o) (N debate))))

(PNT .))

Jos´e S´ocrates chegou atrasado ao debate.

Jos´e S´ocrates arrived late to the debate.

The pattern for an adverb occurring in a VP

phrase is: [

ADV|ADVP , V|VP

]. The subtree top node

is an adverb

ADV

or (

) an adverb phrase

ADVP

, which

must follow immediately (

) a verb (

) or a verb

phrase (

). Yet, when the adverb appears in a noun

phrase, the pattern is: [

ADV|ADVP $ N|NP

]. The

subtree top node is an adverb

ADV

or (

) an adverb

phrase

ADVP

that is a sister (

) of a noun

or (

) a

noun phrase

, occurring before or after it.

Parentheticals are phrases that explain or qualify

other information being expressed. Consider the fol-

lowing sentence:

O Parlamento aprovou, por ampla maioria, a proposta.

The Parliament approved by large majority the proposal.

(NP

(ART O) (N Parlamento))

(VP

(V’ (V aprovou)

(PP

(PNT ,)

(PP (P por)

(NP (N’ (A ampla) (N maioria))))

(PNT ,))

)

(NP

(ART a) (N proposta)))

(PNT .))

O Parlamento aprovou a proposta.

The Parliament approved the proposal.

Parentheticals are identiﬁed using several differ-

ent patterns depending on the punctuation symbol

(parenthesis, commas or dashes) that encloses the

phrase. The comma patterns is:

[

PP|ADV|ADVP|CONJ|CONJP <<, (PNT < ,)

<<- (PNT < ,)

The subtree top node can be either a preposi-

tional phrase

, or (

) an adverb

ADV

, or an adverb

phrase

ADVP

, or a conjunction

CONJ

, or a conjunc-

tional phrase

CONJP

. Its leftmost descendant (

<<,

) is

a punctuation token

PNT

that immediately dominates

(

) a comma. The same way, its rightmost descendant

(

<<-

) is a punctuation token, immediately dominating

a comma.

Relative clauses are clauses that modify a noun

phrase. They have the same structure as appositions,

differing in the top node. Consider the sentence:

O Parlamento aprovou a proposta, que reduz os venci-

mentos dos deputados.

The Parliament approved the proposal, which reduces

the salaries of deputies.

(NP (ART O) (N Parlamento))

(VP

(V aprovou) (NP (ART a) (N’ (N proposta)

(CP

(PNT ,)

(CP

(NP (REL que))

(VP (V reduz)

(NP (ART os) (N’ (N vencimentos)

(PP (P de_) (NP (ART os)

(N deputados)))))))))

)))

(PNT .))

O Parlamento aprovou a proposta.

The Parliament approved the proposal.

CompressingMulti-documentSummariesthroughSentenceSimplification

115

The pattern is: [

CP <<, (PNT < ,) <<- (PNT

< ,)

]. The subtree top node is a complementizer

phrase

, which leftmost (

<<,

) and rightmost (

<<-

)

descendants are punctuation tokens

PNT

immediately

dominating (

) a comma.

4.2 Sentence Selection

After the structures have been removed from the sen-

tence, it is time to determine if this new simpliﬁed

sentence should replace the original one.

Hence, the sentence score is considered. In the

summarization algorithm, the sentence score deﬁnes

the sentence relevance to the complete collection of

sentences obtained from the input texts. This score

is computed using the tf-idf metric, which states that

the relevance of a term not only depends on its fre-

quency over the collection of texts, but also it depends

on the number of documentsin which the term occurs.

Equation 1 describes the computation of the sentence

score.

score

∑

t f − id f

totalWords

(1)

Hence, score

of the sentence S measures the rel-

evance of this sentence considering the collection of

sentences obtained from the input texts.

As words or expressions were removed from the

original sentence to create the new simpliﬁed sen-

tence, the score of this simpliﬁed sentence must be

computed, considering only the words that it now

contains. After having both sentence scores, the orig-

inal sentence score is compared with the one of its

simpliﬁed version. If the simpliﬁed sentence score is

higher than the one of the original sentence, the sim-

pliﬁed sentence replaces the former one in the sum-

mary.

This procedure ensures that simpliﬁcation indeed

helps to improve the content of the summary, by in-

cluding only the simpliﬁed sentences that contribute

to maximize the informativeness of the ﬁnal sum-

mary.

5 EXPERIMENTS

A prototype of the simpliﬁcation approach detailed

above was developed and tested. In this experiment,

we used CSTNews (Aleixo and Pardo, 2008), an an-

notated corpus composed by texts in Portuguese. It

contains 50 sets of news texts from several domains,

for a total of 140 documents, 2,247 sentences, and

47,428 words. Each set contains, on average, three

documents that address the same subject. The texts

were retrieved from ﬁve Brazilian newspapers. In ad-

dition, for each set of texts, the corpus contains a man-

ually built summary – the so-called ideal summary –

that is used to assess the quality of the automatic sum-

maries. There are 50 ideal summaries, containing a

total of 6,859 words. Each ideal summary has on av-

erage 137 words, resulting in an average compression

rate of 85%.

5.1 Simpliﬁcation Statistics

The simpliﬁcation process handles several types of

syntactic structures that can be removed from a sen-

tence, by executing the simpliﬁcation algorithm de-

tailed in Section 4. Occurrences of these construc-

tions in the complete corpus were detected and the

total number of words they contain was obtained.

Statistics of the targeted structures occurring in the

complete corpus are shown in Table 1.

Table 1: Corpus targeted structures (totals).

Phrase type Total Total words

Appositions 362 1,795

Parentheticals 639 3,272

Relative clauses 178 1,446

Adjectives 999 1,199

Adverbs 288 411

Total 2,466 8,123

Note that the numbers in this table were obtained

by running the simpliﬁcation process over all the sen-

tences of the corpus. However, there can be more

structures in the sentences that are not considered.

The reasons are twofold: (i) while retrieving the main

clause in each sentence, adjunct expressions may be

ignored and (ii) structures can be included in other

structures. When obtaining, for instance, a relative

clause, inside it can be an adjective or an adverb that

will not be taken into account, since the major clause

is the one that will indeed be counted.

While retrieving the main clause, 1,200 sentences

were found, that include adjunct expressions, contain-

ing 12,481 words that can be removed. Considering

all the words in the structures – added to the number

of words removed when selecting the main clauses –,

we retrieve 44% of the words of all texts, a total of

20,694 words. Without considering the main clauses,

these structures make up 17% of all texts. Thus, there

is a profusion of material to work with.

Table 2 displays the number of structures found,

taking into account only the sentences that are con-

sidered when executing the simpliﬁcation algorithm,

that is after applying the compression step.

A total of 15,983 words, from the 731 sentences,

ICAART2013-InternationalConferenceonAgentsandArtificialIntelligence

116

Table 2: Simpliﬁcation structures to be removed.

Phrase type Total Total words

Appositions 77 400

Parentheticals 145 811

Relative clauses 26 273

Adjectives 178 226

Adverbs 28 39

Total 454 1,749

were considered in this simpliﬁcation process. In this

part of the corpora, the ﬁrst compression step – main

clause selection – modiﬁes 197 sentences by remov-

ing 3,323 words. This corresponds to an initial com-

pression of 20%, that is, it makes room in the sum-

maries for an additional 158 sentences of average size

(21 words).

Additionally, with the removal of the words iden-

tiﬁed in the structures, the compression can be up to

11%. Adding the compression attained with the main

clauses selection (20%) to the one achieved by re-

moving the identiﬁed structures (11%), we are able

to compress the original texts down to 5,072 words.

Considering an average-sized sentence, this number

of words corresponds to, on average, 241 sentences

of novel information that can be included in the sum-

mary.

5.2 Comparative Results

For each set of the CSTNews corpora, two types

of summaries were created using our summariza-

tion system: simpliﬁed summaries and non-simpliﬁed

summaries. The simpliﬁed summaries are built by in-

cluding the simpliﬁcation module in the summariza-

tion procedure, and the non-simpliﬁed summaries are

created without running that module. An example

of both a simpliﬁed (Figure 1) and a non-simpliﬁed

summary (Figure 2), for the same set of texts, is

presented. As a baseline,

GistSumm

(Pardo et al.,

2003), a single-document summarizer that also per-

forms multi-document summarization, the only avail-

able for Portuguese, was used.

In this experiment, we used a compression rate of

85%, since this is the average compression rate of the

ideal summaries, which means that the summary con-

tains 15% of the words contained in the set of texts to

be summarized.

Afterwards, ROUGE (Lin, 2004) was used to au-

tomatically compute precision, recall and F-measure

(an average of precision and recall that considers both

the same way) metrics. Four ROUGE metrics were

computed: ROUGE-1, ROUGE-2, ROUGE-L, and

ROUGE-S. Although being the most common one,

ROUGE-1 cannot be very suitable to the task at stake,

since it requires consecutive matches between the

summaries being compared. Hence, we also com-

puted all the other three metrics. ROUGE-2 computes

2-gram co-occurrences, allowing some gaps between

the sentences. ROUGE-L identiﬁes the common sub-

sequences between two sequences, not requiring con-

secutive matches. Finally, ROUGE-S computes any

pair of words in their sentence order, allowing for ar-

bitrary gaps. Results are shown in Table 3.

Table 3: ROUGE metrics for the automatic summaries.

ROUGE-1

Recall Precision F-measure

GistSumm 0.4216 0.4776 0.4425

Non-Simpliﬁed 0.5400 0.5199 0.5231

Simpliﬁed 0.5768 0.5101 0.5357

ROUGE-2

Recall Precision F-measure

GistSumm 0.2089 0.2393 0.2204

Non-Simpliﬁed 0.3031 0.2953 0.2948

Simpliﬁed 0.3535 0.3126 0.3283

ROUGE-L

Recall Precision F-measure

GistSumm 0.3832 0.4339 0.4017

Non-Simpliﬁed 0.5024 0.4836 0.4866

Simpliﬁed 0.5467 0.4832 0.5078

ROUGE-SU

Recall Precision F-measure

GistSumm 0.1760 0.2239 0.1801

Non-Simpliﬁed 0.2822 0.2651 0.2594

Simpliﬁed 0.3171 0.2499 0.2683

Some conclusions can be drawn from these re-

sults. The complete summarization process has an

overall better performance than the baseline, since

it overcomes the baseline when considering all the

ROUGE values, and all the metrics considered (pre-

cision, recall and f-measure). Yet, note that the pre-

cision values of the simpliﬁed summaries are, in gen-

eral, lower than the ones of the non-simpliﬁed sum-

maries. On the contrary, recall values are, on average,

3 points higher.

On the one hand, the precision values are lower.

Intuitively, precision values should decrease, since

less in-sequence matches would be found in the sim-

pliﬁed summaries. These values are penalized since

structures within a sentence are removed, justifying

the higher precision values for non-simpliﬁed sum-

maries, when compared to the ones of the simpliﬁed

summaries, concerning most of the metrics in study.

On the other hand, the recall values obtained are

very encouraging. These values mean that there is a

CompressingMulti-documentSummariesthroughSentenceSimplification

117

The water level at the river Severn reached 10.4 meters at some point, nearly going over the defense barriers, which are

10.7 meters in height, according to the BBC on Monday, the 23rd.

The rain that’s been hitting the United Kingdom has covered roads and millions of people are without electricity and

water due to the worst ﬂood the country has seen in the last 60 years.

The worst ﬂoods in the last 60 years in the United Kigdom have left thousands of Birtish people homeless.

The United Kingdom’s two largest rivers are threatning to overﬂow this Monday.

Figure 1: Example of a simpliﬁed summary (translated from Portuguese) – in its original version, this summary contains 93

words.

The torrential rain that hit the United Kingdom has covered roads and thousands of people are without electricity and

drinking water due to the worst ﬂood the country has seen in 60 years, according to the BBC television network on

Monday, the 23rd.

The United Kingdom’s largest rivers, the Severn and the Thames, are threatening to overﬂow this Monday, further

aggravating the situation on the central and southern areas of Britain, which have been punished by ﬂoods since last

Friday.

Figure 2: Example of a non-simpliﬁed summary (translated from Portuguese) – in its original version, this summary contains

82 words.

huge density of words that are both in the simpliﬁed

summaries and in the ideal summaries. So, retrieving

the most relevant information in a sentence by dis-

carding the less relevant data ensures that the sum-

mary indeed contains the most important conveyed.

This is a direct result of the simpliﬁcation process,

since new information is being added to the summary,

after sentence simpliﬁcation has been carried out.

Finally, the f-measure values of the simpliﬁed

summaries are higher than both the GistSumm ones,

and the non-simpliﬁed ones, when considering all

ROUGE metrics, reﬂecting the better summaries pro-

duced when the simpliﬁcation process is applied.

Note that, ROUGE metrics, other than ROUGE-1, al-

low for gaps between the sentences, reﬂecting better

results when considering such transformations as the

ones performed by the simpliﬁcation process. The

differences between all the precision values are very

slightly and do not inﬂuence the ﬁnal f-measure re-

sults, which are clearly inﬂuenced by the great recall

values.

In conclusion, despite removing information from

the sentences, the summarization process combined

with the simpliﬁcation approach presented produces

highly informative summaries. The combination of

these two procedures seeks to create summaries con-

taining not only the most relevant information present

in the input texts, but also as much of this information

as possible. Moreover, we can conclude that the type

of structures that our simpliﬁcation process removes

are indeed additional information to the overall com-

prehension of the sentence, and their removal makes

room in the summary for more novel information.

Thus, our compressing strategy produces high re-

call summaries. At the same time, the simpliﬁed sum-

maries contain the most relevant information, regard-

ing the ideal summaries, achieving our goal of includ-

ing as much information as possible in each summary.

6 FINAL REMARKS

This paper presents an approach that performs syntac-

tic simpliﬁcation. The rules that make up this simpli-

ﬁcation approach are fully detailed.

The approach that combines summarization with

simpliﬁcation has proved to be effective, when the

complete procedure was evaluated. However, the ap-

proach presented modiﬁes the sentences in such a way

that currently available automatic metrics are not fair

enough when evaluating summaries containing sim-

pliﬁed sentences. Thus, a human evaluation is needed

to assess the effective gain of simpliﬁcation.

Even though, recall results are very promising and

indicate that simpliﬁcation allows for the inclusion

of further sentences containing relevant information.

Despite the precision values are lower than expected,

they do not impact on the ﬁnal f-measure results,

which state that the combination between summariza-

tion followed by simpliﬁcation can produce highly in-

formative summaries.

REFERENCES

Aleixo, P. and Pardo, T. A. S. (2008). CSTNews: Um

c´orpus de textos jornal´ısticos anotados segundo a teo-

ria discursiva multidocumento CST (cross-document

structure theory). Technical report, Universidade de

S˜ao Paulo.

Blair-Goldensohn, S., Evans, D., Hatzivassiloglou, V.,

Mckeown, K., Nenkova, A., Passonneau, R., Schiff-

man, B., Schlaikjer, A., Advaith, Siddharthan, A.,

ICAART2013-InternationalConferenceonAgentsandArtificialIntelligence

118

and Siegelman, S. (2004). Columbia university at

duc 2004. In Proceedings of the 2004 document un-

derstanding conference (DUC 2004), HLT/NAACL

2004, pages 23–30, Boston, Massachusetts.

Chandrasekar, R., Doran, C., and Srinivas, B. (1996). Mo-

tivations and methods for text simpliﬁcation. In In

Proceedings of the Sixteenth International Conference

on Computational Linguistics (COLING ’96), pages

1041–1044.

Cohn, T. and Lapata, M. (2009). Sentence compression as

tree transduction. J. Artif. Intell. Res. (JAIR), 34:637–

674.

Conroy, J., Schlesinger, J., and Stewart, J. (2005). Classy

query-based multidocument summarization. In Pro-

ceedings of 2005 Document Understanding Confer-

ence, Vancouver, BC.

Feng, L. (2008). Text simpliﬁcation: A survey. Technical

report, The City University of New York.

Filippova, K. (2010). Multi-sentence compression: ﬁnding

shortest paths in word graphs. In Proceedings of the

23rd International Conference on Computational Lin-

guistics, COLING ’10, pages 322–330, Stroudsburg,

PA, USA. ACL.

Jing, H. (2000). Sentence reduction for automatic text sum-

marization. In Proceedings of the sixth conference on

Applied natural language processing, pages 310–315,

Morristown, NJ, USA. Association for Computational

Linguistics.

Jing, H. and McKeown, K. R. (2000). Cut and paste based

text summarization. In Proceedings of the 1st North

American chapter of the Association for Computa-

tional Linguistics conference, NAACL 2000, pages

178–185, Stroudsburg, PA, USA. ACL.

Levy, R. and Andrew, G. (2006). Tregex and Tsurgeon:

Tools for querying and manipulating tree data struc-

tures. In Proceedings of the 5th Language Resources

and Evaluation Conference (LREC).

Lin, C.-Y. (2004). Rouge: A package for automatic evalu-

ation of summaries. In Text Summarization Branches

Out: Proceedings of the ACL-04 Workshop, pages 74–

81, Barcelona, Spain. ACL.

Lloret, E. (2011). Text Summarisation based on Human

Language Technologies and its Applications. PhD the-

sis, Universidad de Alicante.

Pardo, T. A. S., Rino, L. H. M., and das Grac¸as V. Nunes,

M. (2003). Gistsumm: A summarization tool based on

a new extractive method. In PROPOR, Lecture Notes

in Computer Science, pages 210–218. Springer.

Siddharthan, A., Nenkova, A., and McKeown, K. (2004).

Syntactic simpliﬁcation for improving content selec-

tion in multi-document summarization. In COLING

’04: Proceedings of the 20th international conference

on Computational Linguistics, page 896, Morristown,

NJ, USA. ACL.

Silva, J., Branco, A., Castro, S., and Reis, R. (2010). Out-

of-the-box robust parsing of Portuguese. In Proceed-

ings of the 9th Encontro para o Processamento Com-

putacional da L´ıngua Portuguesa Escrita e Falada

(PROPOR), pages 75–85.

Silveira, S. B. and Branco, A. (2012a). Combining a dou-

ble clustering approach with sentence simpliﬁcation

to produce highly informative multi-document sum-

maries. In IRI 2012: 14th International Conference

on Artiﬁcial Intelligence, pages 482–489, Las Vegas,

USA.

Silveira, S. B. and Branco, A. (2012b). Enhancing multi-

document summaries with sentence simpliﬁcation. In

ICAI 2012: International Conference on Artiﬁcial In-

telligence, Las Vegas, USA.

Wubben, S., van den Bosch, A., and Krahmer, E.

(2012). Sentence simpliﬁcation by monolingual ma-

chine translation. In ACL – The 50th Annual Meet-

ing of the Association for Computational Linguistics,

Proceedings of the Conference, July 8-14, 2012, Jeju

Island, Korea - Volume 1: Long Papers, pages 1015–

1024. The Association for Computer Linguistics.

Yoshikawa, K., Iida, R., Hirao, T., and Okumura, M. (2012).

Sentence compression with semantic role constraints.

In ACL – The 50th Annual Meeting of the Associa-

tion for Computational Linguistics, Proceedings of the

Conference, July 8-14, 2012, Jeju Island, Korea - Vol-

ume 2: Short Papers, pages 349–353. The Association

for Computer Linguistics.

Zajic, D., Dorr, B. J., Lin, J., and Schwartz, R. (2007).

Multi-candidate reduction: Sentence compression as a

tool for document summarization tasks. Inf. Process.

Manage., 43(6):1549–1570.

CompressingMulti-documentSummariesthroughSentenceSimplification

119