Automatic Summarization Based on Sentence

Morpho-Syntactic Structure: Narrative Sentences

Compression

Mehdi Yousﬁ-Monod and Violaine Prince

LIRMM, UMR 5506, 161 rue Ada, 34392 Montpellier Cedex 5, France

Abstract. We propose an automated text summarization through sentence com-

pression. Our approach uses constituent syntactic function and position in the

sentence syntactic tree. We ﬁrst deﬁne the idea of a constituent as well as its role

as an information provider, before analyzing contents and discourse consistency

losses caused by deleting such a constituent. We explain why our method works

best with narrative texts. With a rule-based system using SYGFRAN’s morpho-

syntactic analysis for French [1], we select removable constituents. Our results

are satisfactory at the sentence level but less effective at the whole text level, a

situation we explain by describing the difference of impact between constituents

and relations.

1 Introduction

The amount of information available on the Web or in some compagnies, administra-

tions and laboratories is always increasing, thus hardening information retrieval on such

resources. Automatic summarization, aiming at considerably reducing the size of such

data, appears to be a good solution to ease this search. It does so by introducing a

smaller but relevant text, and thus shortens the choice duration of a request, concerning

text relevance acceptance.

The main idea of our research is to ﬁnd text contraction bounds by sentence com-

pression without major content loss. Its originality relies on constituents syntactic func-

tion and position in the syntactic tree, to select deletable constituents.

In next section, we enumerate the main automatic summarization approaches, then

we compare those working at a ﬁner granularity level (section 2); we then outline our

sentence compression method (section 3); we illustrate its effectiveness with experi-

menting a prototype application-applied to story/short novel texts (section 4); and ﬁ-

nally we discuss about the results of this experiment and draw some perspectives (sec-

tion 5).

2 Summarization by sentence compression

In this article, we only focus on sentence compression.

[2] tackles the sentence compression problem by using a noisy-channel model con-

sisting in making the following assumption: “We look at a long string and imagine

Yousﬁ-Monod M. and Prince V. (2005).

Automatic Summarization Based on Sentence Morpho-Syntactic Structure: Narrative Sentences Compression.

In Proceedings of the 2nd International Workshop on Natural Language Understanding and Cognitive Science, pages 161-167

DOI: 10.5220/0002570201610167

 SciTePress

that it was originally a short string, and then someone added some additional, optional

text to it. Compression is a matter of identifying the original short string.”. The aim is

then to locate this optional text and to remove it. To do so, the authors use a Bayesian

probabilistic model trained on a corpus composed by documents with their summary.

[3] focuses on detecting and removing relative clauses which are preceeded by

clauses like NP

P rep NP

, where NP

and N P

are noun phrases and P rep is

a preposition. The purpose is to correctly attach the relative referent by choosing a wide

or local attachment.

These two approaches based on textual units shorter than sentences do not take into

account the sentences constituents syntactic function and position in the syntactic tree.

In fact, function and position are naturally useful to help choosing the constituents to

be removed. Moreover, such a technique is easily checked by human examination.

3 Compression by pruning the syntactic tree

The starting point of our approach was the insight that the sentence constituents syn-

tactic function and position in the syntactic tree, plays a weighty role in the con-

stituents importance for text understanding. This insight comes from logical gram-

matical analysis. Indeed, some adjective phrases, adverbials, etc, are not systematically

needed to understand the main sentence meaning,

This approach needs a sentence morpho-syntactic analysis tool (section 3.1) and a

survey on constituents importance relative to their syntactic function and position in the

syntactic tree (section 3.2). We present our system architecture in the section 3.3.

3.1 The morpho-syntactic analyser

Since our working language is French, our experiments have been run on this language.

However, the same methods can be easily transposable to English or other languages

for which syntactic parsers have been developed.

We use the French morpho-syntactic parser called SYGFRAN, based on the op-

erational system SYGMART, both deﬁned in [1]. SYGFRAN uses a transformation

rules set of structured elements, based on French grammar rules. It transforms a sen-

tence (raw text) in a syntactic tree (structured element) enriched with information about

constituents. This parser has the following advantages: the fastness: the analysis com-

plexity is O(k ∗ n ∗ log

(n)) where k is the rules number and n the text length. the

robustness: SYGFRAN manages to produce a correct structure for at least 30% of the

different cases of French sentences syntaxes, for other cases, SYGFRAN provides a

partial but workable analysis. the production of a syntactic tree: much of the existing

syntactic analysis systems only achieve a basic linear tagging and those providing a tree

are not robust enough relatively to the body of existing syntactic constructions.

SYGFRAN takes a raw text input and produces a bracketed structure, correspond-

ing to the morpho-syntactic tree of each text sentence, in which many variables are

acquainted on the different constituents natures, syntactic functions, canonical forms,

grammatical categories, tense, gender, number, etc.

161

3.2 Function and Position

The constituents deletion test is addressed by many French grammar works to help

in attaching a syntactic function to a constituent. The test is validated if the resulting

sentence remains grammatically consistent. However, linguistic texts dealing with the

constituent importance in the sentence according to their syntactic function are rather

uncommon. Some recommendations are provided by linguists, but there is no funda-

mental rule.

So we have proceeded in the following way. We have considered these recom-

mendations as working assumptions and we have tried to support them empirically.

Mel’

cuk, in his contemporary French analysis, speaks about syntactic functions known

as governement (in the aftermath of Chomsky’s works). Constituents are said to be

governors, if they are mandatory to grammatical coherence and to sentence semantics.

The sentence subject and its verbal group are governors in a grammatical coherence

viewpoint.

We have noted three non-governor constituent categories likely to be deleted, ac-

cording to their syntactic function and their position: adverbials, epithets and apposi-

tions. As we can see, they have a medium granularity level. Appositions, when trans-

formed in relative clauses (noun complement) get a wider granularity level, thus in-

creasing the ﬁnal compression ratio.

Adverbials. We have noticed that the most important adverbials where temporal and

purpose ones. They do answer the questions we deem the most important namely

“When ?” and “In which purpose ?” In the case where a location adverbial is present

after the verb “to be”, deletion cannot be done. ”to be” is a particular verb, and must be

cautiously dealt with.

However, if several location adverbials are consecutive, all but one can be deleted

without major content loss : “John is

in the car, in the car park, near to the sweet shop.”

At last, adverbials located in interrogative sentences appear to be extremely important

since they do issue the question.

Epithets. Adjectives, adjective phrases and some relative clauses (noun complement)

have an epithet function. In a way similar to adverbials, when an epithet si located after

the verb “to be”, and more generally after a stative verb, its importance considerably

increases, making deletion impossible.

Also, we have noticed that when the epithet is located in a noun phrase in which the

determiner is a deﬁnite article, then its deletion is difﬁcult. The reason is the deﬁnite

article is used to speak about a speciﬁc entity and , thus the noun epithet allows to dif-

ferenciate this entity from others.

Appositions. Apposition may be of different types and might appear asa noun phrase, a

pronoun, a relative clause, a present participle clause, a past participle clause, an inﬁni-

tive clause.In the ﬁrst three cases, constituents can be easily deleted. Participle clauses

can be deleted too, but with a more important content loss. In the latter case, deleting the

clause appears to be more difﬁcult, because the inﬁnitive clause systematically provides

an important information completing the subject.

162

3.3 Architecture

Our system architecture is outlined in ﬁgure 1. It relies on all considerations provided

in the preceding section about the importance of constituents in a sentence. It is based

on a parser output in the form of syntactic trees, and produces as an output, a text

coloration of the deletable segments according to constituents hierarchy. The way the

system works is: source text is fed to SYGFRAN, which in turn produces syntactic trees.

Then, the textual segment selection/coloration module uses the following information to

accomplish the selection: the source text, syntactic trees and variables/values provided

by SYGFRAN, the size/loss ratio threshold not to exceed, being provided by the user or

deﬁned by the application type and last, the constituents selection rules set, to achieve

the different constituents selection iterations until the size/loss ratio is satisﬁed. The

selected constituents are then deleted.

SYGFRAN

syntactic analysis

Source text

Syntactic trees and

associated variables

Contracted text

Colored text

If CC and ...

If epithet and ...

If apposition and ...

Constituents

selection rules

User or

application

Constituents selection

iterations until

size/ loss

ratio satisfaction

Selection module

iterations

trees and

variables

use

rules

application

Fig.1. From the source text to the compressed text: our sentence compression system

163

4 Experiment

We have have implemented a part of our theory in a computer program to assess the

effectiveness of such an approach. We have deﬁned a system using basic rules, based

on our experimental survey’s results (section 3.2).

Our current prototype only performs one iteration. The ﬁrst step consists in coloring

deletable constituents. A color is assigned to each constituent type. So it is easy to assess

rules quality on the processed text before actually deleting these constituents.

In the second step, colored segments are deleted to produce the summary. The cho-

sen textis a French Haitian story. We have chosen a French text because the current rules

set of SYGFRAN allows it to analyze only French sentences. The reason of choosing

this story is that SYGFRAN produces a correct syntax for all the sentences of this text

and because it is a well-sized, good representative of what is a narrative text. The col-

oration result of a story part is presented in the ﬁgures 2 (the orignial French version)

and 3 (the English translated one).

5 Discussion

With our current rule set, our method has allown us to delete approximatively 34% of the

full text. We can note a light discursive content and coherence loss, which is more than

satisfactory relatively to current automatic summarizers. Moreover, the grammatical

consistency is preserved. We think our rules can be more reﬁned, but there is a lack

of linguistic information in this domain. For this text, SYGFRAN provides us correct

syntactic trees, but variable values are not systematically true and full. For adverbials,

SYGFRAN only speciﬁes the object semantics for the temporal and locative ones. The

other somehow lack semantic information.

Selecting rules of deletable constituents can be more reﬁned according to con-

stituent function and especially to text types. Concerning this subject, we project to

carry out experiments on more texts dealing with more different types. However, sen-

tence compression is not sufﬁcient to produce a summary of a satisfying size in most

application cases. As we have already seen, compression greatly depends on the text

type. So we consider our intra-sentential approach as one of the tasks to perfom in

the automatic summary production, in complement with other approaches working at a

granularity level at least as big as sentences.

6 Conclusion

Current automatic summarization approaches use information such as term frequency,

lexical relations, POS tags, probabilistical learning engines, texts rhetorical structure,

however, none of them use both constituents syntactic function and position in the

syntactic tree as our is able. Our approach has started by a survey on the sentence

constituents importance. The deletion criterion evaluates the contents and coherence

loss generated by constituents deletion. The selection criterion is based on constituents

syntactic function and position in the syntactic tree. Narrative texts (novels, stories, ...)

appeared to be the most suitable for such an approach. We have modeled a sentence

164

Au bout d’un moment elle bougea et marmonna: “Quelle sorte de nuit est-ce donc pour

durer si longtemps ?” Mais elle se rendormit

parce qu’il faisait aussi noir qu’au cœur

de la nuit dans la maison. Finalement elle se r

eveilla en sursaut et se mit

a chercher ses

etements.

Courant de tous c

es, elle arracha ce que Maui avait fourr

e dans les fentes.

Mais c’

etait le jour! Le grand jour! Le soleil

etait d

a haut

dans le ciel ! Elle s’empara

d’un morceau de tapa

pour se couvrir et se sauva de la maison, en pleurant

la pens

d’avoir

ainsi tromp

ee par ses propres enfants . Sa m

ere partie, Maui bondit pr

es du

store qui se balanc¸ait encore de son passage et regarda par l’ouverture. Il vit qu’elle

etait

a loin

, sur la premi

ere pente de la montagne. Puis elle s’arr

eta, saisit

pleines mains

un arbuste de tiare Tahiti, le souleva d’un coup: un trou apparut, elle s’y engouffra et

remit le buisson en place

comme avant.

Maui jaillit

de la maison aussi vite qu’il put, escalada la pente abrupte, tr

ebuchant et

tombant sur les mains car il gardait les yeux ﬁx

es sur l’arbuste de tiare. Il l’atteignit

ﬁnalement, le souleva et d

ecouvrit une

::::

belle

caverne

::::::::

spacieuse

:::

qui

:::::::::

s’enfonc¸ait

::::

dans

::::::::

montagne

egende :

compl

ements circonstanciels, proposition au g

erondif,

:::::::::::::::::

propositions

relatives,

::::::::::::::::

groupes

adjectivaux.

Fig.2. Our text coloration/compression, original French version

After a moment, she stirred and muttered; ”what type of a night it is to be so long” ?

But she went back to sleep

because it was as dark in the house as in the core of the night

. Finally she woke up with a start and began to look for her clothes. Running everywhere

she tore up what Maui had slipped into the holes. It was day ! The full bright day ! The

sun was already high

up in the sky! She took a piece of tapa to cover herself and ﬂed

from home,weeping at the thought that she had been so deceived by her own children.

His mother gone Maui jumped close to the window shade that was still moving after

her and looked through the openning. He saw that she was already far away ,on the ﬁrst

slope of the mountain. Then she stopped, grabbed a Tahiti tiara bushtree with her whole

arms and lifted it up completely : a hole appeared, and she rushed in and then put the

bushtree back

like before.

Maui sprang up

from the house as quickly as possible, climbed up the abrupt slope ,

stumbling and falling on his hand, because his eyes were kept on the tiara bushtree.

He ﬁnally reached it, lifted it up and found a

:::::::

beautiful

:::::::

spacious

cave

that

:::::

went

::::

deep

::::

under

:::

the

::::::::

mountain

Legend :

adverbials, gerund clause,

::::::::::::

relative

clause,

::::::::::::

adjective

group.

Fig.3. Our text coloration/compression, English translated version

165

compression system based on constituents deletion. The creation of a rule system based

on our model has allowed us to assess the feasibility of such an approach. We ﬁrst

have colored the constituents according to selection rules, in order to judge the rele-

vance of each rule. Our method managed to delete approximatively 34% of the test text,

while preserving a good grammatical coherence. We thus conclude that our compres-

sion could be useful when used as one of the tasks of a wider automatic summarization

process, either as a ﬁrst-phase running summarization, or as a post-phase, after having

removed larger chunks of text. We plan to augment accuracy of text sentences prun-

ing by running our system on important narrative text corpora, ﬁnd heuristics for wider

portions of text deletion based on rethorical information use text types or domains to

introduce speciﬁc summary rules (scientiﬁc articles in which titles might help to delete

wide portions of text). All this, naturally, will be sorted out and put into a more sophis-

ticated system to provide a better set-up for summarization by compression.

References

1. Chauch

e, J.: Un outil multidimensionnel de l’analyse du discours. In: Coling’84, Standford

University, California (1984) 11–15

2. Knight, K., Marcu, D.: Summarization beyond sentence extraction: a probabilistic approach

to sentence compression. Artiﬁcial Intelligence archive 139(1) (2002) 91–107

3. Siddharthan, A.: Resolving relative clause attachment ambiguities using machine learning

techniques and wordnet hierarchies. In: 5th National Colloquium for Computational Linguis-

tics in the UK (CLUK 2002). (2002) 45–49

166