Efﬁcient and Effective Single-Document Summarizations and a

Word-Embedding Measurement of Quality

Liqun Shao, Hao Zhang, Ming Jia and Jie Wang

Department of Computer Science, University of Massachusetts, Lowell, MA, U.S.A.

Keywords:

Single-Document Summarizations, Keyword Ranking, Topic Clustering, Word Embedding, SoftPlus Func-

tion, Semantic Similarity, Summarization Evaluation, Realtime.

Abstract:

Our task is to generate an effective summary for a given document with speciﬁc realtime requirements. We use

the softplus function to enhance keyword rankings to favor important sentences, based on which we present a

number of summarization algorithms using various keyword extraction and topic clustering methods. We show

that our algorithms meet the realtime requirements and yield the best ROUGE recall scores on DUC-02 over

all previously-known algorithms. To evaluate the quality of summaries without human-generated benchmarks,

we deﬁne a measure called WESM based on word-embedding using Word Mover’s Distance. We show that

the orderings of the ROUGE and WESM scores of our algorithms are highly comparable, suggesting that

WESM may serve as a viable alternative for measuring the quality of a summary.

1 INTRODUCTION

Text summarization algorithms have been studied in-

tensively and extensively. An effective summary must

be human-readable and convey the central meanings

of the original document within a given length bound-

ary. The common approach of unsupervised summa-

rization algorithms extracts sentences based on im-

portance rankings (e.g., see (DUC, 2002; Mihalcea

and Tarau, 2004; Rose et al., 2010; Lin and Bilmes,

2011; Parveen et al., 2015)), where a keyword may

also be a phrase. A sentence with a larger num-

ber of keywords of higher ranking scores is consid-

ered more important for extraction. Supervised algo-

rithms include CNN and RNN models for generating

extractive and abstractive summaries (e.g., see (Rush

et al., 2015; Nallapati et al., 2016; Cheng and Lapata,

2016)).

We were asked to construct a general-purpose

text-automation tool to produce, among other things,

an effective summary for a given document with the

following realtime requirements: Generate a sum-

mary instantly for a document of up to 2,000 words,

under 1 second for a document of slightly over 5,000

words, and under 3 seconds for a very long document

of around 10,000 words. Moreover, we need to deal

with documents of arbitrary topics without knowing

what the topics are in advance. After investigating

all existing summarization algorithms, we conclude

that unsupervised single-document summarization al-

gorithms would be the best approach to meeting our

requirements.

We use topic clusterings to obtain a good topic

coverage in the summary when extracting key sen-

tences. In particular, we ﬁrst determine which topic

a sentence belongs to, and then extract key sentences

to cover as many topics as possible within the given

length boundary.

Human judgement is the best evaluation of the

quality of a summarization algorithm. It is a stan-

dard practice to run an algorithm over DUC data and

compute the ROUGE recall scores with a set of DUC

benchmarks, which are human-generated summaries

for articles of a moderate size. DUC-02 (DUC, 2002),

in particular, is a small set of benchmarks for single-

document summarizations. When dealing with a large

number of documents of unknown topics and various

sizes, human judgement may be impractical, and so

we would like to have an alternative mechanism of

measurement without human involvement. Ideally,

this mechanism should preserve the same ordering as

ROUGE over DUC data; namely, if S

and S

are two

summaries of the same DUC document produced by

two algorithms, and the ROUGE score of S

is higher

than that of S

, then it should also be the case under

the new measure.

Louis and Nenkova (Louis and Nenkova, 2009)

devised an unsupervised method to evaluate sum-

Shao L., Zhang H., Jia M. and Wang J.

Efﬁcient and Effective Single-Document Summarizations and a Word-Embedding Measurement of Quality.

DOI: 10.5220/0006581301140122

In Proceedings of the 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KDIR 2017), pages 114-122

ISBN: 978-989-758-271-4

marization without human models using common

similarity measures: Kullback-Leibler divergence,

Jensen-Shannon divergence, and cosine similarity.

These measures, as well as the information-theoretic

similarity measure (Aslam and Frost, 2003), are

meant to measure lexical similarities, which are un-

suitable for measuring semantic similarities.

Word embeddings such as Word2Vec can be

used to ﬁll this void and we devise WESM (Word-

Embedding Similarity Measure) based on Word

Mover’s Distance (WMD) (Kusner et al., 2015) to

measure word-embedding similarity of the summary

and the original document. WESM is meant to eval-

uate summaries for new datasets when no human-

generated benchmarks are available. WESM has an

advantage that it can measure the semantic similar-

ity of documents. We show that WESM correlates

well with ROUGE on DUC-02. Thus, WESM may

be used as an alternative summarization evaluation

method when benchmarks are unavailable.

The major contributions of this paper are summa-

rized below:

1. We present a number of summarization algo-

rithms using topic clustering methods and en-

hanced keyword rankings by the softplus func-

tion, and show that they meet the realtime require-

ments and outperform all the previously-known

summarization algorithms under the ROUGE

measures over DUC-02.

2. We propose a new mechanism WESM as an al-

ternative measurement of summary quality when

human-generated benchmarks are unavailable.

The rest of the paper is organized as follows: We

survey in Section 2 unsupervised single-document

summarization algorithms. We present in Section 3

the details of our summarization algorithms and de-

scribe WESM in Section 4. We report the results of

extensive experiments in Section 5 and conclude the

paper in Section 6.

2 EARLY WORK

Early work on single-topic summarizations can be de-

scribed in the following three categories: keyword ex-

tractions, coverage and diversity optimizations, and

topic clusterings.

2.1 Keyword Extractions

To identify keywords in a document over a corpus of

documents, the measure of term-frequency-inverse-

document-frequency (TF-IDF) (Salton and Buckley,

1987) is often used. When document corpora are

unavailable, the measure of word co-occurrences

(WCO) can produce a comparable performance to

TF-IDF over a large corpus of documents (Matsuo

and Ishizuka, 2003). The methods of TextRank (Mi-

halcea and Tarau, 2004) and RAKE (Rapid Auto-

matic Keyword Extraction) (Rose et al., 2010) further

reﬁne the WCO method from different perspectives,

which are also sufﬁciently fast to become candidates

for meeting the realtime requirements.

TextRank computes the rank of a word in an undi-

rected, weighted word-graph using a slightly modi-

ﬁed PageRank algorithm (Brin and Page, 1998). To

construct a word-graph for a given document, ﬁrst re-

move stop words and represent each remaining word

as a node, then link two words if they both appear in

a sliding window of a small size. Finally, assign the

number of co-occurrences of the endpoints of an edge

as a weight to the edge.

RAKE ﬁrst removes stop words using a stoplist,

and then generates words (including phrases) using a

set of word delimiters and a set of phrase delimiters.

For each remaining word w, the degree of w is the

frequency of w plus the number of co-occurrences of

consecutive word pairs ww

and w

w in the document,

where w

and w

are remaining words. The score of

w is the degree of w divided by the frequency of w.

We note that the quality of RAKE also depends on

a properly-chosen stoplist, which is language depen-

dent.

2.2 Coverage and Diversity

Optimization

The general framework of selecting sentences gives

rise to optimization problems with objective functions

being monotone submodular (Lin and Bilmes, 2011)

to promote coverage and diversity. Among them is an

objective function in the form of L(S) + λR(S) with a

summary S and a coefﬁcient λ ≥ 0, where L(S) mea-

sures the coverage of the summary and R(S) rewards

diversity. We use SubmodularF to denote the algo-

rithm computing this objective function. Submod-

ularF uses TF-IDF values of words in sentences to

compute the cosine similarity of two sentences. While

it is NP-hard to maximize a submodular objective

function subject to a summary length constraint, the

submodularity allows a greedy approximation with a

proven approximation ratio of 1 −1/

√

SubmodularF needs labeled data to train the pa-

rameters in the objective function to achieve a bet-

ter summary and it is intended to work on multiple-

document summarizations. While it is possible to

work on a single document without a corpus, we note

that the greedy algorithm has at least a quadratic-

time complexity and it produces a summary with low

ROUGE scores over DUC-02 (see Section 2.4), and

so it would not be a good candidate to meet our needs.

This also applies to a generalized objective function

consisting of a submodular component and a non-

submodular component (Dasgupta et al., 2013).

2.3 Topic Clusterings

Two unsupervised approaches to topic clusterings for

a given document have been investigated. One is

TextTiling (Hearst, 1997) and the other is LDA (La-

tent Dirichlet Allocation) (Blei et al., 2003). Text-

Tiling represents a topic as a set of consecutive para-

graphs in the document. It merges adjacent para-

graphs that belong to the same topic. TextTiling iden-

tiﬁes major topic-shifts based on patterns of lexical

co-occurrences and distributions. LDA computes for

each word a distribution under a pre-determined num-

ber of topics. LDA is a computation-heavy algorithm

that incurs a runtime too high to meet our realtime

requirements. TextTiling has a time complexity of

almost linear, which meets the requirements of efﬁ-

ciency.

2.4 Other Algorithms

Following the general framework of selecting sen-

tences to meet the requirements of topic coverage and

diversity, a number of unsupervised single-document

summarization algorithms have been devised. The

most notable is CP

(Parveen et al., 2016), which

produces the best ROUGE-1 (R-1), ROUGE-2 (R-2),

and ROUGE-SU4 (R-SU4) scores on DUC-02 among

all early algorithms, including Lead (Parveen et al.,

2015), DUC-02 Best, TextRank, LREG (Cheng and

Lapata, 2016), Mead (Radev et al., 2004), ILP

phrase

(Woodsend and Lapata, 2010), URANK (Wan, 2010),

UniformLink (Wan and Xiao, 2010), Egraph + Co-

herence (Parveen and Strube, 2015), Tgraph + Co-

herence (Topical Coherence for Graph-based Extrac-

tive Summarization) (Parveen et al., 2015), NN-SE

(Cheng and Lapata, 2016), and SubmodularF.

maximizes importance, non-redundancy, and

pattern-based coherence of sentences to generate a co-

herent summary using ILP. It computes the ranks of

selected sentences for the summary by the Hubs and

Authorities algorithm (HITS) (Kleinberg, 1999), and

ensures that each selected sentence has unique infor-

mation. It then uses mined patterns to extract sen-

tences if the connectivity among nodes in the projec-

tion graph matches the connectivity among nodes in

a coherence pattern. Because of space limitation, we

omit the descriptions of the other algorithms.

Table 1 shows the comparison results, where the

results for SubmodularF is obtained using the best pa-

rameters trained on DUC-03 (Lin and Bilmes, 2011).

Thus, to demonstrate the effectiveness of our algo-

rithms, we will compare our algorithms with only CP

over DUC-02.

Table 1: ROUGE scores (%) on DUC-02 data.

Methods R-1 R-2 R-SU4

Lead 45.9 18.0 20.1

DUC 2002 Best 48.0 22.8

TextRank 47.0 19.5 21.7

LREG 43.8 20.7

Mead 44.5 20.0 21.0

ILP

phrase

45.4 21.3

URANK 48.5 21.5

UniformLink 47.1 20.1

Egraph + Coh. 48.5 23.0 25.3

Tgraph + Coh. 48.1 24.3 24.2

NN-SE 47.4 23.0

SubmodularF 39.6 16.9 17.8

49.0 24.7 25.8

Solving ILP, however, is time consuming even on

documents of a moderate size, for ILP is NP-hard.

Thus, CP

does not meet the requirements of time ef-

ﬁciency. We will need to investigate new methods.

3 OUR METHODS

We use TextRank and RAKE to obtain initial rank-

ing scores of keywords, and use the softplus function

(Glorot et al., 2011)

sp(x) = ln(1 + e

) (1)

to enhance keyword rankings to favor sentences that

are more important.

3.1 Softplus Ranking

Assume that after ﬁltering, a sentence s consists of

k keywords w

,··· ,w

, and w

has a ranking score r

produced by TextRank or RAKE. We can rank s using

one of the following two methods:

Rank(s) =

∑

i=1

(2)

Rank

(s) =

∑

i=1

sp(r

) (3)

Let DTRank (Direct TextRank) and ETRank (En-

hanced TextRank) denote the methods of ranking sen-

tences using, respectively, Rank(s) and Rank

(s)

over TextRank keyword rankings, and DRAKE (Di-

rect RAKE) and ERAKE (Enhanced RAKE) to de-

note the methods of ranking sentences using, respec-

tively, Rank(s) and Rank

(s) over RAKE keyword

rankings.

The softplus function is helpful because when x

is a small positive number, sp(x) increases the value

of x signiﬁcantly (see Figure 1) and when x is large,

sp(x) ≈ x. In particular, given two sentences s

and

Figure 1: Softplus function ln(1 + e

, suppose that s

has a few keywords with high

rankings and the rest of the keywords with low rank-

ings, while s

has medium rankings for almost all the

keywords. In this case, we would consider s

important than s

. However, we may end up with

Rank(s

) < Rank(s

). To illustrate this using a nu-

merical example, assume that s

and s

each consists

of 5 keywords, with original scores (sc) and softplus

scores (sp) given in the following table 2.

Table 2: Numerical examples with given sc and sp scores.

Rank

sc 2.6 2.2 2.1 0.3 0.2 7.4

sp 2.67 2.31 2.22 0.85 0.80 8.84

sc 1.6 1.5 1.5 1.5 1.4 7.5

sp 1.78 1.70 1.70 1.70 1.62 8.51

Sentence s

is more important than s

because

it contains three keywords of much higher ranking

scores than those of s

. However, s

will be selected

without using softplus. After using softplus, s

is se-

lected as it should be.

For a real-life example, consider the following two

sentences from an article in DUC-02:

: Hurricane Gilbert swept toward Jamaica yesterday with

100-mile-an-hour winds, and ofﬁcials issued warnings

to residents on the southern coasts of the Dominican

Republic, Haiti and Cuba.

: Forecasters said the hurricane was gaining strength as

it passed over the ocean and would dump heavy rain

on the Dominican Republic and Haiti as it moved south

of Hispaniola, the Caribbean island they share, and

headed west.

We consider s

more important as it speciﬁes the

name, strength, and direction of the hurricane, the

places affected, and the ofﬁcial warnings. Us-

ing TextRank to compute keyword scores, we have

Rank(s

) = 1.538 < Rank(s

) = 1.603, which returns

a less important sentence s

. After computing soft-

plus, we have Rank

) = 8.430 > Rank

) =

7.773; the more important sentence s

is selected.

Note that not any exponential function would do

the trick. What we want is a function to return roughly

the same value as the input when the input is large,

and a signiﬁcantly larger value than the input when

the input is much less than 1. The softplus function

meets this requirement.

3.2 Topic Clustering Schemes

We consider four topic clustering schemes: TCS,

TCP, TCTT, and TCLDA.

1. TCS selects sentences without checking topics.

2. TCP treats each paragraph as a separate topic.

3. TCTT partitions a document into a set of multi-

paragraph segments using TextTiling.

4. TCLDA computes a topic distribution for each

word using LDA. We set the number of topics

from 5 to 8 depending on the length of the docu-

ment. Assume that a document contains K topics

(5 ≤ K ≤ 8) and the topic j consists of k

words

1 j

,··· ,w

, j

, where 1 ≤ j ≤ K and w

i j

has a

probability p

i j

> 0. For a document with n sen-

tences s

,··· ,s

, we use the following maximiza-

tion to determine which topic t

the sentence s

belongs to (1 ≤t ≤K):

= argmax

1≤j≤k



∏

i:w

i j

∈s

i j



(4)

3.3 Summarization Algorithms

The length of a summary may be speciﬁed by users,

either as a number of words or as a percentage of the

number of characters of the original document. By a

“30% summary” we mean that the number of charac-

ters of the summary does not exceed 30% of that of

the original document.

Let L be the summary length (the total number of

characters) speciﬁed by the user and S a summary. If

S consist of m sentences s

,··· ,s

, and the number

of characters of s

is `

, then the following inequality

must hold:

∑

i=1

≤ L.

Depending on which sentence-ranking algorithm

and which topic-clustering scheme to use, we have

eight combinations using ETRank and ERAKE, and

eight combinations using DTRank and DRAKE,

shown in Table 3. For example, ET3Rank (Enhanced

TextTiling TRank) means to use Rank

(s) to rank

sentences and TextTiling to compute topic cluster-

ings, and T2RAKE (TextTiling RAKE) means to use

Rank(s) rank sentences over RAKE keywords and

TextTiling to compute topic clusterings.

Table 3: Description of all the Algorithms with different

sentence-ranking (S-R) and topic-clustering (T-C) schemes.

Methods S-R T-C

ESTRank ETRank TCS

EPTRank ETRank TCP

ET3Rank ETRank TCTT

ELDATRank ETRank TCLDA

ESRAKE ERAKE TCS

EPRAKE ERAKE TCP

ET2RAKE ERAKE TCTT

ELDARAKE ERAKE TCLDA

STRank DTRank TCS

PTRank DTRank TCP

T3Rank DTRank TCTT

LDATRank DTRank TCLDA

SRAKE DRAKE TCS

PRAKE DRAKE TCP

T2RAKE DRAKE TCTT

LDARAKE DRAKE TCLDA

All algorithms follow the following procedure for

selecting sentences:

1. Preprocessing phase

(a) Identify keywords and compute the ranking of

each keyword.

(b) Compute the ranking of each sentence.

2. Sentence selection phase

(a) Sort the sentences in descending order of their

ranking scores.

(b) Select sentences one at a time with a higher

score to a lower score. Check if the se-

lected sentence s belongs to the known-topic set

(KTS) according to the underlying topic clus-

tering scheme, where KTS is a set of topics

from sentences placed in the summary so far.

If s is in KTS, then discard it; otherwise, place

s into the summary and its topic into KTS.

reaches its length constraint.

(d) If the number of topics contained in the KTS is

equal to the number of topics in the document,

empty KTS and repeat the procedure from Step

Figure 2 shows an example of 30% summary gen-

erated by ET3Rank on an article in NewsIR-16.

4 A WORD-EMBEDDING

MEASUREMENT OF QUALITY

Word2vec (Mikolov et al., 2013a; Mikolov et al.,

2013b) is an NN model that learns a vector represen-

tation for each word contained in a corpus of docu-

ments. The model consists of an input layer, a projec-

tion layer, and an output layer to predict nearby words

in the context. In particular, a sequence of T words

,··· ,w

are used to train a Word2Vec model for

maximizing the probability of neighboring words:

∑

t=1

∑

j∈b(t)

log p(w

) (5)

where b(t) = [t −c,t +c] is the set of center word w

’s

neighboring words, c is the size of the training con-

text, and p(w

) is deﬁned by the softmax function.

Word2Vec can learn complex word relationships if it

trains on a very large data set.

4.1 Word Mover’s Distance

Word Mover’s Distance (WMD) (Kusner et al., 2015)

uses Word2Vec as a word embedding representation

method. It measures the dissimilarity between two

documents and calculates the minimum cumulative

distance to “travel” from the embedded words of one

document to the other. Although two documents may

not share any words in common, WMD can still mea-

sure the semantical similarity by considering their

word embeddings, while other bag-of-words or TF-

IDF methods only measure the similarity by the ap-

pearance of words. A smaller value of WMD indi-

cates that the two sentences are more similar.

4.2 A Word-embedding Similarity

Measure

Based on WMD’s ability of measuring the semantic

similarity of documents, we propose a summarization

evaluation measure WESM (Word-Embedding Sim-

ilarity Measure). Given two documents D

and D

let WMD(D

) denote the distance of D

and D

Given a document D, assume that it consists of ` para-

graphs P

,··· ,P

. Let S be a summary of D. We com-

pare the word-embedding similarity of a summary S

Figure 2: An example of 30% summary of an article in NewsIR-16 by ET3Rank, where the original document is on the left

and the summary is on the right.

with D using WESM(S,D) as follows:

WESM(S,D) =

∑

i=1

1 + WMD(S,P

)

(6)

The value of WESM(S,D) is between 0 and 1. Under

this measure, the higher the WESM(S,D) value, the

more similar S is to D.

5 NUMERICAL ANALYSIS

We evaluate the qualities of summarizations using

the DUC-02 dataset (DUC, 2002) and the NewsIR-16

dataset (Corney et al., 2016). DUC-02 consists of 60

reference sets, each of which consists of a number of

documents, single-document summary benchmarks,

and multi-document abstracts/extracts. The common

ROUGE recall measures of ROUGE-1, ROUGE-2,

and ROUGE-SU4 are used to compare the quality of

summarization algorithms over DUC data. NewsIR-

16 consists of 1 million articles from English news

media sites and blogs.

We use various software packages to implement

TextRank (with window size = 2) (MIT, 2014), RAKE

(MIT, 2015), TexTiling (Boutsioukis, 2016), LDA

and Word2Vec (Rehurek, 2017).

We use the existing Word2Vec model trained on

English Wikipedia (Foundation, 2017), which con-

sists of 3.75 million articles formatted in XML. The

reason to choose this dataset is for its large size and

the diverse topics it covers.

5.1 ROUGE Evaluations over DUC-02

As mentioned before, we use CP

to cover all previ-

ously known algorithms for the purpose of comparing

qualities of summaries, as CP

produces the best re-

sults among them.

Among all the algorithms we devise, we only

present those with at least one ROUGE recall score

better than or equal to the corresponding score of CP

identiﬁed in bold (see Table 4). Also shown in the

table is the average of the three ROUGE scores (R-

AVG). We can see that ET3Rank is the winner, fol-

lowed by T2RAKE; both are superior to CP

. More-

over, ET2RAKE offers the highest ROUGE-1 score

of 49.3.

Table 4: ROUGE scores (%) on DUC-02 data.

Methods R-1 R-2 R-SU4 R-AVG

49.0 24.7 25.8 33.17

ET3Rank 49.2 25.6 27.5 34.10

ESRAKE 49.0 23.6 26.1 32.90

ET2RAKE 49.3 21.4 24.5 31.73

PRAKE 49.0 24.5 25.3 32.93

T2RAKE 49.1 25.4 25.8 33.43

5.2 WESM Evaluations over DUC-02

and NewsIR-16

Table 5 shows the evaluation results on DUC-02 and

NewsIR-16 using WESM based on the Word2Vec

model trained on English Wikipedia. The ﬁrst number

in the third row is the average score on all benchmark

summaries in DUC-02. For the rest of the rows, each

number is the average score of summaries produced

by the corresponding algorithm for all documents in

DUC-02 and NewsIR-16. The size constraint of a

summary on DUC-02 for each document is the same

as that of the corresponding DUC-02 summary bench-

mark.

For NewsIR-16, we select at random 1,000 doc-

uments from NewsIR-16 and remove the title, refer-

ences, and other unrelated content from each article.

Based on an observation that a 30% summary allows

for a good summary, we compute 30% summaries of

these articles using each algorithm.

Table 5: Scores (%) over DUC-02 and NewsIR-16 under

WESM trained on English-Wikipedia.

Datasets DUC-02 NewsIR-16

Benchmarks 3.021

ET3Rank 3.382 2.002

ESRAKE 3.175 1.956

ET2RAKE 3.148 1.923

PRAKE 3.150 1.970

T2RAKE 3.247 1.990

It is expected that scores of our algorithms are bet-

ter than the score for benchmarks under each measure,

for the benchmarks often use different words not in

the original documents, and hence would have smaller

similarities.

5.3 Normalized L

-norm

We would like to determine if WESM is a viable

measure. From our experiments, we know that

the all-around best algorithm ET3Rank, the second

best algorithm T2RAKE, and ET2RAKE remain the

same positions under R-AVG over DUC-02 and under

WESM over both DUC-02 and NewsIR-16 (see Table

6), ESRAKE and PRAKE remain the same positions

under R-AVG over DUC-02 and under WESM over

NewsIR-16, while ESRAKE and PRAKE only differ

by one place under R-AVG and WESM over DUC-02.

Table 6: Orderings of R-AVG scores over DUC-02 and

WESM scores over DUC-02 and NewsIR-16.

Methods

R-AVG WESM

DUC-02 DUC-02 NewsIR-16

ET3Rank 1 1 1

ESRAKE 4 3 4

ET2RAKE 5 5 5

PRAKE 3 4 3

T2RAKE 2 2 2

Next, we compare the ordering of the R-AVG

scores and the WESM scores over DUC-02. For this

purpose, we use the normalized L

-norm to compare

the distance of two orderings. Let X = (x

,··· ,x

)

be a sequence of k objects, where each x

has two val-

ues a

and b

such that a

,.. . ,a

and b

,.. . ,b

are, respectively, permutations of 1,2,. . ., k. Let

∑

i=1

|(k −i + 1) −i|,

which is the maximum distance two permutations can

possibly have. Then the normalized L

-norm of A =

,··· ,b

) and B = (b

,··· ,b

) is deﬁned by

||A,B||

∑

i=1

−b

Table 6 shows the orderings of the R-AVG scores

over DUC-02 and WESM scores over DUC-02 and

NewsIR-16 (from Tables 4 and 5).

It is straightforward to see that D

= 12,

||O

= ||O

= 2/12 = 1/6 and

||O

= 0. This indicates that WESM and

ROUGE are highly comparable over DUC-02 and

NewsIR-16, and the orderings of WESM on different

datasets, while with larger spread, are still similar.

5.4 Runtime Analysis

We carried out runtime analysis through experiments

on a computer with a 3.5 GHz Intel Xeon CPU E5-

1620 v3. We used a Python implementation of our

summarization algorithms. Since DUC-02 are short,

all but LDA-based algorithms run in about the same

time. To obtain a ﬁner distinction, we ran our ex-

periments on NewsIR-16. Since the average size

of NewsIR-16 articles is 405 words, we selected at

random a number of articles from NewsIR-16 and

merged them to generate a new article. For each size

from around 500 to around 10,000 words, with incre-

ments of 500 words, we selected at random 100 ar-

ticles and computed the average runtime of different

algorithms to produce 30% summary (see Figure 3).

We note that the time complexity of each of our algo-

rithms incurs mainly in the preprocessing phase; the

size of summaries in the sentence selection phase only

incur minor ﬂuctuations of computation time, and so

it sufﬁce to compare the runtime for producing 30%

summaries.

Figure 3: Runtime analysis, where the unit on the x-axis is

100 words and the unit of the y-axis is seconds.

We can see from Figure 3 that ESRAKE and

PRAKE incur about the same linear time and they

are extremely fast. Also, ET3RANK, ET2RAKE, and

T2RAKE incur about the same time. While the time

is higher because of the use of TextTiling and is closed

to being linear, it meets the realtime requirements.

For example, for a document of up to 3,000 words,

over 3,000 but less than 5,500 words, and 10,000

words, respectively, the runtime of ET3Rank is under

0.5, 1, and 2.75 seconds.

The runtime of SubmodularF is acceptable for

documents of moderate sizes (not shown in the pa-

per); but for a document of about 10,000 words,

the runtime is close to 4 seconds. LDA-based algo-

rithms is much higher. For example, LDARAKE in-

curs about 16 seconds for a document of about 2,000

words, about 41 seconds for a document of about

5,000 words, and about 79 seconds for a document

of about 10,000 words.

6 CONCLUSIONS

We presented a number of unsupervised single-

document summarization algorithms for generating

effective summaries in realtime and a new measure

based on word-embedding similarities to evaluate the

quality of a summary. We showed that ET3Rank is the

best all-around algorithm. A web-based summariza-

tion tool using ET3Rank and T2RAKE will be made

available to the public.

To further obtain better topic clusterings efﬁ-

ciently, we plan to extend TextTiling over non-

consecutive paragraphs. To obtain a better under-

standing of word-embedding similarity measures, we

plan to compare WESM with human evaluation and

other unsupervised methods including those devised

by Louis and Nenkova (Louis and Nenkova, 2009).

We also plan to explore new ways to measure sum-

mary qualities without human-generated benchmarks.

ACKNOWLEDGEMENTS

We thank the members of the Text Automation Lab

at UMass Lowell for their support and fruitful discus-

sions.

REFERENCES

Aslam, J. A. and Frost, M. (2003). An information-theoretic

measure for document similarity. In SIGIR ’03: Pro-

ceedings of the 26th annual international ACM SI-

GIR conference on Research and development in in-

formaion retrieval, pages 449–450, New York, NY,

USA. ACM.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent

dirichlet allocation. Journal of machine Learning re-

search, 3(Jan):993–1022.

Boutsioukis, G. (2016). Natural language toolkit: Texttil-

ing.

Brin, S. and Page, L. (1998). The anatomy of a large-scale

hypertextual Web search engine. Computer Networks

and ISDN Systems, 30:107–117.

Cheng, J. and Lapata, M. (2016). Neural summariza-

tion by extracting sentences and words. CoRR,

abs/1603.07252.

Corney, D., Albakour, D., Martinez, M., and Moussa, S.

(2016). What do a million news articles look like?

In Proceedings of the First International Workshop

on Recent Trends in News Information Retrieval co-

located with 38th European Conference on Informa-

tion Retrieval (ECIR 2016), Padua, Italy, March 20,

2016., pages 42–47.

Dasgupta, A., Kumar, R., and Ravi, S. (2013). Summariza-

tion through submodularity and dispersion. In ACL

(1), pages 1014–1022. The Association for Computer

Linguistics.

DUC (2002). Document understanding conference 2002.

Foundation, W. (2017). Wikimedia downloads.

Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse

rectiﬁer neural networks. In Aistats, volume 15, page

275.

Hearst, M. A. (1997). Texttiling: Segmenting text into

multi-paragraph subtopic passages. Computational

linguistics, 23(1):33–64.

Kleinberg, J. M. (1999). Authoritative sources in a hy-

perlinked environment. Journal of the ACM (JACM),

46(5):604–632.

Kusner, M. J., Sun, Y., Kolkin, N. I., and Weinberger,

K. Q. (2015). From word embeddings to document

distances. In Proceedings of the 32nd International

Conference on Machine Learning (ICML 2015), pages

957–966.

Lin, H. and Bilmes, J. A. (2011). A class of submodular

functions for document summarization. In Lin, D.,

Matsumoto, Y., and Mihalcea, R., editors, ACL, pages

510–520. The Association for Computer Linguistics.

Louis, A. and Nenkova, A. (2009). Automatically evalu-

ating content selection in summarization without hu-

man models. In Proceedings of the 2009 Conference

on Empirical Methods in Natural Language Process-

ing: Volume 1 - Volume 1, EMNLP ’09, pages 306–

314, Stroudsburg, PA, USA. Association for Compu-

tational Linguistics.

Matsuo, Y. and Ishizuka, M. (2003). Keyword extraction

from a single document using word co-occurrence sta-

tistical information. In Proceedings of the Sixteenth

International Florida Artiﬁcial Intelligence Research

Society Conference, pages 392–396. AAAI Press.

Mihalcea, R. and Tarau, P. (2004). Textrank: Bringing or-

der into texts. In Proceedings of EMNLP-04 and the

2004 Conference on Empirical Methods in Natural

Language Processing.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a).

Efﬁcient estimation of word representations in vector

space. arXiv preprint arXiv:1301.3781.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and

Dean, J. (2013b). Distributed representations of words

and phrases and their compositionality. In Advances in

neural information processing systems, pages 3111–

3119.

MIT (2014). Textrank implementation in python.

MIT (2015). A python implementation of the rapid auto-

matic keyword extraction.

Nallapati, R., Zhou, B., dos Santos, C. N., Glehre, ., and Xi-

ang, B. (2016). Abstractive text summarization using

sequence-to-sequence rnns and beyond. In CoNLL,

pages 280–290. ACL.

Parveen, D., Mesgar, M., and Strube, M. (2016). Gener-

ating coherent summaries of scientiﬁc articles using

coherence patterns. In Proceedings of the 2016 Con-

ference on Empirical Methods in Natural Language

Processing, pages 772–783.

Parveen, D., Ramsl, H.-M., and Strube, M. (2015). Topical

coherence for graph-based extractive summarization.

In Mrquez, L., Callison-Burch, C., Su, J., Pighin, D.,

and Marton, Y., editors, EMNLP, pages 1949–1954.

The Association for Computational Linguistics.

Parveen, D. and Strube, M. (2015). Integrating importance,

non-redundancy and coherence in graph-based extrac-

tive summarization. In Yang, Q. and Wooldridge, M.,

editors, IJCAI, pages 1298–1304. AAAI Press.

Radev, D. R., Allison, T., Blair-Goldensohn, S., Blitzer, J.,

Celebi, A., Dimitrov, S., Drabek, E., Hakim, A., Lam,

W., Liu, D., et al. (2004). Mead-a platform for multi-

document multilingual text summarization. In LREC.

Rehurek, R. (2017). gensim 2.0.0.

Rose, S., Engel, D., Cramer, N., and Cowley, W. (2010).

Automatic keyword extraction from individual docu-

ments. In Berry, M. W. and Kogan, J., editors, Text

Mining. Applications and Theory, pages 1–20. John

Wiley and Sons, Ltd.

Rush, A. M., Chopra, S., and Weston, J. (2015). A neural at-

tention model for abstractive sentence summarization.

CoRR, abs/1509.00685.

Salton, G. and Buckley, C. (1987). Term weighting ap-

proaches in automatic text retrieval. Technical report,

Cornell University, Ithaca, NY, USA.

Wan, X. (2010). Towards a uniﬁed approach to simultane-

ous single-document and multi-document summariza-

tions. In Proceedings of the 23rd international confer-

ence on computational linguistics, pages 1137–1145.

Association for Computational Linguistics.

Wan, X. and Xiao, J. (2010). Exploiting neighborhood

knowledge for single document summarization and

keyphrase extraction. ACM Trans. Inf. Syst., 28(2).

Woodsend, K. and Lapata, M. (2010). Automatic genera-

tion of story highlights. In Proceedings of the 48th

Annual Meeting of the Association for Computational

Linguistics, pages 565–574. Association for Compu-

tational Linguistics.