
 
relative responsiveness (5-point scale, 1= worst, 5= 
best) of the summaries was also assessed by 
measuring the amount of information that helps the 
user in successfully retrieving information. This 
measure seems to coincide with our usefulness one, 
which is assessed in our experiment through 
questions 1, 2, and 5. The average responsiveness in 
DUC 2005 was 48% for automatic summaries, 
against 93% for the reference ones (Hachey et al., 
2005). 
Accordingly looking at our data, usefulness 
averaged 72% for those three questions. Certainly, 
this is quite a simplistic comparison, due to the 
profound differences of both assessments. However, 
the whole design of the experiment is quite 
significant, when compared to the DUC ones. To 
make it statistically significant, we must invest on its 
robustness (e.g., by increasing the amount of Web 
users and search engine answers). 
6 FINAL REMARKS 
The reported results show a significant proximity of 
ExtraWeb with Google. This means that ExtraWeb 
may also be useful for the users to make decisions 
on retrieving documents, in spite of their low score 
(68%) on full satisfaction with the results of the 
emulated search task. Although the experiment was 
not intended to control either the homogeneity of the 
judging population or its subjectivity in 
accomplishing the demanded task, the analysis of 
their scores shows that the overall judgment was 
quite consistent. However, the same extrinsic task-
oriented evaluation may yield different results when 
a higher scale on both, judges and retrieved 
documents, is taken into account. Usually, users 
assessing the same task and set of results of a search 
engine would not necessarily respond in the same 
way and some of them might read the extracts more 
carefully than others. As a consequence, their 
judgments could be more accurate. This is very 
likely to be evident when scaling up the type of 
assessment reported in this paper.  
Another important issue to pinpoint is that 
ExtraWeb is domain-independent. However, it 
depends on previous HTML-marking keywords 
which are usually accomplished by the documents 
authors. The alternative to this would be to generate 
a keywords list through statistical methods such as 
Luhn’s itself. However, this would not yield 
keywords as expressive as the authored ones. 
Future work shall build on both, improving the 
enrichment of the ontology and assessing more 
broadly the system, in a distributed environment in 
real-time. Most probably, it will be relevant to 
reproduce similar quality questions to the ones used 
in the most recent DUCs too. 
REFERENCES 
Amitay, E., 2001. What lays in the layout: Using anchor-
paragraph arrangements to extract descriptions of 
Web documents. PhD Thesis. Department Mani of 
Computing, Macquarie University.  
Barros, F. A., Gonçalves, P. F., Santos, T. L. V. L., 1998. 
Providing Context to Web Searches: The Use of 
Ontologies to Enhance Search Engine's Accuracy. 
Journal of the Brazilian Computer Society, 5(2):45-55. 
Chirita, P. A., Nejdl, W., Paiu, R., Kohlschütter, C., 2005. 
Using ODP meta-data to personalize search. In the 
Proc. of the 28th Annual International ACM SIGIR 
Conference on Research and Development in 
Information Retrieval, pp. 178-185. 
Conklin, J., 1987. Hypertext: An Introduction and Survey. 
IEEE Computer, 20(9), pp.17-41. 
Dorr, B., Monz, C., President, S., Schwartz, R.,  Zajic, D., 
2005. A Methodology for Extrinsic Evaluation of Text 
Summarization: Does ROUGE Correlate? In the Proc. 
of the ACL Workshop on Intrinsic and Extrinsic 
Evaluation Measures for Machine Translation and/or 
Summarization, pp. 1-8. 
Edmundson, H. P., 1969. New Methods in Automatic 
Extracting. Journal of the ACM, 16(2):264-285. 
Greghi, J. G., Martins, R. T., Nunes, M. G. V., 2002. 
Diadorim: a lexical database for brazilian portuguese. 
In the Proc. of the Third International Conference on 
language Resources and Evaluation. 4:1346-1350. 
Griesbaum, J., 2004. Evaluation of three German search 
engines: Altavista.de, Google.de and Lycos.de. 
Information Research, 9(4), paper 189. 
Hachey, B., Murray, G., Reitter, D., 2005. Embra System 
at DUC 2005: Query-oriented multi-document 
summarization with a very large latent semantic. 
Document Understanding Conference 2005, 
Vancouver, British Columbia, Canada. 
Haveliwala, T. H., 2002. Topic-sensitive PageRank. In the 
Proc. of the Eleventh International World Wide Web 
Conference, Honolulu, Hawaii. 
Inktomi-Corp., 2003. Web search relevance test. Ve-ritest. 
Available at http://www.veritest.com/clients/reports/ 
inktomi/inktomi_Web_search_test.pdf [March 2006]. 
Jansen, B. J., Spink, A., Saracevic, T., 2000. Real life, real 
users, and real needs: a study and analysis of user 
queries on the web. Information Processing and 
Management, 36(2):207-227.  
Lewis, J. R., 1995. Computer Usability Satisfaction 
Questionnaires: Psychometric Evaluation and 
Instructions for Use. International Journal of Human-
Computer Interaction, 7(1):57-78. 
Liang, S. F., Devlin, S., Tait, J., 2004. Feature Selection 
for Summarising: The Sunderland DUC 2004 
ExtraWeb-AnExtrinsicTask-orientedEvaluationofWebpageExtracts
473