A Domain Independent Double Layered Approach to Keyphrase Generation

Dario De Nart, Carlo Tasso

2014

Abstract

The annotation of documents and web pages with semantic metatdata is an activity that can greatly increase the accuracy of Information Retrieval and Personalization systems, but the growing amount of text data available is too large for an extensive manual process. On the other hand, automatic keyphrase generation, a complex task involving Natural Language Processing and Knowledge Engineering, can significantly support this activity. Several different strategies have been proposed over the years, but most of them require extensive training data, which are not always available, suffer high ambiguity and differences in writing style, are highly domain-specific, and often rely on a well-structured knowledge that is very hard to acquire and encode. In order to overcome these limitations, we propose in this paper an innovative domain-independent approach that consists of an unsupervised keyphrase extraction phase and a subsequent keyphrase inference phase based on loosely structured, collaborative knowledge such as Wikipedia, Wordnik, and Urban Dictionary. This double layered approach allows us to generate keyphrases that both describe and classify the text.

References

  1. Barker, K. and Cornacchia, N. (2000). Using noun phrase heads to extract document keyphrases. In Advances in Artificial Intelligence, pages 40-52. Springer.
  2. Bracewell, D. B., Ren, F., and Kuriowa, S. (2005). Multilingual single document keyword extraction for information retrieval. In Natural Language Processing and Knowledge Engineering, 2005. IEEE NLPKE'05. Proceedings of 2005 IEEE International Conference on, pages 517-522. IEEE.
  3. Danilevsky, M., Wang, C., Desai, N., Guo, J., and Han, J. (2013). Kert: Automatic extraction and ranking of topical keyphrases from content-representative document titles. arXiv preprint arXiv:1306.0271.
  4. DAvanzo, E., Magnini, B., and Vallin, A. (2004). Keyphrase extraction for summarization purposes: The lake system at duc-2004. In Proceedings of the 2004 document understanding conference.
  5. De Nart, D., Ferrara, F., and Tasso, C. (2013). Personalized access to scientific publications: from recommendation to explanation. In User Modeling, Adaptation, and Personalization, pages 296-301. Springer.
  6. Dumais, S., Platt, J., Heckerman, D., and Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the seventh international conference on Information and knowledge management, pages 148-155. ACM.
  7. Ferrara, F. and Tasso, C. (2012). Integrating semantic relatedness in a collaborative filtering system. In Mensch & Computer Workshopband, pages 75-82.
  8. Ferrara, F. and Tasso, C. (2013). Extracting keyphrases from web pages. In Digital Libraries and Archives, pages 93-104. Springer.
  9. Litvak, M. and Last, M. (2008). Graph-based keyword extraction for single-document summarization. In Proceedings of the workshop on multi-source multilingual information extraction and summarization, pages 17- 24. Association for Computational Linguistics.
  10. Marujo, L., Gershman, A., Carbonell, J., Frederking, R., and Neto, J. P. (2013). Supervised topical key phrase extraction of news stories using crowdsourcing, light filtering and co-reference normalization. arXiv preprint arXiv:1306.4886.
  11. Medelyan, O. and Witten, I. H. (2006). Thesaurus based automatic keyphrase indexing. In Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, pages 296-297. ACM.
  12. Mihalcea, R. and Tarau, P. (2004). Textrank: Bringing order into texts. In Proceedings of EMNLP, volume 4. Barcelona, Spain.
  13. Pouliquen, B., Steinberger, R., and Ignat, C. (2006). Automatic annotation of multilingual text collections with a conceptual thesaurus. arXiv preprint cs/0609059.
  14. Pudota, N., Dattolo, A., Baruzzo, A., Ferrara, F., and Tasso, C. (2010). Automatic keyphrase extraction and ontology mining for content-based tag recommendation. International Journal of Intelligent Systems, 25(12):1158-1186.
  15. Sarkar, K. (2013). A hybrid approach to extract keyphrases from medical documents. arXiv preprint arXiv:1303.1441.
  16. Silverstein, C., Marais, H., Henzinger, M., and Moricz, M. (1999). Analysis of a very large web search engine query log. In ACm SIGIR Forum, volume 33, pages 6-12. ACM.
  17. Strube, M. and Ponzetto, S. P. (2006). Wikirelate! computing semantic relatedness using wikipedia. In AAAI, volume 6, pages 1419-1424.
  18. Turney, P. D. (1999). Learning to extract keyphrases from text. national research council. Institute for Information Technology, Technical Report ERB-1057.
  19. Turney, P. D. (2000). Learning algorithms for keyphrase extraction. Information Retrieval, 2(4):303-336.
  20. Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., and Nevill-Manning, C. G. (1999). Kea: Practical automatic keyphrase extraction. In Proceedings of the fourth ACM conference on Digital libraries, pages 254-255. ACM.
Download


Paper Citation


in Harvard Style

De Nart D. and Tasso C. (2014). A Domain Independent Double Layered Approach to Keyphrase Generation . In Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST, ISBN 978-989-758-024-6, pages 305-312. DOI: 10.5220/0004855303050312


in Bibtex Style

@conference{webist14,
author={Dario De Nart and Carlo Tasso},
title={A Domain Independent Double Layered Approach to Keyphrase Generation},
booktitle={Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,},
year={2014},
pages={305-312},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004855303050312},
isbn={978-989-758-024-6},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,
TI - A Domain Independent Double Layered Approach to Keyphrase Generation
SN - 978-989-758-024-6
AU - De Nart D.
AU - Tasso C.
PY - 2014
SP - 305
EP - 312
DO - 10.5220/0004855303050312