Informativeness-based Keyword Extraction from Short Documents

Mika Timonen, Timo Toivanen, Yue Teng, Chao Chen, Liang He

2012

Abstract

With the rise of user created content on the Internet, the focus of text mining has shifted. Twitter messages and product descriptions are examples of new corpora available for text mining. Keyword extraction, user modeling and text categorization are all areas that are focusing on utilizing this new data. However, as the documents within these corpora are considerably shorter than in the traditional cases, such as news articles, there are also new challenges. In this paper, we focus on keyword extraction from documents such as event and product descriptions, and movie plot lines that often hold 30 to 60 words. We propose a novel unsupervised keyword extraction approach called Informativeness-based Keyword Extraction (IKE) that uses clustering and three levels of word evaluation to address the challenges of short documents. We evaluate the performance of our approach by using manually tagged test sets and compare the results against other keyword extraction methods, such as CollabRank, KeyGraph, Chi-squared, and TF-IDF. We also evaluate the precision and effectiveness of the extracted keywords for user modeling and recommendation and report the results of all approaches. In all of the experiments IKE out-performs the competition.

References

  1. Andrade, M. and Valencia, A. (1998). Automatic extraction of keywords from scientific text: Application to the knowledge domain of protein families. Bioinformatics, 14:600-607.
  2. Clark, K. and Gale, W. (1995). Inverse document frequency (idf): A measure of deviation from poisson. In Third Workshop on Very Large Corpora, pages 121-130.
  3. Ercan, G. and Cicekli, I. (2007). Using lexical chains for keyword extraction. Inf. Process. Manage., 43(6):1705- 1714.
  4. Frank, E., Paynter, G. W., Witten, I. H., Gutwin, C., and Nevill-Manning, C. G. (1999). Domain-specific keyphrase extraction. In Dean, T., editor, IJCAI'99, pages 668-673. Morgan Kaufmann.
  5. HaCohen-Kerner, Y. (2003). Automatic extraction of keywords from abstracts. In Palade, V., Howlett, R. J., and Jain, L. C., editors, KES 2003, volume 2773 of Lecture Notes in Computer Science, pages 843-849. Springer.
  6. HaCohen-Kerner, Y., Gross, Z., and Masa, A. (2005). Automatic extraction and learning of keyphrases from scientific articles. In Gelbukh, A. F., editor, CICLing 2005, volume 3406 of Lecture Notes in Computer Science, pages 657-669. Springer.
  7. Hulth, A. (2003). Improved automatic keyword extraction given more linguistic knowledge. In Conference on Empirical Methods in Natural Language Processing, pages 216-223.
  8. Hulth, A. (2004). Enhancing linguistically oriented automatic keyword extraction. In North American Human language technology conference.
  9. Hulth, A., Karlgren, J., Jonsson, A., Boström, H., and Asker, L. (2001). Automatic keyword extraction using domain knowledge. In Gelbukh, A. F., editor, CICLing'01, volume 2004 of Lecture Notes in Computer Science, pages 472-482. Springer.
  10. Kim, S., Medelyan, O., Kan, M., and Baldwin, T. (2010). Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles. In Proceedings of the 5th International Workshop on Semantic Evaluation, ACL 2010, pages 21-26.
  11. Matsuo, Y. and Ishizuka, M. (2003). Keyword extraction from a single document using word co-occurrence statistical information. In Russell, I. and Haller, S. M., editors, FLAIRS Conference, pages 392-396. AAAI Press.
  12. Nguyen, T. D. and Kan, M.-Y. (2007). Keyphrase extraction in scientific publications. In Goh, D. H.-L., Cao, T. H., Sølvberg, I., and Rasmussen, E. M., editors, ICADL, volume 4822 of Lecture Notes in Computer Science, pages 317-326. Springer.
  13. Ohsawa, Y., Benson, N. E., and Yachida, M. (1998). Keygraph: Automatic indexing by co-occurrence graph based on building construction metaphor. In ADL'98, pages 12-18. IEEE Computer Society.
  14. Page, L., Brin, S., Motwani, R., and Winograd, T. (1998). The pagerank citation ranking: Bringing order to the web. Technical report, Stanford.
  15. Rennie, J. D. M. and Jaakkola, T. (2005). Using term informativeness for named entity detection. In Baeza-Yates, R. A., Ziviani, N., Marchionini, G., Moffat, A., and Tait, J., editors, SIGIR'05, pages 353-360. ACM.
  16. Timonen, M. (2012). Categorization of very short documents. In In-press KDIR'12. SciTePress Digital Library.
  17. Timonen, M., Silvonen, P., and Kasari, M. (2011a). Classification of short documents to categorize consumer opinions. In ADMA'11. Online proceedings.
  18. Timonen, M., Silvonen, P., and Kasari, M. (2011b). Modelling a query space using associations. Frontiers in Artificial Intelligence and Applications, 255:77-96.
  19. Tomokiyo, T. and Hurst, M. (2003). A language model approach to keyphrase extraction. In Proceedings of ACL Workshop on Multiword Expressions.
  20. Turney, P. D. (2000). Learning algorithms for keyphrase extraction. Inf. Retr., 2(4):303-336.
  21. Turney, P. D. (2003). Coherent keyphrase extraction via web mining. In Gottlob, G. and Walsh, T., editors, IJCAI'03, pages 434-442. Morgan Kaufmann.
  22. Wan, X. and Xiao, J. (2008). Collabrank: Towards a collaborative approach to single-document keyphrase extraction. In Scott, D. and Uszkoreit, H., editors, COLING'08, pages 969-976.
  23. Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., and Nevill-Manning, C. G. (1999). Kea: Practical automatic keyphrase extraction. CoRR, cs.DL/9902007.
  24. Yih, W., Goodman, J., and Carvalho, V. R. (2006). Finding advertising keywords on web pages. In Carr, L., Roure, D. D., Iyengar, A., Goble, C. A., and Dahlin, M., editors, WWW'06, pages 213-222. ACM.
Download


Paper Citation


in Harvard Style

Timonen M., Toivanen T., Teng Y., Chen C. and He L. (2012). Informativeness-based Keyword Extraction from Short Documents . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2012) ISBN 978-989-8565-29-7, pages 411-421. DOI: 10.5220/0004130704110421


in Bibtex Style

@conference{sstm12,
author={Mika Timonen and Timo Toivanen and Yue Teng and Chao Chen and Liang He},
title={Informativeness-based Keyword Extraction from Short Documents},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2012)},
year={2012},
pages={411-421},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004130704110421},
isbn={978-989-8565-29-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2012)
TI - Informativeness-based Keyword Extraction from Short Documents
SN - 978-989-8565-29-7
AU - Timonen M.
AU - Toivanen T.
AU - Teng Y.
AU - Chen C.
AU - He L.
PY - 2012
SP - 411
EP - 421
DO - 10.5220/0004130704110421