loading
Documents

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Niraj Kumar ; Venkata Vinay Babu Vemula ; Kannan Srinathan and Vasudeva Varma

Affiliation: International Institute of Information Technology, India

ISBN: 978-989-8425-28-7

Keyword(s): Document clustering, Group-average agglomerative clustering, Community detection, Similarity measure, N-gram, Wikipedia based additional knowledge.

Related Ontology Subjects/Areas/Topics: Artificial Intelligence ; Information Extraction ; Knowledge Discovery and Information Retrieval ; Knowledge-Based Systems ; Symbolic Systems

Abstract: This paper provides a solution to the issue: “How can we use Wikipedia based concepts in document clustering with lesser human involvement, accompanied by effective improvements in result?” In the devised system, we propose a method to exploit the importance of N-grams in a document and use Wikipedia based additional knowledge for GAAC based document clustering. The importance of N-grams in a document depends on a many features including, but not limited to: frequency, position of their occurrence in a sentence and the position of the sentence in which they occur, in the document. First, we introduce a new similarity measure, which takes the weighted N-gram importance into account, in the calculation of similarity measure while performing document clustering. As a result, the chances of topical similarity in clustering are improved. Second, we use Wikipedia as an additional knowledge base both, to remove noisy entries from the extracted N-grams and to reduce the information gap betwee n N-grams that are conceptually-related, which do not have a match owing to differences in writing scheme or strategies. Our experimental results on the publicly available text dataset clearly show that our devised system has a significant improvement in performance over bag-of-words based state-of-the-art systems in this area. (More)

PDF ImageFull Text

Download
Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 3.85.92.139

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Kumar, N.; Vinay Babu Vemula, V.; Srinathan, K. and Varma, V. (2010). EXPLOITING N-GRAM IMPORTANCE AND WIKIPEDIA BASED ADDITIONAL KNOWLEDGE FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING .In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010) ISBN 978-989-8425-28-7, pages 182-187. DOI: 10.5220/0003081201820187

@conference{kdir10,
author={Niraj Kumar. and Venkata Vinay Babu Vemula. and Kannan Srinathan. and Vasudeva Varma.},
title={EXPLOITING N-GRAM IMPORTANCE AND WIKIPEDIA BASED ADDITIONAL KNOWLEDGE FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING },
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)},
year={2010},
pages={182-187},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003081201820187},
isbn={978-989-8425-28-7},
}

TY - CONF

JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)
TI - EXPLOITING N-GRAM IMPORTANCE AND WIKIPEDIA BASED ADDITIONAL KNOWLEDGE FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING
SN - 978-989-8425-28-7
AU - Kumar, N.
AU - Vinay Babu Vemula, V.
AU - Srinathan, K.
AU - Varma, V.
PY - 2010
SP - 182
EP - 187
DO - 10.5220/0003081201820187

Login or register to post comments.

Comments on this Paper: Be the first to review this paper.