Improving Document Clustering Performance: The Use of an Automatically Generated Ontology to Augment Document Representations

Stephen Bradshaw, Colm O'Riordan, Daragh Bradshaw

2017

Abstract

Clustering documents is a common task in a range of information retrieval systems and applications. Many approaches for improving the clustering process have been proposed. One approach is the use of an ontology to better inform the classifier of word context, by expanding the items to be clustered. Wordnet is commonly cited as an appropriate source from which to draw the additional terms; however, it may not be sufficient to achieve strong performance. We have two aims in this paper: first, we show that the use of Wordnet may lead to suboptimal performance. This problem may be accentuated when a document set has been drawn from comments made in social forums; due to the unstructured nature of online conversations compared to standard document sets. Second, we propose a novel method which involves constructing a bespoke ontology that facilitates better clustering. We present a study of clustering applied to a sample of threads from a social forum and investigate the effectiveness of the application of these methods.

Download


Paper Citation


in Harvard Style

Bradshaw S., O'Riordan C. and Bradshaw D. (2017). Improving Document Clustering Performance: The Use of an Automatically Generated Ontology to Augment Document Representations.In Proceedings of the 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, ISBN 978-989-758-271-4, pages 215-223. DOI: 10.5220/0006500202150223


in Bibtex Style

@conference{kdir17,
author={Stephen Bradshaw and Colm O'Riordan and Daragh Bradshaw},
title={Improving Document Clustering Performance: The Use of an Automatically Generated Ontology to Augment Document Representations},
booktitle={Proceedings of the 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR,},
year={2017},
pages={215-223},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006500202150223},
isbn={978-989-758-271-4},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR,
TI - Improving Document Clustering Performance: The Use of an Automatically Generated Ontology to Augment Document Representations
SN - 978-989-758-271-4
AU - Bradshaw S.
AU - O'Riordan C.
AU - Bradshaw D.
PY - 2017
SP - 215
EP - 223
DO - 10.5220/0006500202150223