loading
Papers

Research.Publish.Connect.

Paper

Paper Unlock

Author: Mika Timonen

Affiliation: VTT Technical Research Centre of Finland and University of Helsinki, Finland

ISBN: 978-989-8565-29-7

Keyword(s): Category Distribution, Feature Weighting, Short Document Categorization, SVM, Text Categorization.

Related Ontology Subjects/Areas/Topics: Artificial Intelligence ; Clustering and Classification Methods ; Knowledge Discovery and Information Retrieval ; Knowledge-Based Systems ; Mining Text and Semi-Structured Data ; Symbolic Systems

Abstract: Categorization of very short documents has become an important research topic in the field of text mining. Twitter status updates and market research data form an interesting corpus of documents that are in most cases less than 20 words long. Short documents have one major characteristic that differentiate them from traditional longer documents: each word occurs usually only once per document. This is called the TF=1 challenge. In this paper we conduct a comprehensive performance comparison of the current feature weighting and categorization approaches using corpora of very short documents. In addition, we propose a novel feature weighting approach called Fragment Length Weighted Category Distribution that takes the challenges of short documents into consideration. The proposed approach is based on previous work on Bi-Normal Separation and on short document categorization using a Naive Bayes classifier. We compare the performance of the proposed approach against several traditional ap proaches including Chi-Squared, Mutual Information, Term Frequency-Inverse Document Frequency and Residual Inverse Document Frequency. We also compare the performance of a Support Vector Machine classifier against other classification approaches such as k-Nearest Neighbors and Naive Bayes classifiers. (More)

PDF ImageFull Text

Download
CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 3.231.229.89

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Timonen, M. (2012). Categorization of Very Short Documents.In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012) ISBN 978-989-8565-29-7, pages 5-16. DOI: 10.5220/0004108300050016

@conference{kdir12,
author={Mika Timonen.},
title={Categorization of Very Short Documents},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)},
year={2012},
pages={5-16},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004108300050016},
isbn={978-989-8565-29-7},
}

TY - CONF

JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)
TI - Categorization of Very Short Documents
SN - 978-989-8565-29-7
AU - Timonen, M.
PY - 2012
SP - 5
EP - 16
DO - 10.5220/0004108300050016

Login or register to post comments.

Comments on this Paper: Be the first to review this paper.