Authors: Tomasz Walkowiak 1 and Piotr Malak 2

Affiliations: 1 Wroclaw Univeristy of Science and Technology, Poland ; 2 University of Wroclaw, Poland

ISBN: 978-989-758-275-2

Keyword(s): NLP, Polish, Text Classification, Feature Selection, Weighting Schema, Supervised Machine Learning.

Abstract: Abstract: The paper presents preparation, lead and results of evaluation of efficiency of text classification (TC) methods for Polish. The subject language is of complex morphology, it belongs to flexional languages. Thus there is a strong need of making proper text preprocessing in order to guarantee reliable TC. Basing on authors’ practical experience from former TC, IR and general NLP experiments set of preprocessing rules was applied. Also feature-documents matrix was designed with respect to the most promising feature selected. About 216 experiments on exemplar corpus in subject (topic) classification task, with different preprocessing, weighting, filtering (for dimensions reduction) schemes and classifiers was conducted. Results shows there is not substantial increase of accuracy when using most of classical pre-processing steps in case of corpus of large size (at least 1000 exemplars per class). The highest impact authors were able to obtain concerned the system costs of TC pro cesses, not the TC accuracy. (More)

