Authors:
Tomasz Walkowiak
1
and
Piotr Malak
2
Affiliations:
1
Wroclaw Univeristy of Science and Technology, Poland
;
2
University of Wroclaw, Poland
Keyword(s):
NLP, Polish, Text Classification, Feature Selection, Weighting Schema, Supervised Machine Learning.
Related
Ontology
Subjects/Areas/Topics:
Applications
;
Artificial Intelligence
;
Computational Intelligence
;
Data Mining
;
Databases and Information Systems Integration
;
Enterprise Information Systems
;
Evolutionary Computing
;
Knowledge Discovery and Information Retrieval
;
Knowledge Engineering and Ontology Development
;
Knowledge-Based Systems
;
Machine Learning
;
Natural Language Processing
;
Pattern Recognition
;
Sensor Networks
;
Signal Processing
;
Soft Computing
;
Symbolic Systems
Abstract:
Abstract: The paper presents preparation, lead and results of evaluation of efficiency of text classification (TC)
methods for Polish. The subject language is of complex morphology, it belongs to flexional languages. Thus
there is a strong need of making proper text preprocessing in order to guarantee reliable TC. Basing on
authors’ practical experience from former TC, IR and general NLP experiments set of preprocessing rules
was applied. Also feature-documents matrix was designed with respect to the most promising feature
selected. About 216 experiments on exemplar corpus in subject (topic) classification task, with different
preprocessing, weighting, filtering (for dimensions reduction) schemes and classifiers was conducted.
Results shows there is not substantial increase of accuracy when using most of classical pre-processing steps
in case of corpus of large size (at least 1000 exemplars per class). The highest impact authors were able to
obtain concerned the system costs of TC pr
ocesses, not the TC accuracy.
(More)