Authors: Sanzhar Aubakirov 1 ; Paulo Trigo 2 and Darhan Ahmed-Zaki 1

Affiliations: 1 Department of Computer Science, Al-Farabi Kazakh National University, Almaty and Kazakhstan ; 2 Instituto Superior de Engenharia de Lisboa, Biosystems and Integrative Sciences Institute Agent and Systems Modeling, Lisbon and Portugal

ISBN: 978-989-758-318-6

Keyword(s): Distributed Computing, Text Processing, Machine Learning, Hyperparameters Optimization.

Related Ontology Subjects/Areas/Topics: Business Analytics ; Data Engineering ; Data Management and Quality ; Text Analytics

Abstract: In this paper, we propose an optimization workflow to predict classifiers accuracy based on the exploration of the space composed of different data features and the configurations of the classification algorithms. The overall process is described considering the text classification problem. We take three main features that affect text classification and therefore the accuracy of classifiers. The first feature considers the words that comprise the inputtext; here we use the N-gram concept with different N values. The second feature considers the adoption of textual pre-processing steps such as the stop-word filtering and stemming techniques. The third feature considers the classification algorithms hyperparameters. In this paper, we take the well-known classifiers K-Nearest Neighbors (KNN) and Naive Bayes (NB) where K (from KNN) and a-priori probabilities (from NB) are hyperparameters that influence accuracy. As a result, we explore the feature space (correlation among textual and clas sifier aspects) and we present an approximation model that is able to predict classifiers accuracy. (More)

