Authors:
Asmaa Mountassir
;
Houda Benbrahim
and
Ilham Berrada
Affiliation:
ENSIAS and Mohamed 5 University, Morocco
Keyword(s):
Sentiment Analysis, Opinion Mining, Unbalanced Data Sets, Machine Learning, Text Classification, Natural Language Processing, Arabic Language.
Related
Ontology
Subjects/Areas/Topics:
AI Programming
;
Applications
;
Applications and Case-studies
;
Artificial Intelligence
;
Knowledge Engineering and Ontology Development
;
Knowledge Reengineering
;
Knowledge Representation
;
Knowledge-Based Systems
;
Natural Language Processing
;
Pattern Recognition
;
Symbolic Systems
Abstract:
Sentiment Analysis is a research area where the studies focus on processing and analysing the opinions available on the web. This paper deals with the problem of unbalanced data sets in supervised sentiment classification. We propose three different methods to under-sample the majority class documents, namely Remove Similar, Remove Farthest and Remove by Clustering. Our goal is to compare the effectiveness of the proposed methods with the common random under-sampling. We use for classification three standard classifiers: Naïve Bayes, Support Vector Machines and k-Nearest Neighbours. The experiments are carried out on two different Arabic data sets that we have built and labelled manually. We show that results obtained on the first data set, which is slightly skewed, are better than those obtained on the second one which is highly skewed. The results show also that we can rely on the proposed techniques and that they are typically competitive with random under-sampling.