
 
presents the minority class. 
Our experiments consist on balancing the two 
classes of each data set by the use of the four studied 
under-sampling methods, i.e. RR, RS, RF and RC. 
Then we evaluate the performance of the three 
classifiers on the balanced data sets.  
Our results show that performance obtained on 
DSMR is better than that obtained on DSPo. This 
proves that the more the data set is unbalanced the 
more the results are bad.  
As a comparison between under-sampling 
methods, we can say that, generally, the four 
methods give near results. But iQn most of cases RR 
yields the best results. RF is not recommended for 
NB, it is rather recommended for SVM. For kNN, 
we do not recommend to use RS. 
As future works, we look for performing the 
same experiments on unbalanced data sets that are 
more homogeneous so as to validate our hypothesis 
about the impact of heterogeneity on the 
performance of the proposed techniques. We will 
also study the effectiveness of the four under-
sampling methods by decreasing progressively 
majority class size. On one hand, we aim to see 
whether it is necessary to achieve a balance of 50%-
50% to have the best results.  On the other hand, we 
aim to observe the behaviour of our classifiers, by 
using the different under-sampling methods, toward 
the different steps of majority class decreasing. 
Finally, we have as perspective too the study of 
feature selection techniques on unbalanced data sets 
of SA. 
REFERENCES 
Abdul-Mageed, M., Diab, M.T., Korayem, M., 2011. 
Subjectivity and Sentiment Analysis of Modern 
Standard Arabic. In Proc. ACL (Short Papers). 
pp.587-591. 
Brank, J., Grobelnik, M., Milić-Frayling, N, Mladenić, D., 
2003. Training text classifiers with SVM on very few 
positive examples. Technical report, MSR-TR-2003-
34.  
Burns, N., Bi, Y., Wang, H., Anderson, T., 2011. 
Sentiment Analysis of Customer Reviews: Balanced 
versus Unbalanced Datasets. KES 2011, Part I, LNAI 
6881, pp. 161-170.  
Carpenter, B., 2005. Scaling High-Order Character 
Language Models to Gigabytes. In: Workshop on 
Software. Association for Computational Linguistics, 
Morristown. pp. 86–99. 
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer. 
2002. W. P. SMOTE: Synthetic Minority Over-
sampling Technique. Journal of Artificial Intelligence 
Research (JAIR), Volume 16, pp. 321-357.  
Dasarathy, B. V., 1991. Nearest Neighbor (NN) Norms: 
NN Pattern Classification Techniques. McGraw-Hill 
Computer Science Series. Las Alamitos, California: 
IEEE Computer Society Press. 
Hartigan, J., 1975. Clustering Algorithms. John Wiley & 
Sons, New York, NY.  
Japkowicz, N., 2003. Class Imbalances: Are we Focusing 
on the Right Issue? In Proc. Of ICML’03. 
Khoja, S., Garside, R., 1999. Stemming Arabic text. 
Computer Science Department, Lancaster University, 
Lancaster, UK.  
Kubat, M., Matwin, S., 1997. Addressing the Curse of 
Imbalanced Data Sets: One-Sided Sampling. In 
Proceedings of the Fourteenth International 
Conference on Machine Learning, pp. 179-186. 
Li, S., Wang, Z., Zhou, G., Lee, S. Y. M., 2011. Semi-
Supervised Learning for Imbalanced Sentiment 
Classification.  In Proc. Of the Twenty-Second 
International Joint Conference on Artificial 
Intelligence, pp.1826-1831.  
Mitchell, T., 1996. Machine Learning. McCraw Hill. 
Pang, B., Lee, L., Vaithyanathain, S., 2002. Thumbs up? 
Sentiment classification using machine learning 
techniques.  In Proceedings of the Conference on 
Empirical Methods in Natural Language Processing. 
pp.79-86.  
Platt, J., 1999. Fast training on SVMs using sequential 
minimal optimization. In Scholkopf, B., Burges, C., 
and Smola, A. (Ed.), Advances in Kernel Methods: 
Support Vector Learning, MIT Press, Cambridge, MA, 
pp.185-208. 
Rushdi-Saleh, M., Martin-Valdivia, M. T., Urena-Lopez, 
L. A., Perea-Ortega, J. M., 2011a. Bilingual 
Experiments with an Arabic-English Corpus for 
Opinion Mining. In Proc. Of Recent Advances in 
Natural Language Processing, Hissar, Bulgaria. 
pp.740-745. 
Rushdi-Saleh, M., Martin-Valdivia, M. T., Urena-Lopez, 
L. A., Perea-Ortega, J. M., 2011b. Experiments with 
SVM to classify opinions in different domains. Expert 
Systems with Applications 38, pp.14799-14804.  
Salton, G., McGill, M., 1983. Modern Information 
Retrieval. New York: McGraw-Hill.  
Vapnik, V., 1995. The Nature of Statistical Learning. 
Springer-Verlag.  
Witten, I. H., Frank, E., 2005. Data Mining: Practical 
machine learning tools and techniques, 2nd Edition, 
Morgan Kaufmann, San Francisco, California.  
Wu, G., Chang, E., 2003. Class-Boundary Alignment for 
Imbalanced Dataset Learning. In Proc. Of ICML’03. 
Zhuang, L., Jing, F., Zhu, X., 2006. Movie Review Mining 
and Summarization. In CIKM’06. Virginia, USA. 
AddressingtheProblemofUnbalancedDataSetsinSentimentAnalysis
311