Estimating Sentiment via Probability and Information Theory

Kevin Labille, Sultan Alfarhood, Susan Gauch

2016

Abstract

Opinion detection and opinion analysis is a challenging but important task. Such sentiment analysis can be done using traditional supervised learning methods such as naive Bayes classification and support vector ma- chines (SVM) or unsupervised approaches based on a lexicon may be employed. Because lexicon-based senti- ment analysis methods make use of an opinion dictionary that is a list of opinion-bearing or sentiment words, sentiment lexicons play a key role. Our work focuses on the task of generating such a lexicon. We propose several novel methods to automatically generate a general-purpose sentiment lexicon using a corpus-based approach. While most existing methods generate a lexicon using a list of seed sentiment words and a domain corpus, our work differs from these by generating a lexicon from scratch using probabilistic techniques and information theoretical text mining techniques on a large diverse corpus. We conclude by presenting an ensem- ble method that combines the two approaches. We evaluate and demonstrate the effectiveness of our methods by utilizing the various automatically-generated lexicons during sentiment analysis. When used for sentiment analysis, our best single lexicon achieves an accuracy of 87.60% and the ensemble approach achieves 88.75% accuracy, both statistically significant improvements over 81.60% with a widely-used sentiment lexicon.

References

  1. Abdulla, N. A., Ahmed, N. A., Shehab, M. A., Al-Ayyoub, M., Al-Kabi, M. N., and Al-rifai, S. (2014). Towards improving the lexicon-based approach for arabic sentiment analysis. International Journal of Information Technology and Web Engineering (IJITWE), 9(3):55- 71.
  2. Baccianella, S., Esuli, A., and Sebastiani, F. (2010). Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In LREC, volume 10, pages 2200-2204.
  3. Bayes, M. and Price, M. (1763). An essay towards solving a problem in the doctrine of chances. by the late rev. mr. bayes, frs communicated by mr. price, in a letter to john canton, amfrs. Philosophical Transactions (1683-1775), pages 370-418.
  4. Choi, Y. and Cardie, C. (2009). Adapting a polarity lexicon using integer linear programming for domainspecific sentiment classification. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2, pages 590-598. Association for Computational Linguistics.
  5. Davidov, D., Tsur, O., and Rappoport, A. (2010). Enhanced sentiment learning using twitter hashtags and smileys. In Proceedings of the 23rd international conference on computational linguistics: posters, pages 241-249. Association for Computational Linguistics.
  6. Ding, X., Liu, B., and Yu, P. S. (2008). A holistic lexiconbased approach to opinion mining. In Proceedings of the 2008 international conference on web search and data mining, pages 231-240. ACM.
  7. Frank, E. and Bouckaert, R. R. (2006). Naive bayes for text classification with unbalanced classes. In European Conference on Principles of Data Mining and Knowledge Discovery, pages 503-510. Springer.
  8. Gao, D., Wei, F., Li, W., Liu, X., and Zhou, M. (2015). Cross-lingual sentiment lexicon learning with bilingual word graph label propagation. Computational Linguistics.
  9. Goldberg, A. B. and Zhu, X. (2006). Seeing stars when there aren't many stars: graph-based semi-supervised learning for sentiment categorization. In Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing, pages 45-52. Association for Computational Linguistics.
  10. Hatzivassiloglou, V. and McKeown, K. R. (1997). Predicting the semantic orientation of adjectives. In Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, pages 174-181. Association for Computational Linguistics.
  11. Hu, M. and Liu, B. (2004a). Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 168-177. ACM.
  12. Hu, M. and Liu, B. (2004b). Mining opinion features in customer reviews. In AAAI, volume 4, pages 755-760.
  13. Kamps, J., Marx, M., Mokken, R. J., Rijke, M. d., et al. (2004). Using wordnet to measure semantic orientations of adjectives.
  14. Kanayama, H. and Nasukawa, T. (2006). Fully automatic lexicon expansion for domain-oriented sentiment analysis. In Proceedings of the 2006 conference on empirical methods in natural language processing, pages 355-363. Association for Computational Linguistics.
  15. Khan, A. Z., Atique, M., and Thakare, V. (2015). Combining lexicon-based and learning-based methods for twitter sentiment analysis. International Journal of Electronics, Communication and Soft Computing Science & Engineering (IJECSCSE), page 89.
  16. Kim, J., Li, J.-J., and Lee, J.-H. (2009). Discovering the discriminative views: measuring term weights for sentiment analysis. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, pages 253-261. Association for Computational Linguistics.
  17. Kim, S.-M. and Hovy, E. (2004). Determining the sentiment of opinions. In Proceedings of the 20th international conference on Computational Linguistics, page 1367. Association for Computational Linguistics.
  18. Kim, S.-M. and Hovy, E. (2006a). Extracting opinions, opinion holders, and topics expressed in online news media text. In Proceedings of the Workshop on Sentiment and Subjectivity in Text, pages 1-8. Association for Computational Linguistics.
  19. Kim, S.-M. and Hovy, E. (2006b). Identifying and analyzing judgment opinions. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pages 200-207. Association for Computational Linguistics.
  20. Li, T., Zhang, Y., and Sindhwani, V. (2009). A non-negative matrix tri-factorization approach to sentiment classification with lexical prior knowledge. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, pages 244-252. Association for Computational Linguistics.
  21. Liu, B. (2010). Sentiment analysis and subjectivity. Handbook of natural language processing, 2:627-666.
  22. Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies, 5(1):1-167.
  23. Liu, F., Wang, D., Li, B., and Liu, Y. (2010). Improving blog polarity classification via topic analysis and adaptive methods. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 309-312. Association for Computational Linguistics.
  24. Martineau, J. and Finin, T. (2009). Delta tfidf: An improved feature space for sentiment analysis. ICWSM, 9:106.
  25. McAuley, J., Pandey, R., and Leskovec, J. (2015a). Inferring networks of substitutable and complementary products. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785-794. ACM.
  26. McAuley, J., Targett, C., Shi, Q., and van den Hengel, A. (2015b). Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 43-52. ACM.
  27. Mohammad, S., Dunne, C., and Dorr, B. (2009). Generating high-coverage semantic orientation lexicons from overtly marked words and a thesaurus. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2, pages 599-608. Association for Computational Linguistics.
  28. Ng, V., Dasgupta, S., and Arifin, S. (2006). Examining the role of linguistic knowledge sources in the automatic identification and classification of reviews. In Proceedings of the COLING/ACL on Main conference poster sessions, pages 611-618. Association for Computational Linguistics.
  29. Paltoglou, G. and Thelwall, M. (2010). A study of information retrieval weighting schemes for sentiment analysis. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1386-1395. Association for Computational Linguistics.
  30. Pang, B. and Lee, L. (2004). A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd annual meeting on Association for Computational Linguistics, page 271. Association for Computational Linguistics.
  31. Pang, B. and Lee, L. (2005). Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd annual meeting on association for computational linguistics, pages 115-124. Association for Computational Linguistics.
  32. Pang, B. and Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1-2):1-135.
  33. Pang, B., Lee, L., and Vaithyanathan, S. (2002). Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, pages 79-86. Association for Computational Linguistics.
  34. Peng, W. and Park, D. H. (2004). Generate adjective sentiment dictionary for social media sentiment analysis using constrained nonnegative matrix factorization. Urbana, 51:61801.
  35. Salton, G. and McGill, M. J. (1986). Introduction to modern information retrieval.
  36. Taboada, M., Brooke, J., Tofiloski, M., Voll, K., and Stede, M. (2011). Lexicon-based methods for sentiment analysis. Computational linguistics, 37(2):267-307.
  37. Turney, P. D. (2002). Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 417-424. Association for Computational Linguistics.
  38. Wei, W. and Gulla, J. A. (2010). Sentiment learning on product reviews via sentiment ontology tree. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 404-413. Association for Computational Linguistics.
  39. Yu, H. and Hatzivassiloglou, V. (2003). Towards answering opinion questions: Separating facts from opinions and identifying the polarity of opinion sentences. In Proceedings of the 2003 conference on Empirical methods in natural language processing, pages 129-136. Association for Computational Linguistics.
  40. Zhou, S., Chen, Q., and Wang, X. (2010). Active deep networks for semi-supervised sentiment classification. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pages 1515- 1523. Association for Computational Linguistics.
Download


Paper Citation


in Harvard Style

Labille K., Alfarhood S. and Gauch S. (2016). Estimating Sentiment via Probability and Information Theory . In Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016) ISBN 978-989-758-203-5, pages 121-129. DOI: 10.5220/0006072101210129


in Bibtex Style

@conference{kdir16,
author={Kevin Labille and Sultan Alfarhood and Susan Gauch},
title={Estimating Sentiment via Probability and Information Theory},
booktitle={Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016)},
year={2016},
pages={121-129},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006072101210129},
isbn={978-989-758-203-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016)
TI - Estimating Sentiment via Probability and Information Theory
SN - 978-989-758-203-5
AU - Labille K.
AU - Alfarhood S.
AU - Gauch S.
PY - 2016
SP - 121
EP - 129
DO - 10.5220/0006072101210129