Web Content Classification based on Topic and Sentiment Analysis of Text

Shuhua Liu, Thomas Forss

2014

Abstract

Automatic classification of web content has been studied extensively, using different learning methods and tools, investigating different datasets to serve different purposes. Most of the studies have made use of the content and structural features of web pages. However, previous experience has shown that certain groups of web pages, such as those that contain hatred and violence, are much harder to classify with good accuracy when both content and structural features are already taken into consideration. In this study we present a new approach for automatically classifying web pages into pre-defined topic categories. We apply text summarization and sentiment analysis techniques to extract topic and sentiment indicators of web pages. We then build classifiers based on combined topic and sentiment features. A large amount of experiments were carried out. Our results suggest that incorporating the sentiment dimension can indeed bring much added value to web content classification. Topic similarity based classifiers solely did not perform well, but when topic similarity and sentiment features are combined, the classification model performance is significantly improved for many web categories. Our study offers valuable insights and inputs to the development of web detection systems and Internet safety solutions.

References

  1. Blei, D, Ng, A., and Jordan, M. I. 2003. Latent dirichlet allocation. Advances in neural information processing systems. 601-608.
  2. Blei, D. 2012. Probabilistic topic models. Communications of the ACM, 55(4):77-84, 2012.
  3. Calado, P., Cristo, M., Goncalves, M. A., de Moura, E. S., Ribeiro-Neto, B., and Ziviani, N. 2006. Link-based similarity measures for the classification of web documents. Journal of the American Society for Information Science and Technology (57:2), 208-221.
  4. Chakrabarti, S., B. Dom and P. Indyk. 1998. Enhanced hypertext categorization using hyperlinks. Proceedings of ACM SIGMOD 1998.
  5. Chen, Z., Wu, O., Zhu, M., and Hu, W. 2006. A novel web page filtering system by combining texts and images. Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, 732- 735. Washington, DC IEEE Computer Society.
  6. Cohen, W. 2002. Improving a page classifier with anchor extraction and link analysis. In S. Becker, S. Thrun, and K. Obermayer (Eds.), Advances in Neural Information Processing Systems (Volume 15, Cambridge, MA: MIT Press) 1481-1488.
  7. Dumais, S. T., and Chen, H. 2000. Hierarchical classification of web content. Proceedings of SIGIR'00, 256-263.
  8. Elovici, Y., Shapira, B., Last, M., Zaafrany, O., Friedman, M., Schneider, M., and Kandel, A. 2005. Contentbased detection of terrorists browsing the web using an advanced terror detection system (ATDS), Intelligence and Security Informatics (Lecture Notes in Computer Science Volume 3495, 2005), 244-255.
  9. Gabrilovich, E., and Markovich, S. 2007. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI'07), Hyderabad, India.
  10. Hammami, M., Chahir, Y., and Chen, L. 2003. WebGuard: web based adult content detection and filtering system. Proceedings of the IEEE/WIC Inter. Conf. on Web Intelligence (Oct. 2003), 574 - 578.
  11. Kludas, J. 2007. Multimedia retrieval and classification for web content, Proc. of the 1st BCS IRSG conference on Future Directions in Information Access, British Computer Society Swinton, UK ©2007.
  12. Last, M., Shapira, B., Elovici, Y., Zaafrany, O., and Kandel, A. 2003. Content-Based Methodology for Anomaly Detection on the Web. Advances in Web Intelligence, Lecture Notes in Computer Science (Vol. 2663, 2003), 113-123.
  13. Landauer, T. K., Foltz, P. W., and Laham, D. 1998. Introduction to Latent Semantic Analysis. Discourse Processes (25), 259-284.
  14. Landauer, T. K., and Dumais, S. T. 2008. Latent semantic analysis. Scholarpedia 3(11): 4356.
  15. Liu, B. 2012. Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human Language Technologies, Morgan & Claypool Publishers 2012.
  16. Lu, Y., M. Castellanos, U. Dayal, C. Zhai. 2011a. "Automatic Construction of a Context-Aware Sentiment Lexicon: An Optimization Approach", In Proceedings of the 20th international conference on World wide web (WWW'2011) Pages: 347-356.
  17. Lu, Y., Q. Mei, C. Zhai. 2011b. "Investigating Task Performance of Probabilistic Topic Models - An Empirical Study of PLSA and LDA", Information Retrieval, April 2011, Volume 14, Issue 2, pp 178- 203.
  18. Pang, B., and Lee, L. 2008. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2(1-2), 1-135, July 2008.
  19. Pennebaker, J., Mehl, M., & Niederhoffer, K. 2003. Psychological aspects of natural language use: Our words, our selves. Annual review of psychology, 54(1), 547-577.
  20. Qi, X., and Davidson, B. 2007. Web Page Classification: Features and Algorithms. Technical Report LU-CSE07-010, Dept. of Computer Science and Engineering, Lehigh University, Bethlehem, PA, 18015.
  21. Radev, D., Allison, T., Blair-Goldensohn, S., Blitzer, J., Celebi, A., Dimitrov, S., and Zhang, Z. 2004a. MEAD-a platform for multidocument multilingual text summarization. Proeedings of the 4th LREC Conference (Lisbon, Portugal, May 2004).
  22. Radev, D., Jing, H., Stys, M., and Tam, D. 2004b. Centroid-based summarization of multiple documents. Information Processing and Management (40) 919- 938.
  23. Salton, G., and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513-523.
  24. Shen, D., Z. Chen, Q. Yang, H. Zeng, B. Zhang, Y. Lu, W. Ma: Web-page classification through summarization. SIGIR 2004: 242-249.
  25. Shen, D., Qiang Yang, Zheng Chen: Noise reduction through summarization for Web-page classification. Information Processing and Management 43(6): 1735- 1747 (2007).
  26. Stone, P. J., Dunphy, D. C., Smith, M. S., and Ogilvie, D. M. 1966. The general inquirer: a computer approach to content analysis. The MIT Press, Cambridge, Massachusetts, 1966. 651.
  27. Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., and Kappas, A. 2010. Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology, 61(12), 2544- 2558.
  28. Thelwall, M., Buckley, K., and Paltoglou, G. 2012. Sentiment strength detection for the social Web. Journal of the American Society for Information Science and Technology, 63(1), 163-173.
  29. Yu, H., Han, J., and Chang, K. C.-C. 2004. PEBL: Web Page Classification without Negative Examples. IEEE Trans. on Knowledge and Data Eng. (16:1), 70-81.
  30. Zhang, S., Xiaoming Jin, Dou Shen, Bin Cao, Xuetao Ding, Xiaochen Zhang: Short text classification by detecting information path. CIKM 2013: 727-732.
Download


Paper Citation


in Harvard Style

Liu S. and Forss T. (2014). Web Content Classification based on Topic and Sentiment Analysis of Text . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014) ISBN 978-989-758-048-2, pages 300-307. DOI: 10.5220/0005101803000307


in Bibtex Style

@conference{kdir14,
author={Shuhua Liu and Thomas Forss},
title={Web Content Classification based on Topic and Sentiment Analysis of Text},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)},
year={2014},
pages={300-307},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005101803000307},
isbn={978-989-758-048-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)
TI - Web Content Classification based on Topic and Sentiment Analysis of Text
SN - 978-989-758-048-2
AU - Liu S.
AU - Forss T.
PY - 2014
SP - 300
EP - 307
DO - 10.5220/0005101803000307