Temporal-based Feature Selection and Transfer Learning for Text Categorization

Fumiyo Fukumoto, Yoshimi Suzuki

2015

Abstract

This paper addresses text categorization problem that training data may derive from a different time period from the test data. We present a method for text categorization that minimizes the impact of temporal effects. Like much previous work on text categorization, we used feature selection. We selected two types of informative terms according to corpus statistics. One is temporal independent terms that are salient across full temporal range of training documents. Another is temporal dependent terms which are important for a specific time period. For the training documents represented by independent/dependent terms, we applied boosting based transfer learning to learn accurate model for timeline adaptation. The results using Japanese data showed that the method was comparable to the current state-of-the-art biased-SVM method, as the macro-averaged F-score obtained by our method was 0.688 and that of biased-SVM was 0.671. Moreover, we found that the method is effective, especially when the creation time period of the test data differs greatly from that of the training data.

References

  1. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet Allocation. Machine Learning, 3:993-1022.
  2. Blitzer, J., McDonald, R., and Pereira, F. (2006). Domain Adaptation with Structural Correspondence Learning. In Proc. of the Conference on Empirical Methods in Natural Language Processing, pp. 120-128.
  3. Dai, W., Yang, Q., Xue, G., and Yu, Y. (2007). Boosting for Transfer Learning. In Proc. of the 24th International Conference on Machine Learning, pp. 193-200.
  4. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Hashman, R. (1990). Indexing by Latent Semantic Analysis. American Society for Information Science, 41(6):391-407.
  5. Dumais, S. and Chen, H. (2000). Hierarchical Classification of Web Contents. In Proc. of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 256-263.
  6. Elkan, C. and Noto, K. (2008). Learning Classifiers from Only Positive and Unlabeled Data. In Proc. of the 14th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 213-220.
  7. Folino, G., Pizzuti, C., and Spezzano, G. (2007). An Adaptive Distributed Ensemble Approach to Mine Concept-drifting Data Streams. In Proc. of the 19th IEEE International Conference on Tools with Artificial Intelligence, pp. 183-188.
  8. Forman, G. (2003). An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Machine Learning Research, 3:1289-1305.
  9. Freund, Y. and Schapire, R. E. (1997). A DecisionTheoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1):119-139.
  10. Glorot, X., Bordes, A., and Bengio, Y. (2011). Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach. In Proc. of the 28th International Conference on Machine Learning, pp. 97- 110.
  11. Gopal, S. and Yang, Y. (2010). Multilabel Classification with Meta-level Features. In Proc. of the 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 315- 322.
  12. Hassan, S., Mihalcea, R., and Nanea, C. (2007). RandomWalk Term Weighting for Improved Text Classification. In Proc. of the IEEE International Conference on Semantic Computing, pp. 242-249.
  13. He, D. and Parker, D. S. (2010). Topic Dynamics: An Alternative Model of Bursts in Streams of Topics. In Proc. of the 16th ACM SIGKDD Conference on Knowledge discovery and Data Mining, pp. 443-452.
  14. Hofmann, T. (1999). Probabilistic Latent Semantic Indexing. In Proc. of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 35-44.
  15. III, H. D. (2007). Frustratingly Easy Domain Adaptation. In Proc. of the 45th Annual Meeting of the Association of computational Linguistics, pp. 256-263.
  16. Joachims, T. (1998). SVM Light Support Vector Machine. In Dept. of Computer Science Cornell University.
  17. Joachims, T. (1999). Transductive Inference for Text Classification using Support Vector Machines. In Proc. of 16th International Conference on Machine Learning, pp. 200-209.
  18. Kerner, Y. H., Mughaz, D., Beck, H., and Yehudai, E. (2008). Words as Classifiers of Documents according to Their Historical Period and the Ethnic Origin of Their Authors. Cymernetics and Systems, 39(3):213- 228.
  19. Kleinberg, M. (2002). Bursty and Hierarchical Structure in Streams. In Proc. of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 91-101.
  20. Klinkenberg, R. and Joachims, T. (2000). Detecting Concept Drift with Support Vector Machines. In Proc. of the 17th International Conference on Machine Learning, pp. 487-494.
  21. Lazarescu, M. M., Venkatesh, S., and Bui, H. H. (2004). Using Multiple Windows to Track Concept Drift. Intelligent Data Analysis, 8(1):29-59.
  22. Lewis, D. D. and Ringuette, M. (1994). Comparison of Two Learning Algorithms for Text Categorization. In Proc. of the Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81-93.
  23. Li, Y., Yang, M., and Zhang, Z. (2013). Scientific Articles Recommendation. In Proc. of the ACM International Conference on Information and Knowledge Management CIKM 2013, pp. 1147-1156.
  24. Liu, B., dai, Y., Li, X., Lee, W. S., and Yu, P. S. (2003). Building Text Classifiers using Positive and Unlabeled Examples. In Proc. of the ICDM'03, pp. 179- 188.
  25. Maas, A. L. and Ng, A. Y. (2010). A probabilistic Model for Semantic Word Vectors. NIPS, 10.
  26. Matsumoto, Y., Kitauchi, A., Yamashita, T., Hirano, Y., Matsuda, Y., Takaoka, K., and Asahara, M. (2000). Japanese Morphological Analysis System Chasen Version 2.2.1. In Naist Technical Report.
  27. Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. In Proc. of the International Conference on Learning Representations Workshop.
  28. Mourao, F., Rocha, L., Araujo, R., Couto, T., Goncalves, M., and Jr., W. M. (2008). Understanding Temporal Aspects in Document Classification. In Proc. of the 1st ACM International Conference on Web Search and Data Mining, pp. 159-169.
  29. Murphy, J. (1999). Technical Analysis of the Financial Markets. Prentice Hall.
  30. Raina, R., Ng, A. Y., and Koller, D. (2006). Constructing Informative Priors using Transfer Learning. In Proc. of the 23rd International Conference on Machine Learning, pp. 713-720.
  31. Salles, T., Rocha, L., and Pappa, G. L. (2010). Temporallyaware Algorithms for Document Classification. In Proc. of the 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 307-314.
  32. Siao, M. and Guo, Y. (2013). Domain Adaptation for Sequence Labeling Tasks with a Probabilistic Language Adaptation Model. In Proc. of the 30th International Conference on Machine Learning, pp. 293-301.
  33. Song, M., Heo, G. E., and Kim, S. Y. (2014). Analyzing topic evolution in bioinformatics: Investigation of dynamics of the field with conference data in dblp. Scientometrics, 101(1):397-428.
  34. Sparinnapakorn, K. and Kubat, M. (2007). Combining Subclassifiers in Text Categorization: A DST-based Solution and a Case Study. In Proc. of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 210-219.
  35. Wang, C., Blei, D., and Heckerman, D. (2008). Continuous Time Dynamic Topic Models. In Proc. of the 24th Conference on Uncertainty in Artificial Intelligence, pp. 579-586.
  36. Xue, G. R., Dai, W., Yang, Q., and Yu, Y. (2008). Topicbridged PLSA for Cross-Domain Text Classification. In Proc. of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 627-634.
  37. Yan, X., Guo, J., Lan, Y., and X.Cheng (2013). A Biterm Topic Model for Short Texts. In Proc. of the 22nd International Conference on World Wide Web, pp. 1445- 1456.
  38. Yang, Y. and Pedersen, J. O. (1997). A Comparative Study on Feature Selection in Text Categorization. In Proc. of the 14th International Conference on Machine Learning, pp. 412-420.
Download


Paper Citation


in Harvard Style

Fukumoto F. and Suzuki Y. (2015). Temporal-based Feature Selection and Transfer Learning for Text Categorization . In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2015) ISBN 978-989-758-158-8, pages 17-26. DOI: 10.5220/0005593100170026


in Bibtex Style

@conference{kdir15,
author={Fumiyo Fukumoto and Yoshimi Suzuki},
title={Temporal-based Feature Selection and Transfer Learning for Text Categorization},
booktitle={Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2015)},
year={2015},
pages={17-26},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005593100170026},
isbn={978-989-758-158-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2015)
TI - Temporal-based Feature Selection and Transfer Learning for Text Categorization
SN - 978-989-758-158-8
AU - Fukumoto F.
AU - Suzuki Y.
PY - 2015
SP - 17
EP - 26
DO - 10.5220/0005593100170026