Incorporating Feature Selection and Clustering Approaches for High-Dimensional Data Reduction

Been-Chian Chien

2014

Abstract

Data reduction is an important research topic for analyzing mass data efficiently and effectively in the era of big data. The task of dimension reduction is usually accomplished by technologies of feature selection, feature clustering or algebraic transformation. A novel approach for reducing high-dimensional data is initiated in this paper. The main idea of the proposed scheme is to incorporate data clustering and feature selection to transform high-dimensional data into lower dimensions. The incremental clustering algorithm in the scheme is used to handle the number of dimensions, and the relative discriminant variable is design for selecting significant features. Finally, a simple inner product operation is applied to transform original highdimensional data into a low one. Evaluations are conducted by testing the reduction approach on the problem of document categorization. The experimental results show that the reduced data have high classification accuracy for most of datasets. For some special datasets, the reduced data can get higher classification accuracy in comparison with original data.

References

  1. 20Newsgroups, 2013. http://people.csail.mit.edu/jrennie /20Newsgroups/
  2. VMtools, 2013. http://sourceforge.net/projects/wvtool/ Cade, 2014. http://web.ist.utl.pt/acardoso/datasets/ LIBSVM, 2013. http://www.csie.ntu.edu.tw/cjlin/libsvm/ RCV1, 2004. http://jmlr.org/papers/volume5/lewis04a/ lewis04a.pdf.
  3. Baker, L. D., McCallum, A., 1998. Distributional clustering of words for text classification. In ACM SIGIR98 the 21st Annual International, pp. 96-103.
  4. Bekkerman, R., El-Yaniv R., Tishby N., Winter Y., 2003. Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research, vol. 3, 1183-1208.
  5. Blum, A. L., Langley, P., 1997. Selection of relevant features and examples in machine learning. Aritficial Intelligence, vol. 97, no.1-2, 245-271.
  6. Combarro, E. F., Mont?nés, E., Díaz, I., Ranilla, J., Mones, R., 2005. Introducing a family of linear measures for feature selection in text categorization. IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 9, 1223-1232.
  7. Daphne, K., Sahami, M., 1996. Toward optimal feature selection. In the 13th International Conference on Machine Learning, pp. 284-292.
  8. Hsu, H. H., Hsieh, C. W., 2010, Feature selection via correlation coefficient clustering. Journal of Software, vol. 5, no. 12, 1371-1377.
  9. Jiang, J. Y., Liou, R. J., Lee, S. J., 2011. A fuzzy selfconstructing feature clustering algorithm for text classification. IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 3, 335-349.
  10. Jolliffe, I. T., 2002. Principal Component Analysis, 2nd edition, Springer, 2002.
  11. Kriegel, H. P., Kröger, P., Zimek, A., 2009. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data, vol. 3, no. 1, 1-58.
  12. Liu, H., Yu, L., 2005. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering, , 17(4), 491-502.
  13. Martinez, A. M., Kak, A. C., 2001. PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 2, 228-233.
  14. Park, H., Jeon, M., Rosen, J., 2003. Lower dimensional representation of text data based on centroids and least squares. BIT Numberical Math, vol. 43, 427-448.
  15. Salton, G., McGill, M. J., 1983. Introduction to modern retrieval. McGraw-Hill Book Company.
  16. Slonim, N., Tishby, N. 2001. The power of word clusters for text classification. In 23rd European Colloquium on Information Retrieval Research (Vol. 1).
  17. Yang, Y., Pedersen, J. O., 1997. A comparative study on feature selection in text categorization. In the 14th International Conference on Machine Learning, pp. 412-420.
Download


Paper Citation


in Harvard Style

Chien B. (2014). Incorporating Feature Selection and Clustering Approaches for High-Dimensional Data Reduction . In Proceedings of 3rd International Conference on Data Management Technologies and Applications - Volume 1: DATA, ISBN 978-989-758-035-2, pages 72-77. DOI: 10.5220/0005093300720077


in Bibtex Style

@conference{data14,
author={Been-Chian Chien},
title={Incorporating Feature Selection and Clustering Approaches for High-Dimensional Data Reduction},
booktitle={Proceedings of 3rd International Conference on Data Management Technologies and Applications - Volume 1: DATA,},
year={2014},
pages={72-77},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005093300720077},
isbn={978-989-758-035-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of 3rd International Conference on Data Management Technologies and Applications - Volume 1: DATA,
TI - Incorporating Feature Selection and Clustering Approaches for High-Dimensional Data Reduction
SN - 978-989-758-035-2
AU - Chien B.
PY - 2014
SP - 72
EP - 77
DO - 10.5220/0005093300720077