ARTIFICIAL DATA GENERATION FOR ONE-CLASS CLASSIFICATION - A Case Study of Dimensionality Reduction for Text and Biological Data

Santiago D. Villalba, Pádraig Cunningham

2009

Abstract

Artificial negatives have been employed in a variety of contexts in machine learning to overcome data availability problems. In this paper we explore the use of artificial negatives for dimension reduction in one-class classification, that is classification problems where only positive examples are available for training. We present four different strategies for generating artificial negatives and show that two of these strategies are very effective for discovering discriminating projections on the data, i.e., low dimension projections for discriminating between positive and real negative examples. The paper concludes with an assessment of the selection bias of this approach to dimension reduction for one-class classification.

References

  1. Abe, N., Zadrozny, B., and Langford, J. (2006). Outlier detection by active learning. In KDD: International Conference on Knowledge Discovery and Data Mining, pages 767-772.
  2. Breiman, L. (2001). Random forests. Machine Learning, 45(1):5-32.
  3. Chapelle, O., Sch ölkopf, B., and Zien, A., editors (2006). Semi-Supervised Learning. The MIT Press, Cambridge, MA.
  4. Chung, F. R. K. (1997). Spectral Graph Theory (CBMS Regional Conference Series in Mathematics, No. 92). American Mathematical Society.
  5. Dhillon, I., Kogan, J., and Nicholas, C. (2003). Feature selection and document clustering. In A Comprehensive Survey of Text Mining, pages 73-100. Springer.
  6. El-Yaniv, R. and Nisenson, M. (2006). Optimal single-class classification strategies. In NIPS: Advances in Neural Information Processing Systems.
  7. Fan, W., Miller, M., Stolfo, S., Lee, W., and Chan, P. (2004). Using artificial anomalies to detect unknown and known network intrusions. Knowledge and Information Systems, 6(5):507-527.
  8. Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3:1289-1305.
  9. Franc¸ois, D. (2008). High-dimensional Data Analysis: From Optimal Metrics to Feature Selection. VDM Verlag.
  10. Francois, D., Wertz, V., and Verleysen, M. (2007). The concentration of fractional distances. IEEE Trans. on Knowl. and Data Eng., 19(7):873-886.
  11. Hastie, T., Tibshirani, R., and Friedman, J. H. (2001). The Elements of Statistical Learning. Springer.
  12. He, X. and Niyogi, P. (2003). Locality preserving projections. In NIPS: Advances in Neural Information Processing Systems.
  13. Hempstalk, K., Frank, E., and Witten, I. H. (2008). Oneclass classification by combining density and class probability estimation. In ECML: European Conference of Machine Learning.
  14. Kim, H., Howland, P., and Park, H. (2005). Dimension reduction in text classification with support vector machines. Journal of Machine Learning Research, 6:37- 53.
  15. Liu, H. and Wong, L. (2003). Data mining tools for biological sequences. Journal of Bioinformatics and Computational Biology, 1(1):139-167.
  16. Madsen, R. E., Kauchak, D., and Elkan, C. (2005). Modeling word burstiness using the dirichlet distribution. In ICML: International Conference on Machine Learning, pages 545-552.
  17. Manevitz, L. M. and Yousef, M. (2001). One-class SVMs for document classification. Journal of Machine Learning Research, 2:139-154.
  18. Mosci, S., Rosasco, L., and Verri, A. (2007). Dimensionality reduction and generalization. In ICML: International Conference on Machine Learning, pages 657- 664.
  19. Schölkopf, B., Platt, J. C., Shawe-Taylor, J. C., Smola, A. J., and Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural Computation, 13(7):1443-1471.
  20. Scott, C. and Blanchard, G. (2009). Novelty detection: Unlabeled data definitely help. In AISTATS: Artificial Intelligence and Statistics, JMLR: W&CP 5.
  21. Scott, C. D. and Nowak, R. D. (2006). Learning minimum volume sets. Journal of Machine Learning Research, 7:665-704.
  22. Shi, T. and Horvath, S. (2006). Unsupervised learning with random forest predictors. Journal of Computational & Graphical Statistics, 15:118-138.
  23. Steinwart, I., Hush, D., and Scovel, C. (2005). A classification framework for anomaly detection. Journal of Machine Learning Research, 6:211-232.
  24. Tax, D. M. J. (2001). One-class classification. Concept learning in the absence of counterexamples. PhD thesis, Delft University of Technology.
  25. Tax, D. M. J. and Duin, R. P. W. (2002). Uniform object generation for optimizing one-class classifiers. Journal of Machine Learning Research, 2:155-173.
  26. Tax, D. M. J. and Muller, K.-R. (2003). Feature extraction for one-class classification. In ICANN/ICONIP: Joint International Conference on Artificial Neural Networks and Neural Information Processing.
  27. Villalba, S. D. and Cunningham, P. (2007). An evaluation of dimension reduction techniques for one-class classification. Artificial Intelligence Review, 27(4):273-294.
Download


Paper Citation


in Harvard Style

D. Villalba S. and Cunningham P. (2009). ARTIFICIAL DATA GENERATION FOR ONE-CLASS CLASSIFICATION - A Case Study of Dimensionality Reduction for Text and Biological Data . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009) ISBN 978-989-674-011-5, pages 202-210. DOI: 10.5220/0002310202020210


in Bibtex Style

@conference{kdir09,
author={Santiago D. Villalba and Pádraig Cunningham},
title={ARTIFICIAL DATA GENERATION FOR ONE-CLASS CLASSIFICATION - A Case Study of Dimensionality Reduction for Text and Biological Data},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009)},
year={2009},
pages={202-210},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002310202020210},
isbn={978-989-674-011-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009)
TI - ARTIFICIAL DATA GENERATION FOR ONE-CLASS CLASSIFICATION - A Case Study of Dimensionality Reduction for Text and Biological Data
SN - 978-989-674-011-5
AU - D. Villalba S.
AU - Cunningham P.
PY - 2009
SP - 202
EP - 210
DO - 10.5220/0002310202020210