Learning on Vertically Partitioned Data based on Chi-square Feature Selection and Naive Bayes Classification

Verónica Bolón-Canedo, Diego Peteiro-Barral, Amparo Alonso-Betanzos, Bertha Guijarro-Berdiñas, Noelia Sánchez-Maroño

2014

Abstract

In the last few years, distributed learning has been the focus of much attention due to the explosion of big databases, in some cases distributed across different nodes. However, the great majority of current selection and classification algorithms are designed for centralized learning, i.e. they use the whole dataset at once. In this paper, a new approach for learning on vertically partitioned data is presented, which covers both feature selection and classification. The approach splits the data by features, and then uses the c2 filter and the naive Bayes classifier to learn at each node. Finally, a merging procedure is performed, which updates the learned model in an incremental fashion. The experimental results on five representative datasets show that the execution time is shortened considerably whereas the classification performance is maintained as the number of nodes increases.

References

  1. Ananthanarayana, V., Subramanian, D., and Murty, M. (2000). Scalable, distributed and dynamic mining of association rules. High Performance Computing HiPC 2000, pages 559-566.
  2. Banerjee, M. and Chakravarty, S. (2011). Privacy preserving feature selection for distributed data using virtual dimension. In Proceedings of the 20th ACM international conference on Information and knowledge management, CIKM 7811, pages 2281-2284. ACM.
  3. Bramer, M. (2013). Dealing with large volumes of data. In Principles of Data Mining, pages 189-208. Springer.
  4. Chan, P., Stolfo, S., et al. (1993). Toward parallel and distributed learning by meta-learning. In AAAI workshop in Knowledge Discovery in Databases, pages 227- 240.
  5. Chawla, N. V., Hall, L. O., Bowyer, K. W., Moore Jr, T., and Kegelmeyer, W. P. (2002). Distributed pasting of small votes. In Multiple Classifier Systems, pages 52- 61. Springer.
  6. Chen, R., Sivakumar, K., and Kargupta, H. (2001). Distributed web mining using bayesian networks from multiple data streams. In Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on, pages 75-82. IEEE.
  7. Czarnowski, I. (2011). Distributed learning with data reduction. In Transactions on computational collective intelligence IV, pages 3-121. Springer-Verlag.
  8. de Haro García, A. (2011). Scaling data mining algorithms. Application to instance and feature selection. PhD thesis, Universidad de Granada.
  9. Dems?ar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7:1-30.
  10. Dunn, O. J. (1961).
  11. Multiple comparisons among means. Journal of the American Statistical Association, 56(293):52-64.
  12. Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research, 3:1289-1305.
  13. Frank, A. and Asuncion, A. (2010). UCI machine learning repository.
  14. Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics, 11(1):86-92.
  15. Gangrade, A. and Patel, R. (2013). Privacy preserving three-layer naïve bayes classifier for vertically partitioned databases. Journal of Information and Computing Science, 8(2):119-129.
  16. Guyon, I., Gunn, S., Nikravesh, M., and Zadeh, L. (2006). Feature extraction: foundations and applications, volume 207. Springer.
  17. Kantarcioglu, M. and Clifton, C. (2004). Privacypreserving distributed mining of association rules on horizontally partitioned data. Knowledge and Data Engineering, IEEE Transactions on, 16(9):1026- 1037.
  18. Liu, H. and Setiono, R. (1995). Chi2: Feature selection and discretization of numeric attributes. In Tools with Artificial Intelligence, 1995. Proceedings., Seventh International Conference on, pages 388-391. IEEE.
  19. Liu, H. and Setiono, R. (1997). Feature selection via discretization. Knowledge and Data Engineering, IEEE Transactions on, 9(4):642-645.
  20. McConnell, S. and Skillicorn, D. (2004). Building predictors from vertically distributed data. In Proceedings of the 2004 conference of the Centre for Advanced Studies on Collaborative research, pages 150-162. IBM Press.
  21. Prodromidis, A., Chan, P., and Stolfo, S. (2000). Metalearning in distributed data mining systems: Issues and approaches. Advances in distributed and parallel knowledge discovery, 3.
  22. Rish, I. (2001). An empirical study of the naive bayes classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence, volume 3, pages 41-46.
  23. Saari, P., Eerola, T., and Lartillot, O. (2011). Generalizability and simplicity as criteria in feature selection: application to mood classification in music. Audio, Speech, and Language Processing, IEEE Transactions on, 19(6):1802-1812.
  24. Saeys, Y., Inza, I., and Larran˜aga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19):2507-2517.
  25. Skillicorn, D. and McConnell, S. (2008). Distributed prediction from vertically partitioned data. Journal of Parallel and Distributed computing, 68(1):16-36.
  26. Tou, J. and González, R. (1977). Pattern recognition principles. Addison-Wesley.
  27. Tsoumakas, G. and Vlahavas, I. (2002). Distributed data mining of large classifier ensembles. In Proceedings Companion Volume of the Second Hellenic Conference on Artificial Intelligence, pages 249-256.
  28. Tsoumakas, G. and Vlahavas, I. (2009). Distributed data mining. Database Technologies: Concepts, Methodologies, Tools, and Applications, 1:157.
  29. Vaidya, J. and Clifton, C. (2004). Privacy preserving naïve bayes classifier for vertically partitioned data. In 2004 SIAM International Conference on Data Mining, Lake Buena Vista, Florida, pages 522-526.
  30. Vaidya, J. and Clifton, C. (2005). Privacy-preserving decision trees over vertically partitioned data. In Data and Applications Security XIX, pages 139-152. Springer.
  31. Ventura, D. and Martinez, T. (1995). An empirical comparison of discretization methods. In Proceedings of the Tenth International Symposium on Computer and Information Sciences, pages 443-450.
  32. Wang, J., Luo, Y., Zhao, Y., and Le, J. (2009). A survey on privacy preserving data mining. In Database Technology and Applications, 2009 First International Workshop on, pages 111-114. IEEE.
  33. Wirth, R., Borth, M., and Hipp, J. (2001). When distribution is part of the semantics: A new problem class for distributed knowledge discovery. In Proceedings of the PKDD 2001 workshop on ubiquitous data mining for mobile and distributed environments, pages 56-64. Citeseer.
  34. Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2):241-259.
  35. Yao, A. C. (1982). Protocols for secure computations. In Proceedings of the 23rd Annual Symposium on Foundations of Computer Science, pages 160-164.
  36. Ye, M., Hu, X., and Wu, C. (2010). Privacy preserving attribute reduction for vertically partitioned data. In Artificial Intelligence and Computational Intelligence (AICI), 2010 International Conference on, volume 1, pages 320-324. IEEE.
  37. Zhao, Z. and Liu, H. (2011). Spectral Feature Selection for Data Mining. Chapman & Hall/Crc Data Mining and Knowledge Discovery. Taylor & Francis Group.
Download


Paper Citation


in Harvard Style

Bolón-Canedo V., Peteiro-Barral D., Alonso-Betanzos A., Guijarro-Berdiñas B. and Sánchez-Maroño N. (2014). Learning on Vertically Partitioned Data based on Chi-square Feature Selection and Naive Bayes Classification . In Proceedings of the 6th International Conference on Agents and Artificial Intelligence - Volume 1: ICAART, ISBN 978-989-758-015-4, pages 350-357. DOI: 10.5220/0004759503500357


in Bibtex Style

@conference{icaart14,
author={Verónica Bolón-Canedo and Diego Peteiro-Barral and Amparo Alonso-Betanzos and Bertha Guijarro-Berdiñas and Noelia Sánchez-Maroño},
title={Learning on Vertically Partitioned Data based on Chi-square Feature Selection and Naive Bayes Classification},
booktitle={Proceedings of the 6th International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,},
year={2014},
pages={350-357},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004759503500357},
isbn={978-989-758-015-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 6th International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,
TI - Learning on Vertically Partitioned Data based on Chi-square Feature Selection and Naive Bayes Classification
SN - 978-989-758-015-4
AU - Bolón-Canedo V.
AU - Peteiro-Barral D.
AU - Alonso-Betanzos A.
AU - Guijarro-Berdiñas B.
AU - Sánchez-Maroño N.
PY - 2014
SP - 350
EP - 357
DO - 10.5220/0004759503500357