DEALING WITH “VERY LARGE” DATASETS - An Overview of a Promising Research Line: Distributed Learning

Diego Peteiro-Barral, Bertha Guijarro-Berdiñas, Beatriz Pérez-Sánchez

Abstract

Traditionally, a bottleneck preventing the development of more intelligent systems was the limited amount of data available. However, nowadays in many domains of machine learning, the size of the datasets is so large that the limiting factor is the inability of learning algorithms to use all the data to learn with in a reasonable time. In order to handle this problem a new field in machine learning has emerged: large-scale learning, where learning is limited by computational resources rather than by the availability of data. Moreover, in many real applications, “very large” datasets are naturally distributed and it is necessary to learn locally in each of the workstations in which the data are generated. However, the great majority of well-known learning algorithms do not provide an admissible solution to both problems: learning from “very large” datasets and learning from distributed data. In this context, distributed learning seems to be a promising line of research with which to deal with both situations, since “very large” concentrated datasets can be partitioned among several workstations. This paper provides some background regarding distributed environments as well as an overview of distributed learning for dealing with “very large” datasets.

References

  1. Agrawal, R. and Srikant, R. (1994). Fast algorithms for mining association rules. In Proceedigns of the 20th International Conference on Very Large Data Bases (VLDB), volume 1215, pages 487-499.
  2. Bottou, L. and Bousquet, O. (2008). The tradeoffs of large scale learning. Advances in neural information processing systems, 20:161-168.
  3. Caragea, D., Silvescu, A., and Honavar, V. (2001). Analysis and synthesis of agents that learn from distributed dynamic data sources. Emergent neural computational architectures based on neuroscience, pages 547-559.
  4. Catlett, J. (1991). Megainduction: machine learning on very large databases. PhD thesis, School of Computer Science, University of Technology, Sydney, Australia.
  5. Chan, P. and Stolfo, S. (1993). Toward parallel and distributed learning by meta-learning. In AAAI in Knowledge Discovery in Databases, pages 227-240.
  6. Chawla, N., Hall, L., Bowyer, K., Moore, T., and Kegelmeyer, W. (2002). Distributed pasting of small votes. Multiple Classifier Systems, pages 52-61.
  7. D-Lib Magazine (2006). A Research Library Based on the Historical Collections of the Internet Archive. http://www.dlib.org/dlib/february06/arms/02arms.html. [Online; accessed 27-Oct.-2010].
  8. Davies, W., Edwards, P., and Scotland, U. (2000). Dagger: A new approach to combining multiple models learned from disjoint subsets. Machine Learning, 2000:1-16.
  9. Dietterich, T. (2000). Ensemble methods in machine learning. Multiple classifier systems, pages 1-15.
  10. Guijarro-Berdin˜as, B., Martínez-Rego, D., and FernándezLorenzo, S. (2009). Privacy-Preserving Distributed Learning Based on Genetic Algorithms and Artificial Neural Networks. Distributed Computing, Artificial Intelligence, Bioinformatics, Soft Computing, and Ambient Assisted Living, pages 195-202.
  11. Guo, Y. and Sutiwaraphun, J. (1999). Probing knowledge in distributed data mining. Methodologies for Knowledge Discovery and Data Mining, pages 443-452.
  12. Hansen, L. and Salamon, P. (1990). Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10):993-1001.
  13. Huber, P. (1997). From large to huge: A statistician's reaction to KDD and DM. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (KDD), pages 304-308.
  14. Kargupta, H., Park, B., Hershberger, D., and Johnson, E. (2000). Collective Data Mining: A New Perspective Toward Distributed Data Mining. Advances in Distributed and Parallel Knowledge Discovery, pages 131-178.
  15. Kittler, J. (1998). Combining classifiers: A theoretical framework. Pattern Analysis & Applications, 1(1):18- 27.
  16. Krishnan, S., Bhattacharyya, C., and Hariharan, R. (2008). A randomized algorithm for large scale support vector learning. In Proceedings of Advances in Neural Information Processing Systems (NIPS), pages 793-800.
  17. Lazarevic, A. and Obradovic, Z. (2002). Boosting algorithms for parallel and distributed learning. Distributed and Parallel Databases, 11(2):203-229.
  18. Moretti, C., Steinhaeuser, K., Thain, D., and Chawla, N. (2008). Scaling up classifiers to cloud computers. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM), pages 472-481.
  19. Provost, F. and Kolluri, V. (1999). A survey of methods for scaling up inductive algorithms. Data mining and knowledge discovery, 3(2):131-169.
  20. Raina, R., Madhavan, A., and Ng, A. (2009). Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pages 873- 880.
  21. School of Information and Management and Systems (2000). How much information? http://www2.sims.berkeley.edu/research/projects/howmuch-info/internet.html. [Online; accessed 27- September-2010].
  22. School of Information and Management and Systems (2003). How much information? http://www2.sims.berkeley.edu/research/projects/howmuch-info-2003/internet.htm. [Online; accessed 27-Sept.-2010].
  23. Sonnenburg, S., Franc, V., Yom-Tov, E., and Sebag, M. (2009). PASCAL Large Scale Learning Challenge. Journal of Machine Learning Research.
  24. Sonnenburg, S., Ratsch, G., and Rieck, K. (2007). Large scale learning with string kernels. Large Scale Kernel Machines, pages 73-104.
  25. Tsoumakas, G. (2009). Distributed Data Mining. Database Technologies: Concepts, Methodologies, Tools, and Applications, pages 157-171.
  26. Tsoumakas, G., Angelis, L., and Vlahavas, I. (2004a). Clustering classifiers for knowledge discovery from physically distributed databases. Data & Knowledge Engineering, 49(3):223-242.
  27. Tsoumakas, G., Katakis, I., and Vlahavas, I. (2004b). Effective voting of heterogeneous classifiers. Machine Learning: ECML 2004, pages 465-476.
  28. Tsoumakas, G. and Vlahavas, I. (2002). Effective stacking of distributed classifiers. In ECAI 2002: 15th European Conference on Artificial Intelligence, page 340.
  29. Wolpert, D. (1992). Stacked generalization. Neural networks, 5(2):241-259.
Download


Paper Citation


in Harvard Style

Peteiro-Barral D., Guijarro-Berdiñas B. and Pérez-Sánchez B. (2011). DEALING WITH “VERY LARGE” DATASETS - An Overview of a Promising Research Line: Distributed Learning . In Proceedings of the 3rd International Conference on Agents and Artificial Intelligence - Volume 1: ICAART, ISBN 978-989-8425-40-9, pages 476-481. DOI: 10.5220/0003288804760481


in Bibtex Style

@conference{icaart11,
author={Diego Peteiro-Barral and Bertha Guijarro-Berdiñas and Beatriz Pérez-Sánchez},
title={DEALING WITH “VERY LARGE” DATASETS - An Overview of a Promising Research Line: Distributed Learning},
booktitle={Proceedings of the 3rd International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,},
year={2011},
pages={476-481},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003288804760481},
isbn={978-989-8425-40-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 3rd International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,
TI - DEALING WITH “VERY LARGE” DATASETS - An Overview of a Promising Research Line: Distributed Learning
SN - 978-989-8425-40-9
AU - Peteiro-Barral D.
AU - Guijarro-Berdiñas B.
AU - Pérez-Sánchez B.
PY - 2011
SP - 476
EP - 481
DO - 10.5220/0003288804760481