Towards a Synthetic Data Generator for Matching Decision Trees

Taoxin Peng, Florian Hanke

2016

Abstract

It is popular to use real-world data to evaluate or teach data mining techniques. However, there are some disadvantages to use real-world data for such purposes. Firstly, real-world data in most domains is difficult to obtain for several reasons, such as budget, technical or ethical. Secondly, the use of many of the real-world data is restricted or in the case of data mining, those data sets do either not contain specific patterns that are easy to mine for teaching purposes or the data needs special preparation and the algorithm needs very specific settings in order to find patterns in it. The solution to this could be the generation of synthetic, “meaningful data” (data with intrinsic patterns). This paper presents a framework for such a data generator, which is able to generate datasets with intrinsic patterns, such as decision trees. A preliminary run of the prototype proves that the generation of such “meaningful data” is possible. Also the proposed approach could be extended to a further development for generating synthetic data with other intrinsic patterns.

References

  1. Berthold, M., Borgelt, C., Höppner, F., & Klawonn, F. 2010. Guide to intelligent data analysis: How to intelligently make sense of real data. Springer-Verlag London.
  2. Coyle, E., Roberts, R., Collins, E., and Barbu, A. 2014. Synthetic Data Generation for Classification via UniModal Cluster Interpolation. Auto Robot 37:27 - 45.
  3. Eno, J. and Thompson, C., 2008. Generating Synthetic Data to Match Data Mining Patterns. IEEE Intenet Computing, Vol. 12, No. 3 pp. 78 - 82.
  4. Frasch, J. V., Lodwich, A., Shafait, F. and M. Breuel, T. M., 2011. A Bayes-true data generator for evaluation of supervised and unsupervised learning Methods. Pattern Recognition Letters 32.11, pp. 1523-1531.
  5. Galler, S. J. and Aichernig, B. K. 2014. An Evalaution of White- and Grey-box Testing Tools for C#, C++, Eiffel, and Java, Int J Softw Tools Technol Transfer 16: pp. 727 -751.
  6. Houkjaer, K., Torp, K., and Wind, R. 2006. Simple and Realistic Data Generation. Proceedings of the 32nd international conference on very large data bases (VLDB 7806), pp. 1243-1246
  7. Jeske, D. R., Samadi, B., Lin, P. J., Ye, L., Cox, S., Xiao, R., Younglove, T., Ly, M., Holt, D., and Rich, R., 2005. Generation of Synthetic Data Sets for Evaluating the Accuracy of Knowledge Discovery Systems. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. ACM, New York, NY, USA. pp. 756 - 762.
  8. Lin, P., Samadi, B., Cipolone, A., Jeske, D., Cox, S., Rendon, C., Holt, D. and Xiao, R., 2006. Development of a Synthetic Data Set Generator for Building and Testing Information Discovery Systems. In Proceedings of the Third International Conference on Information Technology: New Generations. IEEE, pp. 707 - 712
  9. Pei, Y. and Zaiane, O., 2006. A Synthetic Data Generator for Clustering and Outlier Analysis. Technical Report, University of Alberta, Canada.
  10. Quinlan, J. R. 1979. Discovering Rules by Induction from Large Collections of Examples. In D. Michie (Ed.), Expert Systems in the Micro Electronic Age. Edinburgh University Press.
  11. Quinlan, J. R. 1986. Induction of Decision Trees, Machine Learning 1: 81-106.
  12. Rachkovskij, D. A. and Kussul, E. M., 1998. Datagen: A Generator of Datasets for Evaluation of Classification Algorithms. Pattern Recognition Letters 19 (7), 537- 544.
  13. Sánchez-Monedero, J., Gutiérrez, P. A., Pérez-Ortiz, M. and Hervás- Martínez, C. 2013. An n-Spheres Based Synthetic Data Generator for Supervised Classification. Advances in Computational Intelligence. Ed. by Rojas, I., Joya, G. and Gabestany, J. Lecture Notes in Computer Science 7902. Springer Berlin Heidelberg, pp. 613-621.
  14. van der Walt, C. and Barnard, E. 2007. Data Characteristics That Determine Classifier Performance. SAIEE Africa Research Journal, Vol 98(3), pp 87-93.
Download


Paper Citation


in Harvard Style

Peng T. and Hanke F. (2016). Towards a Synthetic Data Generator for Matching Decision Trees . In Proceedings of the 18th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-187-8, pages 135-141. DOI: 10.5220/0005829001350141


in Bibtex Style

@conference{iceis16,
author={Taoxin Peng and Florian Hanke},
title={Towards a Synthetic Data Generator for Matching Decision Trees},
booktitle={Proceedings of the 18th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2016},
pages={135-141},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005829001350141},
isbn={978-989-758-187-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 18th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - Towards a Synthetic Data Generator for Matching Decision Trees
SN - 978-989-758-187-8
AU - Peng T.
AU - Hanke F.
PY - 2016
SP - 135
EP - 141
DO - 10.5220/0005829001350141