Exploring Text Classification Configurations - A Bottom-up Approach to Customize Text Classifiers based on the Visualization of Performance

Alejandro Gabriel Villanueva Zacarias, Laura Kassner, Bernhard Mitschang

2017

Abstract

Automated Text Classification (ATC) is an important technique to support industry expert workers, e.g. in product quality assessment based on part failure reports. In order to be useful, ATC classifiers must entail reasonable costs for a certain accuracy level and processing time. However, there is little clarity on how to customize the composing elements of a classifier for this purpose. In this paper we highlight the need to configure an ATC classifier considering the properties of the algorithm and the dataset at hand. In this context, we develop three contributions: (1) the notion of ATC Configuration to arrange the relevant design choices to build an ATC classifier, (2) a Feature Selection technique named Smart Feature Selection, and (3) a visualization technique, called ATCC Performance Cube, to translate the technical configuration aspects into a performance visualization. With the help of this Cube, business decision-makers can easily understand the performance and cost variability that different ATC Configurations have in their specific application scenarios.

References

  1. Clauset, A., Rohilla Shalizi, C., and Newman, M. (2009). Power-law Distributions in Empirical Data. SIAM Review, 51(4):661-703.
  2. Csardi, G. and Nepusz, T. (2006). The igraph software package for complex network research. InterJournal, Complex Systems:1695.
  3. Dasgupta, a., Drineas, P., Harb, B., Josifovski, V., and Mahoney, M. W. (2007). Feature selection methods for text classification. Proceedings of the 13th ACM SIGKDD International Conference, pages 230-239.
  4. Feinerer, I. and Hornik, K. (2015). tm: Text Mining Package. R package version 0.6-2.
  5. Ferrucci, D. and Lally, A. (2004). Uima: an architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering, 10(3-4):327-348.
  6. Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3:1289-1305.
  7. Gupta, M. R., Bengio, S., and Weston, J. (2014). Training Highly Multiclass Classifiers. Journal of Machine Learning Research, 15:1461-1492.
  8. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. (2009). The WEKA data mining software. ACM SIGKDD Explorations Newsletter, 11(1):10.
  9. Heimerl, F., Koch, S., Bosch, H., and Ertl, T. (2012). Visual classifier training for text document retrieval. IEEE TVCG Journal, 18(12):2839-2848.
  10. Hornik, K., Buchta, C., and Zeileis, A. (2009). Open-source machine learning: R meets Weka. Computational Statistics, 24(2):225-232.
  11. Hornik, K., Mair, P., Rauch, J., Geiger, W., Buchta, C., and Feinerer, I. (2013). The textcat package for n-gram based text categorization in R. Journal of Statistical Software, 52(6):1-17.
  12. Hotho, A., Nürnberger, A., and Paaß, G. (2005). A Brief Survey of Text Mining. LDV Forum - GLDV Journal for Computational Linguistics and Language Technology, 20:19-62.
  13. Kassner, L. and Mitschang, B. (2016). Exploring text classification for messy data: An industry use case for domain-specific analytics. InProceedings of the 19th EDBT International Conference 2016.
  14. Kemper, H.-G., Baars, H., and Lasi, H. (2013). An Integrated Business Intelligence Framework. In Rausch, P., Sheta, A. F., and Ayesh, A., editors, Business Intelligence and Performance Management, chapter 2, pages 13-26. Springer, London.
  15. Kouznetsov, A. and Japkowicz, N. (2010). Using classifier performance visualization to improve collective ranking techniques for biomedical abstracts classification. In Farzindar, A. and Kes?elj, V., editors, Advances in Artificial Intelligence, volume 6085, pages 299-303. Springer Berlin Heidelberg, Ottawa.
  16. Liu, W., Wang, L., and Yi, M. (2013). Power Law for Text Categorization. In Sun, M., Zhang, M., Lin, D., and Wang, H., editors, Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, volume 8208, pages 131-143, Suzhou. Springer.
  17. Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2(2):159-165.
  18. Naidu, K., Dhenge, A., and Wankhade, K. (2014). Feature selection algorithm for improving the performance of classification: A survey. In Tomar, G. and Singh, S., editors, Proceedings of the 2014 4th CSNT International Conference, pages 468-471, Bhopal. IEEE Computer Society.
  19. Newman, M. E. J. (2005). Power laws, Pareto distributions and Zipf's law. Power laws, Pareto distributions and Zipf 's law. Contemporary physics, 46(5):323-351.
  20. Ng, R. T., Arocena, P. C., Barbosa, D., and Carenini, G. (2013). Perspectives on Business Intelligence. Morgan & Claypool.
  21. Salton, G., Wong, a., and Yang, C. S. (1975). A Vector Space Model for Automatic Indexing. Magazine Communications of the ACM, 18(11):613-620.
  22. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1-47.
Download


Paper Citation


in Harvard Style

Villanueva Zacarias A., Kassner L. and Mitschang B. (2017). Exploring Text Classification Configurations - A Bottom-up Approach to Customize Text Classifiers based on the Visualization of Performance . In Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-247-9, pages 504-511. DOI: 10.5220/0006309705040511


in Bibtex Style

@conference{iceis17,
author={Alejandro Gabriel Villanueva Zacarias and Laura Kassner and Bernhard Mitschang},
title={Exploring Text Classification Configurations - A Bottom-up Approach to Customize Text Classifiers based on the Visualization of Performance},
booktitle={Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2017},
pages={504-511},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006309705040511},
isbn={978-989-758-247-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - Exploring Text Classification Configurations - A Bottom-up Approach to Customize Text Classifiers based on the Visualization of Performance
SN - 978-989-758-247-9
AU - Villanueva Zacarias A.
AU - Kassner L.
AU - Mitschang B.
PY - 2017
SP - 504
EP - 511
DO - 10.5220/0006309705040511