Croujaction - A Novel Approach to Text-based Job Name Clustering with Correlation Analysis

Zunhe Liu, Yan Liu, Xiao Yang, Shengyu Guo, Buyang Cao

2015

Abstract

Job name clustering gradually becomes more and more important in terms of numerous anomaly detections and analysis of cloud performance nowadays. Unlike crude texts, job name is a kind of sequential characters or tokens. This made it a challenge for clustering based on job name text. In this paper we analysis the correlation between columns and use user-job correlation to improve classic algorithm TF-IDF. We optimize words tokenizing and feature sets generating. We use hierarchical clustering methods to implement experience. Finally we develop a module and evaluate the performance of optimized algorithm, delivering it as a product to a prestige e-commerce company.

References

  1. Beil, F., Ester, M., and Xu, X. (2002). Frequent term-based text clustering. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 436-442. ACM.
  2. Chandola, V., Banerjee, A., and Kumar, V. (2009). Anomaly detection: A survey. ACM Comput. Surv., 41(3):15:1-15:58.
  3. Cutting, D. R., Karger, D. R., Pedersen, J. O., and Tukey, J. W. (1992). Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pages 318-329. ACM.
  4. Fu, Q., Lou, J.-G., Wang, Y., and Li, J. (2009). Execution anomaly detection in distributed systems through unstructured log analysis. In Data Mining, 2009. ICDM'09. Ninth IEEE International Conference on, pages 149-158. IEEE.
  5. Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3):283-304.
  6. Lou, J.-G., Fu, Q., Yang, S., Xu, Y., and Li, J. (2010). Mining invariants from console logs for system problem detection. In USENIX Annual Technical Conference.
  7. Neto, J. L., Santos, A. D., Kaestner, C. A., Alexandre, N., Santos, D., et al. (2000). Document clustering and text summarization.
  8. Perego, R., Orlando, S., and Palmerini, P. (2001). Enhancing the apriori algorithm for frequent set counting. In Data Warehousing and Knowledge Discovery, pages 71-82. Springer.
  9. Ramos, J. (2003). Using tf-idf to determine word relevance in document queries. In Proceedings of the First Instructional Conference on Machine Learning.
  10. Tan, J., Pan, X., Kavulya, S., Gandhi, R., and Narasimhan, P. (2009). Mochi: visual log-analysis based tools for debugging hadoop. In USENIX Workshop on Hot Topics in Cloud Computing (HotCloud), San Diego, CA, volume 6.
Download


Paper Citation


in Harvard Style

Liu Z., Liu Y., Yang X., Guo S. and Cao B. (2015). Croujaction - A Novel Approach to Text-based Job Name Clustering with Correlation Analysis . In Proceedings of the International Conference on Operations Research and Enterprise Systems - Volume 1: ICORES, ISBN 978-989-758-075-8, pages 199-204. DOI: 10.5220/0005271601990204


in Bibtex Style

@conference{icores15,
author={Zunhe Liu and Yan Liu and Xiao Yang and Shengyu Guo and Buyang Cao},
title={Croujaction - A Novel Approach to Text-based Job Name Clustering with Correlation Analysis},
booktitle={Proceedings of the International Conference on Operations Research and Enterprise Systems - Volume 1: ICORES,},
year={2015},
pages={199-204},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005271601990204},
isbn={978-989-758-075-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Operations Research and Enterprise Systems - Volume 1: ICORES,
TI - Croujaction - A Novel Approach to Text-based Job Name Clustering with Correlation Analysis
SN - 978-989-758-075-8
AU - Liu Z.
AU - Liu Y.
AU - Yang X.
AU - Guo S.
AU - Cao B.
PY - 2015
SP - 199
EP - 204
DO - 10.5220/0005271601990204