CONCEPT-BASED CLUSTERING FOR OPEN-SOURCED SOFTWARE(OSS) DEVELOPMENT FORUM THREADS

Jonathan Jason C. King Li, Masanori Akiyoshi, Masaki Samejima, Norihisa Komoda

Abstract

Open-Sourced Software Development depends on the Internet Forum for communication among its developers. However, a typical program would have related modules which are hard to express in the forum. Though human effort of reporting related modules is already being used, this technique is impractical due to human inaccuracy. Our approach uses the Concept-Based Document Similarity for its thorough analysis on the semantic value of a word or phrase on the sentence, document and corpus level for the purpose of measuring similarities between documents. Then we created a novel Clustering Algorithm that does not need any threshold values and it is able to stop clustering when the clusters are already correctly formed. This was first used on newspapers to test its effectiveness and then was used on a cluster of Bugzilla threads. The results from the newspapers proved the clustering process works but the results for the Bugzilla threads, where the comment content do not evidently reveal thread topic, reveals that other elements, aside from thread content, is needed to establish similarity. Future work will utilize other thread elements for clustering similar threads.

References

  1. Arimura, H., Abe, J., Fujino, R., Sakamoto, H., Shimozono, S., and Arikawa, S. (2000). Text data mining: Discovery of important keywords in the cyberspace. In International Conference on Digital Libraries: Research and Practice, pp.220-226.
  2. Clifton, C., Cooley, R., and Rennie, J. (2004). Topcat: Data mining for topic identification in a text corpus. IEEE Transactions on Knowledge and Data Engineering, Vol.16, No.8, pp.949-964.
  3. Dan, K. and Christopher, D. M. (2003). Accurate unlexicalized parsing. In the 41st Meeting of the Association for Computational Linguistics, pp. 423-430.
  4. Ghani, R. and Fano, A. (2002). Using text mining to infer semantic attributes for retail data mining. In 2002 IEEE International Conference on Data Mining(ICDM 2002), pp.195-202.
  5. Hammouda, K. and Kamel, M. (2004). Efficient phrasebased document indexing for web document clustering. IEEE Transactions on Knowledge and Data Engineering, Vol.16, No.10, pp.1279-1296.
  6. Hung, C. and Xiaotie, D. (2008). Efficient phrase-based document similarity for clustering. IEEE Transactions on Knowledge and Data Engineering, Vol.20, No.9, pp.1217-1229.
  7. Jing, P., Dong-qing, Y., Jian-wei, W., Meng-qing, W., and Jun-gang, W. (2007). A clustering algorithm for short documents based on concept similarity. In IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, pp.42-45.
  8. Li, Y., Bandar, Z., and Mclean, D. (2003). An approach for measuring semantic similarity between words using multiple information sources. IEEE Transactions on Knowledge and Data Engineering, Vol.15, No.4, pp.871-882.
  9. Li, Y., McLean, D., Bandar, Z., O'Shea, J., and Crockett, K. (2006). Sentence similarity based on semantic nets and corpus statistics. IEEE Transactions on Knowledge and Data Engineering, Vol.18, No.8, pp.1138- 1150.
  10. Malin, J., Millward, C., Schwarz, H., Gomez, F., Throop, D., and Thronesbery, C. (2009). Linguistic text mining for problem reports. In IEEE International Conference on Systems, Man and Cybernetics(SMC 2009), pp.1578-1583.
  11. Porter, M. (1980). An algorithm for suffix stripping. Program, Vol.14, No.3, pp.130-137.
  12. Shehata, S., Karray, F., and Kamel, M. (2010). An efficient concept-based mining model for enhancing text clustering. IEEE Transactions on Knowledge and Data Engineering, Vol.22, No.10, pp.1360-1370.
  13. Shen, W. and Angryk, R. (2007). Measuring semantic similarity using wordnet-based context vectors. In IEEE International Conference on Systems, Man and Cybernetics(SMC 2007), pp.908-913.
  14. Terachi, M., Saga, R., and Tsuji, H. (2006). Trends recognition in journal papers by text mining. In IEEE International Conference on Systems, Man and Cybernetics 2006(SMC2006), pp.4784-4789.
Download


Paper Citation


in Harvard Style

Jason C. King Li J., Akiyoshi M., Samejima M. and Komoda N. (2011). CONCEPT-BASED CLUSTERING FOR OPEN-SOURCED SOFTWARE(OSS) DEVELOPMENT FORUM THREADS . In Proceedings of the 7th International Conference on Web Information Systems and Technologies - Volume 1: WTM, (WEBIST 2011) ISBN 978-989-8425-51-5, pages 690-695. DOI: 10.5220/0003478206900695


in Bibtex Style

@conference{wtm11,
author={Jonathan Jason C. King Li and Masanori Akiyoshi and Masaki Samejima and Norihisa Komoda},
title={CONCEPT-BASED CLUSTERING FOR OPEN-SOURCED SOFTWARE(OSS) DEVELOPMENT FORUM THREADS},
booktitle={Proceedings of the 7th International Conference on Web Information Systems and Technologies - Volume 1: WTM, (WEBIST 2011)},
year={2011},
pages={690-695},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003478206900695},
isbn={978-989-8425-51-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 7th International Conference on Web Information Systems and Technologies - Volume 1: WTM, (WEBIST 2011)
TI - CONCEPT-BASED CLUSTERING FOR OPEN-SOURCED SOFTWARE(OSS) DEVELOPMENT FORUM THREADS
SN - 978-989-8425-51-5
AU - Jason C. King Li J.
AU - Akiyoshi M.
AU - Samejima M.
AU - Komoda N.
PY - 2011
SP - 690
EP - 695
DO - 10.5220/0003478206900695