Constrained Agglomerative Hierarchical Software Clustering with Hard and Soft Constraints

Chun Yong Chong, Sai Peck Lee

2015

Abstract

Although agglomerative hierarchical software clustering technique has been widely used in reverse engineering to recover a high-level abstraction of the software in the case of limited resources, there is a lack of work in this research context to integrate the concept of pair-wise constraints, such as must-link and cannot-link constraints, to further improve the quality of clustering. Pair-wise constraints that are derived from experts or software developers, provide a means to indicate whether a pair of software components belongs to the same functional group. In this paper, a constrained agglomerative hierarchical clustering algorithm is proposed to maximize the fulfilment of must-link and cannot-link constraints in a unique manner. Two experiments using real-world software systems are performed to evaluate the effectiveness of the proposed algorithm. The result of evaluation shows that the proposed algorithm is capable of handling constraints to improve the quality of clustering, and ultimately provide a better understanding of the analyzed software system.

References

  1. Anquetil, N., & Lethbridge, T. C. (1999). Recovering software architecture from the names of source files. Journal of Software Maintenance, 11(3), 201-221. doi: 10.1002/(sici)1096-908x(199905/06)11:3<201::aidsmr192>3.0.co;2-1
  2. Antonellis, P., Antoniou, D., Kanellopoulos, Y., Makris, C., Theodoridis, E., Tjortjis, C., & Tsirakis, N. (2009). Clustering for Monitoring Software Systems Maintainability Evolution. Electron. Notes Theor. Comput. Sci., 233, 43-57. doi: 10.1016/j.entcs.2009.02.060
  3. Ares, M. E., Parapar, J., & Barreiro, Á. (2012). An experimental study of constrained clustering effectiveness in presence of erroneous constraints. Information Processing & Management, 48(3), 537- 551.
  4. Bair, E. (2013). Semi-supervised clustering methods. Wiley Interdisciplinary Reviews: Computational Statistics, 5(5), 349-361. doi: 10.1002/wics.1270
  5. Basu, S., Banerjee, A., & Mooney, R. (2004). Active Semi-Supervision for Pairwise Constrained Clustering Proceedings of the 2004 SIAM International Conference on Data Mining (pp. 333-344).
  6. Bilenko, M., & Mooney, R. J. (2003). Adaptive duplicate detection using learnable string similarity measures. Paper presented at the Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, Washington, D.C.
  7. Chong, C. Y., Lee, S. P., & Ling, T. C. (2013). Efficient software clustering technique using an adaptive and preventive dendrogram cutting approach. Information and Software Technology, 55(11), 1994-2012.
  8. Chong, C. Y., Lee, S. P., & Ling, T. C. (2014). Prioritizing and Fulfilling Quality Attributes For Virtual Lab Development Through Application of Fuzzy Analytic Hierarchy Process and Software Development Guidelines. Malaysian Journal of Computer Science, 27(1).
  9. Davidson, I., & Ravi, S. S. (2009). Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results. Data Mining and Knowledge Discovery, 18(2), 257-282. doi: 10.1007/s10618-008-0103-4
  10. Davies, D. L., & Bouldin, D. W. (1979). A Cluster Separation Measure. Pattern Analysis and Machine Intelligence, IEEE Transactions on, PAMI-1(2), 224- 227. doi: 10.1109/TPAMI.1979.4766909
  11. Deursen, A. v., & Kuipers, T. (1999). Identifying objects using cluster and concept analysis. Paper presented at the Proceedings of the 21st international conference on Software engineering, Los Angeles, California, USA.
  12. Fokaefs, M., Tsantalis, N., Chatzigeorgiou, A., & Sander, J. (2009). Decomposing Object-Oriented Class Modules Using an Agglomerative Clustering Technique. IEEE International Conference on Software Maintenance, 93-101.
  13. Fokaefs, M., Tsantalis, N., Stroulia, E., & Chatzigeorgiou, A. (2012). Identification and application of Extract Class refactorings in object-oriented systems. Journal of Systems and Software, 85(10), 2241-2260. doi: 10.1016/j.jss.2012.04.013
  14. Hong, Z., & Yiu-ming, C. (2012). Semi-Supervised Maximum Margin Clustering with Pairwise Constraints. Knowledge and Data Engineering, IEEE Transactions on, 24(5), 926-939. doi: 10.1109/TKDE.2011.68
  15. Kestler, H., Kraus, J., Palm, G., & Schwenker, F. (2006). On the Effects of Constraints in Semi-supervised Hierarchical Clustering. In F. Schwenker & S. Marinai (Eds.), Artificial Neural Networks in Pattern Recognition (Vol. 4087, pp. 57-66): Springer Berlin Heidelberg.
  16. Klein, D., Kamvar, S. D., & Manning, C. D. (2002). From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering. Paper presented at the Proceedings of the Nineteenth International Conference on Machine Learning.
  17. Maqbool, O., & Babri, H. A. (2007). Hierarchical Clustering for Software Architecture Recovery. Software Engineering, IEEE Transactions on, 33(11), 759-780. doi: 10.1109/TSE.2007.70732
  18. MathArc - Ensuring Access to Mathematics Over Time. (August 2009).
  19. Mitchell, B. S., & Mancoridis, S. (2001, 2001). Comparing the decompositions produced by software clustering algorithms using similarity measurements. Paper presented at the Software Maintenance, 2001. Proceedings. IEEE International Conference on.
  20. Miyamoto, S. (2012). An Overview of Hierarchical and Non-hierarchical Algorithms of Clustering for Semisupervised Classification. In V. Torra, Y. Narukawa, B. López, & M. Villaret (Eds.), Modeling Decisions for Artificial Intelligence (Vol. 7647, pp. 1-10): Springer Berlin Heidelberg.
  21. Shental, N., & Weinshall, D. (2003). Learning Distance Functions using Equivalence Relations. Paper presented at the In Proceedings of the Twentieth International Conference on Machine Learning.
  22. Sørensen, T. (1948). A Method of Establishing Groups of Equal Amplitude in Plant Sociology Based on Similarity of Species Content and Its Application to Analyses of the Vegetation on Danish Commons: I kommission hos E. Munksgaard.
  23. Wagstaff, K., & Cardie, C. (2000). Clustering with Instance-level Constraints. Paper presented at the Proceedings of the Seventeenth International Conference on Machine Learning.
  24. Wiggerts, T. A. (1997, 6-8 Oct 1997). Using clustering algorithms in legacy systems remodularization. Paper presented at the Reverse Engineering, 1997. Proceedings of the Fourth Working Conference on.
  25. Zhihua, W., & Tzerpos, V. (2004, 24-26 June 2004). An effectiveness measure for software clustering algorithms. Paper presented at the Program Comprehension, 2004. Proceedings. 12th IEEE International Workshop on.
Download


Paper Citation


in Harvard Style

Chong C. and Lee S. (2015). Constrained Agglomerative Hierarchical Software Clustering with Hard and Soft Constraints . In Proceedings of the 10th International Conference on Evaluation of Novel Approaches to Software Engineering - Volume 1: ENASE, ISBN 978-989-758-100-7, pages 177-188. DOI: 10.5220/0005344001770188


in Bibtex Style

@conference{enase15,
author={Chun Yong Chong and Sai Peck Lee},
title={Constrained Agglomerative Hierarchical Software Clustering with Hard and Soft Constraints},
booktitle={Proceedings of the 10th International Conference on Evaluation of Novel Approaches to Software Engineering - Volume 1: ENASE,},
year={2015},
pages={177-188},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005344001770188},
isbn={978-989-758-100-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 10th International Conference on Evaluation of Novel Approaches to Software Engineering - Volume 1: ENASE,
TI - Constrained Agglomerative Hierarchical Software Clustering with Hard and Soft Constraints
SN - 978-989-758-100-7
AU - Chong C.
AU - Lee S.
PY - 2015
SP - 177
EP - 188
DO - 10.5220/0005344001770188