Similarity of Software Libraries: A Tag-based Classification Approach

Maximilian Auch, Maximilian Balluff, Peter Mandl, Christian Wolff

2021

Abstract

The number of software libraries has increased over time, so grouping them into classes according to their functionality simplifies repository management and analyses. With the large number of software libraries, the task of categorization requires automation. Using a crawled dataset based on Java software libraries from Apache Maven repositories as well as tags and categories from the indexing platform MvnRepository.com, we show how the data in this set is structured and point out an imbalance of classes. We introduce a class mapping relevant for the procedure, which maps the libraries from very specific, technical classes into more generic classes. Using this mapping, we investigate supervised machine learning techniques that classify software libraries from the dataset based on their available tags. We show that a tag-based approach to classify libraries with an accuracy of 97.46% can be achieved by using neural networks. Overall, we found techniques such as neural networks and naíve Bayes more suitable in this use case than a logistic regression or a random forest.

Download


Paper Citation


in Harvard Style

Auch M., Balluff M., Mandl P. and Wolff C. (2021). Similarity of Software Libraries: A Tag-based Classification Approach. In Proceedings of the 10th International Conference on Data Science, Technology and Applications - Volume 1: DATA, ISBN 978-989-758-521-0, pages 17-28. DOI: 10.5220/0010521600170028


in Bibtex Style

@conference{data21,
author={Maximilian Auch and Maximilian Balluff and Peter Mandl and Christian Wolff},
title={Similarity of Software Libraries: A Tag-based Classification Approach},
booktitle={Proceedings of the 10th International Conference on Data Science, Technology and Applications - Volume 1: DATA,},
year={2021},
pages={17-28},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010521600170028},
isbn={978-989-758-521-0},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 10th International Conference on Data Science, Technology and Applications - Volume 1: DATA,
TI - Similarity of Software Libraries: A Tag-based Classification Approach
SN - 978-989-758-521-0
AU - Auch M.
AU - Balluff M.
AU - Mandl P.
AU - Wolff C.
PY - 2021
SP - 17
EP - 28
DO - 10.5220/0010521600170028