Authors:
Claudio Curto
1
;
Daniela Giordano
1
and
Daniel Indelicato
2
Affiliations:
1
Department of Electrical Electronic and Computer Engineering (DIEEI), University of Catania, Catania, Italy
;
2
EtnaHitech S.c.p.A., Darwin Technologies S.r.l., Catania, Italy
Keyword(s):
Vulnerable Code Datasets, Vulnerability Detection, Deep Learning, Data Analysis.
Abstract:
Recent years have witnessed growing interest in applying deep learning techniques to software security assessment, particularly for detecting vulnerability patterns in human-generated source code. Despite advances, the effectiveness of deep learning models is often hindered by limitations in the datasets used for training. This study conducts a comprehensive evaluation of one widely used and two recently released C/C++ real-world vulnerable code datasets to assess their impact on the performance of transformer-based models, focusing on generalization across unseen projects, unseen vulnerability types and diverse data distributions. In addition, we analyze the effects of aggregating datasets and compare the results with previous experiments. Experimental results demonstrate that combining datasets significantly improves model generalization across varied distributions, highlighting the importance of diverse, high-quality data for enhancing vulnerability detection in source code.