Addressing the C/C++ Vulnerability Datasets Limitation: The Good, the

Bad and the Ugly

Claudio Curto

1 a

, Daniela Giordano

and Daniel Gustav Indelicato

Department of Electrical Electronic and Computer Engineering (DIEEI), University of Catania, Catania, Italy

EtnaHitech S.c.p.A., Darwin Technologies S.r.l., Catania, Italy

Keywords:

Vulnerable Code Datasets, Vulnerability Detection, Deep Learning, Data Analysis.

Abstract:

Recent years have witnessed growing interest in applying deep learning techniques to software security assess-

ment, particularly for detecting vulnerability patterns in human-generated source code. Despite advances, the

effectiveness of deep learning models is often hindered by limitations in the datasets used for training. This

study conducts a comprehensive evaluation of one widely used and two recently released C/C++ real-world

vulnerable code datasets to assess their impact on the performance of transformer-based models, focusing on

generalization across unseen projects, unseen vulnerability types and diverse data distributions. In addition,

we analyze the effects of aggregating datasets and compare the results with previous experiments. Experi-

mental results demonstrate that combining datasets signiﬁcantly improves model generalization across varied

distributions, highlighting the importance of diverse, high-quality data for enhancing vulnerability detection

in source code.

1 INTRODUCTION

Software security has become a critical priority in

an era where vulnerabilities in source code can lead

to severe consequences, including data breaches, ser-

vice disruptions, and compromised system integrity.

The need for effective and scalable methods to

detect vulnerabilities has driven extensive research

into automated solutions, particularly those leverag-

ing deep learning techniques (Curto et al., 2024a).

Transformer-based models, such as RoBERTa and

its variants (Feng et al., 2020)(Guo et al., 2020),

have shown substantial promise in identifying intri-

cate vulnerability patterns in source code (Mamede

et al., 2022), (Fu and Tantithamthavorn, 2022), (Curto

et al., 2024b). These models, trained on large-

scale datasets, offer the potential to surpass traditional

rule-based methods in both scalability and precision.

However, their reliability and effectiveness remain

heavily dependent on the quality and diversity of the

datasets used for training.

Recent advancements in software vulnerability

prediction (SVP) have demonstrated that high-quality

datasets are essential for training effective machine

learning models. Studies have highlighted signiﬁ-

https://orcid.org/0009-0006-6516-7671

cant challenges in dataset quality, including inaccu-

rate labels, data duplication, and limited diversity in

vulnerability types (Croft et al., 2023). For instance,

Croft et al. revealed that up to 71% of labels in real-

world datasets are inaccurate, which undermines the

performance and generalization of models trained on

such data. Furthermore, while synthetic datasets like

Juliet

provide an additional source of training data,

they often lack the complexity and variability of real-

world vulnerabilities. This has led to an increased fo-

cus on real-world datasets (Fan et al., 2020), (Chen

et al., 2023), (Ni et al., 2024), which, despite their

inherent imperfections, offer a more realistic founda-

tion for training models capable of detecting vulnera-

bilities in diverse codebases.

One of the main limitations in applying large lan-

guage models (LLMs) to vulnerability detection is the

signiﬁcant data requirements for training. Beyond the

volume of data, the origin of the data, whether syn-

thetic or real-world, plays a critical role in shaping

model performance (Chakraborty et al., 2022b). Real-

world datasets, derived from actual software projects,

are particularly valuable for evaluating the general-

ization capabilities of models. However, achieving

robust generalization remains a signiﬁcant challenge.

https://samate.nist.gov/SARD/test-suites/112

Curto, C., Giordano, D., Indelicato and D. G.

Addressing the C/C++ Vulnerability Datasets Limitation: The Good, the Bad and the Ugly.

DOI: 10.5220/0013495200003979

In Proceedings of the 22nd International Conference on Security and Cryptography (SECRYPT 2025), pages 355-362

ISBN: 978-989-758-760-3; ISSN: 2184-7711

355

Generalization in this context refers to a model’s abil-

ity to detect vulnerabilities across unseen projects,

novel vulnerability types (e.g., new Common Weak-

ness Enumeration, CWEs), and varied data distribu-

tions. High performance on speciﬁc datasets often

masks the inability of models to generalize effec-

tively, an issue that has been underexplored in the lit-

erature.

This study bridges existing gaps by conducting

an in-depth evaluation of three widely-used C/C++

vulnerable code datasets. Using a RoBERTa-based

model, we examine its generalization capabilities

across three key scenarios: unseen projects, novel

vulnerability types, and varying data distributions.

Furthermore, we analyze the effects of aggregating

these datasets into a uniﬁed collection to assess its im-

pact on model performance.

Our results reveal that dataset aggregation signiﬁ-

cantly improves model generalization, particularly in

challenging cases involving unseen projects and pre-

viously unencountered vulnerabilities. This ﬁnding

emphasizes the value of combining datasets to miti-

gate biases and enhance the robustness of deep learn-

ing models in software vulnerability detection.

By addressing dataset limitations and exploring

their role in transformer-based models, this work

underscores the importance of diverse, high-quality

datasets as a foundation for advancing automated

software security assessment. The insights presented

aim to guide researchers and practitioners in develop-

ing scalable and reliable solutions for software vul-

nerability detection.

2 BACKGROUND

2.1 Transformer Models for

Vulnerability Detection

Ensuring software security heavily relies on detecting

vulnerabilities, traditionally approached with rule-

based methods. However, these methods often strug-

gle with scalability and precision. Recent advance-

ments in deep learning, particularly transformer-

based models, have automated vulnerability detection

across large codebases (Jiang et al., 2025). Models

like CodeBERT (Feng et al., 2020) and other code-

speciﬁc transformers (Xu et al., 2022), (Chakraborty

et al., 2022a), (Rozi

ere et al., 2023) have proven ef-

fective in identifying complex vulnerability patterns

by understanding both syntax and semantics of pro-

gramming languages. Their ability to capture long-

range dependencies has positioned them at the fore-

front of the ﬁeld.

However, their success depends critically on the

quality and diversity of training datasets. Issues like

data imbalance, inaccurate labels, and lack of diver-

sity can hinder generalization, limiting performance

on new codebases or unseen vulnerabilities. This

challenge remains a core issue in software security

research (Chen et al., 2023). Our study explores

how combining multiple datasets and enhancing vul-

nerability diversity can improve the generalization of

transformer models, with a particular focus on the

MITRE Top 25 CWEs and new vulnerability cate-

gories.

2.2 Vulnerability Datasets Quality

The quality of vulnerability datasets is a crucial, of-

ten overlooked, factor in the effectiveness of machine

learning models for software security (Croft et al.,

2023). While transformer-based models show high

performance on speciﬁc datasets, their ability to gen-

eralize to diverse, unseen data remains a challenge

(Chen et al., 2023). Most studies focus on perfor-

mance metrics within speciﬁc datasets (Zhou et al.,

2019), neglecting how models handle real-world data

distributions.

This emphasis on dataset-speciﬁc performance

can give a false sense of model effectiveness. High ac-

curacy on a particular dataset may mask weaknesses

when confronted with new vulnerabilities or projects.

This issue is exacerbated when datasets have biases

in the types of vulnerabilities, code structure, or pro-

gramming languages represented.

To address these challenges, it is essential to focus

on both dataset diversity and generalization, ensuring

that vulnerability datasets are not only large and com-

prehensive but also varied enough to represent real-

world applications. This diversity is crucial for train-

ing models capable of detecting vulnerabilities across

different codebases, vulnerability types, and project

conﬁgurations.

3 RELATED WORKS

The use of deep learning techniques for software vul-

nerability detection has been an active area of re-

search, with a particular focus on applying both Graph

Neural Nework (GNN) (Cheng et al., 2021), (Nguyen

et al., 2022), (Hin et al., 2022) and transformer-based

models (Curto et al., 2024b), (Fu and Tantithamtha-

vorn, 2022) to identify vulnerability patterns in source

code. Several studies have explored the effectiveness

of learning-based approaches for vulnerability pre-

diction, demonstrating improvements over traditional

SECRYPT 2025 - 22nd International Conference on Security and Cryptography

356

rule-based methods in scalability and accuracy. For

instance, recent advancements in software vulnera-

bility prediction (SVP) models have leveraged large-

scale datasets to train deep neural networks, achiev-

ing state-of-the-art results in predicting security ﬂaws

(Curto et al., 2024a). However, the reliability of these

models is often compromised by data quality issues,

such as inaccurate labels, data duplication, and in-

consistency, which can signiﬁcantly affect model per-

formance (Croft et al., 2023). The importance of

dataset diversity and high-quality data was also em-

phasized by other works (Zheng et al., 2021), which

suggested that aggregating multiple datasets could al-

leviate data bias and improve model robustness and

generalization. Recently, (Chakraborty et al., 2024)

addressed critical limitations in existing datasets used

for evaluating deep learning-based vulnerability de-

tection models. They introduce the RealVul dataset,

designed to represent realistic usage scenarios by

including complete codebases rather than focusing

solely on isolated snippets from ﬁxing commits, as

seen in prior datasets like BigVul and SARD (Na-

tional Institute of Standards and Technology, 2025).

They highlight signiﬁcant discrepancies between the

performance of deep learing models on synthetic or

limited datasets and their practical application in real-

world scenarios. Furthermore, they identify overﬁt-

ting as a primary issue and propose an augmentation

technique to improve generalization. This study un-

derscores the necessity of realistic datasets for evalu-

ating and improving the robustness of vulnerability

detection models. Our study extends this body of

work by evaluating the impact of real-world vulner-

ability datasets on transformer-based models, exam-

ining the importance of diverse data distributions for

improving vulnerability detection and model’s gener-

alization through various experimental setups.

4 METHODOLOGY

4.1 Datasets

This section offers a comprehensive overview of the

analyzed datasets, detailing their key characteristics.

4.1.1 BigVul

BigVul (Fan et al., 2020) is a large C/C++ vulner-

ability dataset, collected from open-source GitHub

projects. Big-Vul dataset contains the details of CVE

entries from 2002 to 2019 and covers 358 different

projects that are linked to 4,432 unique code commits.

The 4,432 code commits contain the code ﬁxes for

3,754 vulnerabilities in 91 Common Weakness Enu-

meration (CWE) types. We obtained the current ver-

sion of BigVul from the ofﬁcial GitHub repository

4.1.2 DiverseVul

DiverseVul (Chen et al., 2023) is another large C/C++

vulnerability dataset obtained by the following pro-

cedures: crawling of security issue websites, col-

lection of vulnerability reports, extraction of vulner-

ability ﬁxing commits, and, for each vulnerability,

cloning of the project and extraction of vulnerable

and non-vulnerable source code. The result is a to-

tal of 16,109 vulnerable functions and 311,560 non-

vulnerable functions, extracted from 7,514 commits

and covering 150 CWEs. DiverseVul is one of the

largest vulnerability codebases, built to provide more

quality data for the training of large deep learning

models. We obtained the dataset from the author’s

GitHub repository

. As for BigVul, we performed a

cleaning procedure on the data. As a matter of fact,

all the dataset entries came with the CWE feature as

a list of zero, one, or more CWE IDs. For simplicity,

we removed all the entries with more than one CWE

ID, converted them as a string and all the entries with-

out a CWE ID, represented as ’[]’ in the table. As

a result, we removed 45,922 entries. The resulting

dataset is composed of 16,109 vulnerable functions

and 265,638 not vulnerable functions.

4.1.3 MegaVul

MegaVul (Ni et al., 2024) is a comprehensive and

continually updated dataset developed to support vul-

nerability detection in C/C++ software. MegaVul

aggregates data from the Common Vulnerabilities

and Exposures (CVE) database and associated Git-

based repositories to address the limitations of previ-

ous datasets, including incomplete data, outdated en-

tries, and limited diversity. The dataset encompasses

17,380 vulnerabilities extracted from 992 repositories

and spans 169 vulnerability types, covering the pe-

riod from January 2006 to October 2023. Unlike prior

datasets, MegaVul is designed for continuous updates,

ensuring its relevance in vulnerability research. We

obtained the dataset from the author’s GitHub reposi-

tory,

https://github.com/ZeoVan/MSR 20 Code

vulnerability CSV Dataset

https://github.com/wagner-group/diversevul

https://github.com/Icyrockton/MegaVul

Addressing the C/C++ Vulnerability Datasets Limitation: The Good, the Bad and the Ugly

357

Table 1: Characteristics of the most used vulnerability C/C++ source code datasets compared with our merged dataset.

Language Origin Functions Vulnerable Functions Vul/notVul ratio Last update

BigVul C/C++ CVE database (MITRE, 2025) 188636 10900 0.0578 2019

DiverseVul C/C++ Security issue websites (Snyk, 2025)(RedHat, 2025) 281747 16109 0.0572 2023

Megavul C/C++ NVD database (NIST, 2025) 353873 17975 0.0508 2023

Merged dataset C/C++ BigVul+DiverseVul+Megavul 599991 33585 0.0559 2023

(a) BigVul (b) DiverseVul (c) MegaVul

Figure 1: Different projects distributions among all the benchmark datasets.

4.1.4 Merged Dataset

To perform comprehensive experiments, we con-

structed a ﬁnal dataset by merging BigVul, Diverse-

Vul, and MegaVul, followed by an additional dedupli-

cation process. This deduplication was carried out us-

ing the SHA256 hashing algorithm, which computed

unique hashes for each code snippet. Duplicate en-

tries were identiﬁed and removed by eliminating re-

peated hash values. The resulting dataset comprises

approximately 600,000 functions, including a total of

33,585 labeled as vulnerable.

In table 1 the characteristics of the previously

mentioned datasets are reported.

4.2 Model

We used a RoBERTa-like model designed for work-

ing with source code, namely CodeBERTa (Hugging-

Face, 2024), trained on the CodeSearchNet dataset

(Husain et al., 2019) from GitHub. Since the tok-

enizer is optimized for code rather than natural lan-

guage, it represents the data more efﬁciently, reduc-

ing the length of the sequences by 33% to 50% com-

pared to standard tokenizers such as those for GPT-2

or RoBERTa. The model itself is relatively compact,

with six layers and 84 million parameters, similar in

size to DistilBERT. It was trained from scratch on the

entire dataset for ﬁve epochs.

4.3 Evaluation Metrics

Considering the high imbalance in all three

datasets,the Area Under the Receiver Operating

Characteristic Curve (AUC) is preferred over accu-

racy, providing a more comprehensive and nuanced

evaluation of a model’s performance. Alongside

AUC, precision, recall, and F1-score are used to

assess speciﬁc aspects of performance.

5 EXPERIMENTS SETTING

The experiments were conducted to evaluate vulnera-

bility detection in source code as a binary classiﬁca-

tion task across various generalization scenarios:

• Random Split Benchmarking: Training, valida-

tion, and testing were performed on randomly

split subsets of the dataset to establish baseline

performance.

• Project Generalization: Models were trained and

validated on 90% of the source code from speciﬁc

projects and tested on entirely different projects

not included in the training set. This setup ex-

amines the model’s ability to generalize across

unseen projects, as prior studies indicate perfor-

mance drops in such scenarios.

• Vulnerability Type Generalization: Training and

validation were conducted on functions associated

with vulnerabilities outside the MITRE Top 25

most dangerous CWEs. Testing was performed on

data within the Top 25, focusing on how dataset

vulnerability distributions impact model perfor-

mance when detecting high-priority unseen vul-

nerabilities.

• Cross-Dataset Evaluation: Training and valida-

tion were performed on one dataset, followed

SECRYPT 2025 - 22nd International Conference on Security and Cryptography

358

(a) BigVul (b) DiverseVul (c) MegaVul

Figure 2: Different CWE IDs distributions among all the benchmark datasets.

(a) BigVul (b) DiverseVul (c) MegaVul

Figure 3: Representation of the number of functions belonging to the MITRE top 25 most dangerous CWEs.

by testing on a combined dataset from the other

datasets included in the study.

These settings assess model performance under

diverse and realistic conditions, emphasizing gen-

eralization across projects, vulnerability types, and

datasets. For the hyperparameters, we choose 2∗ 10

−5

for the learning rate, with AdamW as the optimizer

and 512 sequence length for every model. The train-

ing is performed on an NVIDIA RTX A6000 GPU.

6 RESULTS AND DISCUSSION

The experimental results obtained from the three

state-of-the-art datasets are presented in Table 2. Ta-

ble 3 summarizes the performance achieved on the

merged dataset across the three analyzed scenarios.

Finally, Table 4 illustrates the model’s generalization

capability when trained on the merged dataset and

evaluated on previously unseen data from the three

original datasets.

6.1 The Good

CodeBERTa has shown excellent performance under

favorable conditions, particularly when trained and

tested on random splits of the same dataset. This sce-

nario, often used as a baseline, allows the model to

beneﬁt from shared patterns and characteristics within

the data. For example, on the BigVul dataset, the

model achieved an impressive AUC of 92.24, coupled

with a Precision of 81.22 and an F1-score of 83.39.

These results demonstrate the model’s capacity to ef-

fectively detect vulnerabilities when the distribution

of the training and test data is consistent. Even on a

more challenging dataset like Megavul, where diver-

sity and noise are higher, the model maintained solid

performance, with an AUC of 71.38 and an F1-score

of 49.81, showing its robustness under less ideal con-

ditions.

Beyond single-dataset experiments, combining

the three datasets (BigVul, DiverseVul, and Megavul)

into a uniﬁed training set further enhanced the

model’s performance, particularly on datasets that ini-

tially posed challenges. The DiverseVul dataset is

a notable example: by merging data and removing

duplicates, the model’s F1-score increased by 6.18

points, rising from 28.5 to 34.68. This improvement

highlights the potential beneﬁts of addressing data

sparsity and increasing the diversity of training sam-

ples. Combining datasets, therefore, emerges as a

promising strategy for handling underrepresented vul-

nerabilities and improving overall model robustness.

Addressing the C/C++ Vulnerability Datasets Limitation: The Good, the Bad and the Ugly

359

Table 2: CodeBERTa performance on the different tested use cases. The values between parenthesis indicate the metrics’

differences with respect to the ”Random Splits” case, that is the case when the model is tested on a random split of the base

dataset.

AUC Precision Recall F1-score FPR FNR

Random Splits

BigVul 92,24 81,22 85,69 83,39 1,22 14,31

DiverseVul 66,28 21,65 41,71 28,5 9,16 58,29

Megavul 71,38 56,4 44,61 49,81 1,85 55,39

Unseen Projects

BigVul 86(-6,24) 88,7(+7,48) 72,65(-13,04) 79,88(-3,51) 0,64(-0,58) 27,35(+13,04)

DiverseVul 53,39(-12,67) 15,71(-9,02) 10,79(-27,32) 12,79(-17,21) 4,01(-1,99) 8,35(-53,54)

Megavul 58,03(-13,35) 39,46(-16,94) 18,54(-26,07) 25,22(-24,59) 2,47(+0,62) 81,46(+26,07)

Unseen CWEs

BigVul 88,45(-3,79) 84,4(+3,18) 77,86(-7,83) 81(-2,39) 0,97(-0,25) 22,14(+7,83)

DiverseVul 56.59(-9,69) 17,44(-4,21) 18,67(-24,04) 18,03(-10,47) 5,5(-3,66) 81,33(+23,04)

Megavul 59,4(-11,98) 29,68(-26,72) 21,77(-22,84) 25,12(-24,69) 2,96(+1,11) 78,23(+22,84)

Unseen Datasets

BigVul 53,94(-38,3) 39,47(-41,75) 8,63(-77,06) 14,16(-69,23) 0,75(-0,47) 91,37(+77,06)

DiverseVul 69,23(+2,95) 53,85(+32,2) 40,42(-1,29) 46,18(+17,68) 1,95(-7,21) 59,58(-7,21)

Megavul 72,58(+1,2) 30,75(-25,65) 52,35(+7,74) 38,75(-11,06) 7,18(+5,33) 47,65(-7,74)

Table 3: CodeBERTa performance when trained with all the data on three test cases.

AUC Precision Recall F1-score FPR FNR

Standard splits 80,16 46,12 64,8 53,89 4,49 35,2

Unseen Projects 58,9(-21,26) 35,1(-11,02) 20,84(-43,96) 26,16(-27,73) 3,03(-1,46) 79,16(+43,96)

Unseen CWEs 66,18(-13,98) 32,18(-13,94) 37,15(-27,65) 34,48(-19,41) 4,8(+0,31) 62,85(+27,65)

6.2 The Bad

Despite these strengths, the model exhibited signif-

icant weaknesses when tested on unseen projects.

These tests aimed to evaluate CodeBERTa’s ability

to generalize beyond the speciﬁc patterns and styles

of the training data. The results, however, revealed

limitations. On the BigVul dataset, the F1-score

decreased by 3.51 points, dropping from 83.39 to

79.88. While the decline may seem moderate, it re-

ﬂects a diminished ability to handle new coding styles

or project-speciﬁc conventions. For DiverseVul, the

drop was much more severe: the F1-score fell by

17.21 points, plunging from 28.5 to 12.79. This sug-

gests that the model struggles to adapt to the unique

characteristics of different projects, which can vary

widely in structure and context.

Testing on vulnerabilities with unseen CWEs re-

vealed a similar trend. Vulnerabilities in this set-

ting often involve patterns or characteristics that the

model has not encountered before. Consequently, the

performance declined sharply across metrics. In the

Megavul dataset, for instance, the AUC dropped to

59.4, while the F1-score decreased by 24.69 points.

These ﬁndings underline the model’s limited capac-

ity to extrapolate from known vulnerabilities to new

or rare ones, which poses a signiﬁcant challenge for

real-world applications where unknown vulnerability

types frequently arise.

6.3 The Ugly

The most critical shortcomings of CodeBERTa’s

performance were exposed in cross-dataset testing,

where the model was trained on one dataset and tested

on entirely unseen datasets. This scenario simulates

the real-world challenge of deploying a vulnerability

detection system trained on one context to operate ef-

fectively in a completely different one. The results

were stark. When trained on the combined dataset and

tested on BigVul, the F1-score plummeted by 69.23

points, dropping from 83.39 to a dismal 14.16. Sim-

ilarly, for DiverseVul, although some metrics showed

minor improvements, key indicators like Recall re-

mained critically low, indicating that the model could

not adequately identify vulnerabilities in this setting.

These results point to a deeper issue: the strong in-

ﬂuence of dataset-speciﬁc biases. Each dataset carries

unique characteristics — ranging from labeling strate-

gies to the types of vulnerabilities represented — that

SECRYPT 2025 - 22nd International Conference on Security and Cryptography

360

the model internalizes during training. While this

speciﬁcity can boost performance within a dataset,

it severely limits generalization across datasets. For

instance, while the combined dataset approach im-

proved the model’s performance on DiverseVul, it

had only marginal beneﬁts—or even detrimental ef-

fects—for other datasets like Megavul, where the F1-

score saw a slight decline.

This variability underscores the challenges of har-

monizing diverse datasets, which often suffer from in-

consistencies, noise, and differing deﬁnitions of what

constitutes a vulnerability. The poor cross-dataset

performance also raises concerns about real-world ap-

plicability, where systems must handle diverse and

unseen data without the luxury of retraining or ﬁne-

tuning on every new context.

Table 4: CodeBERTa results when tested on unseen data

from the three datasets.

Train/Val

Dataset

Test

Dataset

F1-score

New

F1-score

Diff.

Big+Diverse+Mega BigVul 83.39 83.11 -0.28

Big+Diverse+Mega DiverseVul 28.5 34.68 +6.18

Big+Diverse+Mega MegaVul 49.81 49.43 -0.38

6.4 Discussion

Overall, the evaluation of CodeBERTa reveals that

the model excels in scenarios where training and

testing datasets share similar characteristics, particu-

larly in terms of the same project or similar vulner-

ability distributions. However, signiﬁcant challenges

arise when faced with unseen projects, vulnerability

types, or datasets with varied distributions. The re-

sults emphasize the importance of high-quality, di-

verse datasets to improve the generalizability of deep

learning models in vulnerability detection tasks.

While combining datasets leads to improved per-

formance, the limitations observed in cross-dataset

testing highlight the need for more sophisticated data

aggregation techniques. Current methods, which pri-

marily focus on concatenation and deduplication, fail

to fully address inconsistencies in labeling, represen-

tation, and the types of vulnerabilities covered. Tech-

niques such as data augmentation, domain adapta-

tion, and representation learning could help harmo-

nize diverse datasets and reduce dataset-speciﬁc bi-

ases. Such techniques are critical for developing mod-

els capable of performing robustly in real-world ap-

plications, where the data are heterogeneous, noisy,

and constantly evolving. Future research should focus

on integrating these approaches into the data prepara-

tion pipeline to enhance the generalization capabili-

ties of transformer-based models in vulnerability de-

tection.

7 CONCLUSIONS

This study assessed the impact of dataset quality and

diversity on machine learning models for vulnerabil-

ity detection in source code. By evaluating model

performance across unseen projects, new vulnerabil-

ity types (CWEs), and diverse dataset distributions,

we demonstrated that data quality is crucial for the

reliability and generalization of these models. The

results show that while high-quality datasets enable

strong performance in controlled environments, sig-

niﬁcant declines occur in generalization scenarios, es-

pecially when models encounter new projects or vul-

nerabilities. This highlights the limitations of current

datasets, which lack the diversity needed to represent

real-world complexity. A key ﬁnding is that aggregat-

ing datasets enhances model performance. Combin-

ing multiple real-world datasets and removing dupli-

cates improved generalization, showing that diverse

data distributions mitigate biases. However, simply

increasing dataset size is not enough—varied and rep-

resentative examples are crucial to adapting to differ-

ent coding styles and vulnerability patterns. Chal-

lenges such as imbalanced vulnerability types, inac-

curate labels, and limited real-world variability per-

sist. Overcoming these issues requires better dataset

curation, improved labeling, and the inclusion of a

broader range of vulnerabilities. While synthetic

datasets can be useful, they must be designed to re-

ﬂect real-world complexities. Ultimately, the future

of automated vulnerability detection depends more on

improving data quality than reﬁning model architec-

tures. Focusing on diverse, accurate, and representa-

tive datasets will lead to more reliable and generaliz-

able solutions, advancing software security.

REFERENCES

Chakraborty, P., Arumugam, K. K., Alfadel, M., Nagappan,

M., and McIntosh, S. (2024). Revisiting the Perfor-

mance of Deep Learning-Based Vulnerability Detec-

tion on Realistic Datasets. 50(8):2163–2177.

Chakraborty, S., Ahmed, T., Ding, Y., Devanbu, P. T., and

Ray, B. (2022a). NatGen: Generative pre-training by

“naturalizing” source code. In Proceedings of the 30th

ACM Joint European Software Engineering Confer-

ence and Symposium on the Foundations of Software

Engineering, pages 18–30. ACM.

Chakraborty, S., Krishna, R., Ding, Y., and Ray, B. (2022b).

Deep learning based vulnerability detection : Are we

there yet ? IEEE Transactions on Software Engineer-

ing, 48(9):3280–3296.

Chen, Y., Ding, Z., Alowain, L., Chen, X., and Wagner, D.

(2023). DiverseVul: A New Vulnerable Source Code

Dataset for Deep Learning Based Vulnerability Detec-

Addressing the C/C++ Vulnerability Datasets Limitation: The Good, the Bad and the Ugly

361

tion. In Proceedings of the 26th International Sympo-

sium on Research in Attacks, Intrusions and Defenses,

pages 654–668. ACM.

Cheng, X., Wang, H., Hua, J., Xu, G., and Sui, Y.

(2021). DeepWukong: Statically Detecting Software

Vulnerabilities Using Deep Graph Neural Network.

30(3):38:1–38:33.

Croft, R., Babar, M. A., and Kholoosi, M. M. (2023). Data

Quality for Software Vulnerability Datasets. In 2023

IEEE/ACM 45th International Conference on Soft-

ware Engineering (ICSE), pages 121–133.

Curto, C., Giordano, D., Indelicato, D. G., and Patatu, V.

(2024a). Can a Llama Be a Watchdog? Exploring

Llama 3 and Code Llama for Static Application Secu-

rity Testing. In 2024 IEEE International Conference

on Cyber Security and Resilience (CSR), pages 395–

400.

Curto, C., Giordano, D., Palazzo, S., and Indelicato,

D. (2024b). MultiVD: A Transformer-based Mul-

titask Approach for Software Vulnerability Detec-

tion. In Proceedings of the 21st International Confer-

ence on Security and Cryptography, pages 416–423.

SCITEPRESS - Science and Technology Publications.

Fan, J., Li, Y., Wang, S., and Nguyen, T. N. (2020). A c

/ c ++ code vulnerability dataset with code changes

and cve summaries. In IEEE/ACM 17th International

Conference on Mining Software Repositories (MSR),

pages 508–512. ACM.

Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M.,

Shou, L., Qin, B., Liu, T., Jiang, D., and Zhou, M.

(2020). Codebert : A pre-trained model for program-

ming and natural languages. Findings of EMNLP.

Fu, M. and Tantithamthavorn, C. (2022). Linevul: A

transformer-based line-level vulnerability prediction.

In 2022 IEEE/ACM 19th International Conference on

Mining Software Repositories (MSR). IEEE.

Guo, D., Ren, S., Lu, S., Feng, Z., Tang, D., Liu, S., Zhou,

L., Duan, N., Svyatkovskiy, A., Fu, S., Tufano, M.,

Deng, S. K., Clement, C., Drain, D., Sundaresan, N.,

Yin, J., Jiang, D., and Zhou, M. (2020). GraphCode-

BERT: Pre-training Code Representations with Data

Flow.

Hin, D., Kan, A., Chen, H., and Babar, M. A. (2022).

LineVD: Statement-level vulnerability detection us-

ing graph neural networks. In Proceedings of the 19th

International Conference on Mining Software Reposi-

tories, pages 596–607. ACM.

HuggingFace (2024). CodeBERTa. https://huggingface.co/

huggingface/CodeBERTa-small-v1. accessed: 2024-

09.

Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., and

Brockschmidt, M. (2019). CodeSearchNet Chal-

lenge: Evaluating the State of Semantic Code Search.

arXiv:1909.09436 [cs, stat]. arXiv: 1909.09436.

Jiang, X., Wu, L., Sun, S., Li, J., Xue, J., Wang, Y., Wu,

T., and Liu, M. (2025). Investigating Large Language

Models for Code Vulnerability Detection: An Experi-

mental Study.

Mamede, C., Pinconschi, E., Abreu, R., and Campos, J.

(2022). Exploring transformers for multi-label clas-

siﬁcation of java vulnerabilities. In IEEE, editor,

2022 IEEE 22nd International Conference on Soft-

ware Quality , Reliability and Security ( QRS ), pages

43–52.

MITRE (2025). CVE published by year. https://www.

cvedetails.com/browse-by-date.php. Accessed: 2025-

01.

National Institute of Standards and Technology

(2025). Nist software assurance reference dataset.

https://samate.nist.gov/SARD. Accessed: 2025-01.

Nguyen, V.-A., Nguyen, D. Q., Nguyen, V., Le, T., Tran,

Q. H., and Phung, D. (2022). ReGVD: Revisit-

ing graph neural networks for vulnerability detec-

tion. In Proceedings of the ACM/IEEE 44th Interna-

tional Conference on Software Engineering: Compan-

ion Proceedings, ICSE ’22, pages 178–182. Associa-

tion for Computing Machinery.

Ni, C., Shen, L., Yang, X., Zhu, Y., and Wang, S. (2024).

MegaVul: A C/C++ Vulnerability Dataset with Com-

prehensive Code Representations. In 2024 IEEE/ACM

21st International Conference on Mining Software

Repositories (MSR), pages 738–742.

NIST (2025). NVD database. https://nvd.nist.gov/. Ac-

cessed: 2025-01.

RedHat (2025). Red Hat Bugzilla website. https://bugzilla.

redhat.com/. Accessed: 2025-01.

Rozi

ere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I.,

Tan, X. E., Adi, Y., Liu, J., Remez, T., Rapin, J.,

Kozhevnikov, A., Evtimov, I., Bitton, J., Bhatt, M.,

Ferrer, C. C., Grattaﬁori, A., Xiong, W., D

efossez,

A., Copet, J., Azhar, F., Touvron, H., Martin, L.,

Usunier, N., Scialom, T., and Synnaeve, G. (2023).

Code Llama: Open Foundation Models for Code.

Snyk (2025). Snyk website. https://snyk.io/. Accessed:

2025-01.

Xu, F. F., Alon, U., Neubig, G., and Hellendoorn, V. J.

(2022). A systematic evaluation of large language

models of code. In Proceedings of the 6th ACM

SIGPLAN International Symposium on Machine Pro-

gramming, MAPS 2022, pages 1–10. Association for

Computing Machinery.

Zheng, Y., Pujar, S., Lewis, B., Buratti, L., Epstein, E.,

Yang, B., Laredo, J., Morari, A., and Su, Z. (2021).

D2A: A Dataset Built for AI-Based Vulnerability De-

tection Methods Using Differential Analysis. In 2021

IEEE/ACM 43rd International Conference on Soft-

ware Engineering: Software Engineering in Practice

(ICSE-SEIP), pages 111–120.

Zhou, Y., Liu, S., Siow, J., Du, X., and Liu, Y. (2019). De-

vign: Effective Vulnerability Identiﬁcation by Learn-

ing Comprehensive Program Semantics via Graph

Neural Networks. In Advances in Neural Information

Processing Systems, volume 32. Curran Associates,

Inc.

SECRYPT 2025 - 22nd International Conference on Security and Cryptography

362