Evaluating Biased Synthetic Data Effects on Large Language Model-Based Software Vulnerability Detection

Lucas Germano; Lincoln Vieira; Ronaldo Goldschmidt; Julio Duarte; Ricardo Choren

Research.Publish.Connect.

*Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Country:
Subject:

Advanced Search Affiliations Search

If you're looking for an exact phrase use quotation marks on text fields.

Proceedings

Proceedings Search *Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

Papers

Papers Search *Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

Authors

Authors Search *Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

Advanced Search

Paper

Evaluating Biased Synthetic Data Effects on Large Language Model-Based Software Vulnerability Detection

Topics: Deep Learning; Intelligence and Cybersecurity; Machine Learning; Natural Language Processing; Neural Networks

In Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART, 504-511, 2025 , Porto, Portugal

Authors: Lucas B. Germano ; Lincoln Q. Vieira ; Ronaldo Goldschmidt ; Julio Cesar Duarte and Ricardo Choren

Affiliation: Military Institute of Engineering, Brazil

Keyword(s): Data Preprocessing, Deep Learning, Large Language Models, Synthetic Vulnerability Dataset, Vulnerability Detection.

Abstract: Software security ensures data privacy and system reliability. Vulnerabilities in the development cycle can lead to privilege escalation, causing data exfiltration or denial of service attacks. Static code analyzers, based on predefined rules, often fail to detect errors beyond these patterns and suffer from high false positive rates, making rule creation labor-intensive. Machine learning offers a flexible alternative, which can use extensive datasets of real and synthetic vulnerability data. This study examines the impact of bias in synthetic datasets on model training. Using CodeBERT for C/C++ vulnerability classification, we compare models trained on biased and unbiased data, incorporating overlooked preprocessing steps to remove biases. Results show that the unbiased model achieves 98.5% accuracy, compared to 63.0% for the biased model, emphasizing the critical need to address dataset biases in training.

CC BY-NC-ND 4.0

Guest: Register as new SciTePress user now for free.

SciTePress user: please login.

My Papers

You are not signed in, therefore limits apply to your IP address 216.73.216.173

In the current month:

Recent papers: 100 available of 100 total

2⁺ years older papers: 200 available of 200 total

Paper citation in several formats:

Germano, L. B., Vieira, L. Q., Goldschmidt, R., Duarte, J. C. and Choren, R. (2025). Evaluating Biased Synthetic Data Effects on Large Language Model-Based Software Vulnerability Detection. In Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART; ISBN 978-989-758-737-5; ISSN 2184-433X, SciTePress, pages 504-511. DOI: 10.5220/0013156800003890

@conference{icaart25,
author={Lucas B. Germano and Lincoln Q. Vieira and Ronaldo Goldschmidt and Julio Cesar Duarte and Ricardo Choren},
title={Evaluating Biased Synthetic Data Effects on Large Language Model-Based Software Vulnerability Detection},
booktitle={Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART},
year={2025},
pages={504-511},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013156800003890},
isbn={978-989-758-737-5},
issn={2184-433X},
}

TY - CONF

JO - Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART
TI - Evaluating Biased Synthetic Data Effects on Large Language Model-Based Software Vulnerability Detection
SN - 978-989-758-737-5
IS - 2184-433X
AU - Germano, L.
AU - Vieira, L.
AU - Goldschmidt, R.
AU - Duarte, J.
AU - Choren, R.
PY - 2025
SP - 504
EP - 511
DO - 10.5220/0013156800003890
PB - SciTePress