loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Authors: Lucas B. Germano ; Lincoln Q. Vieira ; Ronaldo Goldschmidt ; Julio Cesar Duarte and Ricardo Choren

Affiliation: Military Institute of Engineering, Brazil

Keyword(s): Data Preprocessing, Deep Learning, Large Language Models, Synthetic Vulnerability Dataset, Vulnerability Detection.

Abstract: Software security ensures data privacy and system reliability. Vulnerabilities in the development cycle can lead to privilege escalation, causing data exfiltration or denial of service attacks. Static code analyzers, based on predefined rules, often fail to detect errors beyond these patterns and suffer from high false positive rates, making rule creation labor-intensive. Machine learning offers a flexible alternative, which can use extensive datasets of real and synthetic vulnerability data. This study examines the impact of bias in synthetic datasets on model training. Using CodeBERT for C/C++ vulnerability classification, we compare models trained on biased and unbiased data, incorporating overlooked preprocessing steps to remove biases. Results show that the unbiased model achieves 98.5% accuracy, compared to 63.0% for the biased model, emphasizing the critical need to address dataset biases in training.

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 216.73.216.202

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Germano, L. B., Vieira, L. Q., Goldschmidt, R., Duarte, J. C. and Choren, R. (2025). Evaluating Biased Synthetic Data Effects on Large Language Model-Based Software Vulnerability Detection. In Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART; ISBN 978-989-758-737-5; ISSN 2184-433X, SciTePress, pages 504-511. DOI: 10.5220/0013156800003890

@conference{icaart25,
author={Lucas B. Germano and Lincoln Q. Vieira and Ronaldo Goldschmidt and Julio Cesar Duarte and Ricardo Choren},
title={Evaluating Biased Synthetic Data Effects on Large Language Model-Based Software Vulnerability Detection},
booktitle={Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART},
year={2025},
pages={504-511},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013156800003890},
isbn={978-989-758-737-5},
issn={2184-433X},
}

TY - CONF

JO - Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART
TI - Evaluating Biased Synthetic Data Effects on Large Language Model-Based Software Vulnerability Detection
SN - 978-989-758-737-5
IS - 2184-433X
AU - Germano, L.
AU - Vieira, L.
AU - Goldschmidt, R.
AU - Duarte, J.
AU - Choren, R.
PY - 2025
SP - 504
EP - 511
DO - 10.5220/0013156800003890
PB - SciTePress