Efficient Source Code Authorship Attribution Using Code Stylometry Embeddings

David Álvarez-Fidalgo, Francisco Ortin, Francisco Ortin

2025

Abstract

Source code authorship attribution or identification is used in the fields of cybersecurity, forensic investigations, and intellectual property protection. Code stylometry reveals differences in programming styles, such as variable naming conventions, comments, and control structures. Authorship verification, which differs from attribution, determines whether two code samples were written by the same author, often using code stylometry to distinguish between programmers. In this paper, we explore the benefits of using CLAVE, a contrastive learning-based authorship verification model, for Python authorship attribution with minimal training data. We develop an attribution system utilizing CLAVE stylometry embeddings and train an SVM classifier with just six Python source files per programmer, achieving 0.923 accuracy for 85 programmers, outperforming state-of-the-art deep learning models for Python authorship attribution. Our approach enhances CLAVE’s performance for authorship attribution by reducing the classification error by 45.4%. Additionally, the proposed method requires significantly lower CPU and memory resources than deep learning classifiers, making it suitable for resource-constrained environments and enabling rapid retraining when new programmers or code samples are introduced. These findings show that CLAVE stylometric representations provide an efficient, scalable, and high-performance solution for Python source code authorship attribution.

Download


Paper Citation


in Harvard Style

Álvarez-Fidalgo D. and Ortin F. (2025). Efficient Source Code Authorship Attribution Using Code Stylometry Embeddings. In Proceedings of the 20th International Conference on Software Technologies - Volume 1: ICSOFT; ISBN 978-989-758-757-3, SciTePress, pages 167-177. DOI: 10.5220/0013559800003964


in Bibtex Style

@conference{icsoft25,
author={David Álvarez-Fidalgo and Francisco Ortin},
title={Efficient Source Code Authorship Attribution Using Code Stylometry Embeddings},
booktitle={Proceedings of the 20th International Conference on Software Technologies - Volume 1: ICSOFT},
year={2025},
pages={167-177},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013559800003964},
isbn={978-989-758-757-3},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 20th International Conference on Software Technologies - Volume 1: ICSOFT
TI - Efficient Source Code Authorship Attribution Using Code Stylometry Embeddings
SN - 978-989-758-757-3
AU - Álvarez-Fidalgo D.
AU - Ortin F.
PY - 2025
SP - 167
EP - 177
DO - 10.5220/0013559800003964
PB - SciTePress