Authors:
David Álvarez-Fidalgo
1
and
Francisco Ortin
1
;
2
Affiliations:
1
Computer Science Department, University of Oviedo, c/Calvo Sotelo 18, Oviedo, Spain
;
2
Computer Science Department, Munster Technological University, Rossa Avenue, Bishopstown, Cork, Ireland
Keyword(s):
Source Code Authorship Attribution, Code Stylometry Embeddings, CLAVE, Machine Learning.
Abstract:
Source code authorship attribution or identification is used in the fields of cybersecurity, forensic investigations, and intellectual property protection. Code stylometry reveals differences in programming styles, such as variable naming conventions, comments, and control structures. Authorship verification, which differs from attribution, determines whether two code samples were written by the same author, often using code stylometry to distinguish between programmers. In this paper, we explore the benefits of using CLAVE, a contrastive learning-based authorship verification model, for Python authorship attribution with minimal training data. We develop an attribution system utilizing CLAVE stylometry embeddings and train an SVM classifier with just six Python source files per programmer, achieving 0.923 accuracy for 85 programmers, outperforming state-of-the-art deep learning models for Python authorship attribution. Our approach enhances CLAVE’s performance for authorship attribu
tion by reducing the classification error by 45.4%. Additionally, the proposed method requires significantly lower CPU and memory resources than deep learning classifiers, making it suitable for resource-constrained environments and enabling rapid retraining when new programmers or code samples are introduced. These findings show that CLAVE stylometric representations provide an efficient, scalable, and high-performance solution for Python source code authorship attribution.
(More)