SUPPΟRTING THE CYBERCRIME INVESTIGATION PROCESS: EFFECTIVE DISCRIMINATION OF SOURCE CODE AUTHORS BASED ON BYTE-LEVEL INFORMATION

Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis

Abstract

Source code authorship analysis is the particular field that attempts to identify the author of a computer program by treating each program as a linguistically analyzable entity. This is usually based on other undisputed program samples from the same author. There are several cases where the application of such a method could be of a major benefit, such as tracing the source of code left in the system after a cyber attack, authorship disputes, proof of authorship in court, etc. In this paper, we present our approach which is based on byte-level n-gram profiles and is an extension of a method that has been successfully applied to natural language text authorship attribution. We propose a simplified profile and a new similarity measure which is less complicated than the algorithm followed in text authorship attribution and it seems more suitable for source code identification since is better able to deal with very small training sets. Experiments were performed on two different data sets, one with programs written in C++ and the second with programs written in Java. Unlike the traditional language-dependent metrics used by previous studies, our approach can be applied to any programming language with no additional cost. The presented accuracy rates are much better than the best reported results for the same data sets.

References

  1. Ding, H., Samadzadeh, M., H., Extraction of Java program fingerprints for software authorship identification, The Journal of Systems and Software, Volume 72, Issue 1, Pages 49-57 June 2004,
  2. Elliot, W., and. Valenza, R.,1991, Was the Earl of Oxford The True Shakespeare?, Notes and Queries, 38:501- 506.
  3. Gray, A., Sallis, P., and MacDonell, S.,, Identified (integrated dictionary-based extraction of nonlanguage-dependent token information for forensic identification, examination, and discrimination): A dictionary-based system for extracting source code metrics for software forensics. In Proceedings of SE:E&P'98 (Software Engineering: Education and Practice Conference), IEEE Computer Society Press, pages 252-259., 1998.
  4. Gray, A., Sallis, P., and MacDonell, S., Software forensics: Extending authorship analysis techniques to computer programs, in Proc. 3rd Biannual Conf. Int. Assoc. of Forensic Linguists (IAFL'97), pages 1-8, 1997.
  5. Frantzeskou, G., Gritzalis, S., Mac Donell, S., Source Code Authorship Analysis for supporting the cybercrime investigation process, in Proc. 1st International Conference on e-business and Telecommunications Networks (ICETE04), Vol 2, pages (85-92), 2004.
  6. Keselj, V., Peng, F., Cercone, N., Thomas, C., N-gram based author profiles for authorship attribution, In Proc. Pacific Association for Computational Linguistics, 2003.
  7. Keselj, V.,. Perl package Text::N-grams http://www.cs.dal.ca/vlado/srcperl/N-grams or http://search.cpan.org/author/VLADO/Text-N-grams0.03/N-grams.pm, 2003.
  8. Kilgour, R. I., Gray, A.R., Sallis, P. J., and MacDonell, S. G., A Fuzzy Logic Approach to Computer Software Source Code Authorship Analysis, In the Fourth International Conference on Neural Information Processing -- The Annual Conference of the Asian Pacific Neural Network Assembly (ICONIP'97). Dunedin. New Zealand, 1997.
  9. Krsul, I., and Spafford, E. H, Authorship analysis: Identifying the author of a program, In Proc. 8th National Information Systems Security Conference, pages 514-524, National Institute of Standards and Technology., 1995.
  10. Krsul, I., and Spafford, E. H., 1996, Authorship analysis: Identifying the author of a program, Technical Report TR-96-052, 1996
  11. Longstaff, T. A., and Schultz, E. E., Beyond Preliminary Analysis of the WANK and OILZ Worms: A Case Study of Malicious Code, Computers and Security, 12:61-77, 1993.
  12. MacDonell, S.G, and Gray, A.R. Software forensics applied to the task of discriminating between program authors. Journal of Systems Research and Information Systems 10: 113-127 (2001)
  13. Oman, P., and Cook, C., Programming style authorship analysis. In Seventeenth Annual ACM Science Conference Proceedings, pages 320-326. ACM, 1989.
  14. Peng, F., D., Shuurmans, and S., Wang., Augmenting naive bayes classifiers with statistical language models, Information Retrieval Journal, 7(1): 317-345, 2004.
  15. Sallis P., Aakjaer, A., and MacDonell, S., Software Forensics: Old Methods for a New Science. Proceedings of SE:E&P'96 (Software Engineering: Education and Practice). Dunedin, New Zealand, IEEE Computer Society Press, 367-371, 1996
  16. Spafford, E. H., The Internet Worm Program: An Analysis,” Computer Communications Review, 19(1): 17-49, 1989.
  17. Spafford, E. H., and Weeber, S. A., Software forensics: tracking code to its authors, Computers and Security, 12:585-595, 1993
  18. Stamatatos, E., N., Fakotakis, and G. Kokkinakis. Automatic text categorisation in terms of genre and author. Computational Linguistics, 26(4): 471-495, 2000.
Download


Paper Citation


in Harvard Style

Frantzeskou G., Stamatatos E. and Gritzalis S. (2005). SUPPΟRTING THE CYBERCRIME INVESTIGATION PROCESS: EFFECTIVE DISCRIMINATION OF SOURCE CODE AUTHORS BASED ON BYTE-LEVEL INFORMATION . In Proceedings of the Second International Conference on e-Business and Telecommunication Networks - Volume 1: ICETE, ISBN 972-8865-32-5, pages 283-290. DOI: 10.5220/0001414902830290


in Bibtex Style

@conference{icete05,
author={Georgia Frantzeskou and Efstathios Stamatatos and Stefanos Gritzalis},
title={SUPPΟRTING THE CYBERCRIME INVESTIGATION PROCESS: EFFECTIVE DISCRIMINATION OF SOURCE CODE AUTHORS BASED ON BYTE-LEVEL INFORMATION},
booktitle={Proceedings of the Second International Conference on e-Business and Telecommunication Networks - Volume 1: ICETE,},
year={2005},
pages={283-290},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001414902830290},
isbn={972-8865-32-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Second International Conference on e-Business and Telecommunication Networks - Volume 1: ICETE,
TI - SUPPΟRTING THE CYBERCRIME INVESTIGATION PROCESS: EFFECTIVE DISCRIMINATION OF SOURCE CODE AUTHORS BASED ON BYTE-LEVEL INFORMATION
SN - 972-8865-32-5
AU - Frantzeskou G.
AU - Stamatatos E.
AU - Gritzalis S.
PY - 2005
SP - 283
EP - 290
DO - 10.5220/0001414902830290