Logo image
On the Use of Discretized Source Code Metrics for Author Identification
Conference proceeding   Open access

On the Use of Discretized Source Code Metrics for Author Identification

Maxim Shevertalov, Jay Kothari, Edward Stehle and Spiros Mancoridis
1ST INTERNATIONAL SYMPOSIUM ON SEARCH BASED SOFTWARE ENGINEERING, PROCEEDINGS
01 Jan 2009
url
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.151.2833View

Abstract

Computer Science Computer Science, Software Engineering Engineering Engineering, Electrical & Electronic Science & Technology Technology
Intellectual property infringement and plagiarism litigation involving source code would be more easily resolved using code authorship identification tools. Previous efforts in this area have demonstrated the potential of determining the authorship of a disputed piece of source code automatically, This was achieved by using source code metrics to build a database of developer profiles, thus characterizing a population of developers. These profiles were then used to determine the likelihood that the unidentified source code was authored by a given developer. In this paper we evaluate the effect of discretizing source code metrics for use in building developer profiles. It is well known that machine learning techniques perform better when using categorical variables as opposed to continuous ones. We present a genetic algorithm to discretize metrics to improve source code to author classification. We evaluate the approach with a case study involving 20 open source developers and over 750,000 lines of Java source code.

Metrics

10 Record Views
28 citations in Scopus

Details

InCites Highlights

Data related to this publication, from InCites Benchmarking & Analytics tool:

Web of Science research areas
Computer Science, Software Engineering
Engineering, Electrical & Electronic
Logo image