Conference proceeding
Exploring Paraphrasing Techniques on Formal Language for Generating Semantics Preserving Source Code Transformations
2020 IEEE 14TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC 2020), pp 242-248
01 Jan 2020
Abstract
Automatically identifying and generating equivalent semantic content to a word, phrase, or sentence is an important part of natural language processing (NLP). The research done so far in paraphrases in NLP has been focused exclusively on textual data, but has significant potential if it is applied to formal languages like source code. In this paper, we present a novel technique for generating source code transformations via the use of paraphrases. We explore how to extract and validate source code paraphrases. The transformations can be used for stylometry tasks and processes like refactoring. A machine learning method of identifying valid transformations has the advantage of avoiding the generation of transformations by hand and is more likely to have more valid transformations. Our data set is comprised by 27,300 C++ source code files, consisting of 273 topics each with 10 parallel files. This generates approximately 152,000 paraphrases. Of these paraphrases, 11% yield valid code transformations. We then train a random forest classifier that can identify valid transformations with 83% accuracy. In this paper we also discuss some of the observed relationships between linked paraphrase transformations. We depict the relationships that emerge between alternative equivalent code transformations in a graph formalism.
Metrics
Details
- Title
- Exploring Paraphrasing Techniques on Formal Language for Generating Semantics Preserving Source Code Transformations
- Creators
- Aviel J. Stein - Drexel UniversityLevi Kapllani - Drexel Univ, Coll Comp & Informat, Philadelphia, PA 19144 USASpiros Mancoridis - New York UniversityRachel Greenstadt - New York UniversityIEEE
- Publication Details
- 2020 IEEE 14TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC 2020), pp 242-248
- Series
- IEEE International Conference on Semantic Computing
- Publisher
- IEEE
- Number of pages
- 7
- Resource Type
- Conference proceeding
- Language
- English
- Academic Unit
- Computer Science
- Web of Science ID
- WOS:000565450400042
- Scopus ID
- 2-s2.0-85083451289
- Other Identifier
- 991019167448004721
InCites Highlights
Data related to this publication, from InCites Benchmarking & Analytics tool:
- Collaboration types
- Domestic collaboration
- Web of Science research areas
- Computer Science, Artificial Intelligence