Preprocessing Source Code Comments for Linguistic Models

Sergey Matskevich; Colin S Gordon

doi:10.48550/arxiv.2208.11235

Back

Preprocessing Source Code Comments for Linguistic Models

Preprint

Open access

Preprocessing Source Code Comments for Linguistic Models

Sergey Matskevich and Colin S Gordon

arXiv (Cornell University)

26 Aug 2022

DOI: https://doi.org/10.48550/arxiv.2208.11235

Files and links (1)

url

https://doi.org/10.48550/arxiv.2208.11235View

Preprint (Author's original)arXiv.org - Non-exclusive license to distribute, Open

Abstract

Computer Science - Learning

Computer Science - Software Engineering

Comments are an important part of the source code and are a primary source of documentation. This has driven interest in using large bodies of comments to train or evaluate tools that consume or produce them -- such as generating oracles or even code from comments, or automatically generating code summaries. Most of this work makes strong assumptions about the structure and quality of comments, such as assuming they consist mostly of proper English sentences. However, we know little about the actual quality of existing comments for these use cases. Comments often contain unique structures and elements that are not seen in other types of text, and filtering or extracting information from them requires some extra care. This paper explores the contents and quality of Python comments drawn from 840 most popular open source projects from GitHub and 8422 projects from SriLab dataset, and the impact of na\"ive vs. in-depth filtering can have on the use of existing comments for training and evaluation of systems that generate comments.

Metrics

5 Record Views

Details

Title: Preprocessing Source Code Comments for Linguistic Models
Creators: Sergey Matskevich
Colin S Gordon
Publication Details: arXiv (Cornell University)
Resource Type: Preprint
Language: English
Academic Unit: Computer Science (Computing)
Other Identifier: 991021868725604721

Preprocessing Source Code Comments for Linguistic Models

Files and links (1)

Abstract

Metrics

Details

Drexel University Social media