Automated Feature Extraction from Version Control Artifacts in Github Repositories

Main Article Content

Maitreya Vaghulade, Urav Dalal, Sean Fargose, Devang Shah, Kush Maniar, Kiran Bhowmick, Meera Narvekar

Abstract

Managing and tracking implemented features in large-scale open-source projects with numerous contributors is a challenging task. This research proposes an automated system to extract features from version control artifacts. A substantial dataset of popular open-source GitHub repositories is collected for development and evaluation purposes. To ensure complete and current information, the system uses Selenium to scrape commits, release notes, README files, and closed/merged pull requests. The suggested method splits the data into manageable portions to preprocess it. Summarization was performed on each chunk using BART (Bidirectional and Auto-Regressive Transformer), and BERT (Bidirectional Encoder Representations from Transformers), two state-of-the-art large language models to extract features from the scraped version control artifacts automatically. Each chunk's text was corrected using GPT-4 (Generative Pre-trained Transformer - 4), which was then combined to create a thorough synopsis. This innovative method seeks to lessen the workload associated with manual feature tracking, streamline contribution management, improve project visibility, ease GitHub adoption, and foster productive contributor interactions. By automating the feature extraction process, developers can focus more on coding rather than extensive documentation, leading to well-structured and informative feature updates for GitHub repositories. Furthermore, the automated feature extraction can be seamlessly integrated into the CI/CD pipeline, enabling continuous monitoring and documentation of implemented features throughout the software development lifecycle.

Article Details

Section
Articles