How to Filter GitHub Repositories for Your MSR Paper

A Mining Software Repository (MSR) study often involves identifying a set of GitHub repositories based on a set of filters. An MSR study uses the obtained repositories as the subject system to mine a specific aspect (for example, source code, number of issues, or other repository metrics). How you define your sampling criteria relates to the sampling theory, and what you do with your subject systems strongly relates to the problem you are solving and your solution. The step in between these two steps, i.e., identify and download GitHub repositories, is where the focus of this post is.

Let's say you would like to identify repositories with the following constraints:

  • the primary programming language should be Java,
  • must have modified in the last year at least once (to avoid analyzing inactive repositories),
  • must have more than 10,000 lines of code,
  • must have at least ten stars, and
  • must have good code quality (yes, it is pretty vague; let's hold on for now).

GHTorrent significantly made it easier to work on this problem; especially when your criteria involve metadata (such as the number of stars) about the repositories, you can write SQL query to obtain a set of repositories. However, you either need to have a local installation of the GHTorrent database or have access to the APIs offered by the GHTorrent platform. Another issue is that you may not get up-to-date information depending on when the platform crawled GitHub.

As the MSR community grew, the community started striving to filter repositories based on the repositories' quality. RepoReapers was an attempt to collect quality aspects (such as documentation, architecture, unit test, license, and issues). Again, the dataset is prone to provide obsolete information due to its static nature.

GitHub search offered by the SEART lab also provides an elegant interface to search GitHub repositories. You may specify your search criteria using numerous parameters supported by GitHub's advanced search.

Then enters GitHub GraphQL API. You can query GitHub directly using the API and get the filtered set of repositories. The information is up-to-date information. However, there is a catch. MSR researchers want thousands of repositories as their subject system. To do that, they need to search using quite broad criteria. Even if you use a GitHub API token to search, the search results are often too big to complete in one shot, and soon you will consume your API limit.

To overcome the above challenge, you may use searchgithubrepo package. With this package, you only need to call one method searchrepo with all the parameters (such as the minimum number of stars and programming language). The package will create a text file containing all the GitHub repositories modified on or after the specified start date and satisfying other conditions.

The above package solves the problem partially. GitHub does not support searching based on lines of code in a repository. Hence, we need to rely on another source for this information. Another issue is that we would not like to select poor-quality repositories. It is partially avoided by relying on GitHub stars but is not foolproof.

To solve both issues, we can use QScored. QScored is an open platform where code quality information (smells and metrics) of more than 200 thousand open-source repositories is available. QScored assigns a weighted quality score based on the detected smells at various granularities (the higher the score, the poorer the quality). QScored offers search APIs. The below program takes the file generated using the searchgithubrepo package and checks the size of the project in terms of lines of code and the quality score.

That's it for now. Happy mining :)