ICSE 2019: (Basic) Analysis of Accepted Papers

Mon 20 May 2019

by Tushar

tagged ICSE, ICSE2019, Web-Scrapping

The wordle above is generated from the titles of 109 accepted papers in ICSE 2019 research track.

ICSE 2019 is merely one week away and many of SE researchers will be flying to Montreal on the coming weekend to participate possibly in the biggest ensemble of SE researchers worldwide. I was taking a look at the accepted papers of the research track and was intrigued by some observations such as some universities/authors have contributed significantly. To understand it better, I carried out some basic analysis of the accepted papers during the last weekend.

I will first provide the process of extracting and curating the data and then present what I observed using the extracted data.

Data extraction and cleaning

First, I took the html code from the accepted papers page from ICSE website and saved it locally. Then, I wrote a small python script using BeautifulSoup to extract the required meta-data. You may find the script below.

The result of the script was far from clean. For instance, there were encoding problems in author names. Similarly, there were many variations in university names (such as 'College of William and Mary', 'College of William & Mary', and 'The College of William and Mary'). I manually fixed the encoding problems (to some extent) and normalized the organization names.

To assign organization type to each organization, I first used a heuristic that checks for the 'university' in the organization name and tag it as "Academic" otherwise "Industry". Wherever there are more than one affiliation mentioned for an author, I just picked the first affiliation to keep the analysis simple. Later, I manually checked each non-academic tag to fix the wrong assignments.

I also wanted to know whether a paper is the result of a collaboration between academia and industry. I used MS Excel functions to figure out whether the affiliation type associated with a paper has both industry and academia or only one type and assign PA (pure academic), PI (pure industry), or Y (collaborative) tags.

The resultant cleaned csv looks like the following.

You may download the CSV file.

Results

First, let us look at the top contributing organizations. The following figure presents the 14 top organizations having total number of unique papers more than three.

Not surprisingly, most of these organizations are academic except Microsoft. This brings to the next observation that explores the distribution of authors by their organization type. Following figure shows that close to 82% of authors belong to academia. A significant number of authors didn't mention any affiliation (resulting to a high number of NAs).

Further, it is interesting to observe what this distribution of authors leads to from collaboration point of view. The following figure shows the total number of papers that are written by authors belonging to both the worlds and by pure academic or industrial authors.

The figure shows that 21 papers has authors from both academia and industry. Not surprisingly, majority of the papers have only academic researchers. There are two papers which are written by authors belonging to only industry - one from Google and Microsoft each.

The final observation concerns with the top contributing authors. The following figure shows authors that have more than two accepted papers in the research track.

That's it for now. If you would like to play with the data, you may download the cleaned CSV from here. If you are coming to Montreal to attend ICSE, see you soon :-)

ICSE 2019: (Basic) Analysis of Accepted Papers

Data extraction and cleaning

Results

Recent posts

Tags