MSR Mining Challenge 2009

May 16-17, 2009

Special track within MSR 2009,
6th IEEE Working Conference on Mining Software Repositories  

Co-located with ICSE 2009,
IEEE International Conference on Software Engineering


Christian Bird (chair)
Univ. of California, Davis, USA
Katsuro Inoue
University of Osaka, Japan
Michael W. Godfrey
University of Waterloo, Canada
Jim Whitehead
Univ. of California, Santa Cruz, USA


Israel Herraiz
(Universidad Rey Juan Carlos, Spain)
Emily Hill
(University of Delaware, USA)
Abram Hindle
(University of Waterloo, Canada)
Reid Holmes
(University of Calgary, Canada)
Rahul Premraj
(Saarland University, Germany)
Peter Rigby
(University of Victoria, Canada)


Co-located with ICSE 2009,
Vancouver, Canada

NOTE: new later submissions dates announced!


Since 2006 the IEEE Working Conference on Mining Software Repositories (MSR) has hosted a mining challenge. The MSR Mining Challenge brings together researchers and practitioners who are interested in applying, comparing, and challenging their mining tools and approaches on software repositories for open source projects. Unlike previous years that have examined a single project or multiple projects in isolation, this year the MSR challenge involves examining the GNOME Desktop Suite of projects. The emphasis this year is on how the projects are related and how they interact.

There will be two challenge tracks: #1: general and #2: prediction. The winner of each track will be given the MSR 2009 Challenge Award.

Challenge #1: General

In this category you can demonstrate the usefulness of your mining tools. The main task will be to find interesting insights by analyzing the software repositories of the projects within the GNOME Desktop Suite of projects. GNOME is very mature, and composed of a number of individual projects (nautilus, epiphany, evolution, etc.) and provides lots of input for mining tools. The idea of this track is that tools can be used to examine a family of projects that are related and similar in nature. It is recommended (though not required) that tools examine multiple projects within the GNOME ecosystem. For instance, examining API usage across all projects, training a predictive model on one project and assessing its accuracy on another, or examining how developers' activity spans multiple projects.

Participation is straightforward:

  1. Select your mining area (one of bug analysis, change analysis, architecture and design, process analysis, team structure, etc.).
  2. Get project data for multiple GNOME projects.
  3. Formulate your mining questions.
  4. Use your mining tool(s) to answer them.
  5. Write up and submit your 4-page challenge report.

The challenge report should describe the results of your work and cover the following aspects: questions addressed, input data, approach and tools used, derived results and interpretation of them, and conclusions. Keep in mind that the report will be evaluated by a jury. Reports must be at most 4 pages long and in the ICSE format.

The submission will be via Easychair ( Each report will undergo a thorough review, and accepted challenge reports will be published as part of the MSR 2009 proceedings. Authors of selected papers will be invited to give a presentation at the MSR conference in the MSR Challenge track.


Feel free to use any data source for the Mining Challenge. For your convenience, we provide repository logs, mirrored repositories, bugzilla database dumps, and various other forms of data at msrchallengedata.html.

Challenge #2: Predict

This year, the MSR Mining Challenge prediction will involve predicting the code growth (in terms of raw source code lines) of each project that will occur between February 1st and April 30th, 2009 (both days included). Your job is predicting the change in size of code in terms of lines in source code files using all possible resources.

ProjectLines of source added from 2009/2/1 to 2009/4/30
epiphany 2,023
nautilus 3,112
evolution 720

Participation is as follows:

  1. Pick a team name, e.g., WICKED WARTHOGS.
  2. Come up with predictions for code growth based on some criteria or prediction model. A very simple model, for instance, would be the amount of growth in the past three months.
  3. Annotate the corresponding files with your predictions

The prediction is on a per project basis. Thus for each project in projects.txt, you need to predict the growth in number of source code lines. Your submission should be a text file with each line containing a project name followed by the change in number of source lines as in challengeexample.txt.

Each submission will be scored in the following way. For each project, the difference between the submitted growth and the actual growth will be calculated and then normalized by the size of the project as of February 1st. Thus, if zenity is 2000 lines on February 1st and 2500 lines on April 30th and your prediction is 300 lines, then the value would be (300 - 500) / 2000, or -0.1. The score for each submission is the sum of the squares of these values across all of the projects. A perfect prediction submission would have a score of 0. Lower scores indicate better predictions than higher scores. Using sums of squares rather than simple sums rewards predictions that are more consistent in their accuracy.

Obviously, the team with the best predictions will win. However, to increase the competition, we will organize a set of "benchmark" predictions.

Code Growth Prediction

The predictions for code growth should be made at the project level. We will only examine source code (not makefile's, readme's, documentation, etc.) contained in each project repository as it exists at 12:01 a.m. on February 1st and 11:59 p.m. on April 30th, 2008. Source code files are determined by extension (as defined below) and all lines in a source file will be counted regardless of their content. For the challenge we will consider selected projects within the core GNOME desktop suite. A complete list of the projects is in the file projects.txt. We will provide the tool for officially counting raw source lines in the near future.

To calculate source code lines for a project, only include the source files that reside in the trunk of the repository (some .c files, for example, may be generated during the configure or make stages and we do not include those). Do not include files from the branches or tags directories of the repository. We define source code files as those with the extensions c, cc, cpp, cs, glade, h, java, pl, py, and tcl. In addition, any file that has one of those extensions followed by .in or .template are also considered source files. A simple way to calculate the number of source lines is to execute the following command at the root of the tree

find . -regextype posix-extended -type f -regex ".*\.(c|cc|cpp|cs|glade|h|java|pl|py|tcl)(\.template|\.in)?$" | xargs wc -l | tail -n 1

Frequently Asked Questions

Important Dates

Previous Challenges

Nedstat Basic - Free web site statistics