Christian Bird (chair)
Univ. of California, Davis, USA
University of Osaka, Japan
Michael W. Godfrey
University of Waterloo, Canada
Univ. of California, Santa Cruz, USA
(Universidad Rey Juan Carlos, Spain)
(University of Delaware, USA)
(University of Waterloo, Canada)
(University of Calgary, Canada)
(Saarland University, Germany)
(University of Victoria, Canada)
Co-located with ICSE 2009,
Since 2006 the IEEE Working Conference on Mining Software
Repositories (MSR) has hosted a mining challenge. The MSR Mining Challenge
brings together researchers and practitioners who are interested in applying,
comparing, and challenging their mining tools and approaches on software
repositories for open source projects. Unlike previous years that have examined
a single project or multiple projects in isolation, this year the MSR challenge
involves examining the GNOME Desktop Suite of projects. The emphasis this year
is on how the projects are related and how they interact.
There will be two challenge tracks: #1: general and
The winner of
each track will be given the MSR 2009 Challenge Award.
In this category you can demonstrate the usefulness of your mining tools. The main task
will be to find interesting insights by analyzing the software repositories of the projects
within the GNOME Desktop Suite of projects. GNOME is very mature, and composed of a number
of individual projects (nautilus, epiphany, evolution, etc.) and provides lots of input for
mining tools. The idea of this track is that tools can be used to examine a family of
projects that are related and similar in nature. It is recommended (though not required)
that tools examine multiple projects within the GNOME ecosystem. For instance, examining
API usage across all projects, training a predictive model on one project and assessing
its accuracy on another, or examining how developers' activity spans multiple projects.
Participation is straightforward:
- Select your mining area (one of bug analysis, change analysis, architecture and design, process analysis, team structure, etc.).
- Get project data for multiple GNOME projects.
- Formulate your mining questions.
- Use your mining tool(s) to answer them.
- Write up and submit your 4-page challenge report.
The challenge report should describe the results of your work
and cover the following aspects: questions addressed, input data,
approach and tools used, derived results and interpretation of them,
and conclusions. Keep in mind that the report will be evaluated by
a jury. Reports must be at most 4 pages long
and in the ICSE
The submission will be via Easychair (http://www.easychair.org/conferences/?conf=msrchallenge2009).
Each report will undergo a thorough review, and accepted challenge
reports will be published as part of the MSR 2009 proceedings. Authors
of selected papers will be invited to give a presentation at the MSR
conference in the MSR Challenge track.
Feel free to use any data source for the Mining
Challenge. For your convenience, we provide repository logs, mirrored
repositories, bugzilla database dumps, and various other forms of data
This year, the MSR Mining Challenge prediction will involve
predicting the code growth (in terms of raw source code lines) of each project
that will occur between February 1st and April 30th, 2009 (both days included).
Your job is predicting the change in size of code in terms of lines in source code files using all
|Project||Lines of source added from 2009/2/1 to 2009/4/30|
Participation is as follows:
- Pick a team name, e.g., WICKED WARTHOGS.
- Come up with predictions for code growth based on some criteria or prediction model. A very simple model, for instance,
would be the amount of growth in the past three months.
- Annotate the corresponding files with your predictions
The prediction is on a per project basis. Thus for each project in projects.txt,
you need to predict the growth
in number of source code lines. Your submission should be a text file
with each line containing a project name followed by the change in
number of source lines as in challengeexample.txt.
Each submission will be scored in the following way. For each project,
the difference between the submitted growth and the actual growth will
be calculated and then normalized by
the size of the project as of February 1st. Thus, if zenity is 2000
lines on February 1st and 2500 lines on April 30th and your prediction
is 300 lines, then the value would be
(300 - 500) / 2000, or -0.1. The score for each submission is the sum
of the squares of these values across all of the projects. A perfect
prediction submission would have a score
of 0. Lower scores indicate better predictions than higher scores.
Using sums of squares rather than simple sums rewards predictions that
are more consistent in their accuracy.
Obviously, the team with the best predictions will win. However, to
increase the competition, we will organize a set of "benchmark"
Code Growth Prediction
The predictions for code growth should be made at the
project level. We will only examine source code (not makefile's,
readme's, documentation, etc.) contained in each project repository as
it exists at 12:01 a.m. on February 1st and 11:59 p.m. on April 30th,
2008. Source code files are determined by extension (as defined below)
and all lines in a source file will be counted regardless of their
content. For the challenge we will consider selected projects within
the core GNOME desktop suite. A complete list of the projects is in the
file projects.txt. We will provide
the tool for officially counting raw source lines in the near future.
To calculate source code lines for a project, only include the source
files that reside in the trunk of the repository (some .c files, for
example, may be generated
during the configure or make stages and we do not include those). Do
not include files from the branches or tags directories of the
repository. We define source code files as those with the extensions c,
cc, cpp, cs, glade,
h, java, pl, py, and tcl. In addition, any file that has one of those
extensions followed by .in or .template are also considered source
files. A simple
way to calculate the number of source lines is to execute the following
command at the root of the tree
find . -regextype posix-extended -type f -regex
".*\.(c|cc|cpp|cs|glade|h|java|pl|py|tcl)(\.template|\.in)?$" | xargs
wc -l | tail -n 1
Do I need to give a presentation at the MSR conference?
For challenge #1, the jury will select finalists that are expected to
give a short presentation at the conference. Then the audience will
select a winner. For challenge #2, there is no presentation at the
conference. The winners will be determined with statistical methods
(correlation analysis) and announced at the conference.
Does the challenge report have to be four
pages? No, of course you can submit less than four
pages. The page limit was set to ease the presentation of
space-intensive results such as visualizations.
Wow, the data set is soooo big! My tool won't finish in
time. What can I do? Just run your tool on a subset
of the projects. For instance, you could examine only the nautilus file manager and the epiphany web browser.
Especially when you are doing
visualizations, it is almost impossible to show everything.
Predicting code growth? But, I have no clue how to build prediction models. That's the fun thing about this
category: you don't need to build sophisticated models. Of course, some
people will, but others will just build simple predictors. In the end,
we will see (a) whether we can predict future development events and
(b) who does it best.
My cat is a visionary...can I submit its predictions or is the
challenge #2 only for tools? Of course, go ahead and
submit its predictions as a benchmark. However, your cat will run out of
competition—only predictions generated by tools or by humans in
a systematic way are eligible to win challenge #2.
For the challenge #2-predict, is it acceptable if our team
submit more than one prediction file? Only one submission from a team (person) is allowed.