Data set name |
Size (MB) | Data matrix type | Num. ex. (tr/val/te) | Num. feat. |
ADA | 0.6 | Non sparse | 4147 / 415 / 41471 | 48 |
GINA | 19.4 | Non sparse | 3153 / 315 / 31532 | 970 |
HIVA | 7.6 | Non sparse | 3845 / 384 / 38449 | 1617 |
NOVA | 2.3 | Sparse binary | 1754 / 175 / 17537 | 16969 |
SYLVA | 15.6 | Non sparse | 13086 / 1308 / 130858 | 216 |
Is there a limit to the number
of submissions?
You can make as many submissions as you want (albeit no more than
5 per day not to overload our system.) However, only your FIVE last valid submissions will be used for the final
ranking, if you use your own models. Valid submissions
include all results on all datasets.
Is there
code I can use to perform the challenge tasks?
Yes: We provide a Matlab package called CLOP
(Challenge Learning Object Package), which is based on the interface of
the Spider
package developed at the Max Plank Institute. It contains preprocessing
and learning machine "objects", and examples on how
to apply then to the challenge data. We would like you to use this
provided package because it will allow us to compare in a more controlled
manner the various model selection strategies used by the participants.
However, this is not a requirement to participate to the challenge.
I don't understand what you mean by "performance
prediction" challenge.
When you participate to the challenge, you will be asked to provide
prediction results on test data. Since the five tasks of the challenge
are all two-class classification problems, you will be asked to provide
class labels for the test examples. The performance of your method will
be computed by comparing your predictions with the actual class labels of
the examples and computing a prediction score. The prediction score chosen
is the balanced error rate (the average of the error rates of the two classes.)
The objective of the challenge is to predict the "generalization
error" of your predictive model. Such generalization error cannot be
exactly determined, but we shall estimate it from your performance on test
data. This is why we reserved large test sets. In other words, the job
of the challenge participants is not only to devise good predictive models,
but also to devise good estimators of performance prediction (based
on training data only.)
Why should I care about "performance prediction"?
In practical applications, before the final model is deployed in
the field, it is important to provide an accurate estimate of how well
it will perform on new unseen data (the so-called "generalization error".)
Will the model meet the specifications and be good enough for the application?
Should one invest more time and resources to collect additional data
or develop more sophisticated models? In many applications, training
data are scarce. Reserving a large chunk of the training data to evaluate
the model performances is not a satisfactory solution because data are needed
to train the model. Most people resort to using some form of cross-validation.
Eventually, all the data are used to produce the final model. But, is cross-validation
the best solution? If yes, what is the best way of splitting data to perform
cross-validation? Wouldn't we better off training with all the available
data and relying on theoretical predictions of performance prediction?
Are there perhaps hybrid methods providing more accurate performance estimators?
Think, for instance, that performance as a function of a given hyper-parameter
could be modeled by a parametric curve to be fitted by cross-validation.
What is the scoring
method?
The competitors are asked to provide a guess of how well their
model will perform on the test set (the first value in the .guess
file.) The challenge entries are judged on this guess. The scoring is based
on:
E = testBER + deltaBER * (1- exp(-deltaBER/sigma)) , whereHere, predictedBER is your guess of the balanced error rate you will obtain on the test set, based on the available training data; testBER is the BER computed from your predictions on test data. Because testBER is not the expected value of the BER we would like to use in this score, but rather an estimate from a (large) test set, we limit the influence of deltaBER in the region of uncertainty where deltaBER is commensurate with sigma, the error bar on our the actual BER as estimated by testBER. The sigma is computed according to the formula:
deltaBER = abs(predictedBER - testBER)
sigma = (1/2) sqrt[ (p1(1-p1)/m1) + (p2(1-p2)/m2) ]where p1 (resp. p2) is the fraction of errors of examples of the first class (resp. of the second class) and m1 is the number of examples of the first class (resp. of the second class). Note that during the development period, the validation set was substituted to the test set. The validation set was much smaller that the test set so it could not be relied upon to give you an accurate predictedBER.
Will the results be published?
Yes, the results of the challenge will be published at
the WCCI 2006 conference.
In addition, you can submit a paper to that conference and participate
the workshop
on model selection where the results of the challenge will be presented
(deadline January 31, 2006. EXTENDED TO FEBRUARY
15th) Finally, the best papers will be invited to a special issue
of the journal JMLR.
Can I use an alias or a funky
email not to reveal my identity?
To enter the final ranking, we require participants to identify themselves
by their real name. You cannot win the competition if you use an alias.
However, you may use an alias instead of your real name during the development
period, to make development entries that do not include results on test
data. You must always provide a valid email. Since the system
identifies you by email, please use always the same email. Your
email will only be used by the administrators to contact you with information
that affect the challenge. It will not be visible from others during the
challenge.
Do I need to let you know
what my method is?
Disclosing information about your method is optional. If you want
to submit your model follow the instructions and
submit them with your results.
Can me or my group make multiple
submissions?
Multiple submissions by the same person (nominatively uniquely and
properly identified) are permitted, provided that the following conditions
are met:
- For each final submission, results on ALL the data sets
are provided.
- Less than five final submissions are entered per
person (if a larger number of submissions are made, the last 5 fulfilling
the criteria of final submissions will be considered for the final ranking
and selecting the winner).
Can I use a robot to make
submissions?
Robot submissions are not explicitly forbidden. However, we require
that the total number of submissions per 24 hours from the same origin
does not exceed 5. Please be courteous otherwise we run at risk to overload
the server and we will need to take more drastic measures.
Can I make a submission
with mixed methods?
Mixed submissions containing results of different methods on the
various datasets are permitted. Choosing the methods on the basis of the
validation set results is permitted.
What is the difference between
a development submission and a final submission?
A final submission consists of classification results on ALL the
datasets provided for the five tasks. Partial "development" submissions
(including results only on a subset of the data or only on the validation
set) may also optionally be entered to get feed-back, but they will not
be considered for the final ranking. The organizers will compute validation
set scores right away and publish them on-line. The test set results and
the competition winner(s) will be disclosed only after the closing deadline.
A development submission may include results on a subset of the
datasets. There are no limits on the number of development submissions,
except that we request than no more than five submissions per day
be made to avoid overloading the system. All final submissions should
include classification results on ALL the datasets for the five tasks
(that is training, validation and test set, a total of 15 files) and
optionally the corresponding confidence values (15 files) and performance
guesses (5 files). There is a limit of 5 final submissions. If more than
5 submissions fulfilling the criterion of a valid final submission are
entered, the last 5 only will be taken into account in the final ranking.
Therefore, you may enter final submissions even during development, but
only the last five will be used for the final ranking.
Why should I make development
submissions?
Development submissions are not mandatory. However, they can help
you in a number of ways:
- To get familiar with the submission procedure and make sure everything
runs smoothly before the rush of the deadline.
- To evaluate your method on an independent test set and compare
the results with those of others.
- To get alerted by email if we make changes or become aware of
a problem of general interest to the participants.
How does the "performance prediction" challenge
connect with "model selection"?
The performance prediction challenge offers a platform to test
model selection strategies in two ways:
(1) Accurate performance predictions make good model ranking criteria,
so you may want to devise a method to predict performances and use it
for model selection.
(2) A set of models (called challenge learning
objects) using a standard interface are available; this is an opportunity
to demonstrate cleverness in choosing the best model and best hyper-parameter
setting rather than using your own model.
Can I attend the model selection
workshop if I did not participate to the challenge?
Yes. You can even submit a paper for presentation on the themes of the workshop.
Should I use the models provided for the
challenge?
You can use your own model(s).
Why do you want people to use "challenge
learning objects"?
The analysis of the results of the challenge will allow us to
draw stronger conclusions on model selection methods if the competitors
use both the same datasets and the same set of models.
Why do you hide the identity
of the datasets?
The datasets were prepared using publicly available data. People
who have been exposed the chosen data or similar ones may be at an advantage
if they knew something about the data. We think it is fairer to put
everyone on the same starting line.
In real life, domain knowledge is of great importance to solve
a problem. Yet, some methods have proved to work well on a variety of
problems, without domain knowledge. This benchmark is about method genericity,
not about domain knowledge.
The identity of the datasets and the preprocessing will be disclosed
at the end of the challenge.
Why did you split the data
into training, validation, and test set?
The validation set that we reserved could rather be called "development
test set". It allows participants to assess their performance relative
to other participants' performance during the development period. The
performances on the test set will remain confidential until the closing
deadline. This prevents participants from tuning their method using the
test set, but it allows them to get some feed-back during the development
period.
The participants are free to do what they want with the training
data, including splitting it again to perform cross-validation.
What motivates the proportion
of the data split?
The proportions training/validation/test are 10/1/100. The validation
set size is purposely small. Hence, using the validation set performance
as your performance prediction is probably not a good idea. The training
set is ten times larger than the validation set, to encourage participants
to devise strategies of cross-validation or other ways of using the training
data to make performance predictions. The test set is 100 times larger
than the validation set. Thus, the error bar of our estimate of your
"generalization performance" based on test data predictions will be
approximately an order of magnitude smaller than the validation error
bar. This will make is possible to assess how well the participants could
predict their "generalization error".
Are the training, validation,
and test set distributed differently?
We shuffled the examples randomly before splitting the
data. We respected approximately the same proportion of positive and negative
examples in each subset. This should ensure that the distributions of examples
in the three subsets are similar.
Is it allowed to use
the validation and test sets as extra unlabelled training data?
Yes.
Are the results on NOVA and HIVA correctly reported on the
web page?
The datasets NOVA or HIVA ares a strongly biased: they contain
only a small fraction of examples of the positive class. Classifiers that
minimize the error rate, not the balanced error rate (BER) will tend
to predict systematically the negative class. This may yield a reasonable
error rate, but a BER of about 50%. However, the AUC may be very good
if the classifier orders the scores in a meaningful way.
Can I get the labels
of the validation set to train my classifier?
I has been argued that by making sufficiently many development
submissions, participants could guess the validation set labels and
obtain an unfair advantage. One month before the challenge is over, we
will make the validation set labels available to the participants so
they can use them to make their final submissions. THE VALIDATION SET LABELS ARE AVALIABLE NOW.
Will the organizers enter
the competition?
The winner of the challenge may not be one of the challenge organizers.
However, other workshop organizers that did not participate to the organization
of the challenge may enter the competition. The challenge organizers
will enter development submissions from time to time to challenge others,
under the name "Reference". Reference entries are shown for information
only and are not part of the competition.
Can a participant
give an arbitrary hard time to the organizers?
DISCLAIMER: ALL INFORMATION, SOFTWARE, DOCUMENTATION, AND DATA ARE
PROVIDED "AS-IS". ISABELLE GUYON AND/OR OTHER ORGANIZERS DISCLAIM ANY EXPRESSED
OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR ANY PARTICULAR PURPOSE, AND THE WARRANTY
OF NON-INFRIGEMENT OF ANY THIRD PARTY'S INTELLECTUAL PROPERTY RIGHTS. IN
NO EVENT SHALL ISABELLE GUYON AND/OR OTHER ORGANIZERS BE LIABLE FOR ANY SPECIAL,
INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER ARISING OUT
OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF SOFTWARE, DOCUMENTS, MATERIALS,
PUBLICATIONS, OR INFORMATION MADE AVAILABLE FOR THE CHALLENGE.
Who can I ask for
more help?
For all other questions, email modelselect@clopinet.com.
Last updated: March 26, 2006.