|Data set name
||Size (MB)||Data matrix type||Num. ex. (tr/val/te)||Num. feat.|
|ADA||0.6||Non sparse||4147 / 415 / 41471||48|
|GINA||19.4||Non sparse||3153 / 315 / 31532||970|
|HIVA||7.6||Non sparse||3845 / 384 / 38449||1617|
|NOVA||2.3||Sparse binary||1754 / 175 / 17537||16969|
|SYLVA||15.6||Non sparse||13086 / 1308 / 130858||216|
Is there a limit to the number
You can make as many submissions as you want (albeit no more than 5 per day not to overload our system.) However, only your FIVE last valid submissions will be used for the final ranking, if you use your own models. Valid submissions include all results on all datasets.
code I can use to perform the challenge tasks?
Yes: We provide a Matlab package called CLOP (Challenge Learning Object Package), which is based on the interface of the Spider package developed at the Max Plank Institute. It contains preprocessing and learning machine "objects", and examples on how to apply then to the challenge data. We would like you to use this provided package because it will allow us to compare in a more controlled manner the various model selection strategies used by the participants. However, this is not a requirement to participate to the challenge.
I don't understand what you mean by "performance
When you participate to the challenge, you will be asked to provide prediction results on test data. Since the five tasks of the challenge are all two-class classification problems, you will be asked to provide class labels for the test examples. The performance of your method will be computed by comparing your predictions with the actual class labels of the examples and computing a prediction score. The prediction score chosen is the balanced error rate (the average of the error rates of the two classes.) The objective of the challenge is to predict the "generalization error" of your predictive model. Such generalization error cannot be exactly determined, but we shall estimate it from your performance on test data. This is why we reserved large test sets. In other words, the job of the challenge participants is not only to devise good predictive models, but also to devise good estimators of performance prediction (based on training data only.)
Why should I care about "performance prediction"?
In practical applications, before the final model is deployed in the field, it is important to provide an accurate estimate of how well it will perform on new unseen data (the so-called "generalization error".) Will the model meet the specifications and be good enough for the application? Should one invest more time and resources to collect additional data or develop more sophisticated models? In many applications, training data are scarce. Reserving a large chunk of the training data to evaluate the model performances is not a satisfactory solution because data are needed to train the model. Most people resort to using some form of cross-validation. Eventually, all the data are used to produce the final model. But, is cross-validation the best solution? If yes, what is the best way of splitting data to perform cross-validation? Wouldn't we better off training with all the available data and relying on theoretical predictions of performance prediction? Are there perhaps hybrid methods providing more accurate performance estimators? Think, for instance, that performance as a function of a given hyper-parameter could be modeled by a parametric curve to be fitted by cross-validation.
What is the scoring
The competitors are asked to provide a guess of how well their model will perform on the test set (the first value in the .guess file.) The challenge entries are judged on this guess. The scoring is based on:
E = testBER + deltaBER * (1- exp(-deltaBER/sigma)) , whereHere, predictedBER is your guess of the balanced error rate you will obtain on the test set, based on the available training data; testBER is the BER computed from your predictions on test data. Because testBER is not the expected value of the BER we would like to use in this score, but rather an estimate from a (large) test set, we limit the influence of deltaBER in the region of uncertainty where deltaBER is commensurate with sigma, the error bar on our the actual BER as estimated by testBER. The sigma is computed according to the formula:
deltaBER = abs(predictedBER - testBER)
sigma = (1/2) sqrt[ (p1(1-p1)/m1) + (p2(1-p2)/m2) ]where p1 (resp. p2) is the fraction of errors of examples of the first class (resp. of the second class) and m1 is the number of examples of the first class (resp. of the second class). Note that during the development period, the validation set was substituted to the test set. The validation set was much smaller that the test set so it could not be relied upon to give you an accurate predictedBER.
Will the results be published?
Yes, the results of the challenge will be published at the WCCI 2006 conference. In addition, you can submit a paper to that conference and participate the workshop on model selection where the results of the challenge will be presented (deadline January 31, 2006. EXTENDED TO FEBRUARY 15th) Finally, the best papers will be invited to a special issue of the journal JMLR.
Can I use an alias or a funky
email not to reveal my identity?
To enter the final ranking, we require participants to identify themselves by their real name. You cannot win the competition if you use an alias. However, you may use an alias instead of your real name during the development period, to make development entries that do not include results on test data. You must always provide a valid email. Since the system identifies you by email, please use always the same email. Your email will only be used by the administrators to contact you with information that affect the challenge. It will not be visible from others during the challenge.
Do I need to let you know
what my method is?
Disclosing information about your method is optional. If you want to submit your model follow the instructions and submit them with your results.
Can me or my group make multiple
Multiple submissions by the same person (nominatively uniquely and properly identified) are permitted, provided that the following conditions are met:
- For each final submission, results on ALL the data sets are provided.
- Less than five final submissions are entered per person (if a larger number of submissions are made, the last 5 fulfilling the criteria of final submissions will be considered for the final ranking and selecting the winner).
Can I use a robot to make
Robot submissions are not explicitly forbidden. However, we require that the total number of submissions per 24 hours from the same origin does not exceed 5. Please be courteous otherwise we run at risk to overload the server and we will need to take more drastic measures.
Can I make a submission with mixed methods?
Mixed submissions containing results of different methods on the various datasets are permitted. Choosing the methods on the basis of the validation set results is permitted.
What is the difference between
a development submission and a final submission?
A final submission consists of classification results on ALL the datasets provided for the five tasks. Partial "development" submissions (including results only on a subset of the data or only on the validation set) may also optionally be entered to get feed-back, but they will not be considered for the final ranking. The organizers will compute validation set scores right away and publish them on-line. The test set results and the competition winner(s) will be disclosed only after the closing deadline.
A development submission may include results on a subset of the datasets. There are no limits on the number of development submissions, except that we request than no more than five submissions per day be made to avoid overloading the system. All final submissions should include classification results on ALL the datasets for the five tasks (that is training, validation and test set, a total of 15 files) and optionally the corresponding confidence values (15 files) and performance guesses (5 files). There is a limit of 5 final submissions. If more than 5 submissions fulfilling the criterion of a valid final submission are entered, the last 5 only will be taken into account in the final ranking. Therefore, you may enter final submissions even during development, but only the last five will be used for the final ranking.
Why should I make development submissions?
Development submissions are not mandatory. However, they can help you in a number of ways:
- To get familiar with the submission procedure and make sure everything runs smoothly before the rush of the deadline.
- To evaluate your method on an independent test set and compare the results with those of others.
- To get alerted by email if we make changes or become aware of a problem of general interest to the participants.
How does the "performance prediction" challenge
connect with "model selection"?
The performance prediction challenge offers a platform to test model selection strategies in two ways:
(1) Accurate performance predictions make good model ranking criteria, so you may want to devise a method to predict performances and use it for model selection.
(2) A set of models (called challenge learning objects) using a standard interface are available; this is an opportunity to demonstrate cleverness in choosing the best model and best hyper-parameter setting rather than using your own model.
Can I attend the model selection
workshop if I did not participate to the challenge?
Yes. You can even submit a paper for presentation on the themes of the workshop.
Should I use the models provided for the
You can use your own model(s).
Why do you want people to use "challenge
The analysis of the results of the challenge will allow us to draw stronger conclusions on model selection methods if the competitors use both the same datasets and the same set of models.
Why do you hide the identity
of the datasets?
The datasets were prepared using publicly available data. People who have been exposed the chosen data or similar ones may be at an advantage if they knew something about the data. We think it is fairer to put everyone on the same starting line.
In real life, domain knowledge is of great importance to solve a problem. Yet, some methods have proved to work well on a variety of problems, without domain knowledge. This benchmark is about method genericity, not about domain knowledge.
The identity of the datasets and the preprocessing will be disclosed at the end of the challenge.
Why did you split the data
into training, validation, and test set?
The validation set that we reserved could rather be called "development test set". It allows participants to assess their performance relative to other participants' performance during the development period. The performances on the test set will remain confidential until the closing deadline. This prevents participants from tuning their method using the test set, but it allows them to get some feed-back during the development period.
The participants are free to do what they want with the training data, including splitting it again to perform cross-validation.
What motivates the proportion
of the data split?
The proportions training/validation/test are 10/1/100. The validation set size is purposely small. Hence, using the validation set performance as your performance prediction is probably not a good idea. The training set is ten times larger than the validation set, to encourage participants to devise strategies of cross-validation or other ways of using the training data to make performance predictions. The test set is 100 times larger than the validation set. Thus, the error bar of our estimate of your "generalization performance" based on test data predictions will be approximately an order of magnitude smaller than the validation error bar. This will make is possible to assess how well the participants could predict their "generalization error".
Are the training, validation,
and test set distributed differently?
We shuffled the examples randomly before splitting the data. We respected approximately the same proportion of positive and negative examples in each subset. This should ensure that the distributions of examples in the three subsets are similar.
Is it allowed to use the validation and test sets as extra unlabelled training data?
Are the results on NOVA and HIVA correctly reported on the web page?
The datasets NOVA or HIVA ares a strongly biased: they contain only a small fraction of examples of the positive class. Classifiers that minimize the error rate, not the balanced error rate (BER) will tend to predict systematically the negative class. This may yield a reasonable error rate, but a BER of about 50%. However, the AUC may be very good if the classifier orders the scores in a meaningful way.
Can I get the labels of the validation set to train my classifier?
I has been argued that by making sufficiently many development submissions, participants could guess the validation set labels and obtain an unfair advantage. One month before the challenge is over, we will make the validation set labels available to the participants so they can use them to make their final submissions. THE VALIDATION SET LABELS ARE AVALIABLE NOW.
Will the organizers enter
The winner of the challenge may not be one of the challenge organizers. However, other workshop organizers that did not participate to the organization of the challenge may enter the competition. The challenge organizers will enter development submissions from time to time to challenge others, under the name "Reference". Reference entries are shown for information only and are not part of the competition.
Can a participant give an arbitrary hard time to the organizers?
DISCLAIMER: ALL INFORMATION, SOFTWARE, DOCUMENTATION, AND DATA ARE PROVIDED "AS-IS". ISABELLE GUYON AND/OR OTHER ORGANIZERS DISCLAIM ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY PARTICULAR PURPOSE, AND THE WARRANTY OF NON-INFRIGEMENT OF ANY THIRD PARTY'S INTELLECTUAL PROPERTY RIGHTS. IN NO EVENT SHALL ISABELLE GUYON AND/OR OTHER ORGANIZERS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF SOFTWARE, DOCUMENTS, MATERIALS, PUBLICATIONS, OR INFORMATION MADE AVAILABLE FOR THE CHALLENGE.
Who can I ask for more help?
For all other questions, email email@example.com.
Last updated: March 26, 2006.