[ workshop page | workshop FAQchallenge page | challenge FAQlearning object FAQ ]

Performance Prediction Challenge FAQ

*** March 1st, 2006: the challenge is over!  ***

What is the goal of the challenge?
The goal is to provide the best possible predictive models for the five tasks of the challenge AND predict how these models will perform on test data.

Is there a prize?

Yes: A cash award and a certificate will be conferred to the winner during the award banquet of WCCI 2006 (see the workshop page.) A special prize will be attributed for the best "bonus entry", if it is not the winning entry. In addition, deserving challenge participants who need financial support to attend the workshop may send request to modelselect@clopinet.com.

What is the schedule of the challenge?
The challenge started September 30th, 2005 and ends in March 1st, 2006. See the exact schedule for the intermediate milestones and the submission of papers, and the workshop.

How do I participate in the challenge?
Participation to the challenge is free and open to everyone. The datasets may be downloaded from the table below. The results can be submitted on-line. It is possible to submit on-line results on the validation set during the development period and get immediate performance feed-back. Any time before the end of the challenge, the competitors can return predictions on the final test test. But, the performances on the test set will not be revealed until the challenge is over.

What are the datasets?
Five 2-class classification problems:
Data set name
Size (MB) Data matrix type Num. ex. (tr/val/te) Num. feat.
ADA 0.6 Non sparse 4147 / 415 / 41471 48
GINA 19.4 Non sparse 3153 / 315 / 31532 970
HIVA 7.6 Non sparse 3845 / 384 / 38449 1617
NOVA 2.3 Sparse binary 1754 / 175 / 17537 16969
SYLVA 15.6 Non sparse 13086 / 1308 / 130858 216
Each dataset is split into a training set, a validation set (or development test set), and a final test set. All datasets (but only the training labels) are available since the start of the challenge.  One month before the end of the challenge, the development period will end and the target values of the validation set will be revealed. VALIDATION LABELS: HERE THEY ARE!
The data are also available from the website of the challenge.

What is the data format?
All the data sets are in the same format and include 5 files in text format:
dataname.param: Parameters and statistics about the data
dataname_train.data: Training set (a sparse or a regular matrix, patterns in lines, features in columns).
dataname_valid.data: Development test set or "validation set".
dataname_test.data: Test set.
dataname_train.labels: Labels (truth values of the classes) for training examples.

The matrix data formats used are (in all cases, each line represents a pattern):
- For regular matrices: a space delimited file with a new-line character at the end of each line.
- For sparse matrices with binary values: for each line of the matrix, a space delimited list of indices of the non-zero values. A new-line character at the end of each line.
Is there code to help me read the data and format the results?

Yes: Matlab code is provided for that purpose with the sample code. We also include a simple classifier called "zarbi".

Is there a limit to the number of submissions?
You can make as many submissions as you want (albeit no more than 5 per day not to overload our system.) However, only your FIVE last valid submissions will be used for the final ranking, if you use your own models. Valid submissions include all results on all datasets.

Is there code I can use to perform the challenge tasks?
Yes: We provide a Matlab package called CLOP (Challenge Learning Object Package), which is based on the interface of the Spider package developed at the Max Plank Institute. It contains preprocessing and learning machine "objects", and examples on how to apply then to the challenge data. We would like you to use this provided package because it will allow us to compare in a more controlled manner the various model selection strategies used by the participants. However, this is not a requirement to participate to the challenge.

I don't understand what you mean by "performance prediction" challenge.
When you participate to the challenge, you will be asked to provide prediction results on test data. Since the five tasks of the challenge are all two-class classification problems, you will be asked to provide class labels for the test examples. The performance of your method will be computed by comparing your predictions with the actual class labels of the examples and computing a prediction score. The prediction score chosen is the balanced error rate (the average of the error rates of the two classes.)  The objective of the challenge is to predict the "generalization error" of your predictive model. Such generalization error cannot be exactly determined, but we shall estimate it from your performance on test data. This is why we reserved large test sets. In other words, the job of the challenge participants is not only to devise good predictive models, but also to devise good estimators of performance prediction (based on training data only.)

Why should I care about "performance prediction"?

In practical applications, before the final model is deployed in the field, it is important to provide an accurate estimate of how well it will perform on new unseen data (the so-called "generalization error".) Will the model meet the specifications and be good enough for the application? Should one invest more time and resources to collect additional data or develop more sophisticated models? In many applications, training data are scarce. Reserving a large chunk of the training data to evaluate the model performances is not a satisfactory solution because data are needed to train the model. Most people resort to using some form of cross-validation. Eventually, all the data are used to produce the final model. But, is cross-validation the best solution? If yes, what is the best way of splitting data to perform cross-validation? Wouldn't we better off training with all the available data and relying on theoretical predictions of performance prediction? Are there perhaps hybrid methods providing more accurate performance estimators? Think, for instance, that performance as a function of a given hyper-parameter could be modeled by a parametric curve to be fitted by cross-validation.

What is the scoring method?
The competitors are asked to provide a guess of how well their model will perform on the test set (the first value in the .guess file.) The challenge entries are judged on this guess. The scoring is based on:

E = testBER + deltaBER * (1- exp(-deltaBER/sigma))       , where
deltaBER = abs(predictedBER - testBER)
Here, predictedBER is your guess of the balanced error rate you will obtain on the test set, based on the available training data; testBER is the BER  computed from your predictions on test data. Because testBER is not the expected value of the BER we would like to use in this score, but rather an estimate from a (large) test set, we limit the influence of deltaBER in the region of uncertainty where deltaBER is commensurate with sigma, the error bar on our the actual BER as estimated by testBER. The sigma is computed according to the formula:
sigma = (1/2) sqrt[ (p1(1-p1)/m1) + (p2(1-p2)/m2) ]
where p1 (resp. p2) is the fraction of errors of examples of the first class (resp. of the second class) and m1 is the number of examples of the first class (resp. of the second class). Note that during the development period, the validation set was substituted to the test set. The validation set was much smaller that the test set so it could not be relied upon to give you an accurate predictedBER.

The area under the ROC curve (AUC) will also be computed, if the participants provide classification confidences (the .conf files) in addition to class label predictions (the compulsory .resu files). But the relative strength of classifiers will be judged only on the BER. If you report the error bar on your predicted BER (the second value in the .guess file), this value will be used for research purpose, but not used to score your entry. Other statistics may also be computed and will be reported (e.g. performances using other loss functions) but will not be used towards determining the winner.

How will you create a global ranking?
Two rankings are used to score the participants:
- One ranking is based on the average test score E over all datasets..
- One ranking is based on the average rank of the participants over all 5 datasets, each dataset being ranked by test score E. This prevents overweighing the datasets with largest error rates.

How should I format and submit my results?
The results on each dataset should be formatted in 7 ASCII files:
dataname_train.resu: +-1 classifier outputs for training examples (mandatory for final submissions).
dataname_valid.resu: +-1 classifier outputs for validation examples (mandatory for development and final submissions).
dataname_test.resu: +-1 classifier outputs for test examples (mandatory for final submissions).
dataname_train.conf: Confidence values for training examples (optional).
dataname_valid.conf: Confidence values  for validation examples (optional).
dataname_test.conf: Confidence values for test examples (optional).
dataname.guess: Your best guess of the BER your model will achieve on test data (one decimal value between 0 and 1). If no guess is provided, a value of 1 will be assumed. Optionally, the .guess file may contain a second value separated by a new line, representing an error bar on the guess BER.
To submit a CLOP model you used, the model used to generate the results can also provided:
dataname.mat: Model saved in Matlab format.

Format for classifier outputs:
- All .resu files should have one +-1 integer value per line indicating the prediction for the various patterns.
- All .conf files should have one decimal positive numeric value per line indicating classification confidence. The confidence values can be the absolute discriminant values. They do not need to be normalized to look like probabilities. Optionally they can be normalized between 0 and 1 to be interpreted as abs(P(y=1|x)-P(y=-1|x)). They will be used to compute ROC curves and Area Under such Curve (AUC). and other performance metrics such as the negative cross-entropy.

Matlab code to format the data and models is provided with the sample code.

Create a .zip or .tar.gz archive with your files and give to the archive the name of your submission. You may want to check the example submission files zarbi.zip and zarbi.tar.gz

Submit the results on-line. If any problem, contact the challenge web site administrator.

Will the results be published?
Yes, the results of the challenge will be published at the WCCI 2006 conference. In addition, you can submit a paper to that conference and participate the workshop on model selection where the results of the challenge will be presented (deadline January 31, 2006. EXTENDED TO FEBRUARY 15th) Finally, the best papers will be invited to a special issue of the journal JMLR.

Can I use an alias or a funky email not to reveal my identity?
To enter the final ranking, we require participants to identify themselves by their real name. You cannot win the competition if you use an alias. However, you may use an alias instead of your real name during the development period, to make development entries that do not include results on test data. You must always provide a valid email. Since the system identifies you by email, please use always the same email.  Your email will only be used by the administrators to contact you with information that affect the challenge. It will not be visible from others during the challenge.

Do I need to let you know what my method is?
Disclosing information about your method is optional. If you want to submit your model follow the instructions and submit them with your results.

Can me or my group make multiple submissions?
Multiple submissions by the same person (nominatively uniquely and properly identified) are permitted, provided that the following conditions are met:
- For each final submission, results on ALL the data sets are provided.
- Less than five final submissions are entered per person (if a larger number of submissions are made, the last 5 fulfilling the criteria of final submissions will be considered for the final ranking and selecting the winner).

Can I use a robot to make submissions?
Robot submissions are not explicitly forbidden. However, we require that the total number of submissions per 24 hours from the same origin does not exceed 5. Please be courteous otherwise we run at risk to overload the server and we will need to take more drastic measures.

Can I make a submission with mixed methods?
Mixed submissions containing results of different methods on the various datasets are permitted. Choosing the methods on the basis of the validation set results is permitted.

What is the difference between a development submission and a final submission?
A final submission consists of classification results on ALL the datasets provided for the five tasks. Partial "development" submissions (including results only on a subset of the data or only on the validation set) may also optionally be entered to get feed-back, but they will not be considered for the final ranking. The organizers will compute validation set scores right away and publish them on-line. The test set results and the competition winner(s) will be disclosed only after the closing deadline.

A development submission may include results on a subset of the datasets.  There are no limits on the number of development submissions, except that we request than no more than five submissions per day be made to avoid overloading the system. All final submissions should include classification results on ALL the datasets for the five tasks (that is training, validation and test set, a total of 15 files) and optionally the corresponding confidence values (15 files) and performance guesses (5 files). There is a limit of 5 final submissions. If more than 5 submissions fulfilling the criterion of a valid final submission are entered, the last 5 only will be taken into account in the final ranking. Therefore, you may enter final submissions even during development, but only the last five will be used for the final ranking.

Why should I make development submissions?
Development submissions are not mandatory. However, they can help you in a number of ways:
- To get familiar with the submission procedure and make sure everything runs smoothly before the rush of the deadline.
- To evaluate your method on an independent test set and compare the results with those of others.
- To get alerted by email if we make changes or become aware of a problem of general interest to the participants.

How does the "performance prediction" challenge connect with "model selection"?
The performance prediction challenge offers a platform to test model selection strategies in two ways:
(1) Accurate performance predictions make good model ranking criteria, so you may want to devise a method to predict performances and use it for model selection.
(2) A set of models (called challenge learning objects) using a standard interface are available; this is an opportunity to demonstrate cleverness in choosing the best model and best hyper-parameter setting rather than using your own model.

Can I attend the model selection workshop if I did not participate to the challenge?
Yes. You can even submit a paper for presentation on the themes of the workshop.

Should I use the models provided for the challenge?
You can use your own model(s).

Why do you want people to use "challenge learning objects"?
The analysis of the results of the challenge will allow us to draw stronger conclusions on model selection methods if the competitors use both the same datasets and the same set of models.

Why do you hide the identity of the datasets?
The datasets were prepared using publicly available data. People who have been exposed the chosen data or similar ones may be at an advantage if they knew something about the data. We think it is fairer to put everyone on the same starting line.
In real life, domain knowledge is of great importance to solve a problem. Yet, some methods have proved to work well on a variety of problems, without domain knowledge. This benchmark is about method genericity, not about domain knowledge.
The identity of the datasets and the preprocessing will be disclosed at the end of the challenge. 

Why did you split the data into training, validation, and test set?
The validation set that we reserved could rather be called "development test set". It allows participants to assess their performance relative to other participants' performance during the development period. The performances on the test set will remain confidential until the closing deadline. This prevents participants from tuning their method using the test set, but it allows them to get some feed-back during the development period.
The participants are free to do what they want with the training data, including splitting it again to perform cross-validation.

What motivates the proportion of the data split?
The proportions training/validation/test are 10/1/100. The validation set size is purposely small. Hence, using the validation set performance as your performance prediction is probably not a good idea. The training set is ten times larger than the validation set, to encourage participants to devise strategies of cross-validation or other ways of using the training data to make performance predictions. The test set is 100 times larger than the validation set. Thus, the error bar of our estimate of your "generalization performance" based on test data predictions will be approximately an order of magnitude smaller than the validation error bar. This will make is possible to assess how well the participants could predict their "generalization error".

Are the training, validation, and test set distributed differently?
We shuffled the examples randomly before splitting the data. We respected approximately the same proportion of positive and negative examples in each subset. This should ensure that the distributions of examples in the three subsets are similar.

Is it allowed to use the validation and test sets as extra unlabelled training data?

Are the results on NOVA and HIVA correctly reported on the  web page?

The datasets NOVA or HIVA ares a strongly biased: they contain only a small fraction of examples of the positive class. Classifiers that minimize the error rate, not the balanced error rate (BER) will tend to predict systematically the negative class. This may yield a reasonable error rate, but a BER of about 50%. However, the AUC may be very good if the classifier orders the scores in a meaningful way.

Can I get the labels of the validation set to train my classifier?
I has been argued that by making sufficiently many development submissions, participants could guess the validation set labels and obtain an unfair advantage. One month before the challenge is over, we will make the validation set labels available to the participants so they can use them to make their final submissions. THE VALIDATION SET LABELS ARE AVALIABLE NOW.

Will the organizers enter the competition?
The winner of the challenge may not be one of the challenge organizers. However, other workshop organizers that did not participate to the organization of the challenge may enter the competition. The challenge organizers will enter development submissions from time to time to challenge others, under the name "Reference". Reference entries are shown for information only and are not part of the competition.

Can a participant give an arbitrary hard time to the organizers?

Who can I ask for more help?
For all other questions, email modelselect@clopinet.com.

Last updated: March 26, 2006.