Challenge page | Workshop page

NIPS2003 challenge on feature selection


When did the benchmark start?

The datasets were released and the web site made public on Monday September 8, 2003.

Is there code to help me read the data and format the results?

Yes: Matlab code is provided for that purpose with the sample code. We also include a simple classifier called "lambda".

Where should I submit my results?

Email the challenge webmaster if you encounter problems.

When should I submit my results?

Anytime before or on december 1st, 2003.

When should I ask to make a presentation at the NIPS workshop?

Anytime before or on december 1st, 2003. See:

What is the difference between a development submission and a final submission?

A final submission consists of classification results on ALL the datasets provided for the five tasks and the corresponding feature sets. Submissions including results on the validation set on a subset of datasets may also optionally be entered as development submissions, but they will not be considered for the final ranking. The organizers will compute validation set scores right away and publish them on-line. The test set results and the competition winner will be disclosed only after the closing deadline.

A development submission may include results on a subset of the datasets.  There are no limits on the number of development submissions, except that we request than no more than five submissions per day be made to avoid overloading the system. All final submissions should include classification results on ALL the datasets for the five tasks (that is training, validation and test set, a total of 15 files) and optionally the corresponding confidence values (15 files) and feature sets (5 files). There is a limit of 5 final submissions. If more than 5 submissions fulfilling the criterion of a valid final submission are entered, the last 5 only will be taken into account in the final ranking. Therefore, you may enter final submissions even during development, but only the last five will be used for the final ranking.

Why should I make development submissions?

Development submissions are not mandatory. However, they can help you in a number of ways:
- To get familiar with the submission procedure and make sure everything runs smoothly before the rush of the deadline.
- To evaluate your method on an independent test set and compare the results with those of others.
- To get alerted by email if we make changes or become aware of a problem of general interest to the participants.

Can I remain anonymous?

Anonymous submissions are authorized during the development period for development submissions. Just ask to remain anonymous and your name will not be mentioned next to the performance of your method. However, participants will not be ranked in the final test unless they return nominative results on all five test sets. Therefore, do not check the anonymous box for your final submissions.

Can I use an alias or a funky email not to reveal my identity?

Forging names is strictly forbidden. The person who makes a submission must identify him/herself with his/her real name (last name and first name). Do not use pseudonyms or nicknames. Anonymous submissions protect you from revealing your identity during the development period. If you do not make any final submission, your identity will not be revealed. People caught cheating will be excluded from the competition.

Can me or my group make multiple submissions?

Multiple submissions by the same person (nominatively uniquely and properly identified) are permitted, provided that the following conditions are met:

Is it possible to select different number of features for different datasets?


Can I use a robot to make submissions?

Robot submissions are not explicitly forbidden. However, we require that the total number of submissions per 24 hours from the same origin does not exceed 5. Please be courteous otherwise we run at risk to overload the server and we will need to take more drastic measures.

Can I make a submission with mixed methods?

Mixed submissions containing results of different methods on the various datasets are permitted. Choosing the methods on the basis of the validation set results is permitted.

Do I need to participate to the workshop to enter the challenge?


Do I need to enter the challenge to participate to the workshop?

No. Email your abstracts to before december 1st, 2003.

Do I need to let you know what my method is?

Disclosing information about your method is optional, but recommended.

Can you hide the number of features of my development submissions on the web page so that others cannot use this valuable information to guess what the "correct" number of features is?

You are free not to submit your feature set for development  submissions. Your results will appear as if they used all the features. You may use the number of features of other people's best performing methods as a clue if your wish. The organizers do not make any check of validity of the features for development submissions, so do it "at your own risks". Reference submissions should provide you with reliable information.

What are the data formats?

All the data sets are in the same format and include 5 files in ASCII format:
dataname.param: Parameters and statistics about the data Training set (a sparse or a regular matrix, patterns in lines, features in columns). Validation set. Test set.
dataname_train.labels: Labels (truth values of the classes) for training examples.

The matrix data formats used are (in all cases, each line represents a pattern):
- For regular matrices: a space delimited file with a new-line character at the end of each line.
- For sparse matrices with binary values: for each line of the matrix, a space delimited list of indices of the non-zero values. A new-line character at the end of each line.
- For sparse matrices with non-binary values: for each line of the matrix, a space delimited list of indices of the non-zero values followed by the value itself, separated from it index by a colon. A new-line character at the end of each line.

Matlab code to read the data is provided with the sample code.

How should I format my results?

The results on each dataset should be formatted in 7 ASCII files:
dataname_train.resu: +-1 classifier outputs for training examples (mandatory for final submissions).
dataname_valid.resu: +-1 classifier outputs for validation examples (mandatory for development and final submissions).
dataname_test.resu: +-1 classifier outputs for test examples (mandatory for final submissions).
dataname_train.conf: confidence values for training examples (optional).
dataname_valid.conf: confidence values  for validation examples (optional).
dataname_test.conf: confidence values for test examples (optional).
dataname.feat: list of features selected (one integer feature number per line, starting from one, ordered from the most important to the least important if such order exists). If no list of features is provided, it will be assumed that all the features were used.
Format for classifier outputs:
- All .resu files should have one +-1 integer value per line indicating the prediction for the various patterns.
- All .conf files should have one decimal positive numeric value per line indicating classification confidence. The confidence values can be the absolute discriminant values. They do not need to be normalized to look like probabilities. They will be used to compute ROC curves and Area Under such Curve (AUC).

Matlab code to format the data is provided with the sample code.

Create a .zip or .tar.gz archive with your files and give to the archive the name of your submission. You may want to check the example submission file

How will you rate the results?

The classification results will be rated with the balanced error rate (BER = the average of the error rate on positive class examples and the error rate on negative class examples). Note that this is not the regular error rate in which errors on positive and negative examples are penalized in the same way. E.g., if there are fewer positive examples, the errors on positive examples will count more. The area under the ROC curve (AUC) will also be computed, if the participants provide classification confidences in addition to class label predictions. But the relative strength of classifiers will be judged only on the BER. The participants are invited to provide the list of features used.
For methods having performance differences that are not statistically significant, the method using the smallest number of features will win. If no feature set is provided, it will be assumed that all the features were used.
A certain number of features meaningless by design were introduced in the data (random probes). The proportion of random probes in the feature set declared will also be computed. It will be used to assess the relative strength of methods that are not significantly different both in error rate and number of features. In that case, the method with smallest number of random probes in the feature set will win.
The submissions will be ranked for each of the five tasks. A global ranking will also be made. The top submission of the global ranking will be declared winner of the challenge.
We provide an implementation of our ranking method in Matlab, for review purpose. The organizers reserve the right to make changes in the ranking method. The participants will be notified if changes are made.

How will you know that people effectively used the features they declared?

After the closing deadline, the organizers will use a set of canonical classifiers and train and test them using the features declared. If the best performance thus obtained is significantly different from the results obtained by the participant, the organizers may question the participant on the method used. The organizers may also provide the participant with one or several additional test sets containing only the features they selected to verify the accuracy of their classifier when it uses only those features.

What are the datasets you used?

The datasets were prepared using publicly available datasets. The identity of the original data will be revealed after the closing deadline. The intention is to make the chances of the participants more even and to prevent participants from using domain knowledge about the data. The data are sufficiently disguised that it should be non-trivial to recognize them. Although this is not a challenge in data encryption, if experts in data encryption win the challenge by discovering the identity of the test labels through an identification of the original data, they will have a special mention.

Preparing the data included the following steps:
- Preprocessing data to obtain features in the same numerical range (0 to 999 for continuous data and 0/1 for binary data).
- Adding “random probes” that are features meaningless by design, distributed similarly to the real features. This will allow us to rank algorithms according to their ability to filter out irrelevant features.
- Randomizing the order of the patterns and the features to homogenize the data.
- Training and testing on various data splits using simple feature selection and classification methods to obtain "typical" baseline performances.
- Determining the approximate number of test examples needed for the test set to obtain statistically significant test results using the rule-of-thumb ntest = 100/p, where p is the estimated "typical" test set error rate (see What size test set gives good error rate estimates? I. Guyon, J. Makhoul, R. Schwartz, and V. Vapnik. PAMI, 20 (1), pages 52--64, IEEE. 1998).
- Splitting the data into training, validation and test set. The size of the validation set is usually smaller than that of the test set to keep as much training and test data as possible.

What are "random probes"?

Random probes are presumably meaningless features. They were drawn at random using a distribution similar to that of the real features, but using no information about the class labels. They should carry no information about the problem at hand. They are simple "distracters" that were added to the data with two purposes: make the task more challenging, and allow the organizers to control the quality of the feature sets obtained.

Why do you hide the identity of the datasets?

People who have been exposed to the datasets chosen or similar ones may be at an advantage if they knew something about the data. We think it is more fair to put everyone on the same starting line.
In real life, domain knowledge is of great importance to solve a problem. Yet, some methods have proved to work well on a variety of problems, without domain knowledge. This benchmark is about method genericity, not about domain knowledge.
The identity of the datasets and the preprocessing will be disclosed at the end of the benchmark. People can they see whether they can do better using domain knowledge.

Why did you split the data into training, validation, and test set?

The validation set that we reserved could rather be called "development test set". It allows participants to assess their performance relative to other participants' performance during the development period. The performances on the test set will remain confidential until the closing deadline. This prevents participants from tuning their method using the test set, but it allows them to get some feed-back during the development period.
The participants are free to do what they want with the training data, including splitting it again to perform cross-validation.

Are the training, validation, and test set distributed differently?

We shuffled the examples randomly before splitting the data. We respected approximately the same proportion of positive and negative examples in each subset. This should ensure that the distributions of examples in the three subsets are similar.

Is it allowed to use the validation and test sets as extra unlabelled training data?


Are all the results of Dorothea correctly reported on the  web page?

Dorothea is a strongly biased dataset: it has only about 10% positive examples. Classifiers that minimize the error rate, not the balanced error rate (BER) will tend to predict systematically the negative class. This yields an error rate of about 10%, but a BER of about 50%. However, the AUC may be very good if the classifier orders the scores in a meaningful way.

Can I get the labels of the validation set to train my classifier?

I has been argued that by making sufficiently many development submissions, participants could guess the validation set labels and obtain an unfair advantage. After the December 1st deadline, we will make the validation set labels available to the participants who have made valid final submissions. Such participants will then get a deadline extension until December 8 to make additional final submissions using the extra information of the validation set labels. We impose a limit of five additional final submissions. The results on all final submissions (those of December 1st and December 8) will not be revealed until the workshop is held. The test set labels and the identity of the datasets will also be revealed at that time.

Will the organizers enter the competition?

The winner of the challenge may not be one of the challenge organizers. However, other workshop organizers that did not participate to the organization of the challenge may enter the competition. The challenge organizers will enter development submissions from time to time to challenge others, under the name "Reference". Reference entries are shown for information only and are not part of the competition.

Can a participant give an arbitrary hard time to the organizers?


Who can I ask for more help?

For technical problems relative to the submissions and results, contact the challenge webmaster.
For all other questions, email

Last updated: October 28, 2002.