Data set name 
Domain 
Size (MB)  Data matrix type  Num. ex. (tr/val/te)  Num. feat. 
ADA  Marketing 
0.6  Non sparse  4147 / 415 / 41471  48 
GINA  Digit recognition 
19.4  Non sparse  3153 / 315 / 31532  970 
HIVA  Drug discovery 
7.6  Non sparse  3845 / 384 / 38449  1617 
NOVA  Text classification 
2.3  Sparse binary  1754 / 175 / 17537  16969 
SYLVA  Ecology 
15.6  Non sparse  13086 / 1308 / 130858  216 
Entrant 
Method 
BER Guess 
Test 
Test 
Test Sigma 
Test Guess Error 
Test Score 
Average Rank 
Roman Lutz 
0.10398 
0.108984 
0.891016 
0.002416 
0.007911 
0.116482 
6.2 

Gavin Cawley 
0.110463 
0.11244 
0.924643 
0.002462 
0.003366 
0.115187 
7.6 

Radford Neal 
0.12008 
0.111803 
0.930368 
0.00246 
0.012167 
0.123495 
7.8 

Corinne Dahinden 
0.1087 
0.115813 
0.884187 
0.00248 
0.010635 
0.126363 
7.8 

Wei Chu 
0.106902 
0.115307 
0.536672 
0.002504 
0.008405 
0.123346 
8.2 

Nicolai Meinshausen 
0.107605 
0.116665 
0.883335 
0.002491 
0.010973 
0.127356 
8.6 

Marc Boulle 
0.1306 
0.130675 
0.92424 
0.00267 
0.009634 
0.139864 
10.4 

Kari Torkkola & Eugene
Tuv 
0.09904 
0.119135 
0.880865 
0.002524 
0.020761 
0.139833 
14 

Olivier Chapelle 
0.113162 
0.126242 
0.920822 
0.00267 
0.015373 
0.141382 
15.8 

J. Wichard 
0.12772 
0.125181 
0.901349 
0.002681 
0.017939 
0.142989 
16.6 

Advanced Analytical Methods,
INTEL 
0.1123 
0.136637 
0.90704 
0.002654 
0.027929 
0.164384 
16.6 

Vladimir Nikulin 
0.1044 
0.133446 
0.866661 
0.002639 
0.029046 
0.16239 
16.8 

Edward Harrington 
0.129347 
0.139374 
0.899119 
0.002622 
0.010027 
0.149237 
16.8 

Kai 
Chi 
0.1129 
0.131494 
0.868506 
0.002733 
0.024463 
0.155948 
17.6 
YuYen Ou 
0.0966 
0.127346 
0.872654 
0.002656 
0.033639 
0.16098 
17.8 

Juha Reunanen 
0.142651 
0.146752 
0.913843 
0.002954 
0.008484 
0.154454 
19 

YenJen Oyang 
RBF + ICA ( 3: 1 ) 86 + v 
0.1125 
0.132369 
0.867631 
0.002777 
0.021092 
0.153333 
19 
Patrick Haluptzok 
NN Vanilla 
0.1396 
0.139603 
0.860397 
0.00276 
0.024002 
0.163415 
19.2 
Tobias Glasmachers 
KTA+CV+SVM (3) 
0.135685 
0.148765 
0.890652 
0.002722 
0.015699 
0.164224 
19.6 
Darby TienHao Chang 
PCA+ME+SVM+valid 
0.1425 
0.150324 
0.849676 
0.002537 
0.018775 
0.168944 
21.4 
gavin growup 
chi+ica+com 
0.1061 
0.134534 
0.865466 
0.002782 
0.028434 
0.162963 
21.6 
WHY 
chi + svm 
0.1029 
0.138013 
0.861987 
0.002651 
0.040983 
0.178989 
22 
Seyna 
Fscore+Chi+SVM 
0.10198 
0.134534 
0.865466 
0.002782 
0.032554 
0.167083 
22 
Chunghoon Kim 
2DCLDAQuad (2) 
0.1481 
0.153428 
0.885056 
0.002874 
0.022441 
0.175736 
22.6 
Machete 
16LSVC+adaboost. 
0.120622 
0.170669 
0.881446 
0.002887 
0.050047 
0.220686 
25.2 
decoste 
submit_test 
0.137154 
0.175312 
0.901088 
0.002868 
0.038639 
0.213911 
25.6 
Myoung Soo Park 
Scaling + CAPCA + ELN 
0.1338 
0.199105 
0.800895 
0.00291 
0.065305 
0.26441 
26.4 
w_pietrus 
CWFS + DT 
0.25776 
0.193905 
0.806095 
0.002728 
0.279849 
0.473628 
27 
Ranking by dataset
Two of the winners by dataset do not show up in the top
3 ranking participants with overall score.
Table 3: Best entrants by dataset.
Dataset 
Entrant 
Method 
Test BER 
Test AUC 
Test Guess error 
Test Score 
ADA 
Marc Boulle 
0.172266 
0.91491 
0.007266 
0.179288 

GINA 
Kari Torkkola & Eugene
Tuv 
0.028833 
0.971167 
0.007266 
0.030216 

HIVA 
Gavin Cawley 
0.275695 
0.7671 
0.001667 
0.279689 

NOVA 
Gavin Cawley 
0.04449 
0.991362 
0.0065 
0.044833 

SYLVA 
Marc Boulle 
0.00614 
0.999119 
0.000873 
0.00618 
Result analysis
Dataset profiles
In Figure 1, we show the BER test distribution for final entrants. HIVA
(drug discovery) seems to be the most difficult dataset: the average BER
and the spread are high. ADA (marketing) is the second hardest. The
distribution is very skewed and has a heavy tail, indicating that a small
group of methods "solved" the problem, which was not obvious to others.
NOVA (text classification) and GINA (digit recognition) come next. Both
datasets have classes containing multiple clusters. Hence, the problems
are highly nonlinear. This may explain the very long tails. Finally, SYLVA
(ecology) is the easiest dataset, due to the large amount of training data.
In this challenge, many classifiers did well. There is a variety of methods in the top ranking entries:
 Ensembles of decision trees (Roman Lutz, Corinne Dahinden, Intel)
 Kernel methods/SVMs (Gavin Cawley, Wei Chu, Kari Torkkola and Eugene
Tuv, Olivier Chapelle, Vladimir Nikulin, YuYen Ou)
 Bayesian Neural Networks (Radford Neal)
 Ensembles of linear methods (Nikolai Meinshausen)
 Naive Bayes (Marc Boulle)
Others used mixed models. It is interesting to note that single methods
did better than mixed models.
In Figure 3, we show the performances of the final entrants, with symbols
coding for the methods used:
X: Mixed or unknown method
TREE: Ensembles of trees (like Random Forests, RF)
NN/BNN: Neural networks or Bayesian neural networks
NB: Naive Bayes
LD/SVM/KLS/GP: Methods linear in their parameters, including
kernel methods and linear discriminant (LD), Support Vector Machines
(SVM), Kernel LeastSquares (KLS) and LSSVM, Gaussian Processes (GP).
This chart reveals that Naive Bayes did very well on two datasets (ADA and SYLVA) but poorly on others. Their overall ranking is not good. Similarly, kernel methods did very well on most datasets (they rank first for HIVA, GINA, and NOVA), but they did poorly on ADA. This failure on ADA make them rank only fifth in the overall ranking. They are the most frequently used type of method, but their performance shows a lot of variance, so they do not perform well in all hands. On the contrary, ensembles of decision trees are not so popular, but they perform consistently well on all datasets in this challenge, even though they are never the top entry for any of the datasets. The challenge winner used an ensemble of decision trees. Similarly, Bayesian neural networks did well on all datasets, even though they were not best for any. They end up ranking third in the overall scoring.
This analysis also tells us something about the datasets:
 SYLVA has a large number of training examples, so all methods essentially
do well, even the simple Naive Bayes.
 We know by design that GINA and NOVA are very nonlinear. The top
ranking participants used highly nonlinear methods and Naive Byes failed.
HIVA seems to also be in this category, not too surprisingly: chemical molecules
with very different structures can be active or inactive drugs.
 ADA has a particularity that makes kernel methods fail and which
should be further investigated. We conjecture that is could be because
the variables are mixed categorical/binary/continuous.
Preprocessing and feature extraction
Further examination of the fact sheets (see Tables 2 and 3) reveals that
the preprocessing methods used are very elementary:
 Tree classifiers most often use no preprocessing at all (Roman Lutz,
Corinne Dahinden, Intel)
 The most common preprocessing is feature standardization (subtract
the mean and divide by the standard deviation for each feature).
 The naive Bayes method (Marc Boullé) uses discretization of
continuous variables and a construction of extra features as a sum of elementary
features to alleviate independence assumptions.
 A few entries used PCA or ICA to extract features.
 Most entries used no feature selection at all: they relied on regularization
for overfitting avoidance. Some entries used embedded feature selection
(ARD priors for the Bayesian neural network  Radford Neal, feature scaling
for SVMs  Olivier Chapelle); a few used univariate filters; the naive Bayes
(Marc Boullé) used a wrapper feature selection; the entry of
Torkkola and Tuv used the features generated by an ensemble of tree classifier
and computed a threshold of significance using "random probes".
 Some entries resampled the training data to balance the two classes.
There does not seem to be a correlation between the type of preprocessing
used and how well people did in the challenge.
Model selection and performance prediction
Even though it is interesting to see how well methods performed as a function
of the classification techniques, the most interesting thing is to analyze
the methods of prediction of the BER and the methods of model selection,
since this was the theme of the challenge. These methods are described in
the fact sheets linked from Tables 2 and 3. In the analysis below, the guess
errors quoted do not necessarily correspond to the entries of Table 2, which
are best with respect to the ranking score, not with respect to guess error.
For the full tables, see http://www.modelselect.inf.ethz.ch/results.php.
Nested crossvalidation loops
The entrant who obtained the best average guess error (Gavin Cawley,
average guess error: 0.0034) use a rather sophisticated crossvalidation
scheme. He adjusted hyperparameters with a "virtual leaveoneout" crossvalidation
(VLOO): for regularized leastsquare kernel classifiers (LSSVMs, kernel
ridge regression, RLSC, Gaussian processes), it is possible to compute the
leaveoneout error without training several classifiers (each time leaving
one example out). Only one training with the entire training set and some
inexpensive additional calculations are required. Gavin Cawley explored various
lossfunctions for the VLOO method for LSSVMs. He selected the best loss
function by crossvalidation (drawing 100 random 90%training10%validation
splits; we call this 100CV). After selecting his model, he reestimated performance
by 100CV using fresh data splits.
The entrant who obtained the second best average guess error (Juha Reunanen,
average guess error: 0.0048) performed a similar type of crossvalidation
called "crossindexing", which uses nested crossvalidation loops. The BER
guess was obtained by the crossvalidation performance in the outer loop.
Nested crossvalidation loops can be expensive computationally, however,
in the case of the use of VLOO, the computational increase is minimal. We
note however that VLOO is not available for all methods. Exact leaveoneout
error can be computed with VLOO for the regularized leastsquare kernel
method. Approximate formulas are available for SVMs (as used by Olivier Chapelle,
average guess error: 0.0137) and neural networks.
Plain crossvalidation
Many participants used plain crossvalidation, with a preference for
10fold crossvalidation. They chose their hyperparameters on the basis
of the smallest crossvalidated BER. The same crossvalidated BER (CV BER)
was used to guess the test BER, hoping that the bias introduced by using
only 90% of the training data would be compensated by the bias introduced
by selecting the model having smallest CV BER. This strategy seems to have
been reasonable since the challenge winner (Roman Lutz, average guess error:
0.0059) used it. He won because of his good BER performance. The second best
entrant (Gavin Cawley) had better BER guesses (average guess error: 0.0034)
. A few entrants took care of balancing the fraction of positive and negative
examples in the data splits to reflect the proportions found in the entire
training set (stratified crossvalidation).
Other methods
A few entrants performed model selection or estimated the performance
on future data using training data, without reserving validation data or
performing crossvalidation, which is possible for methods that are not
prone to overfitting. This was successful
as a model selection strategy for the naïve Bayes classifier (Marc
Boullé). Radford Neal for his Bayesian Neural Network predicts the
error rate using training data only, but this was not one of the
best performing methods (average guess error: 0.0122). Other methods include
the use of performance bounds for SVM model selection, like the Radius Margin
bound (Olivier Chapelle).
Bagging methods (including Random Forests) use bootstrap resampling.
The final model makes decisions according to a vote of the models trained
on the various bootstrap samples. The error rate of the model can be estimated
using the "outofbag" samples. This method was used by Nicolai Meinshausen
(average guess error: 0.0098).
The least reliable method was to use the validation set of the challenge
to predict the test set performance. The best ranking participants did not
use this method.
Scoring method
The entrants were judged both on their good prediction performance (small
test BER) and their capabilty to predict this performance accurately (small
absolute difference between the guess BER and the test BER, that is small
"guess error" deltaBER). The guess error is really what matters most to
us in this challenge. However, it is not meaningful to rank entries with
bad BER performance with their guess error since, in the limit, making a
random guess of the label following the right class proportions yields an
easytopredict BER of 50%.
Our ranking score is basically: E= testBER + deltaBER, where
deltaBER=abs(guessedBERtestBER).
We penalize error guessed symmetrically because, from the point of view
of model selection, it is both bad to select a model performing poorly
and to reject a good model. As can be seen in Figure 4, the guess
error is by no means symmetric. People make most often overly optimistic
predictions, so the penalization for pessimistic predictions happens rarely.
In fact, the score that we use is: E= testBER + deltaBER (1exp(deltaBER/sigma)).
The weighting factors emphasized in red takes into account our uncertainty
about what the real BER is. Our estimate of the real BER with test data
has an error bar sigma (see
FAQ for details). The behavior of this corrected score is illustrated
in Figure 5. One may wonder how much influence this correcting factor has.
The answer is: not very much: as can be seen in Figure 6, deltaBER/sigma
is usually larger than one. This shows that our test set is large enough
to measure how good the guesses are.
Figure 7: Monitoring gamma in score=testBER
+ deltaBER (1exp(gamma deltaBER/sigma)).
The curves show results on the GINA dataset.
Conclusion
Rather unsophisticated methods (e.g. simple 10fold cross validation)
have done well to predict performance. Nested crossvalidation loops are
advisable to gain extra prediction accuracy. They come at little extra computational
expense, if the inner loop uses virtual leaveoneout. Other methods do
not seem to be performing as well. A number of sophisticated methods with
appealing theoretical motivations were proposed for the workshop, but the
authors did not compete in the challenge. We believe there is still a gap
to be filled between theory and practice in this game of performance prediction
and model selection.
Disclaimer
This project is supported by the National Science Foundation under Grant
N0. ECS0424142. Any opinions, findings, and conclusions or recommendations
expressed in this material are those of the authors and do not necessarily
reflect the views of the National Science Foundation.
Last updated: April 12, 2006.