Data set name |
Domain |
Size (MB) | Data matrix type | Num. ex. (tr/val/te) | Num. feat. |
ADA | Marketing |
0.6 | Non sparse | 4147 / 415 / 41471 | 48 |
GINA | Digit recognition |
19.4 | Non sparse | 3153 / 315 / 31532 | 970 |
HIVA | Drug discovery |
7.6 | Non sparse | 3845 / 384 / 38449 | 1617 |
NOVA | Text classification |
2.3 | Sparse binary | 1754 / 175 / 17537 | 16969 |
SYLVA | Ecology |
15.6 | Non sparse | 13086 / 1308 / 130858 | 216 |
Entrant |
Method |
BER Guess |
Test |
Test |
Test Sigma |
Test Guess Error |
Test Score |
Average Rank |
Roman Lutz |
0.10398 |
0.108984 |
0.891016 |
0.002416 |
0.007911 |
0.116482 |
6.2 |
|
Gavin Cawley |
0.110463 |
0.11244 |
0.924643 |
0.002462 |
0.003366 |
0.115187 |
7.6 |
|
Radford Neal |
0.12008 |
0.111803 |
0.930368 |
0.00246 |
0.012167 |
0.123495 |
7.8 |
|
Corinne Dahinden |
0.1087 |
0.115813 |
0.884187 |
0.00248 |
0.010635 |
0.126363 |
7.8 |
|
Wei Chu |
0.106902 |
0.115307 |
0.536672 |
0.002504 |
0.008405 |
0.123346 |
8.2 |
|
Nicolai Meinshausen |
0.107605 |
0.116665 |
0.883335 |
0.002491 |
0.010973 |
0.127356 |
8.6 |
|
Marc Boulle |
0.1306 |
0.130675 |
0.92424 |
0.00267 |
0.009634 |
0.139864 |
10.4 |
|
Kari Torkkola & Eugene
Tuv |
0.09904 |
0.119135 |
0.880865 |
0.002524 |
0.020761 |
0.139833 |
14 |
|
Olivier Chapelle |
0.113162 |
0.126242 |
0.920822 |
0.00267 |
0.015373 |
0.141382 |
15.8 |
|
J. Wichard |
0.12772 |
0.125181 |
0.901349 |
0.002681 |
0.017939 |
0.142989 |
16.6 |
|
Advanced Analytical Methods,
INTEL |
0.1123 |
0.136637 |
0.90704 |
0.002654 |
0.027929 |
0.164384 |
16.6 |
|
Vladimir Nikulin |
0.1044 |
0.133446 |
0.866661 |
0.002639 |
0.029046 |
0.16239 |
16.8 |
|
Edward Harrington |
0.129347 |
0.139374 |
0.899119 |
0.002622 |
0.010027 |
0.149237 |
16.8 |
|
Kai |
Chi |
0.1129 |
0.131494 |
0.868506 |
0.002733 |
0.024463 |
0.155948 |
17.6 |
Yu-Yen Ou |
0.0966 |
0.127346 |
0.872654 |
0.002656 |
0.033639 |
0.16098 |
17.8 |
|
Juha Reunanen |
0.142651 |
0.146752 |
0.913843 |
0.002954 |
0.008484 |
0.154454 |
19 |
|
Yen-Jen Oyang |
RBF + ICA ( 3: 1 ) 86 + v |
0.1125 |
0.132369 |
0.867631 |
0.002777 |
0.021092 |
0.153333 |
19 |
Patrick Haluptzok |
NN Vanilla |
0.1396 |
0.139603 |
0.860397 |
0.00276 |
0.024002 |
0.163415 |
19.2 |
Tobias Glasmachers |
KTA+CV+SVM (3) |
0.135685 |
0.148765 |
0.890652 |
0.002722 |
0.015699 |
0.164224 |
19.6 |
Darby Tien-Hao Chang |
PCA+ME+SVM+valid |
0.1425 |
0.150324 |
0.849676 |
0.002537 |
0.018775 |
0.168944 |
21.4 |
gavin growup |
chi+ica+com |
0.1061 |
0.134534 |
0.865466 |
0.002782 |
0.028434 |
0.162963 |
21.6 |
WHY |
chi + svm |
0.1029 |
0.138013 |
0.861987 |
0.002651 |
0.040983 |
0.178989 |
22 |
Seyna |
Fscore+Chi+SVM |
0.10198 |
0.134534 |
0.865466 |
0.002782 |
0.032554 |
0.167083 |
22 |
Chunghoon Kim |
2D-CLDA-Quad (2) |
0.1481 |
0.153428 |
0.885056 |
0.002874 |
0.022441 |
0.175736 |
22.6 |
Machete |
16-LSVC+adaboost. |
0.120622 |
0.170669 |
0.881446 |
0.002887 |
0.050047 |
0.220686 |
25.2 |
decoste |
submit_test |
0.137154 |
0.175312 |
0.901088 |
0.002868 |
0.038639 |
0.213911 |
25.6 |
Myoung Soo Park |
Scaling + CA-PCA + ELN |
0.1338 |
0.199105 |
0.800895 |
0.00291 |
0.065305 |
0.26441 |
26.4 |
w_pietrus |
CWFS + DT |
0.25776 |
0.193905 |
0.806095 |
0.002728 |
0.279849 |
0.473628 |
27 |
Ranking by dataset
Two of the winners by dataset do not show up in the top
3 ranking participants with overall score.
Table 3: Best entrants by dataset.
Dataset |
Entrant |
Method |
Test BER |
Test AUC |
Test Guess error |
Test Score |
ADA |
Marc Boulle |
0.172266 |
0.91491 |
0.007266 |
0.179288 |
|
GINA |
Kari Torkkola & Eugene
Tuv |
0.028833 |
0.971167 |
0.007266 |
0.030216 |
|
HIVA |
Gavin Cawley |
0.275695 |
0.7671 |
0.001667 |
0.279689 |
|
NOVA |
Gavin Cawley |
0.04449 |
0.991362 |
0.0065 |
0.044833 |
|
SYLVA |
Marc Boulle |
0.00614 |
0.999119 |
0.000873 |
0.00618 |
Result analysis
Dataset profiles
In Figure 1, we show the BER test distribution for final entrants. HIVA
(drug discovery) seems to be the most difficult dataset: the average BER
and the spread are high. ADA (marketing) is the second hardest. The
distribution is very skewed and has a heavy tail, indicating that a small
group of methods "solved" the problem, which was not obvious to others.
NOVA (text classification) and GINA (digit recognition) come next. Both
datasets have classes containing multiple clusters. Hence, the problems
are highly non-linear. This may explain the very long tails. Finally, SYLVA
(ecology) is the easiest dataset, due to the large amount of training data.
In this challenge, many classifiers did well. There is a variety of methods in the top ranking entries:
- Ensembles of decision trees (Roman Lutz, Corinne Dahinden, Intel)
- Kernel methods/SVMs (Gavin Cawley, Wei Chu, Kari Torkkola and Eugene
Tuv, Olivier Chapelle, Vladimir Nikulin, Yu-Yen Ou)
- Bayesian Neural Networks (Radford Neal)
- Ensembles of linear methods (Nikolai Meinshausen)
- Naive Bayes (Marc Boulle)
Others used mixed models. It is interesting to note that single methods
did better than mixed models.
In Figure 3, we show the performances of the final entrants, with symbols
coding for the methods used:
X: Mixed or unknown method
TREE: Ensembles of trees (like Random Forests, RF)
NN/BNN: Neural networks or Bayesian neural networks
NB: Naive Bayes
LD/SVM/KLS/GP: Methods linear in their parameters, including
kernel methods and linear discriminant (LD), Support Vector Machines
(SVM), Kernel Least-Squares (KLS) and LS-SVM, Gaussian Processes (GP).
This chart reveals that Naive Bayes did very well on two datasets (ADA and SYLVA) but poorly on others. Their overall ranking is not good. Similarly, kernel methods did very well on most datasets (they rank first for HIVA, GINA, and NOVA), but they did poorly on ADA. This failure on ADA make them rank only fifth in the overall ranking. They are the most frequently used type of method, but their performance shows a lot of variance, so they do not perform well in all hands. On the contrary, ensembles of decision trees are not so popular, but they perform consistently well on all datasets in this challenge, even though they are never the top entry for any of the datasets. The challenge winner used an ensemble of decision trees. Similarly, Bayesian neural networks did well on all datasets, even though they were not best for any. They end up ranking third in the overall scoring.
This analysis also tells us something about the datasets:
- SYLVA has a large number of training examples, so all methods essentially
do well, even the simple Naive Bayes.
- We know by design that GINA and NOVA are very non-linear. The top
ranking participants used highly non-linear methods and Naive Byes failed.
HIVA seems to also be in this category, not too surprisingly: chemical molecules
with very different structures can be active or inactive drugs.
- ADA has a particularity that makes kernel methods fail and which
should be further investigated. We conjecture that is could be because
the variables are mixed categorical/binary/continuous.
Preprocessing and feature extraction
Further examination of the fact sheets (see Tables 2 and 3) reveals that
the preprocessing methods used are very elementary:
- Tree classifiers most often use no preprocessing at all (Roman Lutz,
Corinne Dahinden, Intel)
- The most common preprocessing is feature standardization (subtract
the mean and divide by the standard deviation for each feature).
- The naive Bayes method (Marc Boullé) uses discretization of
continuous variables and a construction of extra features as a sum of elementary
features to alleviate independence assumptions.
- A few entries used PCA or ICA to extract features.
- Most entries used no feature selection at all: they relied on regularization
for overfitting avoidance. Some entries used embedded feature selection
(ARD priors for the Bayesian neural network -- Radford Neal, feature scaling
for SVMs -- Olivier Chapelle); a few used univariate filters; the naive Bayes
(Marc Boullé) used a wrapper feature selection; the entry of
Torkkola and Tuv used the features generated by an ensemble of tree classifier
and computed a threshold of significance using "random probes".
- Some entries resampled the training data to balance the two classes.
There does not seem to be a correlation between the type of preprocessing
used and how well people did in the challenge.
Model selection and performance prediction
Even though it is interesting to see how well methods performed as a function
of the classification techniques, the most interesting thing is to analyze
the methods of prediction of the BER and the methods of model selection,
since this was the theme of the challenge. These methods are described in
the fact sheets linked from Tables 2 and 3. In the analysis below, the guess
errors quoted do not necessarily correspond to the entries of Table 2, which
are best with respect to the ranking score, not with respect to guess error.
For the full tables, see http://www.modelselect.inf.ethz.ch/results.php.
Nested cross-validation loops
The entrant who obtained the best average guess error (Gavin Cawley,
average guess error: 0.0034) use a rather sophisticated cross-validation
scheme. He adjusted hyperparameters with a "virtual leave-one-out" cross-validation
(VLOO): for regularized least-square kernel classifiers (LS-SVMs, kernel
ridge regression, RLSC, Gaussian processes), it is possible to compute the
leave-one-out error without training several classifiers (each time leaving
one example out). Only one training with the entire training set and some
inexpensive additional calculations are required. Gavin Cawley explored various
loss-functions for the VLOO method for LS-SVMs. He selected the best loss
function by cross-validation (drawing 100 random 90%training-10%validation
splits; we call this 100CV). After selecting his model, he re-estimated performance
by 100CV using fresh data splits.
The entrant who obtained the second best average guess error (Juha Reunanen,
average guess error: 0.0048) performed a similar type of cross-validation
called "cross-indexing", which uses nested cross-validation loops. The BER
guess was obtained by the cross-validation performance in the outer loop.
Nested cross-validation loops can be expensive computationally, however,
in the case of the use of VLOO, the computational increase is minimal. We
note however that VLOO is not available for all methods. Exact leave-one-out
error can be computed with VLOO for the regularized least-square kernel
method. Approximate formulas are available for SVMs (as used by Olivier Chapelle,
average guess error: 0.0137) and neural networks.
Plain cross-validation
Many participants used plain cross-validation, with a preference for
10-fold cross-validation. They chose their hyperparameters on the basis
of the smallest cross-validated BER. The same cross-validated BER (CV BER)
was used to guess the test BER, hoping that the bias introduced by using
only 90% of the training data would be compensated by the bias introduced
by selecting the model having smallest CV BER. This strategy seems to have
been reasonable since the challenge winner (Roman Lutz, average guess error:
0.0059) used it. He won because of his good BER performance. The second best
entrant (Gavin Cawley) had better BER guesses (average guess error: 0.0034)
. A few entrants took care of balancing the fraction of positive and negative
examples in the data splits to reflect the proportions found in the entire
training set (stratified cross-validation).
Other methods
A few entrants performed model selection or estimated the performance
on future data using training data, without reserving validation data or
performing cross-validation, which is possible for methods that are not
prone to overfitting. This was successful
as a model selection strategy for the naïve Bayes classifier (Marc
Boullé). Radford Neal for his Bayesian Neural Network predicts the
error rate using training data only, but this was not one of the
best performing methods (average guess error: 0.0122). Other methods include
the use of performance bounds for SVM model selection, like the Radius Margin
bound (Olivier Chapelle).
Bagging methods (including Random Forests) use bootstrap resampling.
The final model makes decisions according to a vote of the models trained
on the various bootstrap samples. The error rate of the model can be estimated
using the "out-of-bag" samples. This method was used by Nicolai Meinshausen
(average guess error: 0.0098).
The least reliable method was to use the validation set of the challenge
to predict the test set performance. The best ranking participants did not
use this method.
Scoring method
The entrants were judged both on their good prediction performance (small
test BER) and their capabilty to predict this performance accurately (small
absolute difference between the guess BER and the test BER, that is small
"guess error" deltaBER). The guess error is really what matters most to
us in this challenge. However, it is not meaningful to rank entries with
bad BER performance with their guess error since, in the limit, making a
random guess of the label following the right class proportions yields an
easy-to-predict BER of 50%.
Our ranking score is basically: E= testBER + deltaBER, where
deltaBER=abs(guessedBER-testBER).
We penalize error guessed symmetrically because, from the point of view
of model selection, it is both bad to select a model performing poorly
and to reject a good model. As can be seen in Figure 4, the guess
error is by no means symmetric. People make most often overly optimistic
predictions, so the penalization for pessimistic predictions happens rarely.
In fact, the score that we use is: E= testBER + deltaBER (1-exp(-deltaBER/sigma)).
The weighting factors emphasized in red takes into account our uncertainty
about what the real BER is. Our estimate of the real BER with test data
has an error bar sigma (see
FAQ for details). The behavior of this corrected score is illustrated
in Figure 5. One may wonder how much influence this correcting factor has.
The answer is: not very much: as can be seen in Figure 6, deltaBER/sigma
is usually larger than one. This shows that our test set is large enough
to measure how good the guesses are.
Figure 7: Monitoring gamma in score=testBER
+ deltaBER (1-exp(-gamma deltaBER/sigma)).
The curves show results on the GINA dataset.
Conclusion
Rather unsophisticated methods (e.g. simple 10-fold cross validation)
have done well to predict performance. Nested cross-validation loops are
advisable to gain extra prediction accuracy. They come at little extra computational
expense, if the inner loop uses virtual leave-one-out. Other methods do
not seem to be performing as well. A number of sophisticated methods with
appealing theoretical motivations were proposed for the workshop, but the
authors did not compete in the challenge. We believe there is still a gap
to be filled between theory and practice in this game of performance prediction
and model selection.
Disclaimer
This project is supported by the National Science Foundation under Grant
N0. ECS-0424142. Any opinions, findings, and conclusions or recommendations
expressed in this material are those of the authors and do not necessarily
reflect the views of the National Science Foundation.
Last updated: April 12, 2006.