Performance Prediction Challenge Results

[ workshop page | workshop FAQ | challenge page | challenge FAQ | learning object FAQ ]

Results of the Performance Prediction Challenge

*** March 1st, 2006: the challenge is over! ***
*** Report on datasets available ***
*** Slides available ***

Participation:
The challenge in performance prediction started Friday September 30, 2005, and ended Monday March 1, 2006 (duration: 21 weeks). We estimated that 145 entrants participated. We received 4228 "development entries" (entries not counting towards the final ranking). A total of 28 participants competed for the final ranking by providing valid challenge entries (results on training, validation, and test sets for all five tasks of the challenge. We received 117 submissions qualifying for the final ranking (a maximum of 5 entries per participant was allowed). The participation doubled in number of participants and entry volume compared to the previous challenge we organized on feature selection.

Goal of the challenge
The goal was to provide the best possible predictive models for the five tasks of the challenge AND predict how these models would perform on test data (see the FAQ for details on the scoring method).

Datasets
Participants had to compete on five 2-class classification problems:

Table 1: Datasets of the challenge


Data set name	Domain	Size (MB)	Data matrix type	Num. ex. (tr/val/te)	Num. feat.
ADA	Marketing	0.6	Non sparse	4147 / 415 / 41471	48
GINA	Digit recognition	19.4	Non sparse	3153 / 315 / 31532	970
HIVA	Drug discovery	7.6	Non sparse	3845 / 384 / 38449	1617
NOVA	Text classification	2.3	Sparse binary	1754 / 175 / 17537	16969
SYLVA	Ecology	15.6	Non sparse	13086 / 1308 / 130858	216

Each dataset is split into a training set, a validation set (or development test set), and a final test set. All datasets (but only the training labels) were available since the start of the challenge. One month before the end of the challenge, the development period will end and the target values of the validation set were revealed (see FAQ for details). The identity of the datasets was not revealed during the challenge. A report is now available explaining what the data are and how they were preprocessed.

Overall ranking

For each dataset, the participants were ranked by a score mixing the test set balanced error rate (Test BER) and their error on guessing their performance (Test Guess Error). See FAQ for details. The following table shows the best entries of each participant. The results averaged over all 5 test sets:
- The winner by average rank is Roman Lutz.
- Gavin Cawley has the best average score and the best guessed BER.
- Radford Neal has the best AUC.

The full result tables are found on the web site of the challenge.

Table 2: Overall ranking. The table includes the best ranking entry of each finalist. We show results averaged over the five datasets. One can access a fact sheet by clicking on the link of the method.

Entrant	Method	BER Guess	Test BER	Test AUC	Test Sigma	Test Guess Error	Test Score	Average Rank
Roman Lutz	LB tree mix cut adapted	0.10398	0.108984	0.891016	0.002416	0.007911	0.116482	6.2
Gavin Cawley	Final #2	0.110463	0.11244	0.924643	0.002462	0.003366	0.115187	7.6
Radford Neal	Bayesian Neural Networks	0.12008	0.111803	0.930368	0.00246	0.012167	0.123495	7.8
Corinne Dahinden	RF	0.1087	0.115813	0.884187	0.00248	0.010635	0.126363	7.8
Wei Chu	SVM/GPC	0.106902	0.115307	0.536672	0.002504	0.008405	0.123346	8.2
Nicolai Meinshausen	ROMA	0.107605	0.116665	0.883335	0.002491	0.010973	0.127356	8.6
Marc Boulle	SNB(CMA) + 10k F(3D) tv	0.1306	0.130675	0.92424	0.00267	0.009634	0.139864	10.4
Kari Torkkola & Eugene Tuv	ACE+RLSC	0.09904	0.119135	0.880865	0.002524	0.020761	0.139833	14
Olivier Chapelle	SVM-LOO	0.113162	0.126242	0.920822	0.00267	0.015373	0.141382	15.8
J. Wichard	submission 13	0.12772	0.125181	0.901349	0.002681	0.017939	0.142989	16.6
Advanced Analytical Methods, INTEL	IDEAL	0.1123	0.136637	0.90704	0.002654	0.027929	0.164384	16.6
Vladimir Nikulin	GbO+MVf+SVM2	0.1044	0.133446	0.866661	0.002639	0.029046	0.16239	16.8
Edward Harrington	ba4	0.129347	0.139374	0.899119	0.002622	0.010027	0.149237	16.8
Kai	Chi	0.1129	0.131494	0.868506	0.002733	0.024463	0.155948	17.6
Yu-Yen Ou	svm+ica	0.0966	0.127346	0.872654	0.002656	0.033639	0.16098	17.8
Juha Reunanen	CLOP-models-only-5	0.142651	0.146752	0.913843	0.002954	0.008484	0.154454	19
Yen-Jen Oyang	RBF + ICA ( 3: 1 ) 86 + v	0.1125	0.132369	0.867631	0.002777	0.021092	0.153333	19
Patrick Haluptzok	NN Vanilla	0.1396	0.139603	0.860397	0.00276	0.024002	0.163415	19.2
Tobias Glasmachers	KTA+CV+SVM (3)	0.135685	0.148765	0.890652	0.002722	0.015699	0.164224	19.6
Darby Tien-Hao Chang	PCA+ME+SVM+valid	0.1425	0.150324	0.849676	0.002537	0.018775	0.168944	21.4
gavin growup	chi+ica+com	0.1061	0.134534	0.865466	0.002782	0.028434	0.162963	21.6
WHY	chi + svm	0.1029	0.138013	0.861987	0.002651	0.040983	0.178989	22
Seyna	Fscore+Chi+SVM	0.10198	0.134534	0.865466	0.002782	0.032554	0.167083	22
Chunghoon Kim	2D-CLDA-Quad (2)	0.1481	0.153428	0.885056	0.002874	0.022441	0.175736	22.6
Machete	16-LSVC+adaboost.	0.120622	0.170669	0.881446	0.002887	0.050047	0.220686	25.2
decoste	submit_test	0.137154	0.175312	0.901088	0.002868	0.038639	0.213911	25.6
Myoung Soo Park	Scaling + CA-PCA + ELN	0.1338	0.199105	0.800895	0.00291	0.065305	0.26441	26.4
w_pietrus	CWFS + DT	0.25776	0.193905	0.806095	0.002728	0.279849	0.473628	27

Ranking by dataset

Two of the winners by dataset do not show up in the top 3 ranking participants with overall score.

Table 3: Best entrants by dataset.

Dataset	Entrant	Method	Test BER	Test AUC	Test Guess error	Test Score
ADA	Marc Boulle	SNB(CMA)+10k F(2D) tv	0.172266	0.91491	0.007266	0.179288
GINA	Kari Torkkola & Eugene Tuv	ACE+RLSC	0.028833	0.971167	0.007266	0.030216
HIVA	Gavin Cawley	Final #3 (corrected)	0.275695	0.7671	0.001667	0.279689
NOVA	Gavin Cawley	Final #1	0.04449	0.991362	0.0065	0.044833
SYLVA	Marc Boulle	SNB(CMA) + 10k F(3D) tv	0.00614	0.999119	0.000873	0.00618

Result analysis

Dataset profiles
In Figure 1, we show the BER test distribution for final entrants. HIVA (drug discovery) seems to be the most difficult dataset: the average BER and the spread are high. ADA (marketing) is the second hardest. The distribution is very skewed and has a heavy tail, indicating that a small group of methods "solved" the problem, which was not obvious to others. NOVA (text classification) and GINA (digit recognition) come next. Both datasets have classes containing multiple clusters. Hence, the problems are highly non-linear. This may explain the very long tails. Finally, SYLVA (ecology) is the easiest dataset, due to the large amount of training data.

Figure 1: Histogram of the BER test values for the 28 finalists.

Performance over time
In Figure 2, we show how the performance improved over time. After an initial period of fast progress, the performance on the test set did not improve a lot, while the performance on validation set continued improving. This is symptomatic of overfitting. One should note however that 30 days before the end of the challenge, the validation set labels were made available, which explains to some extent the validation set BER drop during the last 30 days.

Figure 2: Performance improvement over time. We show the performance of the best ranking method as a function of time. Solid line: test set. Dashed line: validation set.

Classification methods

In this challenge, many classifiers did well. There is a variety of methods in the top ranking entries:

- Ensembles of decision trees (Roman Lutz, Corinne Dahinden, Intel)
- Kernel methods/SVMs (Gavin Cawley, Wei Chu, Kari Torkkola and Eugene Tuv, Olivier Chapelle, Vladimir Nikulin, Yu-Yen Ou)
- Bayesian Neural Networks (Radford Neal)
- Ensembles of linear methods (Nikolai Meinshausen)
- Naive Bayes (Marc Boulle)

Others used mixed models. It is interesting to note that single methods did better than mixed models.

In Figure 3, we show the performances of the final entrants, with symbols coding for the methods used:
X: Mixed or unknown method
TREE: Ensembles of trees (like Random Forests, RF)
NN/BNN: Neural networks or Bayesian neural networks
NB: Naive Bayes
LD/SVM/KLS/GP: Methods linear in their parameters, including kernel methods and linear discriminant (LD), Support Vector Machines (SVM), Kernel Least-Squares (KLS) and LS-SVM, Gaussian Processes (GP).

Figure 3: Performance of the 28 finalists (117 entries for each dataset). The arrows indicates the best entries.

This chart reveals that Naive Bayes did very well on two datasets (ADA and SYLVA) but poorly on others. Their overall ranking is not good. Similarly, kernel methods did very well on most datasets (they rank first for HIVA, GINA, and NOVA), but they did poorly on ADA. This failure on ADA make them rank only fifth in the overall ranking. They are the most frequently used type of method, but their performance shows a lot of variance, so they do not perform well in all hands. On the contrary, ensembles of decision trees are not so popular, but they perform consistently well on all datasets in this challenge, even though they are never the top entry for any of the datasets. The challenge winner used an ensemble of decision trees. Similarly, Bayesian neural networks did well on all datasets, even though they were not best for any. They end up ranking third in the overall scoring.

This analysis also tells us something about the datasets:
- SYLVA has a large number of training examples, so all methods essentially do well, even the simple Naive Bayes.
- We know by design that GINA and NOVA are very non-linear. The top ranking participants used highly non-linear methods and Naive Byes failed. HIVA seems to also be in this category, not too surprisingly: chemical molecules with very different structures can be active or inactive drugs.
- ADA has a particularity that makes kernel methods fail and which should be further investigated. We conjecture that is could be because the variables are mixed categorical/binary/continuous.

Preprocessing and feature extraction

Further examination of the fact sheets (see Tables 2 and 3) reveals that the preprocessing methods used are very elementary:
- Tree classifiers most often use no preprocessing at all (Roman Lutz, Corinne Dahinden, Intel)
- The most common preprocessing is feature standardization (subtract the mean and divide by the standard deviation for each feature).
- The naive Bayes method (Marc Boullé) uses discretization of continuous variables and a construction of extra features as a sum of elementary features to alleviate independence assumptions.
- A few entries used PCA or ICA to extract features.
- Most entries used no feature selection at all: they relied on regularization for overfitting avoidance. Some entries used embedded feature selection (ARD priors for the Bayesian neural network -- Radford Neal, feature scaling for SVMs -- Olivier Chapelle); a few used univariate filters; the naive Bayes (Marc Boullé) used a wrapper feature selection; the entry of Torkkola and Tuv used the features generated by an ensemble of tree classifier and computed a threshold of significance using "random probes".
- Some entries resampled the training data to balance the two classes.
There does not seem to be a correlation between the type of preprocessing used and how well people did in the challenge.

Model selection and performance prediction

Even though it is interesting to see how well methods performed as a function of the classification techniques, the most interesting thing is to analyze the methods of prediction of the BER and the methods of model selection, since this was the theme of the challenge. These methods are described in the fact sheets linked from Tables 2 and 3. In the analysis below, the guess errors quoted do not necessarily correspond to the entries of Table 2, which are best with respect to the ranking score, not with respect to guess error. For the full tables, see http://www.modelselect.inf.ethz.ch/results.php.

Nested cross-validation loops
The entrant who obtained the best average guess error (Gavin Cawley, average guess error: 0.0034) use a rather sophisticated cross-validation scheme. He adjusted hyperparameters with a "virtual leave-one-out" cross-validation (VLOO): for regularized least-square kernel classifiers (LS-SVMs, kernel ridge regression, RLSC, Gaussian processes), it is possible to compute the leave-one-out error without training several classifiers (each time leaving one example out). Only one training with the entire training set and some inexpensive additional calculations are required. Gavin Cawley explored various loss-functions for the VLOO method for LS-SVMs. He selected the best loss function by cross-validation (drawing 100 random 90%training-10%validation splits; we call this 100CV). After selecting his model, he re-estimated performance by 100CV using fresh data splits.

The entrant who obtained the second best average guess error (Juha Reunanen, average guess error: 0.0048) performed a similar type of cross-validation called "cross-indexing", which uses nested cross-validation loops. The BER guess was obtained by the cross-validation performance in the outer loop.

Nested cross-validation loops can be expensive computationally, however, in the case of the use of VLOO, the computational increase is minimal. We note however that VLOO is not available for all methods. Exact leave-one-out error can be computed with VLOO for the regularized least-square kernel method. Approximate formulas are available for SVMs (as used by Olivier Chapelle, average guess error: 0.0137) and neural networks.

Plain cross-validation
Many participants used plain cross-validation, with a preference for 10-fold cross-validation. They chose their hyperparameters on the basis of the smallest cross-validated BER. The same cross-validated BER (CV BER) was used to guess the test BER, hoping that the bias introduced by using only 90% of the training data would be compensated by the bias introduced by selecting the model having smallest CV BER. This strategy seems to have been reasonable since the challenge winner (Roman Lutz, average guess error: 0.0059) used it. He won because of his good BER performance. The second best entrant (Gavin Cawley) had better BER guesses (average guess error: 0.0034) . A few entrants took care of balancing the fraction of positive and negative examples in the data splits to reflect the proportions found in the entire training set (stratified cross-validation).

Other methods
A few entrants performed model selection or estimated the performance on future data using training data, without reserving validation data or performing cross-validation, which is possible for methods that are not prone to overfitting. This was successful
as a model selection strategy for the naïve Bayes classifier (Marc Boullé). Radford Neal for his Bayesian Neural Network predicts the
error rate using training data only, but this was not one of the
best performing methods (average guess error: 0.0122). Other methods include the use of performance bounds for SVM model selection, like the Radius Margin bound (Olivier Chapelle).
Bagging methods (including Random Forests) use bootstrap resampling. The final model makes decisions according to a vote of the models trained on the various bootstrap samples. The error rate of the model can be estimated using the "out-of-bag" samples. This method was used by Nicolai Meinshausen (average guess error: 0.0098).
The least reliable method was to use the validation set of the challenge to predict the test set performance. The best ranking participants did not use this method.

Scoring method
The entrants were judged both on their good prediction performance (small test BER) and their capabilty to predict this performance accurately (small absolute difference between the guess BER and the test BER, that is small "guess error" deltaBER). The guess error is really what matters most to us in this challenge. However, it is not meaningful to rank entries with bad BER performance with their guess error since, in the limit, making a random guess of the label following the right class proportions yields an easy-to-predict BER of 50%.
Our ranking score is basically: E= testBER + deltaBER, where deltaBER=abs(guessedBER-testBER).
We penalize error guessed symmetrically because, from the point of view of model selection, it is both bad to select a model performing poorly and to reject a good model. As can be seen in Figure 4, the guess error is by no means symmetric. People make most often overly optimistic predictions, so the penalization for pessimistic predictions happens rarely.

BERguessCorrel

Figure 4: The guessed BER as a function of the Test BER: the guesses are optimistic.

In fact, the score that we use is: E= testBER + deltaBER (1-exp(-deltaBER/sigma)).
The weighting factors emphasized in red takes into account our uncertainty about what the real BER is. Our estimate of the real BER with test data has an error bar sigma (see FAQ for details). The behavior of this corrected score is illustrated in Figure 5. One may wonder how much influence this correcting factor has. The answer is: not very much: as can be seen in Figure 6, deltaBER/sigma is usually larger than one. This shows that our test set is large enough to measure how good the guesses are.

Figure 5: The challenge score as a function of the guessed BER for a given value of the test BER. A sigma of 0.05 is assumed for the figure. The -.-.-.-. curve is TestBER+deltaBER. The red curve is TestBER+deltaBER(1-exp(-deltaBER/sigma)). The participants are penalized symmetrically for making and error on the test BER: deltaBER=abs(TestBER-GuessedBER). The factor (1-exp(-deltaBER/sigma)) diminishes the importance of deltaBER in the score when the error bar on the test BER sigma is large.

deltasigma

Figure 6: The guess error (delta) over the test BER error bar, as a function of the test BER. Delta/Sigma is generally larger than one. The effect of the correction (1-exp(-delta/sigma)) is small.

We varied a factor gamma in the formula E= testBER + deltaBER (1-exp(-gamma deltaBER/sigma)) so we could see whether gamma=1 is a good choice (Figure 7). For a very small gamma, we have E= testBER, while for a large gamma E= testBER + deltaBER. In between, we monitor the influence of deltaBER on the score. We see the the curves cross. Some entrants make better guessed than others and have almost flat curves. Making gamma larger is unnecessary because the curves do not cross beyond gamma=1 (we verified this is true for all entries on all datasets). Making gamma smaller is not desirable because the influence of the BER becomes too big. After all, what we want to measure is the accuracy of performance prediction!

GinaGamma
Figure 7: Monitoring gamma in score=testBER + deltaBER (1-exp(-gamma deltaBER/sigma)).
The curves show results on the GINA dataset.

Conclusion

Rather unsophisticated methods (e.g. simple 10-fold cross validation) have done well to predict performance. Nested cross-validation loops are advisable to gain extra prediction accuracy. They come at little extra computational expense, if the inner loop uses virtual leave-one-out. Other methods do not seem to be performing as well. A number of sophisticated methods with appealing theoretical motivations were proposed for the workshop, but the authors did not compete in the challenge. We believe there is still a gap to be filled between theory and practice in this game of performance prediction and model selection.

Disclaimer
This project is supported by the National Science Foundation under Grant N0. ECS-0424142. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Last updated: April 12, 2006.