[ workshop page | workshop FAQchallenge page | challenge FAQlearning object FAQ ]
WCCI NSF
Results of the Performance Prediction Challenge


*** March 1st, 2006: the challenge is over!  ***
*** Report on datasets available  ***
*** Slides available  ***


Participation:
The challenge in performance prediction started Friday September 30, 2005, and ended Monday March 1, 2006 (duration: 21 weeks). We estimated that 145 entrants participated. We received 4228 "development entries" (entries not counting towards the final ranking). A total of 28 participants competed for the final ranking by providing valid challenge entries (results on training, validation, and test sets for all five tasks of the challenge. We received 117 submissions qualifying for the final ranking (a maximum of 5 entries per participant was allowed). The participation doubled in number of participants and entry volume compared to the previous challenge we organized on feature selection.

Goal of the challenge
The goal was to provide the best possible predictive models for the five tasks of the challenge AND predict how these models would perform on test data (see the FAQ for details on the scoring method).

Datasets
Participants had to compete on five 2-class classification problems:

Table 1: Datasets of the challenge
 
Data set name
Domain
Size (MB) Data matrix type Num. ex. (tr/val/te) Num. feat.
ADA Marketing
0.6 Non sparse 4147 / 415 / 41471 48
GINA Digit recognition
19.4 Non sparse 3153 / 315 / 31532 970
HIVA Drug discovery
7.6 Non sparse 3845 / 384 / 38449 1617
NOVA Text classification
2.3 Sparse binary 1754 / 175 / 17537 16969
SYLVA Ecology
15.6 Non sparse 13086 / 1308 / 130858 216
 
Each dataset is split into a training set, a validation set (or development test set), and a final test set. All datasets (but only the training labels) were available since the start of the challenge.  One month before the end of the challenge, the development period will end and the target values of the validation set were revealed (see FAQ for details). The identity of the datasets was not revealed during the challenge. A report is now available explaining what the data are and how they were preprocessed.
 
Overall ranking

For each dataset, the participants were ranked by a score mixing the test set balanced error rate (Test BER) and their error on guessing their performance (Test Guess Error). See FAQ for details. The following table shows the best entries of each participant. The results averaged over all 5 test sets:
- The winner by average rank is Roman Lutz.
- Gavin Cawley has the best average score and the best guessed BER.
- Radford Neal has the best AUC.

The full result tables are found on the web site of the challenge.

Table 2: Overall ranking. The table includes the best ranking entry of each finalist. We show results averaged over the five datasets. One can access a fact sheet by clicking on the link of the method.

Entrant

Method

BER Guess

Test
BER

Test
AUC

Test Sigma

Test Guess Error

Test Score

Average Rank

Roman Lutz

LB tree mix cut adapted

0.10398

0.108984

0.891016

0.002416

0.007911

0.116482

6.2

Gavin Cawley

Final #2

0.110463

0.11244

0.924643

0.002462

0.003366

0.115187

7.6

Radford Neal

Bayesian Neural Networks

0.12008

0.111803

0.930368

0.00246

0.012167

0.123495

7.8

Corinne Dahinden

RF

0.1087

0.115813

0.884187

0.00248

0.010635

0.126363

7.8

Wei Chu

SVM/GPC

0.106902

0.115307

0.536672

0.002504

0.008405

0.123346

8.2

Nicolai Meinshausen

ROMA

0.107605

0.116665

0.883335

0.002491

0.010973

0.127356

8.6

Marc Boulle

SNB(CMA) + 10k F(3D) tv

0.1306

0.130675

0.92424

0.00267

0.009634

0.139864

10.4

Kari Torkkola & Eugene Tuv

ACE+RLSC

0.09904

0.119135

0.880865

0.002524

0.020761

0.139833

14

Olivier Chapelle

SVM-LOO

0.113162

0.126242

0.920822

0.00267

0.015373

0.141382

15.8

J. Wichard

submission 13

0.12772

0.125181

0.901349

0.002681

0.017939

0.142989

16.6

Advanced Analytical Methods, INTEL

IDEAL

0.1123

0.136637

0.90704

0.002654

0.027929

0.164384

16.6

Vladimir Nikulin

GbO+MVf+SVM2

0.1044

0.133446

0.866661

0.002639

0.029046

0.16239

16.8

Edward Harrington

ba4

0.129347

0.139374

0.899119

0.002622

0.010027

0.149237

16.8

Kai

Chi

0.1129

0.131494

0.868506

0.002733

0.024463

0.155948

17.6

Yu-Yen Ou

svm+ica

0.0966

0.127346

0.872654

0.002656

0.033639

0.16098

17.8

Juha Reunanen

CLOP-models-only-5

0.142651

0.146752

0.913843

0.002954

0.008484

0.154454

19

Yen-Jen Oyang

RBF + ICA ( 3: 1 ) 86 + v

0.1125

0.132369

0.867631

0.002777

0.021092

0.153333

19

Patrick Haluptzok

NN Vanilla

0.1396

0.139603

0.860397

0.00276

0.024002

0.163415

19.2

Tobias Glasmachers

KTA+CV+SVM (3)

0.135685

0.148765

0.890652

0.002722

0.015699

0.164224

19.6

Darby Tien-Hao Chang

PCA+ME+SVM+valid

0.1425

0.150324

0.849676

0.002537

0.018775

0.168944

21.4

gavin growup

chi+ica+com

0.1061

0.134534

0.865466

0.002782

0.028434

0.162963

21.6

WHY

chi + svm

0.1029

0.138013

0.861987

0.002651

0.040983

0.178989

22

Seyna

Fscore+Chi+SVM

0.10198

0.134534

0.865466

0.002782

0.032554

0.167083

22

Chunghoon Kim

2D-CLDA-Quad (2)

0.1481

0.153428

0.885056

0.002874

0.022441

0.175736

22.6

Machete

16-LSVC+adaboost.

0.120622

0.170669

0.881446

0.002887

0.050047

0.220686

25.2

decoste

submit_test

0.137154

0.175312

0.901088

0.002868

0.038639

0.213911

25.6

Myoung Soo Park

Scaling + CA-PCA + ELN

0.1338

0.199105

0.800895

0.00291

0.065305

0.26441

26.4

w_pietrus

CWFS + DT

0.25776

0.193905

0.806095

0.002728

0.279849

0.473628

27

Ranking by dataset

Two of the winners by dataset do not show up in the top 3 ranking participants with overall score.

Table 3: Best entrants by dataset. 

Dataset

Entrant

Method

Test BER

Test AUC

Test Guess error

Test Score

ADA

Marc Boulle

SNB(CMA)+10k F(2D) tv

0.172266

0.91491

0.007266

0.179288

GINA

Kari Torkkola & Eugene Tuv

ACE+RLSC

0.028833

0.971167

0.007266

0.030216

HIVA

Gavin Cawley

Final #3 (corrected)

0.275695

0.7671

0.001667

0.279689

NOVA

Gavin Cawley

Final #1

0.04449

0.991362

0.0065

0.044833

SYLVA

Marc Boulle

SNB(CMA) + 10k F(3D) tv

0.00614

0.999119

0.000873

0.00618


Result analysis

Dataset profiles

In Figure 1, we show the BER test distribution for final entrants. HIVA (drug discovery) seems to be the most difficult dataset: the average BER and the spread are high. ADA (marketing)  is the second hardest. The distribution is very skewed and has a heavy tail, indicating that a small group of methods "solved" the problem, which was not obvious to others. NOVA (text classification) and GINA (digit recognition) come next. Both datasets have classes containing multiple clusters. Hence, the problems are highly non-linear. This may explain the very long tails. Finally, SYLVA (ecology) is the easiest dataset, due to the large amount of training data.

BER histograms
Figure 1: Histogram of the BER test values for the 28 finalists.

Performance over time
In Figure 2, we show how the performance improved over time. After an initial period of fast progress, the performance on the test set did not improve a lot, while the performance on validation set continued improving. This is symptomatic of overfitting. One should note however that 30 days before the end of the challenge, the validation set labels were made available, which explains to some extent the validation set BER drop during the last 30 days.

BER_time
Figure 2: Performance improvement over time. We show the performance of the best ranking method as a function of time. Solid line: test set. Dashed line: validation set.

Classification methods

In this challenge, many classifiers did well. There is a variety of methods in the top ranking entries:

- Ensembles of decision trees (Roman Lutz, Corinne Dahinden, Intel)
- Kernel methods/SVMs (Gavin Cawley, Wei Chu, Kari Torkkola and Eugene Tuv, Olivier Chapelle, Vladimir Nikulin, Yu-Yen Ou)
- Bayesian Neural Networks (Radford Neal)
- Ensembles of linear methods (Nikolai Meinshausen)
- Naive Bayes (Marc Boulle)

Others used mixed models. It is interesting to note that single methods did better than mixed models.

In Figure 3, we show the performances of the final entrants, with symbols coding for the methods used:
X: Mixed or unknown method
TREE: Ensembles of trees (like Random Forests, RF)
NN/BNN: Neural networks or Bayesian neural networks
NB: Naive Bayes