Challenge Challenge

Participation Participation

Schedule Schedule

Links Links

Contacts Contacts

Agnostic Learning vs.
Prior Knowledge Challenge &
Data Representation Discovery

August 16-17, 2007
Renaissance Orlando Resort, Florida

*** The challenge ended August 1st, 2007 ***

See the results!

See also August 12: feature selection tutorial




With the proper data representation, learning becomes almost trivial. For the defenders of fully automated data processing, the search for better data representations is just part of learning. At the other end of the spectrum, domain specialists engineer data representations, which are tailored to particular applications.

This workshop will bring together researchers from various domains to bridge the gap between best engineering practices and advances in machine learning methods to devise data representations suitable for learning tasks. Topics of interest include, but are not limited to:

  • space embedding, space dimensionality reduction, latent variable methods
  • preprocessing (noise modeling, spectral transformations, etc.)
  • feature extraction (including feature construction and  feature selection)
  • dynamic representations (selective attention; top down expectancy with bottom-up evidence resolution; dynamic feature selection)
  • biologically/psychologically inspired data representations
  • data representations for image, speech, and video processing
  • data representations for molecules, including applications in drug discovery and bioinformatics
  • data representations for text processing, including language translation
  • supervised, unsupervised, semi-supervised methods of learning data representations
  • learning representations with neural networks
  • learning representations with evolutionary computing
  • learning kernels (in kernel methods), learning similarity measures, learning distance metrics
  • transformation invariant representations, robust representations
  • multi-level data representations
  • multi-objective data representations
  • theoretical and empirical assessement of data representations

The emphasis will be more on generic principled methods rather than on ad-hoc solutions.
The results of the "Agnostic Learning vs. Prior Knowledge" challenge will be discussed at the workshop.
This challenge illustrates the quest for better data representations. The participants can compete on five classification problems, formatted in two different ways:
-    In the “prior knowledge” track, the participants have access to the original raw data and as much knowledge as possible about the data. This gives them a lot of flexibility to devise clever representations or similarily measures.
-    In the “agnostic learning” track the participants are forced to use a data representation with ready-made features. They must build predictors without knowing the nature of the features.


"When everything fails, ask for additional domain knowledge"

is the current motto of machine learning. Therefore, assessing the real added value of prior/domain knowledge is a both deep and practical question: Most commercial data mining programs pursue the agnostic learning track. Can their off-the-shelf programs put out-of-business skilled data analysts? Or else, how close can the agnostic learning methods get to the results obtained using prior knowledge?

The "Agnostic Learning vs. Prior Knowledge Challenge is to obtain the best possible predictor using either the raw data or the preprocessed data. Entrants must provide results on ALL five data sets provided. Mixed submissions (using the raw data on some tasks and the preprocessed data on others) are allowed. During the development period, participants may submit results on a validation subset of the data to obtain immediate feed-back. The final ranking will be performed on a separate test set.

Download all the PRIOR KNOWLEDGE TRACK DATA (58.9 MB) and the PRIOR KNOWLEDGE (a document describing the data).
or download dataset by dataset:

Num. ex. (tr/val/te)
Raw data (for the prior knowledge track)
Preprocessed data (for the agnostic learning track)
4147 / 415 / 41471
14  features, comma separated format, 0.6 MB.
48 features, non-sparse format, 0.6 MB.
3153 / 315 / 31532
784 features, non-sparse format, 7.7 MB.
970 features, non-sparse format, 19.4 MB.
Drug discovery
3845 / 384 / 38449
Chemical structure in MDL-SD format, 30.3 MB.
1617 features, non-sparse format, 7.6 MB.
Text classif.
1754 / 175 / 17537
Text. 14 MB.
16969 features, sparse format, 2.3 MB.
13086 / 1309 / 130857
108 features, non-sparse format, 6.2 MB.
216 features, non-sparse format, 15.6 MB.

The validation set labels are now available for the agnostic learning track and the prior knowledge track.
The data are also available from the challenge web site.
All datasets are stored in simple text formats. Sample Matlab code is available to read the data and format the results. The results must be uploaded to the challenge web site. See the example of result archive.


The analysis of the challenge is now available. The prize winners are:
Roman Lutz: Best agnostic learning entry. Best results on SYLVA in the PK track.
Marc Boull
é: Best results on ADA in the PK track.
Vladimir Nikulin: Best results on GINA in the PK track.
Chloé-Agathe Azencott: Best results on HIVA in the PK track.
Jorge Sueiras: Best results on NOVA in the PK track.
Gavin Cawley: Best paper award.
The challenge is open to everyone from October 1, 2006 to August 1st, 2007. Entries can still be made to the challenge web site. 
Some participants submitted papers to the IJCNN 2007 conference and will be published in the proceedings. The winner of the best paper award will be announced at the conference. There are 6 other prizes for challenge participants (see FAQ).

The best papers of the workshop will be published in a book in preparation.


October 1, 2006: Challenge opens.
January 31, 2007: IJCNN 2007 paper submission deadline.
February 1, 2007: Release of the validation set labels.
March 1, 2007: Challenge ends.
August 1, 2007: Challenge ends.
August 16-17, 2007: Workshop (register on the IJCNN website).

Workshop schedule:

Thursday, August 16, 6-10 pm, Crystal Ballroom E:

6 pm: Reception, pizza and drinks.
6:30 pm: Presentation of the main results and Award Ceremony for the IJCNN 2007 challenges
Agnostic learning vs. prior knowledge and Neural network forecasting, by Isabelle Guyon and Sven Crone.
7:00 pm: Neural network forecasting competition workshop, part I.
8:30 pm: Poster session (shared between both competitions).

Friday, August 17, 8:30-12:30 pm, Crystal Ballroom E:

8:30-12:30 pm Neural network forecasting competition workshop, part II.

Friday, August 17, 1:30-7:00 pm, Crystal Ballroom E:

Part I: Agnostic learning. Chair Gavin Cawley
1:30 - 1:40 p.m. ALvsPK challenge results. Isabelle Guyon [abstract][slides][IJCNN paper][NN paper]
1:40 - 2:10 TUTORIAL Baseline models using kernel methods. Gavin Cawley [abstract][fact_sheet][slides][paper]
2:10 - 2:30 Particule Swarm Model Selection. H. Jair Escalante [abstract][fact sheet][slides][paper]
2:30 - 2:50 Feature/Model Selection by the Linear Programming SVM. Erinija Pranckeviciene
[abstract][fact sheet][slides][appendix][paper]
2:50 - 3:10 Agnostic Learning with Ensembles of Classifiers. Joerg D. Wichard [abstract][fact sheet][slides][paper]
3:10 - 3:30 Ensemble of ensemble of tree and neural network. Louis Duclos-Gosselin [abstract][fact sheet][slides]
3:30 - 3:50 Stochastic optimization of a serial tree ensemble for CV error.
Victor Eruhimov, Vladimir Martyanov, Eugene Tuv (Intel team)
[abstract][fact sheet1][fact sheet2]

3:50 - 4:00 Break

Part II: Prior Knowledge. Chair Isabelle Guyon
4:00 - 4:30 TUTORIAL. Preprocessing Techniques for Image Analysis Applications.
Hong Zhang [abstract][slides]
4:30 - 5:00 TUTORIAL. Variable Selection and Feature Construction using methods related to Information Theory. Kari Torkkola [abstract][slides]
5:00- 5:20 High-Throughput Screening with 2D kernels methods. Chloe Agathe Azencott, Pierre Baldi [abstract][slides][fact sheet]
5:20 - 5:40 Model Selection and Assessment Using Cross-indexing. Juha Reunanen [abstract]
[fact sheet][slides][paper]
5:40 - 6:00 Data Grid Models in the Agnostic Learning vs. Prior Knowledge Challenge
Marc Boullé
[abstract][fact sheet][slides][paper]

6:00 -6:10 Break

Part III: 6:10 - 7:00 Discussion, plan future challenge(s).

Extra material from challenge participants (not attending the workshop)


Machine Learning Challenges: Challenges co-organized and sponsored by Clopinet.

WCCI 2006 workshop of model selection and performance prediction challenge. We organized a competition on model selection and the prediction of generalization performance. How good are you at predicting how good you nare?

NIPS 2003 workshop on feature extraction and feature selection challenge. We organized a competition on five data sets in which hundreds of entries were made. The web site of the challenge is still available for post challenge submissions. Measure yourself against the winners! Get the book of the challenge (with data, results, tutorials).

Pascal challenges: The Pascal network is sponsoring several challenges in Machine learning.

Data mining competitions:
A list of data mining competitions maintained by KDnuggets, including the well known KDD cup.

List of data sets for machine learning:
A rather comprehensive list maintained by MLnet.

UCI machine learning repository: A great collection of datasets for machine learning research.

DELVE: A platform developed at University of Torontoto benchmark machine learning algorithms.

Critical Assessment of Microarray Data Analysis, an annual conference on gene expression microarray data analysis. This conference includes a context with emphasis on gene selection, a special case of feature selection.

International Conference on Document Analysis and Recognition, a bi-annual conference proposing a contest in printed text recognition. Feature extraction/selection is a key component to win such a contest.

Text Retrieval conference, organized every year by NIST. The conference is organized around the result of a competition. Past winners have had to address feature extraction/selection effectively.

In conjunction with the International Conference on Pattern Recognition, ICPR 2004, a face recognition contest has been organized.

An important competition in protein structure prediction called Critical Assessment of  Techniques for Protein Structure Prediction.

Contact information

Isabelle Guyon
Clopinet Enterprises
955, Creston Road,
Berkeley, CA 94708, U.S.A.
Tel/Fax: (510) 524 6211

Co-organizers and advisors:
Amir Reza Saffari Azar
(Graz University of Technology), Joachim Buhmann (ETH, Zurich),  Gideon Dror (Academic College of Tel-Aviv-Yaffo),  Olivier Guyon (MisterP services), Lambert Schomaker (University of Groningen), Kari Torkkola (Motorola, USA), Steve Gunn (University of Southampton), Horst Bischof (Graz University, Austria), Françoise Fogelman (KXEN, Paris, France), Michèle Sebag (LRI, Orsay, France), André Elisseeff (IBM, Zurich, Switzerland), Kristin Bennett (Rensselaer Polytechnic Institute, USA), Gökhan BakIr (MPI for Biological Cybernetics, Germany), Joaquin Quiñonero Candela (MPI for Biological Cybernetics, Germany), Yi Lu Murphey (University of Michigan-Dearborn), Gavin Cawley (University of East Anglia, UK), Masoud Nikravesh (UC Berkeley), Stefan Lessmann (University of Hamburg, Germany), Erinija Pranckeviciene (Institute for Biodiagnostics, NRC, Canada), and Leszek Rutkowski Technical University of Czestochowa (Poland).

We are very grateful to the workshop sponsors and the other challenge sponsors:

   Microsoft   KXEN    HDC    

NSF logo This project is supported by the National Science Fundation under Grant N0. ECCS-0736687. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.