Data Representation Discovery workshop

Agnostic Learning vs.

Prior Knowledge Challenge &
Data Representation Discovery
Workshop

August 16-17, 2007
Renaissance Orlando Resort, Florida

*** The challenge ended August 1st, 2007 ***

See the results!

See also August 12: feature selection tutorial

Participants

Background

With the proper data representation, learning becomes almost trivial. For the defenders of fully automated data processing, the search for better data representations is just part of learning. At the other end of the spectrum, domain specialists engineer data representations, which are tailored to particular applications.

This workshop will bring together researchers from various domains to bridge the gap between best engineering practices and advances in machine learning methods to devise data representations suitable for learning tasks. Topics of interest include, but are not limited to:

space embedding, space dimensionality reduction, latent variable methods
preprocessing (noise modeling, spectral transformations, etc.)
feature extraction (including feature construction and feature selection)
dynamic representations (selective attention; top down expectancy with bottom-up evidence resolution; dynamic feature selection)
biologically/psychologically inspired data representations
data representations for image, speech, and video processing
data representations for molecules, including applications in drug discovery and bioinformatics
data representations for text processing, including language translation
supervised, unsupervised, semi-supervised methods of learning data representations
learning representations with neural networks
learning representations with evolutionary computing
learning kernels (in kernel methods), learning similarity measures, learning distance metrics
transformation invariant representations, robust representations
multi-level data representations
multi-objective data representations
theoretical and empirical assessement of data representations

The emphasis will be more on generic principled methods rather than on ad-hoc solutions.
The results of the "Agnostic Learning vs. Prior Knowledge" challenge will be discussed at the workshop. This challenge illustrates the quest for better data representations. The participants can compete on five classification problems, formatted in two different ways:
- In the “prior knowledge” track, the participants have access to the original raw data and as much knowledge as possible about the data. This gives them a lot of flexibility to devise clever representations or similarily measures.
- In the “agnostic learning” track the participants are forced to use a data representation with ready-made features. They must build predictors without knowing the nature of the features.

Challenge

"When everything fails, ask for additional domain knowledge"

is the current motto of machine learning. Therefore, assessing the real added value of prior/domain knowledge is a both deep and practical question: Most commercial data mining programs pursue the agnostic learning track. Can their off-the-shelf programs put out-of-business skilled data analysts? Or else, how close can the agnostic learning methods get to the results obtained using prior knowledge?

The "Agnostic Learning vs. Prior Knowledge Challenge is to obtain the best possible predictor using either the raw data or the preprocessed data. Entrants must provide results on ALL five data sets provided. Mixed submissions (using the raw data on some tasks and the preprocessed data on others) are allowed. During the development period, participants may submit results on a validation subset of the data to obtain immediate feed-back. The final ranking will be performed on a separate test set.

Download all the AGNOSTIC LEARNING TRACK DATA (45.6 MB)
Download all the PRIOR KNOWLEDGE TRACK DATA (58.9 MB) and the PRIOR KNOWLEDGE (a document describing the data).
or download dataset by dataset:

Name	Domain	Num. ex. (tr/val/te)	Raw data (for the prior knowledge track)	Preprocessed data (for the agnostic learning track)
ADA	Marketing	4147 / 415 / 41471	14 features, comma separated format, 0.6 MB.	48 features, non-sparse format, 0.6 MB.
GINA	Handwriting recognition	3153 / 315 / 31532	784 features, non-sparse format, 7.7 MB.	970 features, non-sparse format, 19.4 MB.
HIVA	Drug discovery	3845 / 384 / 38449	Chemical structure in MDL-SD format, 30.3 MB.	1617 features, non-sparse format, 7.6 MB.
NOVA	Text classif.	1754 / 175 / 17537	Text. 14 MB.	16969 features, sparse format, 2.3 MB.
SYLVA	Ecology	13086 / 1309 / 130857	108 features, non-sparse format, 6.2 MB.	216 features, non-sparse format, 15.6 MB.

The validation set labels are now available for the agnostic learning track and the prior knowledge track.
The data are also available from the challenge web site.
All datasets are stored in simple text formats. Sample Matlab code is available to read the data and format the results. The results must be uploaded to the challenge web site. See the example of result archive.

Results

The analysis of the challenge is now available. The prize winners are:
Roman Lutz: Best agnostic learning entry. Best results on SYLVA in the PK track.
Marc Boullé: Best results on ADA in the PK track.
Vladimir Nikulin: Best results on GINA in the PK track.
Chloé-Agathe Azencott: Best results on HIVA in the PK track.
Jorge Sueiras: Best results on NOVA in the PK track.
Gavin Cawley: Best paper award.

Participation The challenge is open to everyone from October 1, 2006 to August 1st, 2007. Entries can still be made to the challenge web site.

Some participants submitted papers to the IJCNN 2007 conference and will be published in the proceedings. The winner of the best paper award will be announced at the conference. There are 6 other prizes for challenge participants (see FAQ).

The best papers of the workshop will be published in a book in preparation.

Schedule

October 1, 2006: Challenge opens.
January 31, 2007: IJCNN 2007 paper submission deadline.
February 1, 2007: Release of the validation set labels.
March 1, 2007: Challenge ends.
August 1, 2007: Challenge ends.
August 16-17, 2007: Workshop (register on the IJCNN website).

Workshop schedule:

Thursday, August 16, 6-10 pm, Crystal Ballroom E:

6 pm: Reception, pizza and drinks.
6:30 pm: Presentation of the main results and Award Ceremony for the IJCNN 2007 challenges
Agnostic learning vs. prior knowledge and Neural network forecasting, by Isabelle Guyon and Sven Crone.
7:00 pm: Neural network forecasting competition workshop, part I.
8:30 pm: Poster session (shared between both competitions).

Friday, August 17, 8:30-12:30 pm, Crystal Ballroom E:

8:30-12:30 pm Neural network forecasting competition workshop, part II.

Friday, August 17, 1:30-7:00 pm, Crystal Ballroom E:

Part I: Agnostic learning. Chair Gavin Cawley
1:30 - 1:40 p.m. ALvsPK challenge results. Isabelle Guyon [abstract][slides][IJCNN paper][NN paper]
1:40 - 2:10 TUTORIAL Baseline models using kernel methods. Gavin Cawley [abstract][fact_shee t][slides][paper]
2:10 - 2:30 Particule Swarm Model Selection. H. Jair Escalante [abstract][fact sheet][slides][paper]
2:30 - 2:50 Feature/Model Selection by the Linear Programming SVM. Erinija Pranckeviciene [abstract][fact sheet][slides][appendix][paper]
2:50 - 3:10 Agnostic Learning with Ensembles of Classifiers. Joerg D. Wichard [abstract][fact sheet][slides][paper]
3:10 - 3:30 Ensemble of ensemble of tree and neural network. Louis Duclos-Gosselin [abstract][fact sheet][slides]
3:30 - 3:50 Stochastic optimization of a serial tree ensemble for CV error.
Victor Eruhimov, Vladimir Martyanov, Eugene Tuv (Intel team) [abstract][fact sheet1][fact sheet2]

3:50 - 4:00 Break

Part II: Prior Knowledge. Chair Isabelle Guyon
4:00 - 4:30 TUTORIAL. Preprocessing Techniques for Image Analysis Applications.
Hong Zhang [abstract][slides]
4:30 - 5:00 TUTORIAL. Variable Selection and Feature Construction using methods related to Information Theory. Kari Torkkola [abstract][slides]
5:00- 5:20 High-Throughput Screening with 2D kernels methods. Chloe Agathe Azencott, Pierre Baldi [abstract][slides][fact sheet]
5:20 - 5:40 Model Selection and Assessment Using Cross-indexing. Juha Reunanen [abstract][fact sheet][slides][paper]
5:40 - 6:00 Data Grid Models in the Agnostic Learning vs. Prior Knowledge Challenge
Marc Boullé [abstract][fact sheet][slides][paper]

6:00 -6:10 Break

Part III: 6:10 - 7:00 Discussion, plan future challenge(s).

Extra material from challenge participants (not attending the workshop)

Doubleboost Roman Lutz [fact sheet][last year's slides][last year's paper]
Random Sets, Boosting and Distance-based Clustering, Vladimir Nikulin [fact sheet][paper]. Some pictures: [suncorp], it is winter in Australia! [mount][snow]
Modified multi-class SVM formulation; Efficient LOO computation. Vojtech Franc [fact sheet]
Hybrid approach for learning. Mehreen Saeed [fact sheet][paper]
Random Subspace Classifier. Dmitry Zhora [fact sheet]
Dimensionality Reduction Techniques. Stijn Vanderlooy & Laurens van der Maaten [fact_sheet]
Boot trees. Jorge Sueiras [fact sheet]

Links

Machine Learning Challenges: Challenges co-organized and sponsored by Clopinet.

WCCI 2006 workshop of model selection and performance prediction challenge. We organized a competition on model selection and the prediction of generalization performance. How good are you at predicting how good you nare?

NIPS 2003 workshop on feature extraction and feature selection challenge. We organized a competition on five data sets in which hundreds of entries were made. The web site of the challenge is still available for post challenge submissions. Measure yourself against the winners! Get the book of the challenge (with data, results, tutorials).

Pascal challenges: The Pascal network is sponsoring several challenges in Machine learning.

Data mining competitions:
A list of data mining competitions maintained by KDnuggets, including the well known KDD cup.

List of data sets for machine learning:
A rather comprehensive list maintained by MLnet.

UCI machine learning repository: A great collection of datasets for machine learning research.

DELVE: A platform developed at University of Torontoto benchmark machine learning algorithms.

CAMDA
Critical Assessment of Microarray Data Analysis, an annual conference on gene expression microarray data analysis. This conference includes a context with emphasis on gene selection, a special case of feature selection.

ICDAR
International Conference on Document Analysis and Recognition, a bi-annual conference proposing a contest in printed text recognition. Feature extraction/selection is a key component to win such a contest.

TREC
Text Retrieval conference, organized every year by NIST. The conference is organized around the result of a competition. Past winners have had to address feature extraction/selection effectively.

ICPR
In conjunction with the International Conference on Pattern Recognition, ICPR 2004, a face recognition contest has been organized.

CASP
An important competition in protein structure prediction called Critical Assessment of Techniques for Protein Structure Prediction.

Contact information

Coordinator:
Isabelle Guyon
Clopinet Enterprises
955, Creston Road,
Berkeley, CA 94708, U.S.A.
Tel/Fax: (510) 524 6211

Co-organizers and advisors:
Amir Reza Saffari Azar (Graz University of Technology), Joachim Buhmann (ETH, Zurich), Gideon Dror (Academic College of Tel-Aviv-Yaffo), Olivier Guyon (MisterP services), Lambert Schomaker (University of Groningen), Kari Torkkola (Motorola, USA), Steve Gunn (University of Southampton), Horst Bischof (Graz University, Austria), Françoise Fogelman (KXEN, Paris, France), Michèle Sebag (LRI, Orsay, France), André Elisseeff (IBM, Zurich, Switzerland), Kristin Bennett (Rensselaer Polytechnic Institute, USA), Gökhan BakIr (MPI for Biological Cybernetics, Germany), Joaquin Quiñonero Candela (MPI for Biological Cybernetics, Germany), Yi Lu Murphey (University of Michigan-Dearborn), Gavin Cawley (University of East Anglia, UK), Masoud Nikravesh (UC Berkeley), Stefan Lessmann (University of Hamburg, Germany), Erinija Pranckeviciene (Institute for Biodiagnostics, NRC, Canada), and Leszek Rutkowski Technical University of Czestochowa (Poland).

We are very grateful to the workshop sponsors and the other challenge sponsors:

This project is supported by the National Science Fundation under Grant N0. ECCS-0736687. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.