|
Agnostic Learning vs.
Prior Knowledge Challenge &
Data Representation
Discovery
Workshop
|
August 16-17, 2007
Renaissance Orlando Resort, Florida
*** The challenge
ended August 1st, 2007 ***
See the results!
See also August 12: feature selection
tutorial
Background
With the proper
data representation, learning becomes almost trivial. For the
defenders of fully automated data processing, the search for better
data representations is just part of learning. At the other end
of the spectrum, domain specialists engineer data representations,
which are tailored to particular applications.
This workshop
will bring together researchers from various domains to bridge
the gap between best engineering practices and advances in machine
learning methods to devise data representations suitable
for learning tasks. Topics of interest include, but are not limited
to:
- space embedding, space
dimensionality reduction, latent variable methods
- preprocessing
(noise modeling,
spectral transformations, etc.)
- feature extraction (including
feature construction and feature selection)
- dynamic representations
(selective attention; top down expectancy with bottom-up evidence
resolution; dynamic feature selection)
- biologically/psychologically
inspired data representations
- data representations
for image, speech, and video processing
- data representations
for molecules, including applications in drug discovery and bioinformatics
- data representations
for text processing, including language translation
- supervised, unsupervised,
semi-supervised methods of learning data representations
- learning representations
with neural networks
- learning representations
with evolutionary computing
- learning kernels (in
kernel methods), learning similarity measures, learning distance
metrics
- transformation invariant
representations, robust representations
- multi-level data representations
- multi-objective data
representations
- theoretical and empirical
assessement of data representations
The emphasis will
be more on generic principled methods rather than on ad-hoc solutions.
The results of the "Agnostic Learning vs. Prior Knowledge"
challenge will be discussed at the workshop. This challenge illustrates
the quest for better data representations. The participants can
compete on five classification problems, formatted in two different
ways:
- In the “prior
knowledge” track, the participants have access to the original
raw data and as much knowledge as possible about the data. This
gives them a lot of flexibility to devise clever representations
or similarily measures.
- In the “agnostic
learning” track the participants are forced to use a data
representation with ready-made features. They must build predictors
without knowing the nature of the features.
Challenge
"When
everything fails, ask for additional domain knowledge"
is the current motto of machine
learning. Therefore, assessing the real added value of prior/domain
knowledge is a both deep and practical question: Most commercial
data mining programs pursue the agnostic learning track. Can their
off-the-shelf programs put out-of-business skilled data analysts?
Or else, how close can the agnostic learning methods get to the
results obtained using prior knowledge?
The "Agnostic
Learning vs. Prior Knowledge Challenge is to obtain the best possible
predictor using either the raw data or the preprocessed
data. Entrants must provide results on ALL five data sets provided. Mixed
submissions (using the raw data on some tasks and the preprocessed
data on others) are allowed. During the development
period, participants may submit results on a validation
subset of the data to obtain immediate feed-back. The
final ranking will be performed on a separate test set.
Download
all the AGNOSTIC LEARNING
TRACK DATA (45.6 MB)
Download all the PRIOR KNOWLEDGE TRACK DATA (58.9 MB) and
the PRIOR KNOWLEDGE (a document
describing the data).
or download dataset by dataset:
Name
|
Domain
|
Num. ex. (tr/val/te)
|
Raw data (for the prior knowledge track)
|
Preprocessed data (for the agnostic learning track)
|
ADA
|
Marketing
|
4147
/ 415 / 41471
|
14 features, comma separated format,
0.6 MB.
|
48 features, non-sparse format, 0.6 MB.
|
GINA
|
Handwriting
recognition |
3153
/ 315 / 31532
|
784 features, non-sparse format, 7.7 MB.
|
970 features, non-sparse format, 19.4
MB.
|
HIVA
|
Drug
discovery
|
3845
/ 384 / 38449
|
Chemical structure in MDL-SD format, 30.3
MB.
|
1617 features, non-sparse format, 7.6
MB.
|
NOVA
|
Text
classif.
|
1754
/ 175 / 17537
|
Text. 14 MB.
|
16969 features, sparse format, 2.3 MB.
|
SYLVA
|
Ecology
|
13086
/ 1309 / 130857
|
108 features, non-sparse format, 6.2
MB.
|
216 features, non-sparse format, 15.6
MB.
|
The validation set labels are now available for the agnostic learning track and
the prior knowledge track.
The data are also available
from the challenge web site.
All datasets are stored
in simple text formats. Sample
Matlab code is available to read the data and format the
results. The results must be uploaded to the challenge web site.
See the example of result archive.
Results
The
analysis of the challenge is now available.
The prize winners are:
Roman Lutz: Best agnostic learning entry. Best results on
SYLVA in the PK track.
Marc Boullé: Best results on ADA in the
PK track.
Vladimir
Nikulin: Best results on GINA in the PK track.
Chloé-Agathe
Azencott: Best results on HIVA in the PK track.
Jorge Sueiras: Best results on NOVA in the
PK track.
Gavin Cawley:
Best paper award.
Participation
The challenge is
open to everyone from October 1, 2006 to August 1st, 2007.
Entries can still be made to the challenge
web site.
Some participants
submitted
papers to the IJCNN 2007 conference and will be published
in the proceedings. The winner of the best paper award will be announced
at the conference. There are 6 other prizes for challenge participants
(see
FAQ).
The best papers of the workshop will be published in a book in preparation.
Schedule
October
1, 2006: Challenge opens.
January 31, 2007: IJCNN 2007 paper
submission deadline.
February 1, 2007: Release
of the validation set labels.
March 1, 2007:
Challenge ends.
August 1, 2007: Challenge ends.
August 16-17, 2007: Workshop (register on the IJCNN website).
Workshop
schedule:
Thursday, August 16, 6-10 pm, Crystal Ballroom E:
6 pm: Reception,
pizza and drinks.
6:30 pm: Presentation of the main results and
Award Ceremony for the IJCNN 2007 challenges
Agnostic learning vs. prior knowledge and Neural network
forecasting, by Isabelle Guyon and Sven Crone.
7:00 pm:
Neural network
forecasting competition workshop, part I.
8:30 pm:
Poster session
(shared between
both competitions).
Friday, August 17, 8:30-12:30 pm,
Crystal Ballroom E:
Friday, August 17, 1:30-7:00 pm,
Crystal Ballroom E:
Part I:
Agnostic learning. Chair Gavin Cawley
1:30
- 1:40 p.m. ALvsPK challenge results. Isabelle Guyon [abstract][slides][IJCNN paper][NN paper]
1:40 - 2:10 TUTORIAL Baseline models using kernel
methods. Gavin Cawley
[abstract][fact_sheet][slides][paper]
2:10 - 2:30 Particule Swarm Model Selection. H. Jair Escalante [abstract][fact sheet][slides][paper]
2:30 - 2:50 Feature/Model Selection by the Linear
Programming SVM. Erinija Pranckeviciene [abstract][fact sheet][slides][appendix][paper]
2:50 - 3:10 Agnostic Learning with Ensembles of Classifiers.
Joerg D. Wichard [abstract][fact sheet][slides][paper]
3:10 - 3:30 Ensemble of ensemble of tree and
neural network. Louis Duclos-Gosselin [abstract][fact sheet][slides]
3:30
- 3:50 Stochastic optimization of a serial tree ensemble
for CV error.
Victor Eruhimov, Vladimir
Martyanov, Eugene Tuv (Intel team) [abstract][fact sheet1][fact sheet2]
3:50 - 4:00 Break
Part II:
Prior Knowledge. Chair Isabelle Guyon
4:00 - 4:30 TUTORIAL. Preprocessing Techniques for Image
Analysis Applications.
Hong Zhang
[abstract][slides]
4:30 - 5:00 TUTORIAL. Variable Selection and Feature Construction
using methods related to Information Theory. Kari Torkkola [abstract][slides]
5:00- 5:20 High-Throughput
Screening with 2D kernels methods. Chloe Agathe Azencott, Pierre Baldi
[abstract][slides][fact sheet]
5:20 - 5:40 Model Selection and Assessment Using Cross-indexing.
Juha Reunanen [abstract][fact sheet][slides][paper]
5:40
- 6:00 Data Grid Models in the Agnostic
Learning vs. Prior Knowledge Challenge
Marc Boullé
[abstract][fact sheet][slides][paper]
6:00 -6:10 Break
Part III:
6:10 - 7:00 Discussion, plan future challenge(s).
Extra material
from challenge participants (not attending the workshop)
Links
Machine Learning Challenges: Challenges
co-organized and sponsored by Clopinet.
WCCI
2006 workshop of model selection and performance prediction
challenge. We organized a competition on model selection
and the prediction of generalization performance. How good
are you at predicting how good you nare?
NIPS 2003 workshop
on feature extraction and feature selection challenge.
We organized a competition on five data sets in
which hundreds of entries were made. The web site of
the challenge is still available for post challenge submissions.
Measure yourself against the winners! Get the book of the challenge
(with data, results, tutorials).
Pascal challenges: The
Pascal network is sponsoring several challenges in Machine learning.
Data mining competitions:
A list of data mining
competitions maintained by KDnuggets, including the well known KDD
cup.
List
of data sets for machine learning:
A rather comprehensive
list maintained by MLnet.
UCI machine learning
repository: A great collection of datasets for machine
learning research.
DELVE: A platform developed
at University of Torontoto benchmark machine learning algorithms.
CAMDA
Critical Assessment
of Microarray Data Analysis, an annual conference on gene expression
microarray data analysis. This conference includes a context
with emphasis on gene selection, a special case of feature
selection.
ICDAR
International Conference
on Document Analysis and Recognition, a bi-annual
conference proposing a contest in printed text recognition.
Feature extraction/selection is a key component to win
such a contest.
TREC
Text Retrieval conference,
organized every year by NIST. The conference
is organized around the result of a competition. Past
winners have had to address feature extraction/selection
effectively.
ICPR
In conjunction with
the International Conference on Pattern Recognition,
ICPR 2004, a face recognition contest has been organized.
CASP
An important competition
in protein structure prediction called Critical
Assessment of Techniques for Protein Structure
Prediction.
Contact information
Coordinator:
Isabelle Guyon
Clopinet Enterprises
955, Creston Road,
Berkeley, CA 94708, U.S.A.
Tel/Fax: (510) 524 6211
Co-organizers and advisors:
Amir Reza Saffari Azar (Graz University of Technology), Joachim
Buhmann (ETH, Zurich), Gideon Dror
(Academic College of Tel-Aviv-Yaffo), Olivier Guyon (MisterP services), Lambert Schomaker (University
of Groningen), Kari Torkkola (Motorola, USA), Steve Gunn (University of Southampton), Horst Bischof (Graz
University, Austria), Françoise
Fogelman (KXEN, Paris, France), Michèle Sebag (LRI, Orsay,
France), André
Elisseeff (IBM, Zurich, Switzerland), Kristin Bennett (Rensselaer Polytechnic
Institute, USA), Gökhan BakIr (MPI for Biological Cybernetics,
Germany), Joaquin Quiñonero
Candela (MPI for Biological Cybernetics, Germany),
Yi
Lu Murphey (University of Michigan-Dearborn), Gavin Cawley (University of East
Anglia, UK), Masoud Nikravesh
(UC Berkeley), Stefan
Lessmann (University of Hamburg, Germany), Erinija Pranckeviciene
(Institute for
Biodiagnostics, NRC, Canada), and Leszek Rutkowski Technical
University of Czestochowa (Poland).
We are very grateful to the
workshop sponsors and the other challenge sponsors:
This project is supported by the National Science
Fundation under Grant
N0. ECCS-0736687. Any opinions, findings, and conclusions
or recommendations expressed in this material are those
of the authors and do not necessarily reflect the views of
the National Science Foundation.
|