Background
Model selection is a problem in statistics,
machine learning, and data mining. Given training data consisting
of input-output pairs, a model is built to predict the output from
the input, usually by fitting adjustable parameters. Many predictive
models have been proposed to perform such tasks, including linear models,
neural networks, classification and regression trees, and kernel methods.
The selection of an optimal model, which should perform best on test
data, is the object of this workshop. A related problem is to find an
optimal ensemble of models forming a committee and voting for the final decision according
to given scores. Contributions to ensemble methods are also within the
scope of the workshop.
Part of the workshop will be devoted
to the results of a challenge on performance
prediction:
How good are you at predicting how good you are?
In most real world
situations, it is not sufficient to provide a good predictor, it
is important to assess accurately how well this predictor will perform
on new unseen data. Before deploying a model in the field, one must
know whether it will meet the specifications or whether one should
invest more time and resources to collect additional data and/or develop
more sophisticated models. The performance prediction challenge asks
you to provide prediction results on new unseen test data AND to predict
how good these predictions are. Therefore, you must design both a good
predictive model and a good performance estimator.
The performance prediction challenge
is connected to model selection because accurate performance predictions
are good model ranking criteria. We formatted five data sets for the purpose
of benchmarking performance prediction in a controlled manner. The data sets
span a wide variety of domains and have sufficiently many test examples to
obtain statistically significant results.
Challenge
The
WCCI 2006 performance prediction challenge is to obtain a good predictor AND
predict how well it will perform on a large test set. Entrants
must provide results on ALL five data sets provided.
To facilitate entering results for all five data sets, all tasks
are two-class classification problems. During the development period,
participants may submit results on a validation subset of the data
to obtain immediate feed-back. The final ranking will be performed
on a separate test set.
How to participate:
The challenge was open since September 30, 2005 and until
March 1, 2006. The challenge is now over, check the results. The challenge
web site will soon reopen for post-challenge submissions.
Some of the participants
and organizers at WCCI 2006
The best contributions will
be invited to submit a paper to a special
topic of the Journal of Machine Learning Research. Participants are also
encouraged to submit negative results to the Journal of Interesting
Negative Results.
Workshop
schedule
The modelselect papers presented at the conference are linked
to for convenience. Contact the organizers to gain access if you are entitled
to.
Session
TueAM-5: IJCNN Competition Program- Performance Prediction Challenge I
Tuesday, July 18, 8:00AM-10:00AM, Room: Junior Ballroom
8:00am -- Performance Prediction Challenge
Isabelle Guyon, Amir Reza Saffari Azar Alamdari, Gideon Dror and Joachim
Buhmann [Slides][Paper]
8:20am -- LogitBoost with Trees Applied to the WCCI 2006
Performance Prediction Challenge Datasets
Roman Lutz [Slides][Paper]
8:40am -- Leave-one-out Cross-validation Based Model Selection
Criteria for Weighted LS-SVMs
Gavin Cawley [Slides][Paper]
9:00am -- Classification with Tree-based Ensembles Applied
to the WCCI 2006 Performance Prediction Challenge Datasets
Corinne Dahinden [Slides][Paper]
9:20am -- Model Selection: An Empirical Study on Two Kernel
Classifiers
Wei Chu [Slides][Paper]
9:40am -- Regularization and Averaging of the Selective
Naive Bayes Classifier
Marc Boullé [Slides][Paper]
Session
TueMM-5: S4: Model Selection
Tuesday, July 18, 1:00PM-3:00PM, Room: Junior Ballroom D
1:00pm -- Nonlinear Model Selection Based on the Modulus
of Continuity
Imhoi Koo and Rhee Kil [Slides][Paper]
1:20pm -- Semi-supervised Model Selection Based on Cross-Validation
Matti Kaariainen [Slides][Paper]
1:40pm -- New Formulation of SVM for Model Selection
Mathias Adankon and Mohamed Cheriet [Slides][Paper]
2:00pm -- Common Subset Selection of Inputs in Multiresponse
Regression
Timo Simila and Jarkko Tikka [Slides][Paper]
2:20pm -- Breakdown Point of Model Selection When the Number
of Variables Exceeds the Number of Observations
David Donoho and Victoria Stodden [Slides][Paper]
2:40pm -- Model Selection via Bilevel Optimization
Kristin Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang
[Slides][Paper]
Session
TuePM-5: IJCNN Competition Program- Performance Prediction Challenge II
Tuesday, July 18, 3:15PM-5:15PM, Room: Junior Ballroom D
3:15pm -- Feature Selection Using Ensemble Based Ranking
Against Artificial Contrasts
Eugene Tuv, Alexander Borisov and Kari Torkkola [Slides][Paper]
3:35pm -- Model Selection in an Ensemble Framework
Joerg D. Wichard [Slides][Paper]
3:55pm -- Learning with Mean-variance Filtering, SVM and
Gradient-based Optimization
Vladimir Nikulin [Slides] [Talk][Paper]
4:15pm -- A Study of Supervised Learning with Multivariate
Analysis on Unbalanced Datasets
Yu-Yen Ou, Hao-Geng Hung and Yen-Jen Oyang [Slides][Paper]
4:35pm - Competition Panel: An open discussion on the
results of the competition and planning for future such events.[Log of the discussion]
Links
NIPS 2003 workshop
on feature extraction and feature selection challenge. We organized
a competition on five data sets in which hundreds of entries were
made. The web site of the challenge is still available for post challenge
submissions. Measure yourself against the winners!
Pascal challenges: The
Pascal network is sponsoring several challenges in Machine learning.
Data mining competitions:
A list of data mining
competitions maintained by KDnuggets, including the well known
KDD cup.
List
of data sets for machine learning:
A rather comprehensive
list maintained by MLnet.
On-line machine
learning resources:
Includes pointers
to software and data. The collections include the famous UCI repositories,
the DELVE platform of University of Toronto, and other resources.
CAMDA
Critical Assessment
of Microarray Data Analysis, an annual conference on gene expression
microarray data analysis. This conference includes a context with
emphasis on gene selection, a special case of feature selection.
ICDAR
International Conference
on Document Analysis and Recognition, a bi-annual conference proposing
a contest in printed text recognition. Feature extraction/selection
is a key component to win such a contest.
TREC
Text Retrieval conference,
organized every year by NIST. The conference is organized around the result
of a competition. Past winners have had to address feature extraction/selection
effectively.
ICPR
In conjunction with
the International Conference on Pattern Recognition, ICPR 2004,
a face recognition contest is being organized.
CASP
An important competition
in protein structure prediction called Critical Assessment of
Techniques
for Protein Structure Prediction.
Contact information
Principal Investigator:
Isabelle Guyon
Clopinet Enterprises
955, Creston Road,
Berkeley, CA 94708, U.S.A.
Tel/Fax: (510) 524 6211
Collaborators and advisors: Steve Gunn (University of Southampton), Yoshua Bengio (University
of Montréal), Asa Ben-Hur (Colorado State University), Joachim
Buhmann (ETH, Zurich), Gideon Dror (Academic College of
Tel-Aviv-Yaffo), Olivier Guyon
(MisterP services), Amir Reza Saffari Azar (Graz University of Technology), Lambert Schomaker (University
of Groningen), and Vladimir
Vapnik (NEC, Princeton).
This project
is supported by the National Science Fundation under Grant
N0. ECS-0424142. Any opinions, findings, and conclusions or
recommendations expressed in this material are those of the authors
and do not necessarily reflect the views of the National Science
Foundation.