Model Selection Workshop
FAQ
What is the scope of
the workshop?
- Model Selection.
- Ensemble Methods (weighing the decision of several models may be
seen as a way to circumvent Model Selection.)
- Performance Prediction (the estimated performance on future data
may be seen as a criterion to perform Model Selection.)
- Discussion of the results of the Performance
Prediction Challenge.
What do you mean by "model selection"?
We consider supervised learning problems in which input-output
pairs are provided for training and output predictions must be made on
new test input. By "model" we mean a "predictive model" whose parameters
are fitted with the training examples. Many predictive models possess "hyper-parameters"
that must be selected prior to training (e.g. the number of hidden units
in a neural network, the stopping criterion of an algorithm.) We might
also have to choose among families of models with no relation with one
another (neural networks, tree classifiers, nearest neighbors, etc.).
The problem of model selection includes finding the best model family and
the best hyper-parameter setting. The workshop participants are invited
to present new model selection strategies or comparisons of existing strategies,
which include (1) searching the model space; (2) finding a good model
ranking criterion.
What is the "Performance
Prediction Challenge"?
We are organizing a competition on five 2-class classification problems,
the results of which will be discussed at the workshop. To learn more
about the competition, see the challenge web site,
the challenge FAQ, and the learning
object FAQ.
How do I participate in the workshop?
Just register to WCCI 2006.
The workshop is part of the IJCNN 2006 conference, one of three conferences
held together in Vancouver, Canada, July 16-21, 2006. The Model Selection
workshop will be held on Tuesday July 18, 2006.
How do I contribute a
paper to the workshop?
Submit a paper to the IJCNN 2006 conference.
1) Follow the
instructions. Page limit = 8pages. Latex users, use
(1) \documentclass[conference,letterpaper]{IEEEtran};
(2) pdflatex;
to ensure the page size is US letter, as required.
2) Submit your paper. IMPORTANT: On
the IJCNN 2006 submit page,
under "Main research topic", select "S. SPECIAL SESSIONS -> S4. Model
selection (Guyon)".
3) Send email to modelselect@clopinet.com to notify us of
you submission, so we can verify it was forwarded to us.
January 31st, 2006, Submission deadline. (DEADLINE EXTENDED TO FEBRUARY 15th)
March 15th, 2006, notification of acceptance.
April 15th, 2006, camera ready copy due.
Do I need to participate
in the workshop to enter the challenge?
No.
Do I need to enter the
challenge to participate to the workshop?
No.
I am cluless,
what is predictive modeling?
We are interested in problems of predicting unknown outcomes from
observations. We limit ourselves to the setting in which the number
of observations is fixed and can be formatted as a vector of input variables
x. The prediction y will be a scalar. For problems of classification,
y will take discrete values, whereas for problems of regression, it will
take continuous values. Examples of classification problems include:
- Handwritten digit recognition: the predictor is presented with
images of digits and must label them 0-9.
- Spam categorization: the predictor is presented with desirable
emails and spam and must decide which ones to discard.
- Medical diagnosis: the predictor is presented with sets of symptoms
or medical analyses and must determine the disease from which the patient
suffers.
Examples of regression problems include:
- Medical prognosis: Predicting the time of recurrence of a disease
from medical analyses.
- Drug screening: Predicting the binding or solubility of a molecule
from chemical and physical descriptors.
- Marketing: Predicting how much a prospective customer will buy
based on census data.
All the problems we are interested in can be cast into
the framework of inference or “learning from examples”. A function y=f(x; a) or "model",
which may have a parameter vector a, is used to
make predictions. The components of its parameter vector a are adjusted
during training using a set of input-output pairs examples {(x1,
y1),
(x2,
y2),
… (xp,
yp)},
called training examples. For non-parametric methods like the “nearest
neighbor” method, the training examples are directly used, without parameter
adjustment per se, but, in our setting, training examples will
be considered parameters in a broader sense. The goal is not to
optimize predictions on the training examples, but to optimize predictions
on a set of test examples, for which the target values y are not available
during training. Such optimization must be carried out, however, using
training input-output pairs only, and, when available, the unlabeled test
set. Since the true objective involves unknown information (the test set
target values), objective functions that approximate the goal are derived.
The simplest one is the prediction error on training examples (training
error). But other objective functions may be better estimators of the true
objective.
Note that, in the challenge, we will have have two test sets:
a validation set and a final test set. The validation set will be used
to provide the competitors with feed-back during the development period,
reserving the final test set for the real test.
Tell me more about
model selection and hyper-parameter selection.
Given a number of model families and given training data, which
one will perform best on the test set? This is the problem of model
selection. For example, in classification, many model families are popular:
nearest neighbors, Parzen windows, neural networks, decision trees,
and support vector machines (see e.g. Andrew Moore's tutorial slides). But
no one knows which one will perform best on a given task. Can we predict
it using training data only? This is one of the central questions to be
answered by our challenge.
Most models (if not all) have hyper-parameters.
We shall denote them as q and write f(x; a, q). Those are
parameters that are not subject to training. For example: the number
of neighbors in the nearest neighbor method, the size of kernel in the
Parzen windows method, the number of layer and units per layer in a neural
network, the “soft-margin” parameter in a support vector machine. But maybe
they can or should be subject to training. This is the problem of hyper-parameter
selection.
If hyper-parameters are subject to training, what
differentiates them from regular parameters? They are often parameters
on the parameters, hence their name. For instance, the dimension of a is a typical
hyper-parameter. We wish to take the notion of hyper-parameters in a
broader sense. Hyper-parameters will designate parameters that are not
“easy” to train with a primary training algorithm that adjusts the primary
parameters a.
For instance, support vector machines solve a quadratic optimization problem
to compute a,
which has the nice property to have a unique global optimum. Adjusting
the soft-margin parameter or the kernel size is outside this nice optimization
framework and requires an auxiliary technique. It is common to resort
to a two-step process: splitting the training data into a first set to
train a
for fixed values of q a second
set (usually referred to as validation set) to select the most promising
value of q.
This two-step process is sometimes called two-level inference. Obviously,
one can envision having more than two levels of inference. But, we limit
ourselves to two levels.
Other questions arise: How to search hyper-parameter
space? How to best split the data? How to assess the models or rank
them? These are some of the questions that the competitors will need
to solve to win the challenge.
It can be argued that hyper-parameter selection
is a subset of the problem of model selection, each hyper-parameter setting
providing a new model. With this interpretation, f(x; a, q) is not a model,
it is a family of models. But reciprocally, model selection can be cast into
a hyper-parameter selection problem. To see that, we must not limit ourselves
to continuous or ordinal hyper-parameters. Nominal hyper-parameters representing
a choice among an unordered set of possibilities allow us to cast model selection
into a hyper-parameter selection problem. We could potentially define a hyper-model
containing all our models of interest, with hyper-parameter a model identifier.
Feature selection is a particular case of model
selection that we have recently studied in an organized challenge (see
the workshop page).
Each input can be associated with a boolean hyper-parameter determining
whether it should be used or not in the model. In the problems we studied
with tens of thousands of input variables (or features), a large fraction
can be eliminated without performance degradation because they are either
non-informative or redundant. In some cases, performances are even improved.
Similarly, one can decide whether or not to use
particular training examples. This may be useful if the data are noisy and
contain mislabeled examples or meaningless examples. We intend to include
these aspects of model selection in our challenge.
How is model selection
usually done?
Model selection is a problem typical of the "risk
minimization" inference methods. One method is to use theoretical approximations
of the generalization error (or regularized objective functions) with
formulas of the type:
Test_error ~ Training_error + e(model_complexity).
The test error is also called "expected risk" and the training
error is called "empirical risk". By directly optimizing approximations
of the expected risk, one avoids splitting the training data.
Another option is to use cross-validation. The training
data are split in multiple ways into two complementary subsets. We shall
call the first set a-set and the
second set q-set.
They are commonly called training set and validation set, but, in the
challenge they will be both extracted from the available training data.
For each value of q, the set a-set is used
to estimate a. Then,
the set q-set
is used to provide a score for q. If a single
data split is used, the best q is chosen on
the basis of this score. If multiple data splits are made, the scores for
each data split can be combined (e.g. averaged) and q chosen on the
basis of the combined score. Finally, a new predictor can be trained using
the entire training set and the chosen q to get a better
estimate of a.
There are many flavors of cross-validation. “Bootstrap”
consists in drawing randomly multiple data splits, with data replacement.
K-fold consists in splitting the data into K subsets of even size,
training on K-1 subsets and testing on the remaining subset. Subsets
are then rotated in all possible ways. A limit case of K-fold is the
leave-one-out method in which a single example is kept for testing in
all the folds. The leave-one-out estimator may be unreliable because
is has a high variance (Vapnik, 1982). But, it is convenient because,
for many methods, their exist approximations or exact formulas allowing
to compute the leave-one-out error by training a single model (instead
of as many models as there are examples) (Vapnik, 1982; Efron-Tibshirani-1993;
Amari et al, 1996; Opper-Winther, 1999; Vapnik-Chapelle-2000, Monari-Dreyfus,
2000; Monari-Dreyfus, 2002).
Finally, it should be mentioned that, if there are
enough training data available, the training error itself can be used
to perform model selection. As a refinement, statistical tests may be used
to assess the significance of performance differences between models (Rivals-Personnaz,
1998).
The approaches above mentioned (performance bounds,
cross-validation, and training error) leave open the problem of finding
a strategy to explore hyper-parameter space. Exhaustive search is possible
for discrete hyper-parameters, but often computationally impossible.
Other search strategies include grid search, pattern search (Lewis-Torczon,
1996; Momma-Bennett, 2002), and all the common search techniques, when
applicable (gradient descent, genetic algorithms, simulated annealing,
greedy methods, relaxation techniques, etc.). Recently, new methods have
been proposed that exploit the availability of unlabeled data in addition
to the training data (Schuurmans, 1997). Also, there has been a surge of
interest in stability-based methods (see, e.g. Bousquet-Elisseeff-2001,Lange
et al., 2002).
Does model selection
fit within the Bayesian framework?
There are two main inference methods in machine
learning: risk minimization and Bayesian inference. Model selection is usually
though of as a problem of the risk minimization framework. Still, we tie
Bayesian inference to our challenge in two ways:
- Via the MAP (maximum a posteriori)
inference.
- Considering the Bayesian methods as ensemble
methods.
Bayesian methods assume that an underlying process
produced the data by drawing samples according to a given distribution.
That process, in turn, has been drawn at random according to a hyper-distribution
on the model family space. Thus if we consider a model family y=f(x; a, q), The goal
of Bayesian learning is to estimate:
P(a | Dtrain, q) = P(Dtrain | a, q) P(a | q) / P(Dtrain | q).
Here, to simplify notations, Dtrain =(Xtrain,Ytrain). The
quantity P(a
| q)
is called prior. It is parameterized by hyper-parameter vector q. The effect
of the prior is equivalent to that of a "structure" in structural risk
minimization, or a regularizer. The difference with the model selection
setting is that no model ends up being selected. Instead, the posterior
distribution of models is estimated P(a | Dtrain, q). Predictions
are carried out by voting among all the models. Models that explain the
data better have a larger weight. Model selection in the Bayesian setting
can be performed by selecting the most probable model (Maximum A Posteriori,
or MAP), instead of letting all models vote. A Bayesian predictor
is a meta predictors that uses the votes of several predictors. Methods
that sample the posterior distribution to approximate the ideal Bayesian
predictor (e.g. Markov chain Monte-Carlo methods) are, from our point of
view, ensemble methods, which can be tested in the framework of our challenge.
What is so challenging
with model selection?
We think model selection is a hard problem that
does not have yet solutions that work across application domains and model
families. It involves devising good search strategies and either good formulas
regularized objective functions or an optimum way of splitting the data into
training a-set
and validation q-set.
There is no consensus right now about which approach works best.
Researcher should unite forces to beat this problem,
not on a specific data set, with a specific model family, or with a
specific inference technique (Bayesian or risk minimization), but in
a generic way. For the challenge, we have packaged
models with a standard interface. This is an opportunity
to test model selection strategies in a controlled environment.
Can I submit results on "ensemble
methods"?
Participants are encouraged to submit results on ensemble methods.
Model selection in the proposed paradigm means searching in a space of
possible models and selecting one model (possibly by ranking all the models
explored and possibly by using some ranking score). Ensemble methods in
the proposed paradigm means searching in a space of possible models and
creating a classifier that votes on the decisions of the classifiers explored
with some voting weights. So, both problems are very similar since voting
weights and ranking scores play similar roles. Think of a Bayesian method
vs. a MAP method. We are not imposing anything about how the search should
be conducted or how the scores/weights should be computed. These are the
problems to be solved. Boosting, bagging, MCMC, etc. are possible strategies
that fit in that framework. The CLOP interface allows participants to make
entries with ensembles of models, using the models provided.
Can a participant give an
arbitrary hard time to the organizers?
DISCLAIMER: ALL INFORMATION, SOFTWARE, DOCUMENTATION, AND DATA ARE
PROVIDED "AS-IS". ISABELLE GUYON AND/OR OTHER ORGANIZERS DISCLAIM ANY
EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY PARTICULAR PURPOSE, AND
THE WARRANTY OF NON-INFRIGEMENT OF ANY THIRD PARTY'S INTELLECTUAL PROPERTY
RIGHTS. IN NO EVENT SHALL ISABELLE GUYON AND/OR OTHER ORGANIZERS BE LIABLE
FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER
ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF SOFTWARE,
DOCUMENTS, MATERIALS, PUBLICATIONS, OR INFORMATION MADE AVAILABLE FOR THE
CHALLENGE.
Who can I ask for more
help?
For all other questions, email modelselect@clopinet.com.
Last updated: February 1st, 2006.