- Model Selection.

- Ensemble Methods (weighing the decision of several models may be seen as a way to circumvent Model Selection.)

- Performance Prediction (the estimated performance on future data may be seen as a criterion to perform Model Selection.)

- Discussion of the results of the Performance Prediction Challenge.

We consider supervised learning problems in which input-output pairs are provided for training and output predictions must be made on new test input. By "model" we mean a "predictive model" whose parameters are fitted with the training examples. Many predictive models possess "hyper-parameters" that must be selected prior to training (e.g. the number of hidden units in a neural network, the stopping criterion of an algorithm.) We might also have to choose among families of models with no relation with one another (neural networks, tree classifiers, nearest neighbors, etc.). The problem of model selection includes finding the best model family and the best hyper-parameter setting. The workshop participants are invited to present new model selection strategies or comparisons of existing strategies, which include (1) searching the model space; (2) finding a good model ranking criterion.

We are organizing a competition on five 2-class classification problems, the results of which will be discussed at the workshop. To learn more about the competition, see the challenge web site, the challenge FAQ, and the learning object FAQ.

How do I participate in the workshop?

Just register to WCCI 2006. The workshop is part of the IJCNN 2006 conference, one of three conferences held together in Vancouver, Canada, July 16-21, 2006. The Model Selection workshop will be held on

Submit a paper to the

1) Follow the instructions. Page limit = 8pages. Latex users, use

(1) \documentclass[conference,letterpaper]{IEEEtran};

(2) pdflatex;

to ensure the page size is US letter, as required.

2) Submit your paper.

3) Send email to

March 15th, 2006, notification of acceptance.

April 15th, 2006, camera ready copy due.

No.

No.

**I am cluless,
what is predictive modeling?**

We are interested in problems of predicting unknown outcomes from
observations. We limit ourselves to the setting in which the number
of observations is fixed and can be formatted as a vector of input variables
**x**. The prediction y will be a scalar. For problems of classification,
y will take discrete values, whereas for problems of regression, it will
take continuous values. Examples of classification problems include:

- Handwritten digit recognition: the predictor is presented with images of digits and must label them 0-9.
- Spam categorization: the predictor is presented with desirable emails and spam and must decide which ones to discard.
- Medical diagnosis: the predictor is presented with sets of symptoms or medical analyses and must determine the disease from which the patient suffers.

Examples of regression problems include:

All the problems we are interested in can be cast into
the framework of inference or “learning from examples”. A function y=f(- Medical prognosis: Predicting the time of recurrence of a disease from medical analyses.
- Drug screening: Predicting the binding or solubility of a molecule from chemical and physical descriptors.
- Marketing: Predicting how much a prospective customer will buy based on census data.

Note that, in the challenge, we will have have two test sets: a validation set and a final test set. The validation set will be used to provide the competitors with feed-back during the development period, reserving the final test set for the real test.

**Tell me more about
model selection and hyper-parameter selection.**

Given a number of model families and given training data, which
one will perform best on the test set? This is the problem of model
selection. For example, in classification, many model families are popular:
nearest neighbors, Parzen windows, neural networks, decision trees,
and support vector machines (see e.g. Andrew Moore's tutorial slides). But
no one knows which one will perform best on a given task. Can we predict
it using training data only? This is one of the central questions to be
answered by our challenge.

Most models (if not all) have hyper-parameters.
We shall denote them as **q** and write f(x; **a**, **q**). Those are
parameters that are not subject to training. For example: the number
of neighbors in the nearest neighbor method, the size of kernel in the
Parzen windows method, the number of layer and units per layer in a neural
network, the “soft-margin” parameter in a support vector machine. But maybe
they can or should be subject to training. This is the problem of hyper-parameter
selection.

If hyper-parameters are subject to training, what
differentiates them from regular parameters? They are often parameters
on the parameters, hence their name. For instance, the dimension of **a** is a typical
hyper-parameter. We wish to take the notion of hyper-parameters in a
broader sense. Hyper-parameters will designate parameters that are not
“easy” to train with a primary training algorithm that adjusts the primary
parameters **a**.
For instance, support vector machines solve a quadratic optimization problem
to compute **a**,
which has the nice property to have a unique global optimum. Adjusting
the soft-margin parameter or the kernel size is outside this nice optimization
framework and requires an auxiliary technique. It is common to resort
to a two-step process: splitting the training data into a first set to
train **a**
for fixed values of **q** a second
set (usually referred to as validation set) to select the most promising
value of **q**.
This two-step process is sometimes called two-level inference. Obviously,
one can envision having more than two levels of inference. But, we limit
ourselves to two levels.

Other questions arise: How to search hyper-parameter
space? How to best split the data? How to assess the models or rank
them? These are some of the questions that the competitors will need
to solve to win the challenge.

It can be argued that hyper-parameter selection
is a subset of the problem of model selection, each hyper-parameter setting
providing a new model. With this interpretation, f(x; **a**, **q**) is not a model,
it is a family of models. But reciprocally, model selection can be cast into
a hyper-parameter selection problem. To see that, we must not limit ourselves
to continuous or ordinal hyper-parameters. Nominal hyper-parameters representing
a choice among an unordered set of possibilities allow us to cast model selection
into a hyper-parameter selection problem. We could potentially define a hyper-model
containing all our models of interest, with hyper-parameter a model identifier.

Feature selection is a particular case of model
selection that we have recently studied in an organized challenge (see the workshop page).
Each input can be associated with a boolean hyper-parameter determining
whether it should be used or not in the model. In the problems we studied
with tens of thousands of input variables (or features), a large fraction
can be eliminated without performance degradation because they are either
non-informative or redundant. In some cases, performances are even improved.

Similarly, one can decide whether or not to use
particular training examples. This may be useful if the data are noisy and
contain mislabeled examples or meaningless examples. We intend to include
these aspects of model selection in our challenge.

Model selection is a problem typical of the "risk
minimization" inference methods. One method is to use theoretical approximations
of the generalization error (or regularized objective functions) with
formulas of the type:

Test_error ~ Training_error + e(model_complexity).

The test error is also called "expected risk" and the training error is called "empirical risk". By directly optimizing approximations of the expected risk, one avoids splitting the training data.

The test error is also called "expected risk" and the training error is called "empirical risk". By directly optimizing approximations of the expected risk, one avoids splitting the training data.

Another option is to use cross-validation. The training
data are split in multiple ways into two complementary subsets. We shall
call the first set **a**-set and the
second set **q**-set.
They are commonly called training set and validation set, but, in the
challenge they will be both extracted from the available training data.
For each value of **q**, the set **a**-set is used
to estimate **a. **Then,
the set **q**-set
is used to provide a score for **q**. If a single
data split is used, the best **q** is chosen on
the basis of this score. If multiple data splits are made, the scores for
each data split can be combined (e.g. averaged) and **q** chosen on the
basis of the combined score. Finally, a new predictor can be trained using
the entire training set and the chosen **q** to get a better
estimate of **a**.

There are many flavors of cross-validation. “Bootstrap”
consists in drawing randomly multiple data splits, with data replacement.
K-fold consists in splitting the data into K subsets of even size,
training on K-1 subsets and testing on the remaining subset. Subsets
are then rotated in all possible ways. A limit case of K-fold is the
leave-one-out method in which a single example is kept for testing in
all the folds. The leave-one-out estimator may be unreliable because
is has a high variance (Vapnik, 1982). But, it is convenient because,
for many methods, their exist approximations or exact formulas allowing
to compute the leave-one-out error by training a single model (instead
of as many models as there are examples) (Vapnik, 1982; Efron-Tibshirani-1993;
Amari et al, 1996; Opper-Winther, 1999; Vapnik-Chapelle-2000, Monari-Dreyfus,
2000; Monari-Dreyfus, 2002).

Finally, it should be mentioned that, if there are
enough training data available, the training error itself can be used
to perform model selection. As a refinement, statistical tests may be used
to assess the significance of performance differences between models (Rivals-Personnaz,
1998).

The approaches above mentioned (performance bounds,
cross-validation, and training error) leave open the problem of finding
a strategy to explore hyper-parameter space. Exhaustive search is possible
for discrete hyper-parameters, but often computationally impossible.
Other search strategies include grid search, pattern search (Lewis-Torczon,
1996; Momma-Bennett, 2002), and all the common search techniques, when
applicable (gradient descent, genetic algorithms, simulated annealing,
greedy methods, relaxation techniques, etc.). Recently, new methods have
been proposed that exploit the availability of unlabeled data in addition
to the training data (Schuurmans, 1997). Also, there has been a surge of
interest in stability-based methods (see, e.g. Bousquet-Elisseeff-2001,Lange
et al., 2002).

There are two main inference methods in machine
learning: risk minimization and Bayesian inference. Model selection is usually
though of as a problem of the risk minimization framework. Still, we tie
Bayesian inference to our challenge in two ways:

- Via the MAP (maximum
*a posteriori*) inference. - Considering the Bayesian methods as ensemble methods.

Bayesian methods assume that an underlying process
produced the data by drawing samples according to a given distribution.
That process, in turn, has been drawn at random according to a hyper-distribution
on the model family space. Thus if we consider a model family y=f(x; **a**, **q**), The goal
of Bayesian learning is to estimate:

P(**a **| D_{train}, **q**) = P(D_{train} |** a**,** q**) P(**a** | **q**) / P(D_{train} |** q**).

Here, to simplify notations, D_{train} =(X_{train},Y_{train}). The
quantity P(**a**
|** q**)
is called prior. It is parameterized by hyper-parameter vector **q**. The effect
of the prior is equivalent to that of a "structure" in structural risk
minimization, or a regularizer. The difference with the model selection
setting is that no model ends up being selected. Instead, the *posterior*
distribution of models is estimated P(**a **| D_{train}, **q**). Predictions
are carried out by voting among all the models. Models that explain the
data better have a larger weight. Model selection in the Bayesian setting
can be performed by selecting the most probable model (Maximum A Posteriori,
or MAP), instead of letting all models vote. A Bayesian predictor
is a meta predictors that uses the votes of several predictors. Methods
that sample the posterior distribution to approximate the ideal Bayesian
predictor (e.g. Markov chain Monte-Carlo methods) are, from our point of
view, ensemble methods, which can be tested in the framework of our challenge.

We think model selection is a hard problem that
does not have yet solutions that work across application domains and model
families. It involves devising good search strategies and either good formulas
regularized objective functions or an optimum way of splitting the data into
training **a**-set
and validation **q**-set.
There is no consensus right now about which approach works best.

Researcher should unite forces to beat this problem,
not on a specific data set, with a specific model family, or with a
specific inference technique (Bayesian or risk minimization), but in
a generic way. For the challenge, we have packaged models with a standard interface. This is an opportunity
to test model selection strategies in a controlled environment.

**Can I submit results on "ensemble
methods"?**

Participants are encouraged to submit results on ensemble methods. Model selection in the proposed paradigm means searching in a space of possible models and selecting one model (possibly by ranking all the models explored and possibly by using some ranking score). Ensemble methods in the proposed paradigm means searching in a space of possible models and creating a classifier that votes on the decisions of the classifiers explored with some voting weights. So, both problems are very similar since voting weights and ranking scores play similar roles. Think of a Bayesian method vs. a MAP method. We are not imposing anything about how the search should be conducted or how the scores/weights should be computed. These are the problems to be solved. Boosting, bagging, MCMC, etc. are possible strategies that fit in that framework. The CLOP interface allows participants to make entries with ensembles of models, using the models provided.

Participants are encouraged to submit results on ensemble methods. Model selection in the proposed paradigm means searching in a space of possible models and selecting one model (possibly by ranking all the models explored and possibly by using some ranking score). Ensemble methods in the proposed paradigm means searching in a space of possible models and creating a classifier that votes on the decisions of the classifiers explored with some voting weights. So, both problems are very similar since voting weights and ranking scores play similar roles. Think of a Bayesian method vs. a MAP method. We are not imposing anything about how the search should be conducted or how the scores/weights should be computed. These are the problems to be solved. Boosting, bagging, MCMC, etc. are possible strategies that fit in that framework. The CLOP interface allows participants to make entries with ensembles of models, using the models provided.

**Can a participant give an
arbitrary hard time to the organizers? **

DISCLAIMER: ALL INFORMATION, SOFTWARE, DOCUMENTATION, AND DATA ARE
PROVIDED "AS-IS". ISABELLE GUYON AND/OR OTHER ORGANIZERS DISCLAIM ANY
EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY PARTICULAR PURPOSE, AND
THE WARRANTY OF NON-INFRIGEMENT OF ANY THIRD PARTY'S INTELLECTUAL PROPERTY
RIGHTS. IN NO EVENT SHALL ISABELLE GUYON AND/OR OTHER ORGANIZERS BE LIABLE
FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER
ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF SOFTWARE,
DOCUMENTS, MATERIALS, PUBLICATIONS, OR INFORMATION MADE AVAILABLE FOR THE
CHALLENGE.

**Who can I ask for more
help? **

For all other questions, email __modelselect@clopinet.com__.

*Last updated: February 1st, 2006.*