[ workshop page | workshop FAQchallenge page | challenge FAQlearning object FAQ ]

Model Selection Workshop FAQ

What is the scope of the workshop?
- Model Selection.
- Ensemble Methods (weighing the decision of several models may be seen as a way to circumvent Model Selection.)
- Performance Prediction (the estimated performance on future data may be seen as a criterion to perform Model Selection.)
- Discussion of the results of the Performance Prediction Challenge.

What do you mean by "model selection"?
We consider supervised learning problems in which input-output pairs are provided for training and output predictions must be made on new test input. By "model" we mean a "predictive model" whose parameters are fitted with the training examples. Many predictive models possess "hyper-parameters" that must be selected prior to training (e.g. the number of hidden units in a neural network, the stopping criterion of an algorithm.) We might also have to choose among families of models with no relation with one another (neural networks, tree classifiers, nearest neighbors, etc.). The problem of model selection includes finding the best model family and the best hyper-parameter setting. The workshop participants are invited to present new model selection strategies or comparisons of existing strategies, which include (1) searching the model space; (2) finding a good model ranking criterion.

What is the "Performance Prediction Challenge"?
We are organizing a competition on five 2-class classification problems, the results of which will be discussed at the workshop. To learn more about the competition, see the challenge web site, the challenge FAQ, and the learning object FAQ.

How do I participate in the workshop?

Just register to WCCI 2006. The workshop is part of the IJCNN 2006 conference, one of three conferences held together in Vancouver, Canada, July 16-21, 2006. The Model Selection workshop will be held on Tuesday July 18, 2006.

How do I contribute a paper to the workshop?
Submit a paper to the IJCNN 2006 conference.
1) Follow the instructions. Page limit = 8pages.  Latex users, use
        (1)  \documentclass[conference,letterpaper]{IEEEtran};
        (2) pdflatex;
to ensure the page size is US letter, as required.
2) Submit your paper. IMPORTANT: On the IJCNN 2006 submit page, under "Main research topic", select "S. SPECIAL SESSIONS -> S4. Model selection (Guyon)".
3) Send email to modelselect@clopinet.com to notify us of you submission, so we can verify it was forwarded to us.
January 31st, 2006, Submission deadline. (DEADLINE EXTENDED TO FEBRUARY 15th)
March 15th, 2006, notification of acceptance.
April 15th, 2006, camera ready copy due.

Do I need to participate in the workshop to enter the challenge?

Do I need to enter the challenge to participate to the workshop?

I am cluless, what is predictive modeling?
We are interested in problems of predicting unknown outcomes from observations. We limit ourselves to the setting in which the number of observations is fixed and can be formatted as a vector of input variables x. The prediction y will be a scalar. For problems of classification, y will take discrete values, whereas for problems of regression, it will take continuous values. Examples of classification problems include:

Examples of regression problems include:
All the problems we are interested in can be cast into the framework of inference or “learning from examples”. A function y=f(xa) or "model", which may have a parameter vector a, is used to make predictions. The components of its parameter vector a are adjusted during training using a set of input-output pairs examples {(x1, y1), (x2, y2), … (xp, yp)}, called training examples. For non-parametric methods like the “nearest neighbor” method, the training examples are directly used, without parameter adjustment per se, but, in our setting, training examples will be considered parameters in a broader sense. The goal is not to optimize predictions on the training examples, but to optimize predictions on a set of test examples, for which the target values y are not available during training. Such optimization must be carried out, however, using training input-output pairs only, and, when available, the unlabeled test set. Since the true objective involves unknown information (the test set target values), objective functions that approximate the goal are derived. The simplest one is the prediction error on training examples (training error). But other objective functions may be better estimators of the true objective.
Note that, in the challenge, we will have have two test sets: a validation set and a final test set. The validation set will be used to provide the competitors with feed-back during the development period, reserving the final test set for the real test.

Tell me more about model selection and hyper-parameter selection.
Given a number of model families and given training data, which one will perform best on the test set? This is the problem of model selection. For example, in classification, many model families are popular: nearest neighbors, Parzen windows, neural networks, decision trees, and support vector machines (see e.g. Andrew Moore's tutorial slides). But no one knows which one will perform best on a given task. Can we predict it using training data only? This is one of the central questions to be answered by our challenge.

Most models (if not all) have hyper-parameters. We shall denote them as q and write f(x; aq). Those are parameters that are not subject to training. For example: the number of neighbors in the nearest neighbor method, the size of kernel in the Parzen windows method, the number of layer and units per layer in a neural network, the “soft-margin” parameter in a support vector machine. But maybe they can or should be subject to training. This is the problem of hyper-parameter selection.
If hyper-parameters are subject to training, what differentiates them from regular parameters? They are often parameters on the parameters, hence their name. For instance, the dimension of a is a typical hyper-parameter. We wish to take the notion of hyper-parameters in a broader sense. Hyper-parameters will designate parameters that are not “easy” to train with a primary training algorithm that adjusts the primary parameters a. For instance, support vector machines solve a quadratic optimization problem to compute a, which has the nice property to have a unique global optimum. Adjusting the soft-margin parameter or the kernel size is outside this nice optimization framework and requires an auxiliary technique. It is common to resort to a two-step process: splitting the training data into a first set to train a for fixed values of q a second set (usually referred to as validation set) to select the most promising value of q. This two-step process is sometimes called two-level inference. Obviously, one can envision having more than two levels of inference. But, we limit ourselves to two levels.
Other questions arise: How to search hyper-parameter space? How to best split the data? How to assess the models or rank them? These are some of the questions that the competitors will need to solve to win the challenge.
It can be argued that hyper-parameter selection is a subset of the problem of model selection, each hyper-parameter setting providing a new model. With this interpretation, f(x; aq) is not a model, it is a family of models. But reciprocally, model selection can be cast into a hyper-parameter selection problem. To see that, we must not limit ourselves to continuous or ordinal hyper-parameters. Nominal hyper-parameters representing a choice among an unordered set of possibilities allow us to cast model selection into a hyper-parameter selection problem. We could potentially define a hyper-model containing all our models of interest, with hyper-parameter a model identifier.
Feature selection is a particular case of model selection that we have recently studied in an organized challenge (see the workshop page). Each input can be associated with a boolean hyper-parameter determining whether it should be used or not in the model. In the problems we studied with tens of thousands of input variables (or features), a large fraction can be eliminated without performance degradation because they are either non-informative or redundant. In some cases, performances are even improved.
Similarly, one can decide whether or not to use particular training examples. This may be useful if the data are noisy and contain mislabeled examples or meaningless examples. We intend to include these aspects of model selection in our challenge.
How is model selection usually done?
Model selection is a problem typical of the "risk minimization" inference methods. One method is to use theoretical approximations of the generalization error (or regularized objective functions) with formulas of the type:
Test_error ~ Training_error + e(model_complexity).
The test error is also called "expected risk" and the training error is called "empirical risk". By directly optimizing approximations of the expected risk, one avoids splitting the training data.
Another option is to use cross-validation. The training data are split in multiple ways into two complementary subsets. We shall call the first set a-set and the second set q-set. They are commonly called training set and validation set, but, in the challenge they will be both extracted from the available training data. For each value of q, the set a-set is used to estimate a. Then, the set q-set is used to provide a score for q. If a single data split is used, the best q is chosen on the basis of this score. If multiple data splits are made, the scores for each data split can be combined (e.g. averaged) and q chosen on the basis of the combined score. Finally, a new predictor can be trained using the entire training set and the chosen q to get a better estimate of a.
There are many flavors of cross-validation. “Bootstrap” consists in drawing randomly multiple data splits, with data replacement. K-fold consists in splitting the data into K subsets of even size, training on K-1 subsets and testing on the remaining subset. Subsets are then rotated in all possible ways. A limit case of K-fold is the leave-one-out method in which a single example is kept for testing in all the folds. The leave-one-out estimator may be unreliable because is has a high variance (Vapnik, 1982). But, it is convenient because, for many methods, their exist approximations or exact formulas allowing to compute the leave-one-out error by training a single model (instead of as many models as there are examples) (Vapnik, 1982; Efron-Tibshirani-1993; Amari et al, 1996; Opper-Winther, 1999; Vapnik-Chapelle-2000, Monari-Dreyfus, 2000; Monari-Dreyfus, 2002).
Finally, it should be mentioned that, if there are enough training data available, the training error itself can be used to perform model selection. As a refinement, statistical tests may be used to assess the significance of performance differences between models (Rivals-Personnaz, 1998).
The approaches above mentioned (performance bounds, cross-validation, and training error) leave open the problem of finding a strategy to explore hyper-parameter space. Exhaustive search is possible for discrete hyper-parameters, but often computationally impossible. Other search strategies include grid search, pattern search (Lewis-Torczon, 1996; Momma-Bennett, 2002), and all the common search techniques, when applicable (gradient descent, genetic algorithms, simulated annealing, greedy methods, relaxation techniques, etc.). Recently, new methods have been proposed that exploit the availability of unlabeled data in addition to the training data (Schuurmans, 1997). Also, there has been a surge of interest in stability-based methods (see, e.g. Bousquet-Elisseeff-2001,Lange et al., 2002).
Does model selection fit within the Bayesian framework?
There are two main inference methods in machine learning: risk minimization and Bayesian inference. Model selection is usually though of as a problem of the risk minimization framework. Still, we tie Bayesian inference to our challenge in two ways:
Bayesian methods assume that an underlying process produced the data by drawing samples according to a given distribution. That process, in turn, has been drawn at random according to a hyper-distribution on the model family space. Thus if we consider a model family y=f(x; aq), The goal of Bayesian learning is to estimate:
P(| Dtrainq) = P(Dtrain | a, q) P(aq) / P(Dtrain | q).
Here, to simplify notations, Dtrain =(Xtrain,Ytrain). The quantity P(a | q) is called prior. It is parameterized by hyper-parameter vector q. The effect of the prior is equivalent to that of a "structure" in structural risk minimization, or a regularizer. The difference with the model selection setting is that no model ends up being selected. Instead, the posterior distribution of models is estimated P(| Dtrainq). Predictions are carried out by voting among all the models. Models that explain the data better have a larger weight. Model selection in the Bayesian setting can be performed by selecting the most probable model (Maximum A Posteriori, or MAP), instead of letting all models vote.  A Bayesian predictor is a meta predictors that uses the votes of several predictors. Methods that sample the posterior distribution to approximate the ideal Bayesian predictor (e.g. Markov chain Monte-Carlo methods) are, from our point of view, ensemble methods, which can be tested in the framework of our challenge.
What is so challenging with model selection?
We think model selection is a hard problem that does not have yet solutions that work across application domains and model families. It involves devising good search strategies and either good formulas regularized objective functions or an optimum way of splitting the data into training a-set and validation q-set. There is no consensus right now about which approach works best.
Researcher should unite forces to beat this problem, not on a specific data set, with a specific model family, or with a specific inference technique (Bayesian or risk minimization), but in a generic way. For the challenge, we have packaged models with a standard interface. This is an opportunity to test model selection strategies in a controlled environment.

Can I submit results on "ensemble methods"?
Participants are encouraged to submit results on ensemble methods. Model selection in the proposed paradigm means searching in a space of possible models and selecting one model (possibly by ranking all the models explored and possibly by using some ranking score). Ensemble methods in the proposed paradigm means searching in a space of possible models and creating a classifier that votes on the decisions of the classifiers explored with some voting weights. So, both problems are very similar since voting weights and ranking scores play similar roles. Think of a Bayesian method vs. a MAP method. We are not imposing anything about how the search should be conducted or how the scores/weights should be computed. These are the problems to be solved. Boosting, bagging, MCMC, etc. are possible strategies that fit in that framework. The CLOP interface allows participants to make entries with ensembles of models, using the models provided.

Can a participant give an arbitrary hard time to the organizers?

Who can I ask for more help?
For all other questions, email modelselect@clopinet.com.

Last updated: February 1st, 2006.