Where do I get CLOP?
You can just download
clop. Follow the installation instructions in the README file.
Are there restrictions to
using CLOP?
Model selection game participants must use CLOP on the challenge
data, provided that they read the disclaimer
and agree to the license.
For other uses of CLOP, please contact the organizers at modelselect@clopinet.com.
Is the sample code part of CLOP?
Yes, but you do not need CLOP to run the sample code. It can
be downloaded
separately.
Do I need to use
CLOP to participate in the game?
Yes. You have to use a CLOP model (or a combination
of CLOP models using chains or ensembles) and you have to save your
model and submit it with your challenge entry. This will ensure reproducibility
and validity of your results.
What are learning objects?
Learning objects are Matlab objects, which have some methods
to be trained and tested.
What are Matlab objects?
If you do not know anything about Matlab objects and/or object
oriented programming, don't be scared away. You can learn how to use
CLOP from examples. But you may definitely benefit from reading the (short)
Matlab help on objects. Briefly:
isa |
checks the class type |
class |
returns the class |
methods |
returns all the methods |
struct |
lets you examine the data members |
fieldnames |
returns a cell array with the field
names |
How do I get started?
In the directory sample_code,
you find an example called main.m,
which loads data and runs example models. Use the functions help,
type, and whereis to view the documentation, the code and
the location of the functions or objects. There is also now a getting started manual in PDF.
Tell me something about the
interface
In the Spider interface, both data and models are objects. Given
2 matrices X containing the examples (patterns in lines and features
in column) and Y containing the target values, you can construct a data
object:
> training_data = data(X, Y);
The resulting object has 2 members: training_data.X and training_data.Y.
Models are derived from the class algorithm. They are constructed
using a set of hyperparameters. Those are provided as a cell array of
strings:
> hyperparameters = {'h1=val1', 'h2=val2'};
> untrained_model = algorithm(hyperparameters);
% Call to the constructor
In this way, hyperparameters can be provided in any order or omitted.
Omitted hyperparameters take default values. To find out about the default
values and allowed hyperparameter range, use the "default" method:
> default(algorithm)
Models all have at least 2 methods: train and test:
> [training_data_output, trained_model] = train(untrained_model,
training_data);
> test_data_output = test(trained_model,
test_data);
trained_model is a model object identical to untrained_model,
except that its parameters (some data members) have been updated by
training. Repeatedly calling train on the same model may have different
effects depending on the model. test_data_output is a data object
identical to test_data, except that the X member has been replaced
by the output of trained_model when test_data.X
has been processed. The test method does not look at the Y member. training_data_output
is a data object identical to training_data, except that the
X member has been replaced by the output of test(trained_model,
training_data);
Of course, you may call untrained_model and trained_model the
same name if you want.
I do not find the methods
train and test in the @a_certain_model directory, where are they?
Models are derived from the class algorithm, look in the directory
spider/basic/@algorithm of them. The train and test methods of algorithm
objects call the methods "training" and "testing" of the models.
I do not find certain methods
in @a_certain_directory, where are they?
Clop overloads some of the spider methods. You can find the missing
methods in the spider directory, the directory tree is the same. Clop
methods have precedence over the spider methods.
Is there a different interface
for preprocessing modules and classifiers?
No. The interface is the same. For preprocessings, training_data_output.X
and test_data_output.X are the preprocessed data. For classifiers,
they correspond to a vector of discriminant values.
How do
I save models?
At the Matlab prompt, type:
> save('filename', 'modelname');
To include your model with your submission, use the [dataname]_model
as the file name for the corresponding (trained) model. Hence, each submission
should have 5 models named: ada_model.mat, gina_model.mat, hiva_model.mat,
nova_model.mat, and sylva_model.mat.
To avoid that your models make your submission too big to be uploaded, please
set the global variable CleanObject to one before saving your model. This
will save the hyperparameters so we can reproduce your results, but not the
trained parameters. Please also save your trained model with CleanObject
to zero for further reference. We might ask you to provide it to verify reproducibility.
How can
I chain models?
You can create a chain object. A chain object behaves like another
learning object. It has a train and a test method. The outputs of one
member of the chain are just fed into the next one. In this example,
feature standardization is used as a preprocessing; the output is fed
into a neural network:
> my_chained_model=chain({standardize,
neural({'units=10', 'shrinkage=0.01'})});
A chain can be trained and tested like another model. If an element
of the chain is going to be re-used once trained as part of another
chain, it can be extracted with the bracket notation. For instance:
> my_model= my_chained_model{1};
will return the first element of the chain.
Chains of trained models can also be formed. Just beware of inconsistencies.
You must use chain objects to save your composite models.
Can
I create ensembles of models?
Yes. You can create an ensemble object, e.g.
> my_model=ensemble({neural,
kridge, naive});
Models of the same class with different hyperparameters or different
models can be combined in this way. Chains can also be part of ensembles.
Or ensembles can be part of chains.
Like other objects, ensembles have a train and a test method.
The test method forwards the data to all the elements of the ensemble.
The output is a weighted sum of the discrimimant values of the models,
plus a bias. Optionally, the sign of the discriminant values is used
in place of the discriminant values. The weights are set to one and the
bias to zero by default. Users are free to set the weights and the
bias to other values, using set_w and set_b, or by rewriting the training
method. By default, the training method trains all the members of the ensembles
and sets the voting weights to zero and the bias to one.
Which models
are part of CLOP?
Models in the challenge_objects directory were designed for the challenge:
Object name |
Function |
Hyperparameters |
Description |
kridge |
Classifier |
shrinkage, kernel parameters
(coef0, degree, gamma), balance |
Kernel ridge regression. |
naive |
Classifier |
none |
Naive Bayes. |
gentleboost | Classifier | child, units, balance,
subratio |
GentleBoost algorithm |
neural |
Classifier |
units (num. hidden), shrinkage,
maxiter, balance |
Neural network (Netlab.) |
rf |
Classifier |
units (num. trees),
mtry (num. feat. per split.) |
Random Forest (RF).
We are presently having segmentation fault problems with this classifier
for large datasets. We regret very much the disparition of Leo Breiman,
the father of CART and RF who died on July 5, 2005 at age 77. We will work
with the other authors of the original package to fix the problem.
|
svc |
Classifier |
shrinkage, kernel parameters
(coef0, degree, gamma) |
Support vector classifier
(LibSVM.) |
s2n |
Feature selection |
f_max, w_min |
Signal-to-noise ratio
coefficient for feature ranking. |
relief |
Feature selection |
f_max, w_min, k_num |
Relief ranking criterion. |
gs |
Feature selection |
f_max |
Forward feature selection
with Gram-Schmidt orthogonalization. |
rffs |
Feature selection |
f_max, w_min, child |
Random Forest used as
feature selection filter. The "child" argument, which may be passed in
the argument array, is an rf object, with defined hyperparameters. If no
child is provided, an rf with default values is used. |
svcrfe |
Feature selection |
f_max, child |
Recursive Feature Elimination
filter using svc. The "child" argument, which passed in the argument array,
is an rf object, with defined hyperparameters. If no child is provided,
an rf with default values is used. |
standardize |
Preprocessing |
center |
Standardization of the
features (the columns of the data matrix are divided by their standard
deviation; optionally, the mean is first subtracted if center=1.) |
normalize |
Preprocessing |
center |
Normalization of the
lines of the data matrix (optionally the mean of the lines is subtracted
first.) |
shift_n_scale |
Preprocessing |
offset, factor, take_log |
Performs X <- (X-offset)/scale
globally on the data matrix. Optionally performs in addition log(1+X).
offset and factor are set as hyperparameters, or
subject to training. |
pc_extract |
Preprocessing |
f_max |
Extraction of features
with principal component analysis. |
subsample |
Preprocessing |
p_max, balance |
Takes a subsample of
the training patterns. The member pidx is set to a random
subset of p_max patterns by training, unless it is set
"by hand" to the indices of the patterns to be kept,
with the method set_idx. May be used to downsize
the training set or exclude outliers. |
bias |
Postprocessing |
option |
Finds the best threshold for the real output values of the classifiers that minimizes several post-processing criteria . |
chain |
Grouping |
child |
A chain of models, one
feeding its outputs at inputs to the next one. The "child" argument is
an array of models. |
ensemble |
Grouping |
child, signed_output |
A group of models voting to make the final decision.
The "child" argument is an array of models. The default training method trains all the members of the
ensemble and sets the voting weights to one and the bias to zero. The hypermarameter signed_output indicates whether the
sign of the outputs should be taken prior to voting. Those can be set to
different values with the methods set_w and set_b0. Alternatively,
the training method may be overloaded. |
What are reasonable hyperparameter
values?
Kernel methods use a kernel k(x, y) = (coef0 + x . y)degree
exp(-gamma ||x - y||2)
Hyperparameter |
Default value |
Range |
Description and comments |
coef0 |
0 for svc, 1 for kridge |
[0, Inf] |
Kernel parameter (bias).
Often taken as 0 or 1. |
degree |
1 |
[0, Inf] |
Kernel parameter (polynomial
degree). Usually taken between 0 or 10. Larger values increase the model
capacity. |
gamma |
0 |
[0, Inf] |
Kernel parameter (inverse
window/neighborhood width). The range of values may depend upon the
geometry of the space (e.g. distance between closest examples of opposite
classes.) Larger values increase the model capacity. |
shrinkage |
1e-14 |
[0, Inf] |
For kernel methods:
small value (ridge) added to the diagonal of the kernel matrix. For
neural networks: weight decay. Acts as a regularizer or shrinkage parameter
to prevent overfitting. |
balance |
0 |
{0, 1} |
Flag indicating whether
one should enable class balancing, i.e. compensating for the inbalance
of the number of examples in the two classes. |
subratio | 0.8 | [0 1] | Boosting methods, at each iteration subsamples the training data according to example weights, this number indicated the fraction of the examples that should be included in the subsampled set. |
units |
10 for neural;100 for
rf, 5 for gentleboost |
[0, Inf] |
Number of hidden units (for a 2-layer neural network) or number of trees (for rf.) In case of GentleBoost it is the number of weak learners. |
maxiter |
100 |
[0, Inf] |
Maximum number of iterations
(for a 2-layer neural network.) |
mtry |
[] |
[0 Inf] |
Number of candidate
feature per split (for rf.) If [], it is set to sqrt(feature_number). |
f_max |
Inf |
[0 Inf] |
Maximum number of features
selected. If f_max=Inf then no limit is set on the number of features. |
p_max |
Inf |
[0 Inf] |
Maximum number of patterns
to train on.. If p_max=Inf then no limit is set on the number of patterns. |
w_min |
-Inf |
[-Inf Inf] |
Threshold on the ranking
criterion W. If W(i) <= w_min, the feature i is eliminated.W is non-negative.
A negative value of w_min means all the features are kept. |
k_num | 4 |
[0 Inf] |
Number of neighbors in
the Relief algorithm. |
child |
a model or an array |
NA |
A learning object passed
as argument to an ensemble or feature selection object or an array passed
to a grouping object. |
center |
1 for standardize and
0 for normalize |
{0, 1} |
Flag indicating whether
the data should be centered (the columns for standardize and the lines
for normalize.) |
offset |
[] |
[-Inf Inf] |
Offset value to be subtracted
from the data matrix X. If [], it is set to the min of X. |
factor |
[] |
[0+ Inf] |
Scaling factor by which
the data matrix X gets divided. If [], it is set to max(X-offset). |
take_log |
0 |
{0, 1} |
Flag indicating whether
one should take the log of 1+X. |
signed_output |
0 |
{0, 1} |
Flag indicating for the ensemble
object whether the sign of the output of the classifiers should be taken
prior to voting. Not to be confused with the private member use_signed_output
that is fixed to 0 for all challenge learning objects, so that we can compute
the AUC. |
How do I do model selection
with CLOP?
You need to write your own code. You can check some
of the spider
functions, which implement cross-validation, like param, cv
and gridsel.
How do I create an ensemble
of models?
Implement a training method for the ensemble
object. The gentleboost and the rf learning objects are
also ensemble methods. You can check some of the spider functions,
which implement ensemble methods.
Do I need to understand the
algorithms?
You can treat the algorithms as black boxes and play with the
hyperparameters. We provide very simple explanations in the getting started manual.
But I want to understand
the algorithms, where do I learn about them?
Each object comes with an on-line help that includes an appropriate
reference. Here are some useful links to information on CLOP implementations:
- Ridge
regression (kridge).
- Naive
Bayes (naive).
- Neural networks
(neural).
- Support Vector Machines
(svc).
- Random
Forests (rf, rffs).
- Boosting
(gentleboost).
- Feature selection
methods (s2n, relief, gs, svcrfe).
Can I modify the code?
There are three ways in which you can
modify the code:
Should we include our models
with our submissions?
Yes, unless your models are very big. Contact the challenge web page administrator
for instructions if your models are too big to upload.
Can
a participant give an arbitrary hard time to the organizers?
DISCLAIMER: ALL INFORMATION, SOFTWARE, DOCUMENTATION, AND DATA
ARE PROVIDED "AS-IS". ISABELLE GUYON AND/OR OTHER ORGANIZERS DISCLAIM
ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY PARTICULAR PURPOSE,
AND THE WARRANTY OF NON-INFRIGEMENT OF ANY THIRD PARTY'S INTELLECTUAL PROPERTY
RIGHTS. IN NO EVENT SHALL ISABELLE GUYON AND/OR OTHER ORGANIZERS BE LIABLE
FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER
ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF SOFTWARE,
DOCUMENTS, MATERIALS, PUBLICATIONS, OR INFORMATION MADE AVAILABLE FOR THE
CHALLENGE.
Who can I ask
for more help and how do I report bugs?
Email modelselect@clopinet.com.
Last updated: May 12, 2006.