What
is CLOP?
CLOP stands for Challenge Learning Object Package. It is a
machine learning package written in the Matlab(R) language,
based on the interface of the Spider package
developed at the Max Plank Institute.
Where
do I get CLOP?
You can just download CLOP from the challenge
website (see the "Model" tab). Follow the installation instructions
in the README file.
Are
there restrictions to using CLOP?
For use in the challenge, read the disclaimer
and agree to the license. For
other uses of CLOP, please contact the organizers at modelselect@clopinet.com.
Is the
sample code part of CLOP?
Yes, but you do not need CLOP to run the sample code. It
can be downloaded separately.
Do I need to use CLOP
to participate in the challenge?
No. But if you do, you may save your model and submit it with your challenge
entry. This will ensure reproducibility and validity of your results for
the agnostic track.
What
are learning objects?
Learning objects are Matlab objects, which have some methods
to be trained and tested.
What
are Matlab objects?
If you do not know anything about Matlab objects and/or object
oriented programming, don't be scared away. You can learn how to use
CLOP from examples. But you may definitely benefit from reading the (short)
Matlab help on objects. Briefly:
- An
object is a structure (i.e. has data members), which has a number of
programs (or methods) associated to it. The methods modify eventually
the data members.
- The methods
of an object "my_object" are stored in a directory named @my_object,
which must be in your Matlab path if you want to use the object (e.g.
call addpath.
- One particular
method, called "constructor" is named my_object. It is called (with
eventually some parameters) to create a new object. For example:
> my_object_instance = my_object(some_parameters);
- Once an
instance is created, you can call a particular method. For example:
> my_results = my_object_method(my_object_instance, some_other_parameters);
- Note that
my_object_instance should be the first argument of my_object_method.
Matlab "knows" that because my_object_instance is an instance of my_object,
it must call the method my_object_method found in the directory @my_object.
This allows methods overloading (i.e. calling methods the same name
for different objects.)
- Inheritance
is supported in Matlab, so an object may be derived from another object.
A child object inherits from the methods of its parents. For example:
> my_results = my_parent_method(my_object_instance, some_other_parameters);
In that case, the method my_parent_method is found in the parent directory
@my_object_parent, unless of course it has been overloaded by a method
of the same name found in @my_object.
- Some useful
functions:
isa
|
checks the class type
|
class
|
returns the class
|
methods
|
returns all the methods
|
struct
|
lets you examine the data
members
|
fieldnames
|
returns a cell array with
the field names
|
How
do I get started?
In the directory sample_code,
you find an example called main.m, which loads data and
runs example models. Use the functions help, type, and
whereis to view the documentation, the code and the location
of the functions or objects. There is also now a getting started manual in PDF.
Tell
me something about the interface
In the Spider interface, both data and models are objects.
Given 2 matrices X containing the examples (patterns in lines and features
in column) and Y containing the target values, you can construct a data
object:
> training_data
= data(X, Y);
The resulting
object has 2 members: training_data.X and training_data.Y.
Models are derived from the class algorithm. They are constructed
using a set of hyperparameters. Those are provided as a cell array
of strings:
> hyperparameters
= {'h1=val1', 'h2=val2'};
> untrained_model = algorithm(hyperparameters);
% Call to the constructor
In this way,
hyperparameters can be provided in any order or omitted. Omitted hyperparameters
take default values. To find out about the default values and allowed hyperparameter
range, use the "default" method:
> default(algorithm)
Models all have
at least 2 methods: train and test:
> [training_data_output,
trained_model] = train(untrained_model,
training_data);
> test_data_output = test(trained_model,
test_data);
trained_model
is a model object identical to untrained_model, except that
its parameters (some data members) have been updated by training. Repeatedly
calling train on the same model may have different effects depending on
the model. test_data_output is a data object identical to test_data,
except that the X member has been replaced by the output of trained_model
when test_data.X has been processed. The test
method does not look at the Y member. training_data_output is
a data object identical to training_data, except that the X member
has been replaced by the output of test(trained_model, training_data);
Of course, you may call untrained_model and trained_model
the same name if you want.
I do
not find the methods train and test in the @a_certain_model directory,
where are they?
Models are derived from the class algorithm, look in the directory
spider/basic/@algorithm of them. The train and test methods of algorithm
objects call the methods "training" and "testing" of the models.
I do
not find certain methods in @a_certain_directory, where are they?
Clop overloads some of the spider methods. You can find the
missing methods in the spider directory, the directory tree is the
same. Clop methods have precedence over the spider methods.
Is there
a different interface for preprocessing modules and classifiers?
No. The interface is the same. For preprocessings, training_data_output.X
and test_data_output.X are the preprocessed data. For classifiers,
they correspond to a vector of discriminant values.
How do I save models?
To save your model type at the Matlab prompt:
> save_model([dataname
'_model'], your_model, 1, 1);
Each submission
should have 5 models named: ada_model.mat, gina_model.mat, hiva_model.mat,
nova_model.mat, and sylva_model.mat.
To avoid that your models make your submission too big to be uploaded,
please set the last flag to one before saving your model, or save your model
before training. This will save the hyperparameters so we can reproduce your
results, but not the trained parameters. Please also save your trained models
(setting the last flag to zero) for further reference, but do not upload them.
We might ask you to provide them to verify reproducibility.
How can I chain models?
You can create a chain object. A chain object behaves like
another learning object. It has a train and a test method. The outputs
of one member of the chain are just fed into the next one. In this example,
feature standardization is used as a preprocessing; the output is fed
into a neural network:
> my_chained_model=chain({standardize,
neural({'units=10', 'shrinkage=0.01'})});
A chain can be trained and tested like another model. If an
element of the chain is going to be re-used once trained as part of another
chain, it can be extracted with the bracket notation. For instance:
> my_model= my_chained_model{1};
will return the first element of the chain.
Chains of trained models can also be formed. Just beware of
inconsistencies.
You may use chain objects to save your composite models.
Can I create ensembles of models?
Yes. You can create an ensemble object, e.g.
> my_model=ensemble({neural,
kridge, naive});
Models of the same class with different hyperparameters or
different models can be combined in this way. Chains can also be part
of ensembles. Or ensembles can be part of chains.
Like other objects, ensembles have a train and a test method.
The test method forwards the data to all the elements of the ensemble.
The output is a weighted sum of the discrimimant values of the models,
plus a bias. Optionally, the sign of the discriminant values is used
in place of the discriminant values. The weights are set to one and
the bias to zero by default. Users are free to set the weights and
the bias to other values, using set_w and set_b, or by rewriting the training
method. By default, the training method trains all the members of the
ensembles and sets the voting weights to zero and the bias to one.
Which models are part of CLOP?
Models in the challenge_objects directory were designed for the challenge:
Object
name
|
Function
|
Hyperparameters
|
Description
|
kridge
|
Classifier
|
shrinkage, kernel parameters
(coef0, degree, gamma), balance
|
Kernel ridge regression.
|
naive
|
Classifier
|
none
|
Naive Bayes.
|
gentleboost |
Classifier |
child, units, balance, subratio
|
GentleBoost algorithm |
neural
|
Classifier
|
units (num. hidden), shrinkage,
maxiter, balance
|
Neural network (Netlab.)
|
rf
|
Classifier
|
units (num. trees), mtry (num.
feat. per split.)
|
Random Forest (RF). We are
presently having segmentation fault problems with this classifier for
large datasets. We regret very much the disparition of Leo Breiman,
the father of CART and RF who died on July 5, 2005 at age 77. We will
work with the other authors of the original package to fix the problem.
|
svc
|
Classifier
|
shrinkage, kernel parameters
(coef0, degree, gamma)
|
Support vector classifier
(LibSVM.)
|
s2n
|
Feature selection
|
f_max, w_min
|
Signal-to-noise ratio coefficient
for feature ranking.
|
relief
|
Feature selection
|
f_max, w_min, k_num
|
Relief ranking criterion.
|
gs
|
Feature selection
|
f_max
|
Forward feature selection
with Gram-Schmidt orthogonalization.
|
rffs
|
Feature selection
|
f_max, w_min, child
|
Random Forest used as feature
selection filter. The "child" argument, which may be passed in the
argument array, is an rf object, with defined hyperparameters. If no
child is provided, an rf with default values is used.
|
svcrfe
|
Feature selection
|
f_max, child
|
Recursive Feature Elimination
filter using svc. The "child" argument, which passed in the argument
array, is an rf object, with defined hyperparameters. If no child is provided,
an rf with default values is used.
|
standardize
|
Preprocessing
|
center
|
Standardization of the features
(the columns of the data matrix are divided by their standard deviation;
optionally, the mean is first subtracted if center=1.)
|
normalize
|
Preprocessing
|
center
|
Normalization of the lines
of the data matrix (optionally the mean of the lines is subtracted
first.)
|
shift_n_scale
|
Preprocessing
|
offset, factor, take_log
|
Performs X <- (X-offset)/scale
globally on the data matrix. Optionally performs in addition log(1+X).
offset and factor are set as hyperparameters, or
subject to training.
|
pc_extract
|
Preprocessing
|
f_max
|
Extraction of features with
principal component analysis.
|
subsample
|
Preprocessing
|
p_max, balance
|
Takes a subsample of the
training patterns. The member pidx is set to a random subset
of p_max patterns by training, unless it is set "by hand"
to the indices of the patterns to be kept, with the method
set_idx. May be used to downsize the
training set or exclude outliers.
|
bias
|
Postprocessing
|
option
|
Finds the best threshold for
the real output values of the classifiers that minimizes several post-processing
criteria . |
chain
|
Grouping
|
child
|
A chain of models, one feeding
its outputs at inputs to the next one. The "child" argument is an array
of models.
|
ensemble
|
Grouping
|
child, signed_output
|
A group
of models voting to make the final decision. The "child" argument is an
array of models. The default training method
trains all the members of the ensemble and sets the voting weights to one
and the bias to zero. The hypermarameter signed_output
indicates whether the sign of the outputs should be taken prior to voting.
Those can be set to different values with the methods set_w and
set_b0. Alternatively, the training method may be overloaded.
|
What
are reasonable hyperparameter values?
Kernel methods
use a kernel k(x, y) = (coef0 + x . y)degree exp(-gamma
||x - y||2)
Hyperparameter
|
Default
value
|
Range
|
Description
and comments
|
coef0
|
0 for svc, 1 for kridge
|
[0, Inf]
|
Kernel parameter (bias). Often
taken as 0 or 1.
|
degree
|
1
|
[0, Inf]
|
Kernel parameter (polynomial
degree). Usually taken between 0 or 10. Larger values increase the
model capacity.
|
gamma
|
0
|
[0, Inf]
|
Kernel parameter (inverse
window/neighborhood width). The range of values may depend upon the geometry
of the space (e.g. distance between closest examples of opposite classes.)
Larger values increase the model capacity.
|
shrinkage
|
1e-14
|
[0, Inf]
|
For kernel methods: small
value (ridge) added to the diagonal of the kernel matrix. For neural
networks: weight decay. Acts as a regularizer or shrinkage parameter to
prevent overfitting.
|
balance
|
0
|
{0, 1}
|
Flag indicating whether one
should enable class balancing, i.e. compensating for the inbalance of
the number of examples in the two classes.
|
subratio |
0.8 |
[0 1] |
Boosting methods, at each
iteration subsamples the training data according to example weights, this
number indicated the fraction of the examples that should be included in
the subsampled set. |
units
|
10 for neural;100 for rf,
5 for gentleboost
|
[0, Inf]
|
Number of hidden units (for
a 2-layer neural network) or number of trees (for rf.) In case of GentleBoost
it is the number of weak learners. |
maxiter
|
100
|
[0, Inf]
|
Maximum number of iterations
(for a 2-layer neural network.)
|
mtry
|
[]
|
[0 Inf]
|
Number of candidate feature
per split (for rf.) If [], it is set to sqrt(feature_number).
|
f_max
|
Inf
|
[0 Inf]
|
Maximum number of features
selected. If f_max=Inf then no limit is set on the number of features.
|
p_max
|
Inf
|
[0 Inf]
|
Maximum number of patterns
to train on.. If p_max=Inf then no limit is set on the number of patterns.
|
w_min
|
-Inf
|
[-Inf Inf]
|
Threshold on the ranking criterion
W. If W(i) <= w_min, the feature i is eliminated.W is non-negative.
A negative value of w_min means all the features are kept.
|
k_num |
4
|
[0 Inf]
|
Number of neighbors in the
Relief algorithm.
|
child
|
a model or an array
|
NA
|
A learning object passed as
argument to an ensemble or feature selection object or an array passed
to a grouping object.
|
center
|
1 for standardize and 0 for
normalize
|
{0, 1}
|
Flag indicating whether the
data should be centered (the columns for standardize and the lines for
normalize.)
|
offset
|
[]
|
[-Inf Inf]
|
Offset value to be subtracted
from the data matrix X. If [], it is set to the min of X.
|
factor
|
[]
|
[0+ Inf]
|
Scaling factor by which the
data matrix X gets divided. If [], it is set to max(X-offset).
|
take_log
|
0
|
{0, 1}
|
Flag indicating whether one
should take the log of 1+X.
|
signed_output
|
0
|
{0, 1}
|
Flag indicating for the ensemble
object whether the sign of the output of the classifiers should be taken
prior to voting. Not to be confused with the private member use_signed_output
that is fixed to 0 for all challenge learning objects, so that we can compute
the AUC.
|
Can I also use other Spider functions or objects?
Yes sure. We have not tried and test them all, so we
do not guarantee we can help if you encounter problems, but we will do our
best.
How
do I do model selection with CLOP?
You can check some of the spider functions,
which implement cross-validation, like param, cv and gridsel.
How
do I create an ensemble of models?
Implement a training method for the ensemble object. The gentleboost and
the rf learning objects are also ensemble methods. You can check some
of the spider
functions, which implement ensemble methods.
Do I
need to understand the algorithms?
You can treat the algorithms as black boxes and play with
the hyperparameters. We provide very simple explanations in the getting started manual.
But
I want to understand the algorithms, where do I learn about them?
Each object comes with an on-line help that includes an appropriate
reference. Here are some useful links to information on CLOP implementations:
- Ridge regression
(kridge).
- Naive
Bayes (naive).
- Neural networks
(neural).
- Support Vector
Machines (svc).
- Random
Forests (rf, rffs).
- Boosting
(gentleboost).
- Feature
selection methods (s2n, relief, gs, svcrfe).
Can
I modify the code?
Sure. If you use CLOP, please cite it and let us know of additions
we can incorporate and will benefit to others.
Should we
include our models with our submissions?
Yes, this will ensure reproducibility. If your models are
very big and you have trouble uploading them, please contact the challenge web page administrator
for instructions.
Can a participant give an arbitrary hard time to the
organizers?
DISCLAIMER: ALL INFORMATION, SOFTWARE,
DOCUMENTATION, AND DATA ARE PROVIDED "AS-IS". ISABELLE GUYON AND/OR OTHER
ORGANIZERS DISCLAIM ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY
PARTICULAR PURPOSE, AND THE WARRANTY OF NON-INFRIGEMENT OF ANY THIRD PARTY'S
INTELLECTUAL PROPERTY RIGHTS. IN NO EVENT SHALL ISABELLE GUYON AND/OR OTHER
ORGANIZERS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES
OR ANY DAMAGES WHATSOEVER ARISING OUT OF OR IN CONNECTION WITH THE USE OR
PERFORMANCE OF SOFTWARE, DOCUMENTS, MATERIALS, PUBLICATIONS, OR INFORMATION
MADE AVAILABLE FOR THE CHALLENGE.
Who can I ask for more help?
For all other questions, email agnostic@clopinet.com.
|