of challenge feature sets with SVM
and hyper parameter space search:
each (dataset, feature-set, date) the following was performed:
trainer * predictor with the following 8 kernels was tried:
polynomial of degree 2, 3, 4, 5,
rbf with width 0.1, 1, 10
each of the above the 3 values of C were tried:
each of the above, 3 preprocessing of features were tried:
normalize1: each feature was separately standardized, by substracting its
mean and dividing by (std + fudge factor). In case of sparse data, the
mean is not substracted.
normalize2: each example is divided by its L2 norm
normalize3: each example is divided by the mean L2 norm of examples.
some datasets we also used the raw data. However, for some datasets (e.g.
Madelon) the SVM classifiers spent too much time on each trial, so we removed
it from the ‘bank’of normalization methods.
for each (dataset, feature-set, date), we estimated the performance of
the SVM classifier on 8*3*3 = 72 hyper parameter sets. The estimation was
performed using 10-fold cross validation on the training set. (for December
8 submission, the 10- fold cross validation was performed on the training+validation
used the SVMlight program (T. Joachims, Making
large-Scale SVM Learning Practical. Advances in Kernel Methods - Support
Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT-Press,
the 10 fold cross validation we first train the predictor on the whole
training set and ran it on the test set, and choose a bias such the ratio
of positive to negative examples corresponds to that of the training set.
This bias is then subtracted from the scores of predictor when run on each
optimizations were incorporated into the code. For example,
SVMlight reads the examples from disk. Since we run SVMlight many times
(with different hyper-parameters) with exactly the same input data, we
reuse the data written to disk. This speeds the process considerably, especially
for the heavy datasets.