Classification of challenge feature sets with SVM

Contact: Gideon Dror

Parameter and hyper parameter space search:

for each (dataset, feature-set, date) the following was performed:

SVM trainer * predictor with the following 8 kernels was tried:

linear,
polynomial of degree 2, 3, 4, 5,
rbf with width 0.1, 1, 10

for each of the above the 3 values of C were tried:

C = 0.1, 1, 10;

for each of the above, 3 preprocessing of features were tried:

normalize1: each feature was separately standardized, by substracting its mean and dividing by (std + fudge factor). In case of sparse data, the mean is not substracted.
normalize2: each example is divided by its L2 norm
normalize3: each example is divided by the mean L2 norm of examples.

On some datasets we also used the raw data. However, for some datasets (e.g. Madelon) the SVM classifiers spent too much time on each trial, so we removed it from the ‘bank’of normalization methods.

Totally, for each (dataset, feature-set, date), we estimated the performance of the SVM classifier on 8*3*3 = 72 hyper parameter sets. The estimation was performed using 10-fold cross validation on the training set. (for December 8 submission, the 10- fold cross validation was performed on the training+validation set);

We used the SVMlight program (T. Joachims, Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT-Press, 1999).

Biasing

Within the 10 fold cross validation we first train the predictor on the whole training set and ran it on the test set, and choose a bias such the ratio of positive to negative examples corresponds to that of the training set. This bias is then subtracted from the scores of predictor when run on each fold.

Optimizations:

Some optimizations were incorporated into the code. For example, SVMlight reads the examples from disk. Since we run SVMlight many times (with different hyper-parameters) with exactly the same input data, we reuse the data written to disk. This speeds the process considerably, especially for the heavy datasets.