Questions on lecture 8

Assessment methods (continued)

7.  What is the variance of the test error rate?

For an error rate E computed from m examples, the variance is: E(1-E)/m.

8.  What is a good rule of thumb to compute the size of a test set necessary to obtain statistically significant results?
m=100/E.
9. What is a good test to assess the significance of the difference in performance of two classifiers?
The NcNemar paired test.

10. What is cross-validation? Give examples of cross-validation methods.
Cross-validation amounts to splitting the training data multiple times into training set and validation set. training is performed on the training set and test on the validation set. The validation results are then averaged. Note that this does not preclude of reserving a separate test set for the final testing. Examples of cross-validation techniques include k-fold cross-validation, bootstrap, leave-one-out.
11. Why is it wrong to rank the features with all the training set and then run cross validation on nested subsets defined by the ranking to determine the optimum number of features?
This is wrong because it biases the result: all the examples have been iplicitly be used for training since they were used for the feature ranking. It is better (but more computationally expensive) to remove one example, carry out the ranking and the training of all subsets, test on the withheld example, then average the results for each size of feature subset.

12. Is is best to carry out extensive experiments of feature/hyperparameter selection exploring a number of possibilities as large as possible?
No, this is dangerous and prone to overfitting. In this case, the validation set is being overfitted.  If a large number of possibilities are investigated, they should be ranked in order of preference before running the experiments. Then, if two models have the same performance, within the error bar, the best ranked model is chosen.

Wrappers
1. What is the difference between filters and wrappers?
Fiters select features independently of the performances of the learning machine, using a "relevance" criterion. Wrappers select feature subsets on the basis of how well a learning machine performs.
2. Why do wrappers usually need a search method?
Training learning machines on all possible feature subsets is usually computationally infeasible. Wrappers use search strategies to efficiently explore the space of feature subsets.
2. Are all filters feature ranking methods?
No, feature ranking methods are a subset of the filter methods. However, some people call feature ranking methods filters. Filters include methods of selecting feature subsets on the basis of a relevance (or conditional relevance) criterion. Some people call filters methods that select features using one learning machine, but are then used for another one. For example, people use random forests (ensembles of decision tree) as a filter and then train an SVM with the selected features.
3. Are search methods only useful for warppers?
No, you can use a search method to generate feature subsets and then assess them with a "relevance" criterion rather than the score of a learning machine.
4. How is the assessment done for filter methods?
To be perfectly consistent with the definition of a filter, determining the optimum number of features should not use a classifier. Statistical tests and the proble methods are consistent in that respect. However, it is very common that people rank features with a "relevance" criterion, and determine the optimum number of features with the classifier performance. This is can be considered a hybrid between filters and wrappers or we can just call it a wrapper for which the search is guided by a relevance criterion.
5. How is the assessment done for wrapper methods?
Usually using cross-validation. However, there are "wrong" ways to do it. See question 11 in the assessment methods.
6. How many trainings does an exhaustive search in feature space require?
2n, n being the number of features.
7. If we are selecting from data among N decision rules, what is the minimum number of examples we need to find the one that generated the data?
log2 N.
8. Combining the answers to the two previous questions, how should the number of examples scale with the number of features if we use exhaustive search in a wrapper setting?
We have to select among N=2n decision rules. The number of examples should therefore be of the order of n.
9. What is the number of subsets that are investigated in forward selection, backward elimination, and feature ranking?
For feature ranking and nested feature subset methods, N=n. For forward and backward selection N=n(n+1)/2.
10. How should the number of examples scale with the number of features for the methods of question 9?
The number of examples should be of the order of log n.
11. What are the differences, advantages and disadvantages of forward vs. backward selection?
Forward selection starts with an empty set of features and progressively add features. Backward elimination starts with all the features and progressively eliminate some. They both produce nested subsets of features. From the point of view of statistical complexity, they are equivalent. If the search is ended when a stopping criterion is met, forward selection may be computationally more efficient because the trainings are performed on smaller feature subsets. From the point of view of the quality of the solution, it depends: when we vary the number of features, the best feature subsets may not be nested. The forward nesting ensures some optimality of the small feature subsets, but does not guaranty finding the best larger ones. The backward nesting on the other hand ensures same optimality of the larger subsets, but may yield very poorly performing small subsets.
12. What is "floating search"?
A method of combining forward and backward selection. One step forward, then backward as long as we find better subsets than those of the same size obtained so far, or vice versa.
13. What is simulated annealing?
A method of optimization inspired physics of re-crystallization.
- Make a step in feature space, compute deltaE
- If deltaE<0, accept the change
- Otherwise, accept the change with probability exp(-deltaE/T)
- Progressively “cool down”.
14. What are "genetic algorithms"?
A method of optimization inspited by biology.
15. Why couldn't we use gradient descent instead of the search methods we talked about in class?
Because we are searching a discrete space. We will see when we talk about embedded methods how we can in fact use gradient descent.