Questions on lecture 8
Assessment methods
(continued)
7. What is the variance of the test error rate?
For an error rate E computed from m examples, the variance is: E(1-E)/m.
8. What is a good rule of thumb
to compute the size of a test set necessary to obtain statistically significant
results?
m=100/E.
9. What is a
good test to assess the significance of the difference in performance of
two classifiers?
The NcNemar paired test.
10. What is cross-validation?
Give examples of cross-validation methods.
Cross-validation amounts to splitting the training data multiple
times into training set and validation set. training is performed on the
training set and test on the validation set. The validation results are
then averaged. Note that this does not preclude of reserving a separate
test set for the final testing. Examples of cross-validation techniques
include k-fold cross-validation, bootstrap, leave-one-out.
11. Why is it
wrong to rank the features with all the training set and then run cross validation
on nested subsets defined by the ranking to determine the optimum number
of features?
This is wrong because it biases the result: all the examples have been
iplicitly be used for training since they were used for the feature ranking.
It is better (but more computationally expensive) to remove one example,
carry out the ranking and the training of all subsets, test on the withheld
example, then average the results for each size of feature subset.
12. Is is best to carry out extensive
experiments of feature/hyperparameter selection exploring a number of possibilities
as large as possible?
No, this is dangerous
and prone to overfitting. In this case, the validation set is being overfitted.
If a large number of possibilities are investigated, they should
be ranked in order of preference before running the experiments. Then,
if two models have the same performance, within the error bar, the best
ranked model is chosen.
Wrappers
1. What is the difference
between filters and wrappers?
Fiters select features independently of the performances of the learning
machine, using a "relevance" criterion. Wrappers select feature subsets on
the basis of how well a learning machine performs.
2. Why do wrappers usually need a search method?
Training learning machines on all possible feature subsets is usually computationally
infeasible. Wrappers use search strategies to efficiently explore the space
of feature subsets.
2. Are all filters feature ranking methods?
No, feature ranking methods are a subset of the filter methods. However,
some people call feature ranking methods filters. Filters include methods
of selecting feature subsets on the basis of a relevance (or conditional
relevance) criterion. Some people call filters methods that select features
using one learning machine, but are then used for another one. For example,
people use random forests (ensembles of decision tree) as a filter and then
train an SVM with the selected features.
3. Are search methods only useful for warppers?
No, you can use a search method to generate feature subsets and then assess
them with a "relevance" criterion rather than the score of a learning machine.
4. How is the assessment done for filter methods?
To be perfectly consistent with the definition of a filter, determining the
optimum number of features should not use a classifier. Statistical tests
and the proble methods are consistent in that respect. However, it is very
common that people rank features with a "relevance" criterion, and determine
the optimum number of features with the classifier performance. This is can
be considered a hybrid between filters and wrappers or we can just call it
a wrapper for which the search is guided by a relevance criterion.
5. How is the assessment done for wrapper methods?
Usually using cross-validation. However, there are "wrong" ways to do it.
See question 11 in the assessment methods.
6. How many trainings does an exhaustive search in feature space require?
2n, n being the number of features.
7. If we are selecting from data among N decision rules, what is the minimum
number of examples we need to find the one that generated the data?
log2 N.
8. Combining the answers to the two previous questions, how should the
number of examples scale with the number of features if we use exhaustive
search in a wrapper setting?
We have to select among N=2n decision rules. The number of examples
should therefore be of the order of n.
9. What is the number of subsets that are investigated in forward selection,
backward elimination, and feature ranking?
For feature ranking and nested feature subset methods, N=n. For forward and
backward selection N=n(n+1)/2.
10. How should the number of examples scale with the number of features
for the methods of question 9?
The number of examples should be of the order of log n.
11. What are the differences, advantages and disadvantages of forward
vs. backward selection?
Forward selection starts with an empty set of features and progressively
add features. Backward elimination starts with all the features and progressively
eliminate some. They both produce nested subsets of features. From the point
of view of statistical complexity, they are equivalent. If the search is
ended when a stopping criterion is met, forward selection may be computationally
more efficient because the trainings are performed on smaller feature subsets.
From the point of view of the quality of the solution, it depends: when we
vary the number of features, the best feature subsets may not be nested.
The forward nesting ensures some optimality of the small feature subsets,
but does not guaranty finding the best larger ones. The backward nesting
on the other hand ensures same optimality of the larger subsets, but may
yield very poorly performing small subsets.
12. What is "floating search"?
A method of combining forward and backward selection. One step forward, then
backward as long as we find better subsets than those of the same size obtained
so far, or vice versa.
13. What is simulated annealing?
A method of optimization inspired physics of re-crystallization.
- Make a step in feature space, compute deltaE
- If deltaE<0, accept the change
- Otherwise, accept the change with probability exp(-deltaE/T)
- Progressively “cool down”.
14. What are "genetic algorithms"?
A method of optimization inspited by biology.
15. Why couldn't we use gradient descent instead of the search methods
we talked about in class?
Because we are searching a discrete space. We will see when we talk about
embedded methods how we can in fact use gradient descent.