Questions on lecture 9

Embedded methods

1. What is the difference between filters, wrappers, and embedded methods?
Fiters select features independently of the performances of the learning machine, using a "relevance" criterion. Wrappers and embedded methods both select feature subsets using a learning machine. Wrappers explore the space of all possible feature subsets using a search method; the learning machine is used to assess the subsets.  Any off-the-shelf learning machine can be used for that purpose.  Embedded methods perform feature selection in the process of learning. They return a feature subset and a trained learning machine.
2. What are the computational differences between nested subset methods for wrappers and embedded methods?
Assuming we split the data into one training set and one validation set:
- Wrappers perform n(n+1)/2 learning machine trainings.
- Embedded methods perform only n trainings.
This is due to the fact that the next feature to remove or add in embedded methods is chosen by estimating the change in cost function that will be incurred, not by retraining.
3.  
What is the statistical complexity difference between nested subset methods for wrappers and embedded methods?
It is approximatly the same, of the order of log(n). One can afford a number of features exponential in the number of examples.
4. What guides the search of embedded methods?
Changes in the cost function incurred by adding or removing inputs. These changes are estimated without retraining the learning machine, for instance using a Taylor series of the cost function (like in the case of OBD).
5.  How can one move from a search in a dicrete space of feature subsets to a search in a continuous space?
By introducing feature scaling factors.
6.  How does one perform feature selection using scaling factors?
One can perform gradient descent to optimize a performance bound. The resulting scaling factors indicate feature usefulness.
7.  What is the idea behind the "zero norm" method?
We use ||w||0 =number_of_features as a regularizer. This is a shrinkage method.
||w||0 is used instead of ||w||22 or ||w||1. The idea is to make a tradeoff between the empirical risk (e.g. the number of training errors) and the number of features:
Rreg = Remp + lambda
||w||0