**Questions on lecture 9**

__Embedded methods__

**1. What is the
difference between filters, wrappers, and embedded methods?**

Fiters select features independently of the performances of the learning
machine, using a "relevance" criterion. Wrappers and embedded methods both
select feature subsets using a learning machine. Wrappers explore the space
of all possible feature subsets using a search method; the learning machine
is used to assess the subsets. Any off-the-shelf learning machine can
be used for that purpose. Embedded methods perform feature selection
in the process of learning. They return a feature subset and a trained learning
machine.

**2. What are the computational differences between nested subset methods
for wrappers and embedded methods?**

Assuming we split the data into one training set and one validation set:

- Wrappers perform n(n+1)/2 learning machine trainings.

- Embedded methods perform only n trainings.

This is due to the fact that the next feature to remove or add in embedded
methods is chosen by estimating the change in cost function that will be
incurred, not by retraining.

**3. ****What
is the statistical complexity difference between nested subset methods for
wrappers and embedded methods?**

It is approximatly the same, of the order of log(n). One can afford
a number of features exponential in the number of examples.

**4. What guides the search
of embedded methods?**

Changes in the cost function incurred by adding or removing inputs. These
changes are estimated without retraining the learning machine, for instance
using a Taylor series of the cost function (like in the case of OBD).

**5. How can one move from a search in a dicrete space of feature
subsets to a search in a continuous space?**

By introducing feature scaling factors.

**6. How does one perform feature selection using scaling factors?**

One can perform gradient descent to optimize a performance bound. The
resulting scaling factors indicate feature usefulness.

**7. What is the idea behind the "zero norm" method?**

We use ||w||_{0} =number_of_features as a regularizer. This is
a shrinkage method. ||w||_{0} is used instead of ||w||_{2}^{2} or ||w||_{1}. The idea is to make a tradeoff between
the empirical risk (e.g. the number of training errors) and the number of
features:

R_{reg} = R_{emp} + lambda ||w||_{0}