**Questions on lecture
5**

__Filter
methods__

**1.
What purpose(s) may be pursued when selecting features?**

- Removing useless features (pure noise or "distracters") to save computing
time and data storage.

- Improving prediction performance (there is less risk of overfitting if
we start from a lower dimensional space)

- Understanding the process that generated the data (reverse engineering).

2. How can one define feature "relevance"? What is easier to
define, relevance or irrelevance?

Relevance might be defined by the existence of a __dependence__ between
a feature and the target values (or "desired outcome, e.g. the classification
labels). Statistical independence is easy to define: P(X,Y)=P(X)P(Y). So
independence is easier to define than dependence. There are several ways
of defining dependence and assessing it. One way if the measure the discrepancy
between P(X,Y) and P(X) P(Y) with the KL divergence. This criterion is called
"Mutual Information". Features can sometimes be irrelevant by themselves
but relevant "in the context of others". Therefore we need to introduce a
notion of conditional relevance, e.g. conditional mutual information.

3. In what respect is it possible to assess feature relevance from
observational data (i.e. without being able to control the values taken by
the features and designing experiments)? What will the limitations be?

For "canned data" we might be able to observe the dependence between
features and target, but we cannot be sure that a feature showing no dependence
is actually irrelevant. Only designed experiments can allow us to explore
the space of values of the features in a systematic way and rule out dependencies
with confidence. Often observational data consits of a sub-optimal exploration
of input space because some variable values were not explored or the value
of some variables not recorded at all. One should beware of confounded
factors: variables that have co-varied during the experiment. For instance
all the disease patient samples were stored in certain conditions and all
the healthy patient samples in a different condition. Storage is then a confounding
factor.

4. Will causal relationships be determined from the feature selection
process? Is the inference of causality necessary to build a good predictor?

Feature relevance was defined via the notion of statistical dependence/independence,
a generalization of the notion of correlation. There is no implication about
causality. Correlated events may be causally related in either direction
or result from a common cause but not be directly causally related. For example,
observing that a person has a rash and eats chocolate does not mean that
the chocolate diet caused the rash. The person may have eaten chocolate as
a compensation for the ugliness of the rash! Or there might be a common cause,
for example anxiety resulting from the preparation of an exam.

Causality is more difficult to infer than variable dependence. Luckily, we
do not need to infer causality to build good predictors. For example, protein
levels in blood can be used for cancer diagnosis. Some protein levels may
be causing the disease (like the lack of a given tumor suppressor), others
my be the consequence of the disease (like the presence of a given antibody).
But both may equally well be used to diagnose the disease.

**5. How can we define feature "usefulness"?**

A feature is useful if, when added to a subset of other features, it
results in a prediction performance improvement, or if when removed it results
in a performance loss.

6. Are features useful to make predictions always relevant, and vice
versa? Give examples.

No: useful features may be irrelevant and vice versa. For example, two
useful features may be redundant, so the removal of one of them will not
cause performance degradation. Note that we can remove either redundant feature,
so usefulness is not an intrinsic feature characteristic, it depends on all
other features. Conversely, irrelevant features may be useful. A simple constant
input in a linear model adds a bias that may result in performance improvement,
but a constant value is not "relevant" to the target. A more elaborate example
is the case of a nuisance factor adding noise to two features, one of which
"f1" being "relevant" and the other "f2" "irrelevant". The nuisance factor
is a systematic error tan can be removed by subtracting f2 from f1 and resulting
in improved performance, even though f2 has no relevance to the target.

7. In what respects is mutual information a good or a **bad choice to assess feature relevance?**

MI is a good choice
because it does not make any assumption on the data distribution and is looking
for dependence in an agnostic way. Therefore, it can unravel non-linear dependencies.
It is a bad choice because it is very difficult to estimate from data, except
in the case of 2-class binary classifications problems (both features are
target are binary). For problems with multivariate or continuous features
and/or targets, it is preferable to use ranking indexes based on simple statistics
of the distribution (like mean and variance). Such ranking indexes include
the Pearson correlation coefficient and Fisher's criterion.

8. What is the Pearson correlation coefficient? Why is it a measure
of goodness of fit of the linear least-square model? Give the formula that
relates R^{2} and the F score (variance explained/**residual variance****).**

The Pearson correlation coefficient is R=cov(X,Y)/sqrt(var(X) var(Y)). For the
least-quare linear regression, 1-R^{2} is equal to the ratio of the residual
variance over the total variance (i.e. it is the normalized mean-square-error),
thus R^{2} is a measure of goodness-of-fit for
the linear least-square model. It follows simply that 1+F=1/(1+R^{2}), because total variance = variance
explained + residual variance.

** 9. Is correlation related
to mutual information? Give examples in which uncorrelated signals may have
a high mutual information? Can correlated signals have a low mutual information?
**

Correlation is related to mutual information, but in some cases uncorrelated
variables can have a lot of mutual information (example of the sinusoid).
On the other hand, correlated variables always have a lot of mutual information
(there dependence is linear). In the case where X and Y are Gaussian distributed,
there is a simple relation: MI = -(1/2) log(1-R2).

**10. How are the S2N coefficient,
the Pearson correlation coefficient and the Fisher score related?**

They all essentially measure the same thing: the ratio of the "signal"
(the difference between the mean values of the two classes), and the "noise"
(the within class standard deviation). For unbalanced classes differences
arise because some criteria give more importance to the more abundant class,
either for the calculation of the signal or that of the noise. The S2N coefficient
is the best for unbalanced classes.

11. What is conditional relevance? Give examples of feature ranking
method, which take into account the "context" of other features.

Conditional relevance is "relevance
in the context of other features". We discussed in class the Relief criterion.