Questions on lecture 5

Filter methods
1.   What purpose(s) may be pursued when selecting features?
- Removing useless features (pure noise or "distracters") to save computing time and data storage.
- Improving prediction performance (there is less risk of overfitting if we start from a lower dimensional space)
- Understanding the process that generated the data (reverse engineering).
2.   How can one define feature "relevance"?  What is easier to define, relevance or irrelevance?
Relevance might be defined by the existence of a dependence between a feature and the target values (or "desired outcome, e.g. the classification labels). Statistical independence is easy to define: P(X,Y)=P(X)P(Y). So independence is easier to define than dependence. There are several ways of defining dependence and assessing it. One way if the measure the discrepancy between P(X,Y) and P(X) P(Y) with the KL divergence. This criterion is called "Mutual Information". Features can sometimes be irrelevant by themselves but relevant "in the context of others". Therefore we need to introduce a notion of conditional relevance, e.g. conditional mutual information.
3.   In what respect is it possible to assess feature relevance from observational data (i.e. without being able to control the values taken by the features and designing experiments)? What will the limitations be?
For "canned data" we might be able to observe the dependence between features and target, but we cannot be sure that a feature showing no dependence is actually irrelevant. Only designed experiments can allow us to explore the space of values of the features in a systematic way and rule out dependencies with confidence. Often observational data consits of a sub-optimal exploration of input space because some variable values were not explored or the value of some variables not recorded at all.  One should beware of confounded factors: variables that have co-varied during the experiment. For instance all the disease patient samples were stored in certain conditions and all the healthy patient samples in a different condition. Storage is then a confounding factor.
4.   Will causal relationships be determined from the feature selection process? Is the inference of causality necessary to build a good predictor?
Feature relevance was defined via the notion of statistical dependence/independence, a generalization of the notion of correlation. There is no implication about causality. Correlated events may be causally related in either direction or result from a common cause but not be directly causally related. For example, observing that a person has a rash and eats chocolate does not mean that the chocolate diet caused the rash. The person may have eaten chocolate as a compensation for the ugliness of the rash! Or there might be a common cause, for example anxiety resulting from the preparation of an exam.
Causality is more difficult to infer than variable dependence. Luckily, we do not need to infer causality to build good predictors. For example, protein levels in blood can be used for cancer diagnosis. Some protein levels may be causing the disease (like the lack of a given tumor suppressor), others my be the consequence of the disease (like the presence of a given antibody). But both may equally well be used to diagnose the disease.
5.  How can we define feature "usefulness"?
A feature is useful if, when added to a subset of other features, it results in a prediction performance improvement, or if when removed it results in a performance loss.
6.  Are features useful to make predictions always relevant, and vice versa? Give examples.
No: useful features may be irrelevant and vice versa. For example, two useful features may be redundant, so the removal of one of them will not cause performance degradation. Note that we can remove either redundant feature, so usefulness is not an intrinsic feature characteristic, it depends on all other features. Conversely, irrelevant features may be useful. A simple constant input in a linear model adds a bias that may result in performance improvement, but a constant value is not "relevant" to the target. A more elaborate example is the case of a nuisance factor adding noise to two features, one of which "f1" being "relevant" and the other "f2" "irrelevant". The nuisance factor is a systematic error tan can be removed by subtracting f2 from f1 and resulting in improved performance, even though f2 has no relevance to the target.
7.   In what respects is mutual information a good or a
bad choice to assess feature relevance?
MI is a good choice because it does not make any assumption on the data distribution and is looking for dependence in an agnostic way. Therefore, it can unravel non-linear dependencies. It is a bad choice because it is very difficult to estimate from data, except in the case of 2-class binary classifications problems (both features are target are binary). For problems with multivariate or continuous features and/or targets, it is preferable to use ranking indexes based on simple statistics of the distribution (like mean and variance). Such ranking indexes include the Pearson correlation coefficient and Fisher's criterion.
8.   What is the Pearson correlation coefficient? Why is it a measure of goodness of fit of the linear least-square model? Give the formula that relates R2 and the F score (variance explained/
residual variance).
The Pearson correlation coefficient is
R=cov(X,Y)/sqrt(var(X) var(Y)). For the least-quare linear regression, 1-R2 is equal to the ratio of the residual variance over the total variance (i.e. it is the normalized mean-square-error), thus R2 is a measure of goodness-of-fit for the linear least-square model.  It follows simply that 1+F=1/(1+R2), because total variance = variance explained + residual variance.
9.   Is correlation related to mutual information? Give examples in which uncorrelated signals may have a high mutual information? Can correlated signals have a low mutual information?
Correlation is related to mutual information, but in some cases uncorrelated variables can have a lot of mutual information (example of the sinusoid). On the other hand, correlated variables always have a lot of mutual information (there dependence is linear). In the case where X and Y are Gaussian distributed, there is a simple relation: MI = -(1/2) log(1-R2).

10.  How are the S2N coefficient, the Pearson correlation coefficient and the Fisher score related?
They all essentially measure the same thing: the ratio of the "signal" (the difference between the mean values of the two classes), and the "noise" (the within class standard deviation). For unbalanced classes differences arise because some criteria give more importance to the more abundant class, either for the calculation of the signal or that of the noise. The S2N coefficient is the best for unbalanced classes.
11.  What is conditional relevance? Give examples of feature ranking method, which take into account the "context" of other features.

Conditional relevance is "relevance in the context of other features". We discussed in class the Relief criterion.