Questions on lecture 6

Note: some answers were drawn from
1. What is a random variable?
A random variable is a function that associates a unique numerical value with every outcome of an experiment. The value of the random variable will vary from trial to trial as the experiment is repeated. There are two types of random variable - discrete and continuous. A random variable has either an associated probability distribution (discrete random variable) or probability density function (continuous random variable). A ranking index R assessing the dependence between a feature and the target is a random variable.

2. What are the definitions and properties of: expected value, variance, standard deviation, coefficient of variance?
- The expected value E(X) (or population mean µ) of a random variable indicates its average or central value. For a constant a, E(aX)=aE(X). For two random variables X and Y, E(X+Y)=E(X)+E(Y). If X and Y are independent, E(XY)=E(X)E(Y).
- The variance of the random variable X indicates its spread and is defined to be: var(X)=E(X-E(X))2=E(X2)-E(X)2. For two constants a and b, var(aX+b)=a2var(X). For two independent random variables X and Y, var(X+Y)=var(X)+var(Y).
- The standard deviation (stdev(X)) is the square root of the variance.
- The coefficient of variance is the ratio stdev(X)/E(X).

3.  What is an estimator?
An estimator is a quantity calculated from the sample data, which is used to give information about an unknown quantity in the population. For example, the sample mean is an estimator of the population mean.
An estimator is a random variable.
Not all estimators are "equal", some are more powerful than others. Some are biased: for a given size of the sample data, their expected value is not the unknown quantity that we want to estimate. Some have a lot of variance.
4.  What is a probability distribution?
The probability distribution of a discrete random variable is a list of probabilities associated with each of its possible values.
5.  What is a cumulative density function (cdf)?
All random variables (discrete and continuous) have a cumulative distribution function. It is a function giving the probability that the random variable X is less than or equal to x, for every value x. Formally, the cumulative distribution function F(x) is defined to be: F(x)=Proba(X<=x), -Inf<x<+Inf. The cdf is obtained by integrating the pdf.

6.  What is a probability density function (pdf)?
The derivarive of the cdf.
If you have doubts about the definitions of the Gaussian pdf and the central limit theorem, see

7.  What are the basic "ingredients" of a statistical test? What are possible outcomes?
1) A null hypothesis H0 that we want to test (and eventually one or several alternative hypotheses
2) A test statistic T that is a random variable such that if
H0 is true, the expected value of T is zero.
3) The distribution of cdf of T Proba(T<=t), if H0 is true.
4) A risk value alpha and the corresponding threshold talpha such that
alpha=Proba(T>talpha) .
[This is for a one-sided test where the risk is blocked on one side; for a two-sided test the risk is equally spread on both sides of the cdf.]
5)  A realization of T, t for a given population sample.
Then, if t>
talpha we reject H0 with risk alpha of being wrong. In the opposite case, the conclusion is less strong: we do not reject H0. In hypothesis testing, we never "accept" H0. [For a two-sided test, we reject if t>talpha/2 or t<talpha/2.]
8.  What is a pvalue? What does a small pvalue indicate about the null hypothesis?
Given a test statistic T and a realization t, the pvalue is pval=Proba(T>t) [one-sided test]. Small pvalues shed doubt ont the null hypothesis.

Assessment methods
1.  What is the definition of a probably approximatly irrelevant feature?
For a relevance index R, Proba(R>epsilon)<delta, for epsilon and delta positive constants.
2.  If we want to test the statistical significance of the relevance of a feature, what kind of test can we perform? State the null hypothesis. What is the null distribution? What is the alternative distribution?
We can perform a hypothesis test with null hypothesis: "the feature is irrelevant". The null distribution is the distribution of irrelevant features for the given ranking index. The alternative distribution is the distribution of relevant features. Both are usually unknown, but the null distribution of random features is easier to model. We can for example use "random probes" to estimate it.
3.  Give examples of test statistics used to test feature relevance. What is being used as ranking index?
The T statistic, the ANOVA statistic (F statistic), the Wilcoxon-Mann-Whitney statistic. The pvalue is the ranking index. For one-sided tests, it gives the same ranking as the test statistic. Some test statistics have positive and negative values; zero corresponds to irrelevant features, large absolute values correspond to relevant features; the sign indicates the direction of the correlation.
4.  What is the false positive rate (FPR) for feature selection?
This is the fraction of all the irrelevant features that have been selected. It may be approximated by the fration of all the probes that have been selected. If the distribution of irrelevant features is known, it is also the pvalue.
5. In the case of multiple testing, does the FPR (or pvalue) estimate correctly the fraction of wrong decisions?
The FPR correctly estimates the type I errors (fraction of incorrect rejections of the null hypothesis, that is fraction of incorrect decisions that the feature is not irrelevant), if a single feature is being tested (assuming we could test it multiple times by drawing multiple samples of the same size.) It does not estimate the fraction of incorrect decisions that features are not irrelevant if multiple (different, independent) features are tested. In the case of multiple testing, the pvalue is larger. The Bonferroni correction consists in replacing pval by n pval, where n is the number of features tested.
6.  What is the false discovery rate (FDR)?

This is the ratio of the number of irrelevant features selected over the total number of features selected nsc. It is bounded FDR <= FPR n/nsc, where n is the number of features tested. Setting a threshold on the FDR rather that on the FPR amonts to correcting the pvalues and replacing them by n pval/nsc.