Questions on lecture
6
Statistics
Note: some answers were drawn from http://www.stats.gla.ac.uk/steps/glossary/
1. What is a random variable?
A random variable is a function that associates a unique numerical
value with every outcome of an experiment. The value of the random variable
will vary from trial to trial as the experiment is repeated. There are two
types of random variable - discrete and continuous. A random variable has
either an associated probability distribution (discrete random variable)
or probability density function (continuous random variable). A ranking
index R assessing the dependence between a feature and the target is a random
variable.
2. What are the definitions
and properties of: expected value, variance, standard deviation, coefficient
of variance?
- The expected value E(X)
(or population mean µ) of a random variable indicates its average
or central value. For a constant a, E(aX)=aE(X). For two random variables
X and Y, E(X+Y)=E(X)+E(Y). If X and Y are independent, E(XY)=E(X)E(Y).
- The variance of the random variable X indicates its spread and is defined
to be: var(X)=E(X-E(X))2=E(X2)-E(X)2. For
two constants a and b, var(aX+b)=a2var(X). For two independent
random variables X and Y, var(X+Y)=var(X)+var(Y).
- The standard deviation (stdev(X)) is the square root of the variance.
- The coefficient of variance is the ratio stdev(X)/E(X).
3. What is an estimator?
An estimator is a quantity calculated from the sample data, which
is used to give information about an unknown quantity in the population.
For example, the sample mean is an estimator of the population mean.
An estimator is a random variable.
Not all estimators are "equal",
some are more powerful than others. Some are biased: for a given size of
the sample data, their expected value is not the unknown quantity that we
want to estimate. Some have a lot of variance.
4. What is a probability
distribution?
The probability distribution of a discrete random variable is a list
of probabilities associated with each of its possible values.
5. What is a
cumulative density function (cdf)?
All random variables (discrete and continuous) have a cumulative distribution
function. It is a function giving the probability that the random variable
X is less than or equal to x, for every value x. Formally, the cumulative
distribution function F(x) is defined to be: F(x)=Proba(X<=x), -Inf<x<+Inf.
The cdf is obtained by integrating the pdf.
6. What is a probability
density function (pdf)?
The derivarive of the cdf.
If you have doubts about the definitions of the Gaussian pdf and the central
limit theorem, see
http://www.stats.gla.ac.uk/steps/glossary/probability_distributions.html
7. What are the basic
"ingredients" of a statistical test? What are possible outcomes?
1) A null hypothesis H0 that we want to test (and
eventually one or several alternative hypotheses H1.)
2) A test statistic T that is a random variable such that if H0 is true, the expected
value of T is zero.
3) The distribution of cdf
of T Proba(T<=t), if H0 is true.
4) A risk value alpha and the corresponding threshold talpha
such that alpha=Proba(T>talpha) .
[This is for a one-sided test where the risk is blocked on one side; for
a two-sided test the risk is equally spread on both sides of the cdf.]
5) A realization of T, t for a given population sample.
Then, if t>talpha
we reject H0 with risk alpha of being wrong. In
the opposite case, the conclusion is less strong: we do not reject H0. In hypothesis testing,
we never "accept" H0.
[For a two-sided test, we reject
if t>talpha/2 or t<talpha/2.]
8. What is a pvalue?
What does a small pvalue indicate about the null hypothesis?
Given a test statistic T and a realization t, the pvalue is pval=Proba(T>t)
[one-sided test]. Small pvalues shed doubt ont the null hypothesis.
Assessment methods
1. What is the definition
of a probably approximatly irrelevant feature?
For a relevance index R, Proba(R>epsilon)<delta, for epsilon
and delta positive constants.
2. If we want to test the statistical significance of the relevance
of a feature, what kind of test can we perform? State the null hypothesis.
What is the null distribution? What is the alternative distribution?
We can perform a hypothesis test with null hypothesis: "the feature
is irrelevant". The null distribution is the distribution of irrelevant features
for the given ranking index. The alternative distribution is the distribution
of relevant features. Both are usually unknown, but the null distribution
of random features is easier to model. We can for example use "random probes"
to estimate it.
3. Give examples of test statistics used to test feature relevance.
What is being used as ranking index?
The T statistic, the ANOVA statistic (F statistic), the Wilcoxon-Mann-Whitney
statistic. The pvalue is the ranking index. For one-sided tests, it gives
the same ranking as the test statistic. Some test statistics have positive
and negative values; zero corresponds to irrelevant features, large absolute
values correspond to relevant features; the sign indicates the direction
of the correlation.
4. What is the false positive rate (FPR) for feature selection?
This is the fraction of all the irrelevant features that have been
selected. It may be approximated by the fration of all the probes that have
been selected. If the distribution of irrelevant features is known, it is
also the pvalue.
5. In the case of multiple testing, does the FPR (or pvalue) estimate correctly
the fraction of wrong decisions?
The FPR correctly
estimates the type I errors (fraction of incorrect rejections of the null
hypothesis, that is fraction of incorrect decisions that the feature is
not irrelevant), if a single feature is being tested (assuming we could
test it multiple times by drawing multiple samples of the same size.) It
does not estimate the fraction of incorrect decisions that features
are not irrelevant if multiple (different, independent) features are tested.
In the case of multiple testing, the pvalue is larger. The Bonferroni correction
consists in replacing pval by n pval, where n is the number of features
tested.
6. What is the false discovery rate (FDR)?
This is the ratio of the number
of irrelevant features selected over the total number of features selected
nsc. It is bounded FDR <= FPR n/nsc, where n is the number of features tested.
Setting a threshold on the FDR
rather that on the FPR amonts to correcting the pvalues and replacing them
by n pval/nsc.