Questions on lecture 10
Learning and generalization
(revisions)
14. What is the likelihood?
Predictive models learn the mapping y=f(x) or more generally P(y|x). Conversely,
generative models (which are probabilistic) learn the opposive, namely to
predict the density of x given y P(x|y). In the maximum likelihood framework,
we assume that the data was produced by a model. The model has some parameters.
The goodness-of-fit of "likelihood" is the probability that the data was
produced by the model, for a given choice of parameters.
Likelihood = Proba (data | model).
15. What means “maximum likelihood”?
The maximum likelihood method of inference chooses the set of parameters
of the model that maximize the likelihood.
17.
What is Bayesian learning?
In Bayesian learning, one assumes that the data was drawn from a double
random process: first a function f is drawn according to a "prior" distribution
P(f), then data pairs are drawn D={xi, f(xi)}. In Bayesian learning, one seeks to
estimate for a new example x the probability P(y|x,D) by integrating
over all possible choices of f, using P(f|D): P(y|x,D) = integral
P(y|x,f) dP(f|D).
18. What is a prior? What is a posterior?
P(f) is the prior on function space. Our revised opinion, after we have
seen the data, is called the posterior P(f|D).
19. What is Maximum A Posteriori estimation (MAP)?
The Bayesian approach requires computing a difficult intergral. Instead,
we can select only one function f that maximizes P(f|D). This is the Maximum
A Posteriori (MAP) approach. We use Bayes' formula P(f|D)P(D)=P(D|f)P(f) so
that we can replace the maximization of P(f|D) by that of P(D|f)P(f), where
P(D|f) is the likelihood and P(f) the prior.
21. What is “regularization”? What is a “ridge”?
Regularization is a means of solving "ill-posed" problems, such as inverting
a matrix which is not invertible. The penalty term lambda ||w||2
in ridge regression is called a regularizer. The positive coefficient lambda
is called "ridge".
Learning and generalization
(new)
24. What is the correspondance
between maximum likelihood and empirical risk minimization?
The likelihood of a model can be converted to a risk functional via the
formula:
R[f] = - log P(D|f)
For i.i.d. data, P(D|f) can be decomposed as the product of P(xi,yi|f). The risk being the sum of the losses
for the example patterns, the loss function is then given by:
L(f(xi), yi) = - log P(xi,yi|f)
Conversely, the risk may be interpreted
as an "energy". A likelihood can be defined following the Boltzman distribution
P(D|f)= exp -R[f]/T, where T is a "temperature" parameter.
With this correspondance, minimizing
the risk is equivalent to maximizing the likelihood.
25. What is the correspondance
between MAP and regularized risk minimization?
In the MAP framework, we maximize
the product of the likelihood and the prior P(D|f)P(f). A regularized risk functional may be
obtained by taking the negative log:
R[f] = - log P(D|f) - log
P(f)
where - log P(f) takes the role of the regularizer.
26. What is the correspondance
between ridge regression, weight decay, Gaussian processes, and ARD priors?
Assume a linear model f(x)=w.x+b is used. Ridge regression
means least-square regression with a 2-norm regularizer ||w||2.
A stochastic gradient algorithm optimizing the corresponding regularized risk
functional includes a "weight decay" term. Gaussian processes are Bayesian
methods assuming the weights are picked using a prior P(f)=exp-a||w||2.
Thefore the regularizer obtained by taking -logP(f) is the 2-norm regularizer ||w||2.
In the case of the linear model,
this prior is also called ARD (Automatic Relevance Determination). The method
can be "kernelized" by introducing scaling factors.
27. Why does the 1-norm regularization
yield "sparse" solutions?
The surfaces of equal regularization are hyper-diamonds. If the unregularized
solution is close enough to an edge, the solution is pulled to the edge, corresponding
to a number of weights being set to zero. For the 2-norm regularization, the
surfaces of equal regularization are hyper-sheres. The weights do not get
set preferentially to zero.
28. Why is the 1-norm regularization not suitable for the "kernel trick"?
To apply the kernel trick, it should be possible to express the cost function
in terms of dot products of patterns. With the 2-norm regularizer, this is
possible ||w||22 = w wT = alphaT XXT alpha, where XXT is the matrix of the dot products between
all the pairs of patterns (that becomes the kernel matrix after applying the
kernel trick). For the 1-norm regularizer, this is not possible.
29. What is a "link function"? Give examples.
For a discriminant function f, a link function is a function "linking" the
functional margin z=y f(x) and the likelihood P(D|f). Is is a means of converting
the output of a discriminant function to posterior class probabilities. Link
functions are usually S-shaped (sigmoids). The tanh and the logistic function
1/(1+e-z) are often used. A piece-wise function, which is S-shaped,
but is constant before -1 and after +1 may be used to implement "Bayesian"
SVMs and get support vectors.