Questions lecture 10

Questions on lecture 10

Learning and generalization (revisions)
14.    What is the likelihood?
Predictive models learn the mapping y=f(x) or more generally P(y|x). Conversely, generative models (which are probabilistic) learn the opposive, namely to predict the density of x given y P(x|y). In the maximum likelihood framework, we assume that the data was produced by a model. The model has some parameters. The goodness-of-fit of "likelihood" is the probability that the data was produced by the model, for a given choice of parameters.
Likelihood = Proba (data | model).
15.    What means “maximum likelihood”?
The maximum likelihood method of inference chooses the set of parameters of the model that maximize the likelihood.
17.    What is Bayesian learning?
In Bayesian learning, one assumes that the data was drawn from a double random process: first a function f is drawn according to a "prior" distribution P(f), then data pairs are drawn D={x_i, f(x_i)}. In Bayesian learning, one seeks to estimate for a new example x the probability P(y|x,D) by integrating over all possible choices of f, using P(f|D): P(y|x,D) = integral P(y|x,f) dP(f|D).
18.    What is a prior? What is a posterior?
P(f) is the prior on function space. Our revised opinion, after we have seen the data, is called the posterior P(f|D).
19.    What is Maximum A Posteriori estimation (MAP)?
The Bayesian approach requires computing a difficult intergral. Instead, we can select only one function f that maximizes P(f|D). This is the Maximum A Posteriori (MAP) approach. We use Bayes' formula P(f|D)P(D)=P(D|f)P(f) so that we can replace the maximization of P(f|D) by that of P(D|f)P(f), where P(D|f) is the likelihood and P(f) the prior.
21.    What is “regularization”? What is a “ridge”?
Regularization is a means of solving "ill-posed" problems, such as inverting a matrix which is not invertible. The penalty term lambda ||w||² in ridge regression is called a regularizer. The positive coefficient lambda is called "ridge".

Learning and generalization (new)
24. What is the correspondance between maximum likelihood and empirical risk minimization?
The likelihood of a model can be converted to a risk functional via the formula:
R[f] = - log P(D|f)
For i.i.d. data, P(D|f) can be decomposed as the product of P(x_i,y_i|f). The risk being the sum of the losses for the example patterns, the loss function is then given by:
L(f(x_i), y_i) = - log P(x_i,y_i|f)
Conversely, the risk may be interpreted as an "energy". A likelihood can be defined following the Boltzman distribution P(D|f)= exp -R[f]/T, where T is a "temperature" parameter.
With this correspondance, minimizing the risk is equivalent to maximizing the likelihood.
25. What is the correspondance between MAP and regularized risk minimization?
In the MAP framework, we maximize the product of the likelihood and the prior P(D|f)P(f). A regularized risk functional may be obtained by taking the negative log:
R[f] = - log P(D|f) - log P(f)
where - log P(f) takes the role of the regularizer.
26. What is the correspondance between ridge regression, weight decay, Gaussian processes, and ARD priors?
Assume a linear model f(x)=w.x+b is used. Ridge regression means least-square regression with a 2-norm regularizer ||w||². A stochastic gradient algorithm optimizing the corresponding regularized risk functional includes a "weight decay" term. Gaussian processes are Bayesian methods assuming the weights are picked using a prior P(f)=exp-a||w||². Thefore the regularizer obtained by taking -logP(f) is the 2-norm regularizer ||w||². In the case of the linear model, this prior is also called ARD (Automatic Relevance Determination). The method can be "kernelized" by introducing scaling factors.
27. Why does the 1-norm regularization yield "sparse" solutions?
The surfaces of equal regularization are hyper-diamonds. If the unregularized solution is close enough to an edge, the solution is pulled to the edge, corresponding to a number of weights being set to zero. For the 2-norm regularization, the surfaces of equal regularization are hyper-sheres. The weights do not get set preferentially to zero.
28. Why is the 1-norm regularization not suitable for the "kernel trick"?
To apply the kernel trick, it should be possible to express the cost function in terms of dot products of patterns. With the 2-norm regularizer, this is possible ||w||₂² = w w^T = alpha^T XX^T alpha, where XX^T is the matrix of the dot products between all the pairs of patterns (that becomes the kernel matrix after applying the kernel trick). For the 1-norm regularizer, this is not possible.
29. What is a "link function"? Give examples.
For a discriminant function f, a link function is a function "linking" the functional margin z=y f(x) and the likelihood P(D|f). Is is a means of converting the output of a discriminant function to posterior class probabilities. Link functions are usually S-shaped (sigmoids). The tanh and the logistic function 1/(1+e^-z) are often used. A piece-wise function, which is S-shaped, but is constant before -1 and after +1 may be used to implement "Bayesian" SVMs and get support vectors.