Questions on lecture 3

Architectures and algorithms (continued)
13.    What is “ridge regression”? How can one train a ridge regression linear model?
Ridge regression is like least-square regression with an additional penalty term ||w||2. To train a ridge regression linear model f(x)=x w
T, one can perform a matrix inversion of the regularized matrix  (XTX +lambda I) to get wT = (XTX +lambda I)-1XTy  or perform gradient descent in the penalized risk functional: R[w] = || XwTy ||2 + lambda ||w||2. For the stochastic gradient, one derivates simply the contribution to the risk of one example: (xi wT -yi)2 + lambda ||w||2, yielding the learning rule Delta w = [(yi - xi wT) xi - lambda w] or w <- (1-lambda) w +  (yi - xi wT) xi .
14.    What is “weight decay”? What is the connection to ridge regression?
Weight decay means decreasing the weights at every learning step according to (1-lambda)w. The weight decay in ridge regression results from derivaring the pernalty term  lambda ||w||2.
15.    What is a “Gaussian process”? What is the connection to ridge regression and weight decay?
A Gaussian process is a generative model in which the weights of the target function are drawn according to a Gaussian distribution (for a linear model). The prior in function space is P(f) = exp -lambda ||w||2. In Maximum A Posterori (MAP) framework one seeks to find the function f that maximizes P(f|D), D being the data, or equivalently P(D|f) P(f). By taking the negative log, one sees that -logP(D|f) plays the role of the risk and -logP(f) the role of the penalty term lambda ||w||2. Hence a Gaussian prior on the weights is equivalent to using the penalty lambda ||w||2 in the risk minimization framework.
16.    What method(s) solve(s) exactly the least square problem for a linear model? How does this relate to LMS and Hebb’s rule?
The solutions are similar to those of ridge regression, but we let lambda go to zero. LMS stands for "least mean square". It is the rule obtained by derivating the square loss with repect to w: Delta w (yi - xi wT) xi . It is just like Hebb's rule  Delta w = yi xi except that it learns less the examples for which the predictions are already good.
17.    What is the "pseudo-inverse"? How is it linked to ridge regression?
The pseudo-inverse of 
XT is lim lambda->0+  (XTX +lambda I)-1 XT. It is involved in solving the least square regression problem. The solution for ridge regression is the same, except that lambda is now a given positive value.
18.    What is kernel ridge regression? Give other examples of algorithms using the “kernel trick”?
We can use the same ridge regression algorithm for models linear in their parameters, but non linear in their input components ("Perceptrons" f(x)=phi(x) w
T).  Using the "kernel trick" i.e. the fact that w = Sumi alphai phii(x), we can transform the problem of solving Phi wT=y into K alpha = y, where K is the matrix of dot products between the training examples in phi space. Kernel ridge regression amounts to solving that equation, after adding lambda to the diagonal of K. The kernel trick may be used with all the algorithms for which w = Sumi alphai phii(x), in which the phi vectors appear only through their dot product and can therefore be replaced by a similarity measure k(x, x') = phi(x).phi(x')
19.    What is Principal Component Analysis (PCA)? Which eigen value indicates the direction of largest variance? In what sense is the representation obtained from a projection onto the eigen directions corresponding the the largest eigen values optimal for data reconstruction?
PCA is a method of feature construction. The new features are linear combinations (weighted sums) of the old ones. They are obtained by rotating the input space into the axes of the principal components of XTX: X->XU, where the columns of U are eigenvectors.  This transform has the following properties: (1) the eigen directions corresponding to the largest eigenvalues explain best the variance in the data; (2) If we limit ourselves to the n' eigen directions corresponding to the top eigenvalues and rotate back into the original axes: XU -> XUUT, the reconstructed data
XUUT are closest to the original data X in the least square sense. So we cut down on the number of features with as small as possible information loss.
20.    What is the connection between  PCA and weight decay?
PCA cuts the dimensions corresponding to the smallest eigenvalues of X
TX.  Weight decay pulls the weight to zero preferably in those directions, which are the directions of least resistance.
21.    What is an irregular matrix? Name one way of regularizing a matrix.

An irregular matrix is a matrix which is not invertible. A square matrix of dimensions (n, n) is not invertible if its rank is smaller than n, or equavalently it has less than n non-zero eigenvalues. To regularize a matrix, one can add a small positive value to the diagonal, which has the effect of making all eigen values non-zero.

Learning and generalization (continued)
17.    What is Bayesian learning?

In Bayesian learning, one assumes that the data was drawn from a double random process: first a function f is drawn according to a "prior" distribution P(f), then data pairs are drawn D={xi, f(x
i)}. In Bayesian learning, one seeks to estimate for a new example x the probability P(y|x,D) by integrating over all possible choices of f, using P(f|D): P(y|x,D) = integral P(y|x,f) dP(f|D).
18.    What is a prior? What is a posterior?
P(f) is the prior on function space. Our revised opinion, after we have seen the data, is called the posterior P(f|D).
19.    What is Maximum A Posteriori estimation (MAP)?
The Bayesian approach requires computing a difficult intergral. Instead, we can select only one function f that maximizes
P(f|D). This is the Maximum A Posteriori (MAP) approach. We use Bayes' formula P(f|D)P(D)=P(D|f)P(f) so that we can replace the maximization of P(f|D) by that of P(D|f)P(f), where P(D|f) is the likelihood and P(f) the prior.
20.    What is “structural risk minimization”?
It is a means of controlling the complexity of a model by building a structure consisting of nested subsets of models. By doing that, we know that the complexity is increasing from subset to subset. It is then possible to penalize more complex models, without quantifying their complexity. Constraints on the structure are imposed and introduced into the risk functional via Lagrange multipliers.
21.    What is “regularization”? What is a “ridge”?
Regularization is a means of solving "ill-posed" problems, such as inverting a matrix which is not invertible. The penalty term
lambda ||w||2 in ridge regression is called a regularizer. The positive coefficient lambda is called "ridge".
22.    What is a Lagrangian?
In an optimization problem in which one seeks to minimize a functional under some constraints, a Lagrangian is the functional  in which the constraints have been introduced by multiplying them by positive coefficients called "Lagrange multipliers". The net effect is to simplify the optimization problem by putting it in a canonical form.
23.    What is the link between structural risk minimization and regularization?
We can choose a structure that penalizes models with large ||w||2. Each element of the structure imposes a constraint ||w||2< A. Using a Lagrange multiplier lambda, we can obtain again the same penalized risk functional used for ridge regression, i.e. the regularizer  lambda ||w||2.