**Questions on lecture 3**

__Architectures
and algorithms (continued)__

**13.
What is “ridge regression”? How can one train a ridge regression linear
model?**

Ridge regression is like least-square regression with an additional penalty
term ||**w**||^{2}. To train a ridge regression linear model f(**x**)=**x
w**^{T}, one can perform a matrix inversion
of the regularized matrix (X^{T}X +lambda I)** **to get **w**^{T}
= (X^{T}X +lambda I)^{-1}X^{T}**y** or perform gradient descent in the
penalized risk functional: R[**w**] = || X**w**^{T} –**y** ||^{2} + lambda ||**w**||^{2}. For the stochastic gradient, one derivates
simply the contribution to the risk of one example: (**x**_{i} **w**^{T} -y_{i})^{2} + lambda ||**w**||^{2}, yielding the learning rule Delta **w**
= [(y_{i}** **- **x**_{i} **w**^{T})** x**_{i} - lambda **w**] or **w** <- (1-lambda) **w**
+ (y_{i}** **- **x**_{i} **w**^{T})** x**_{i} .

**14. What is
“weight decay”? What is the connection to ridge regression?**

Weight decay means decreasing the
weights at every learning step according to (1-lambda)**w**. The weight
decay in ridge regression results from derivaring the pernalty term lambda ||**w**||^{2}.

** 15. What
is a “Gaussian process”? What is the connection to ridge regression and weight
decay?**

A Gaussian process is a generative
model in which the weights of the target function are drawn according to
a Gaussian distribution (for a linear model). The prior in function space
is P(f) = exp -lambda ||**w**||^{2}. In Maximum A Posterori
(MAP) framework one seeks to find the function f that maximizes P(f|D), D
being the data, or equivalently P(D|f) P(f). By taking the negative log,
one sees that -logP(D|f) plays the role of the risk and -logP(f) the role
of the penalty term lambda
||**w**||^{2}.
Hence a Gaussian prior on the weights is equivalent to using the penalty
lambda ||**w**||^{2} in the risk minimization
framework.

** 16. What
method(s) solve(s) exactly the least square problem for a linear model?
How does this relate to LMS and Hebb’s rule?**

The solutions are similar to those
of ridge regression, but we let lambda go to zero. LMS stands for "least
mean square". It is the rule obtained by derivating the square loss with
repect to **w**: Delta **w** = (y_{i}** **- **x**_{i} **w**^{T})** x**_{i} _{}. It is just like Hebb's rule
Delta **w** = y_{i}** x**_{i} except that it
learns less the examples
for which the predictions are already good.

** 17. What
is the "pseudo-inverse"? How is it linked to ridge regression?**

The pseudo-inverse of X^{T} is lim _{lambda->0+} (X^{T}X +lambda I)^{-1} X^{T}. It is involved in solving the least
square regression problem. The solution for ridge regression is the same,
except that lambda is now a given positive value.

** 18. What is kernel
ridge regression? ****Give
other examples of algorithms using the “kernel trick”?**

We can use the same ridge regression algorithm for models linear in their
parameters, but non linear in their input components ("Perceptrons" f(**x**)=**phi**(**x**)
**w**^{T}). Using the "kernel trick" i.e.
the fact that **w** = Sum_{i} alpha_{i} **phi**_{i}(**x**), we can transform the problem
of solving Phi **w**T=y into K **alpha** = **y**, where K is the
matrix of dot products between the training examples in phi space. Kernel
ridge regression amounts to solving that equation, after adding lambda to
the diagonal of K. The kernel trick may be used with all the algorithms for
which **w** = Sum_{i}
alpha_{i} **phi**_{i}(**x**), in which the **phi **vectors
appear only through their dot product and can therefore be replaced by a
similarity measure k(**x, x'**) = **phi**(**x**)**.phi**(**x**')

**19. What is
Principal Component Analysis (PCA)? Which eigen value indicates the direction
of largest variance?****
In what sense is the representation obtained from a projection onto the eigen
directions corresponding the the largest eigen values optimal for data reconstruction?
**

PCA is a method of feature construction. The new features are linear
combinations (weighted sums) of the old ones. They are obtained by rotating
the input space into the axes of the principal components of XTX: X->XU,
where the columns of U are eigenvectors. This transform has the following
properties: (1) the eigen directions corresponding to the largest eigenvalues
explain best the variance in the data; (2) If we limit ourselves to the n'
eigen directions corresponding to the top eigenvalues and rotate back into
the original axes: XU -> XUU^{T}, the reconstructed data XUU^{T} are closest to the original data X in
the least square sense. So
we cut down on the number of features with as small as possible information
loss.

** 20. What
is the connection between PCA and weight decay?**

PCA cuts the dimensions corresponding to the smallest eigenvalues of X^{T}X. Weight decay pulls the weight
to zero preferably in those directions, which are the directions of least
resistance.

** 21. What is an irregular matrix? Name one way of regularizing
a matrix.**

An irregular matrix is a matrix
which is not invertible. A square matrix of dimensions (n, n) is not invertible
if its rank is smaller than n, or equavalently it has less than n non-zero
eigenvalues. To regularize a matrix, one can add a small positive value to
the diagonal, which has the effect of making all eigen values non-zero.

__Learning and generalization
(continued)__

17. What is Bayesian learning?

In Bayesian learning, one assumes that the data was drawn from a double random
process: first a function f is drawn according to a "prior" distribution
P(f), then data pairs are drawn D={**x**_{i}, f(**x**_{i})}. In Bayesian learning, one seeks
to estimate for a new example** x** the probability P(y|**x**,D) by
integrating over all possible choices of f, using P(f|D): P(y|**x**,D) = integral P(y|**x**,f)
dP(f|D).

** 18. What
is a prior? What is a posterior?**

P(f) is the prior on function space.
Our revised opinion, after we have seen the data, is called the posterior
P(f|D).

**19. What
is Maximum A Posteriori estimation (MAP)?**

The Bayesian approach requires computing a difficult intergral. Instead,
we can select only one function f that maximizes P(f|D). This is the Maximum A Posteriori (MAP) approach.
We use Bayes' formula P(f|D)P(D)=P(D|f)P(f) so that we can replace the maximization
of P(f|D) by that of P(D|f)P(f), where P(D|f) is the likelihood and P(f)
the prior.

**20. What is
“structural risk minimization”?**

It is a means of controlling the complexity of a model by building a structure
consisting of nested subsets of models. By doing that, we know that the complexity
is increasing from subset to subset. It is then possible to penalize more
complex models, without quantifying their complexity. Constraints on the
structure are imposed and introduced into the risk functional via Lagrange
multipliers.

** 21. What is “regularization”? What is a “ridge”?**

Regularization is a means of solving "ill-posed" problems, such as inverting
a matrix which is not invertible. The penalty term lambda ||**w**||^{2} in ridge regression is called a regularizer.
The positive coefficient lambda is called "ridge".

** 22. What is a Lagrangian?**

In an optimization problem in which one seeks to minimize a functional under
some constraints, a Lagrangian is the functional in which the constraints
have been introduced by multiplying them by positive coefficients called
"Lagrange multipliers". The net effect is to simplify the optimization problem
by putting it in a canonical form.

** 23. What is the link between structural risk minimization
and regularization?**

We can choose a structure
that penalizes models with large ||**w**||^{2}. Each element of the structure
imposes a constraint ||**w**||^{2}< A. Using a Lagrange
multiplier lambda, we can obtain again the same penalized risk functional
used for ridge regression, i.e. the regularizer lambda ||**w**||^{2}.