Predictive models learn the mapping y=f(x) or more generally P(y|x). Conversely, generative models (which are probabilistic) learn the opposive, namely to predict the density of x given y P(x|y). In the maximum likelihood framework, we assume that the data was produced by a model. The model has some parameters. The goodness-of-fit of "likelihood" is the probability that the data was produced by the model, for a given choice of parameters.

The maximum likelihood method of inference chooses the set of parameters of the model that maximize the likelihood.

In Bayesian learning, one assumes that the data was drawn from a double random process: first a function f is drawn according to a "prior" distribution P(f), then data pairs are drawn D={

P(f) is the prior on function space. Our revised opinion, after we have seen the data, is called the posterior P(f|D).

The Bayesian approach requires computing a difficult intergral. Instead, we can select only one function f that maximizes P(f|D). This is the Maximum A Posteriori (MAP) approach. We use Bayes' formula P(f|D)P(D)=P(D|f)P(f) so that we can replace the maximization of P(f|D) by that of P(D|f)P(f), where P(D|f) is the likelihood and P(f) the prior.

Regularization is a means of solving "ill-posed" problems, such as inverting a matrix which is not invertible. The penalty term lambda ||

The likelihood of a model can be converted to a risk functional via the formula:

R[f] = - log P(D|f)

For i.i.d. data, P(D|f) can be decomposed as the product of P(

L(f(

Conversely, the risk may be interpreted as an "energy". A likelihood can be defined following the Boltzman distribution P(D|f)= exp -R[f]/T, where T is a "temperature" parameter.

With this correspondance, minimizing the risk is equivalent to maximizing the likelihood.

In the MAP framework, we maximize the product of the likelihood and the prior P(D|f)P(f). A regularized risk functional may be obtained by taking the negative log:

R[f] = - log P(D|f) - log P(f)

where - log P(f) takes the role of the regularizer.

Assume a linear model f(

The surfaces of equal regularization are hyper-diamonds. If the unregularized solution is close enough to an edge, the solution is pulled to the edge, corresponding to a number of weights being set to zero. For the 2-norm regularization, the surfaces of equal regularization are hyper-sheres. The weights do not get set preferentially to zero.

To apply the kernel trick, it should be possible to express the cost function in terms of dot products of patterns. With the 2-norm regularizer, this is possible ||

For a discriminant function f, a link function is a function "linking" the functional margin z=y f(x) and the likelihood P(D|f). Is is a means of converting the output of a discriminant function to posterior class probabilities. Link functions are usually S-shaped (sigmoids). The tanh and the logistic function 1/(1+e