**Questions on lecture 1**

__Learning and generalization __

** 1. What is “Machine Learning”? Give examples of learning
machines.**

Machine Learning is a discipline dedicated to the design and study of artificial
learning systems, particularly systems that learn from examples. Learning
machines include linear models, artificial neural networks, and decisions
trees.

** 2. What is supervised learning? Name special cases
of supervised learning depending on whether the inputs/outputs are categorical,
ordinal, or continuous.**

Supervised learning refers to learning in the presence of a teacher. When
trying to learn to classify objects, the teaching signal is the class label.
In this class, data objects are represented as vectors "**x**" of variables
or "features". We seek to predict an attribute "y" of these data objects,
that is another variable. A continuous variable is a real number. Both categorical
and ordinal variables takes values from a finite set of choices. For
categorical inputs the list is not ordered (e.g. the country of origin) while
for ordinal inputs it is ordered (e.g. three clinical stages in the advancement
of a disease.) Regardless of the type of inputs, if __the output is continuous,
the problem is a regression problem__; if __the output is categorical,
the problem is a classification problem__. "Ordinal regression" problems
have ordinal outputs.

**3. What is unsupervised learning? Give examples of unsupervised
learning tasks.**

In unsupervised learning problems, no teaching signal is available. Dimensionality
reduction by principal component analysis and clustering are examples of
unsupervised learning.

** 4. What is a loss function? What is a risk functional?
Give examples.**

A loss function is a function measuring the discrepancy between a predicted
output f(**x**) and the desired outcome y: L(f(**x**), y). The risk
is the average of L over many examples. Examples of loss functions include
the square loss often used in regression (y-f(**x**))^{2} and
the 0/1 loss used in classification, which is 1 in case of error and 0 otherwise.

**5. What is the empirical risk? What is “empirical risk
minimization”?**

The empirical risk is the average loss over a finite number of given
examples. Empirical risk minimization refers to finding the function f(**x**)
in a family of functions that minimizes the empirical risk. Empirical risk
minimization is a form of training/learning.

** 6. What is the expected risk? **

The expected value of the loss, i.e. the average over an infinite number
of examples.

** 7. What is “generalization”?**

The capability the a predictive system f(**x**) has to make "good" predictions
on examples that were not used for training.

** 8. What is “overfitting”?**

Learning very well the training examples but making poor predictions
on new test examples.

** 9. What are training/validation/test sets? What is
“cross-validation”? Name one or two examples of cross-validation methods.**

For the purpose of this class: The training data provided to you is the
union of the training set and the validation set. The training data consist
of input/output pairs. They are available for "training", i.e. determining
the predictive model f(**x**). The test data consist of inputs only. The
accuracy of the predictions made on test data by the predictive model will
be assessed by someone who knows the "true" outputs, but does not give them
to you. You are free to split training data into a subset reserved for training
(training set) and a subset reserved for evaluation (validation set). You
may want to make several splits and average the results; this is called cross-validation.
One cross-validation method called "**bootstrap**" consists in drawing
with replacement several data splits in equal proportion. Another method
called **k-fold** consits in dividing the training data into k subsets,
training on (k-1) subsets and testing on the last one, then repeating the
operation for all groups of (k-1) subsets and averaging the results. 3-fold
and 10-fold cross-validation are popular methods. In the limit one can have
as many folds as there are examples. The method is then called "**leave-one-out**".

** 10. What are hyper-parameters?**

Predictive models have adjustable parameters subject to training. Some parameters
are not "easy" to train with classical algorithms. They can be fixed during
training. Then, their values are varied and an optimum may be selected by
cross-validation. Such parameters are usually referred to as "hyper-parameters".

** 11. What are “latent” variables?**

Learning systems have input variables (or "features"), output variables,
and internal variables. Latent variables are internal variables. While input
and output variables are observable from the outside may be provided for
training, latent variables are not accessible, thus not provided for training.
On must usually initialize them randomly and recompute their values in the
process of learning.

12. What is “model selection”?

Machine learning usually consists in adjusting the parameters of a model.
However, we may have a number of candidate models (e.g. linear models, kernel
methods, tree classifiers, neural networks...) Model selection refers
to choosing the model, which we believe will generalize best. Model selection
encompasses also hyper-parameter selection and feature selection. Cross-validation
is a commonly used method of model selection.

13. What do “multiple levels of inference” mean? Is it
advantageous to have multiple levels of inference?

Inference refers to the ability of a learning system, namely going from
the "particular" (the examples) to the "general" (the predictive model).
In the best of all worlds, we would not need to worry about model selection.
Inference would be performed in a single step: we input training examples
into a big black box containing all models, hyper-parameters, and parameters;
outcomes the best possible trained model. In practice, we often use 2 levels
of inference: we split the training data into a training set and a validation
set. The training set serves the trains at the lower level (adjust the parameters
of each model); the validation set serves to train at the higher level (select
the model.) Nothing prevents us for using more than 2 levels. However, the
price to pay will be to get smaller data sets to train with at each level.

** 14. What is the likelihood?**

Predictive models learn the mapping y=f(**x**) or more generally P(y|**x**).
Conversely, generative models (which are probabilistic) learn the opposive,
namely to predict the density of **x** given y P(**x**|y). In
the maximum likelihood framework, we assume that the data was produced by
a model. The model has some parameters. The goodness-of-fit of "likelihood"
is the probability that the data was produced by the model, for a given choice
of parameters.

**Likelihood = Proba (data | model)**.

** 15. What means “maximum likelihood”?**

The maximum likelihood method of inference chooses the set of parameters
of the model that maximize the likelihood.

** 16. What is the Bayes formula?**

Bayes formula: P(A|B) P(B) = P(B|A) P(A)

Applied to our problem, we can go from a predictive model to a generative
model and vice et versa using:

P(**x**|y) P(y) = P(y|**x**) P(**x**)