Questions on lecture 1

Learning and generalization

1.    What is “Machine Learning”? Give examples of learning machines.
Machine Learning is a discipline dedicated to the design and study of artificial learning systems, particularly systems that learn from examples. Learning machines include linear models, artificial neural networks,  and decisions trees.
2.    What is supervised learning? Name special cases of supervised learning depending on whether the inputs/outputs are categorical, ordinal, or continuous.
Supervised learning refers to learning in the presence of a teacher. When trying to learn to classify objects, the teaching signal is the class label. In this class, data objects are represented as vectors "x" of variables or "features". We seek to predict an attribute "y" of these data objects, that is another variable. A continuous variable is a real number. Both categorical  and ordinal variables takes values from a finite set of choices. For categorical inputs the list is not ordered (e.g. the country of origin) while for ordinal inputs it is ordered (e.g. three clinical stages in the advancement of a disease.) Regardless of the type of inputs, if the output is continuous, the problem is a regression problem; if the output is categorical, the problem is a classification problem. "Ordinal regression" problems have ordinal outputs.
3.    What is unsupervised learning? Give examples of unsupervised learning tasks.
In unsupervised learning problems, no teaching signal is available. Dimensionality reduction by principal component analysis and clustering are examples of unsupervised learning.
4.    What is a loss function? What is a risk functional? Give examples.
A loss function is a function measuring the discrepancy between a predicted output f(x) and the desired outcome y: L(f(x), y). The risk is the average of L over many examples. Examples of loss functions include the square loss often used in regression (y-f(x))2 and the 0/1 loss used in classification, which is 1 in case of error and 0 otherwise.
5.    What is the empirical risk? What is “empirical risk minimization”?
The empirical risk is the average loss over a finite number of given examples. Empirical risk minimization refers to finding the function f(x) in a family of functions that minimizes the empirical risk. Empirical risk minimization is a form of training/learning.
6.    What is the expected risk?
The expected value of the loss, i.e. the average over an infinite number of examples.
7.    What is “generalization”?
The capability the a predictive system f(x) has to make "good" predictions on examples that were not used for training.
8.    What is “overfitting”?
Learning very well the training examples but making poor predictions on new test examples.
9.    What are training/validation/test sets? What is “cross-validation”? Name one or two examples of cross-validation methods.
For the purpose of this class: The training data provided to you is the union of the training set and the validation set. The training data consist of input/output pairs. They are available for "training", i.e. determining the predictive model f(x). The test data consist of inputs only. The accuracy of the predictions made on test data by the predictive model will be assessed by someone who knows the "true" outputs, but does not give them to you. You are free to split training data into a subset reserved for training (training set) and a subset reserved for evaluation (validation set). You may want to make several splits and average the results; this is called cross-validation. One cross-validation method called "bootstrap" consists in drawing with replacement several data splits in equal proportion. Another method called k-fold consits in dividing the training data into k subsets, training on (k-1) subsets and testing on the last one, then repeating the operation for all groups of (k-1) subsets and averaging the results. 3-fold and 10-fold cross-validation are popular methods. In the limit one can have as many folds as there are examples. The method is then called "leave-one-out".
10.    What are hyper-parameters?
Predictive models have adjustable parameters subject to training. Some parameters are not "easy" to train with classical algorithms. They can be fixed during training. Then, their values are varied and an optimum may be selected by cross-validation. Such parameters are usually referred to as "hyper-parameters".
11.    What are “latent” variables?
Learning systems have input variables (or "features"), output variables, and internal variables. Latent variables are internal variables. While input and output variables are observable from the outside may be provided for training, latent variables are not accessible, thus not provided for training. On must usually initialize them randomly and recompute their values in the process of learning.
12.    What is “model selection”?
Machine learning usually consists in adjusting the parameters of a model. However, we may have a number of candidate models (e.g. linear models, kernel methods, tree classifiers, neural networks...)  Model selection refers to choosing the model, which we believe will generalize best. Model selection encompasses also hyper-parameter selection and feature selection. Cross-validation is a commonly used method of model selection.
13.    What do “multiple levels of inference” mean? Is it advantageous to have multiple levels of inference?
Inference refers to the ability of a learning system, namely going from the "particular" (the examples) to the "general" (the predictive model). In the best of all worlds, we would not need to worry about model selection. Inference would be performed in a single step: we input training examples into a big black box containing all models, hyper-parameters, and parameters; outcomes the best possible trained model. In practice, we often use 2 levels of inference: we split the training data into a training set and a validation set. The training set serves the trains at the lower level (adjust the parameters of each model); the validation set serves to train at the higher level (select the model.) Nothing prevents us for using more than 2 levels. However, the price to pay will be to get smaller data sets to train with at each level.
14.    What is the likelihood?
Predictive models learn the mapping y=f(x) or more generally P(y|x). Conversely, generative models (which are probabilistic) learn the opposive, namely to predict the density of x given y P(x|y).  In the maximum likelihood framework, we assume that the data was produced by a model. The model has some parameters. The goodness-of-fit of "likelihood" is the probability that the data was produced by the model, for a given choice of parameters.
Likelihood = Proba (data | model).
15.    What means “maximum likelihood”?
The maximum likelihood method of inference chooses the set of parameters of the model that maximize the likelihood.
16.    What is the Bayes formula?
Bayes formula: P(A|B) P(B) = P(B|A) P(A)
Applied to our problem, we can go from a predictive model to a generative model and vice et versa using:
P(x|y) P(y) = P(y|x) P(x)