Questions on lecture 2

Feature extraction
1.    What are feature extraction, feature construction, and feature selection?

Feature extraction = feature construction + feature selection.
Feature construction means creating new features from the raw data (this includes normalizations, making products of features, using ad hoc algorithms like extracting edges in image processing). Feature selection means reducing the number of features by removing irrelevant or redundant features.
2.    What are the three “ingredients” of a feature selection method?
Defining a criterion of selection (for individual features or feature subsets).  The criterion may be a ranking coefficient that measures the degree of dependance of a feature with the target. It may be the performance of a learning machine.
- Choosing a method of estimation (for instance evaluating the criterion on training examples; in some cases cross-validation should be used).
- Choosing a search strategy. When the number of subsets of features to be assessed is too large to do an exhausive search, the space of feature subsets must be search in a more efficent way.
3.    What are filters/wrappers/embedded methods?
- Filters
use a criterion of selection that does not make use of the learning machine. An example of filter is the use of the Pearson correlation coefficient for feature ranking.
- Wrappers use the learning machine to evaluate the performance of alternative feature subsets. They use a search method to explore the space of possible subsets. The learning machine is considered as a "black box", i.e. no knowledge of the learning algorithm is necessary to apply the method.
- Embedded methods use feature selection strategies particular to given learning machines.
4.    What is a “univariate method”?
A method making the assumption that variables are independent. Univariate feature selection methods assess the predictive power of individual features.
5.    What is a “multivariate” method?
A method taking into account variable covariance. Multivariate feature selection methods assess the predictive power of feature subsets.
6.    What is the Pearson correlation coefficient? How can it be used for feature selection?
The Pearson correlation coefficient between two vectors x and y is defined as: C(x,y) = ( x-µx )
.( y-µy )/ (sx sy)
where µx is the mean of the coefficients of vector x and sx its variance. So, essentially, the Pearson correlation coefficient is a dot product (or scalar product) between x and y, after "standardization". The standardization operation consists in subtracting the mean and dividing by the standard deviation. The absolute value of the Pearson correlation coefficient is used to rank features. In this case, x is a column of the data matrix and y is the vector of target values. Important: elsewhere, we call x a line of the data matrix.

Architectures and algorithms
1.    What is a “linear discriminant” classifier? Name examples.
A linear discriminant classifier if a function f(x) linear in its parameters. Examples include f(x)=w.x+b (weighted sum of the inputs),  f(x)=w. phi(x) +b (the "Perceptron"), and f(x)=sumi alpha
i k(xi, x) + b (kernel method). The linear discriminant f(x)=w.x+b is also linear in the input components. It builds a decision surface f(x)=0, which is a hyperplane.
2.    What is an artificial neuron? What is a McCullogh-Pitts neuron?
An artificial neuron is a very simplified model of brain neuron. For the McCullogh-Pitts artificial neuron, the inputs and outputs are binary (representing 2 states "active" or "inactive"), the synapse strength (connection between neuron via e.g. a neurotransmitter) is modeled by a weight, the "potential" of the neuron is modeled by a weighted sum of the inputs, and whether or not the neuron "fires" by the thresholded potential. Such an artificial neuron is a linear discriminant.
3.    What is Hebb’s rule?
This is the simplest way of training an artificial neuron: the synapse between two neurons is reinforced if there is co-activity, or equivalently if for a given neuron its input is 1 and simultaneously its output is 1.
wj <- w
j + xi y
For neurons with binary 0/1 states, the weight is updated only if positive activity takes place. But the rule can also be applied to "neurons" with +1/-1 states and for linear discriminant with continuous value inputs.
If there are the same numbers of examples in either class, a linear classifier
f(x)=w.x+b trained with Hebb's rule classifies examples according to the nearest class centroid. It may classify the training examples with a few errors.
4.    What is the “Perceptron”? What is the Perceptron algorithm?
The Perceptron is a linear discriminant invented by Rosenblatt in the sixties:
f(x)=w. phi(x) +b. It may be trained with the Perceptron algorithm, which also applies to the simpler f(x)=w.x+b model. The Perceptron algorithm is like Hebb's rule, but updates are made only for misclassified examples. If training examples are "linearly separable" the Perceptron algorithm converges to a hyperplane separating the examples without error.
5.    What is gradient descent?
Gradient descent is a method of optimization. Given a cost function (or risk functional) R[f] steps are made in the parameter space of f to decrease R[f] in the direction of the steepest local slope. The method converges to a local minimum of f. The error rate is the "natural" risk functional for classification problems. However, it cannot be optimized by gradient descent because of its discontinuities. It is often substituted by other risk functionals (the "Perceptron" risk, the mean-square-error, etc.). Those will be reviewed again in upcoming lectures.
6.    What means “Batch gradient” descent? What means “stochastic gradient” or "on-line" gradient descent?
In "batch gradient" a weight update is made using information from all the training examples. Conversely, "on-line" or "stochastic" gradient makes a weight update for each example individually.
7.    What is the “Adaline”? What is LMS?
The Adaline is a linear model
f(x)=w.x+b proposed in the sixties by Widrow as an artificial neuron model. It was trained with an on-line gradient algorithm optimizing the square loss L(f(x), y) = (f(x) - y)2.
8.    What method(s) solve(s) exactly the least square problem for a linear model? How does this relate to LMS and Hebb’s rule?
The normal equations, or "pseudo-inverse" method solve the least-square problem. So does the LMS algorithm. On can see that for the learning machine 
f(x)=w.x+b, the derivative of the loss L(f(x), y) = (f(x) - y)2 with respect to a weight wj is -2 (1-yf(x)) y xi. The weight update of LMS goes in the direction of the negative gradient of the loss for a single example. Hence, we have:
wj <- wj + eta (1-yf(x))  xi y
where eta is the learning rate. We notice that LMS is similar to Hebb's rule, except that we have the factor  (1-yf(x)). This factor decreases the update if the goal yf(x)=1 is nearly achived. Thus, similarly to the Percetron algorithm, it does not insist on learning examples already known.
9.    What is a “kernel”? What is a dot product? Give examples of kernels that are valid dot products.
A kernel is similarity measure. The kernels we will be taking about in this class are dot products. We all know the "regular" dot product (or scalar product) in a Euclidean space x.y=sumj xj yj . More generally, a dot product on a vector space V is a positive symmetric bilinear form:
<.,.>: V X V -> R
          (x, x') -> <x, x'>

<ax, x'>
= a <x, x'> and <x, ax'> = a <x, x'> (bilinearity)
<x, x> >= 0 with equality only for x=0 (positivity)

Kernels that can be expanded as k(x, x') = phi(x) . phi(x') are valid dot products.
10.    What is an RBF? What is a “potential function”? What is a “Parzen window”?
These are all names for kernels. RBF stands for "radial basis function". A Gaussian kernel is an RBF. A potential function is also an RBF having the form of an electric potential. A Parzen window is also an RBF of any shape, used in particular for density estimation. You do not need to remember this nomenclature, this is just for your information.
11.    What is a kernel classifier? Is a kernel classifier a linear discriminant classifier?
A kernel classifier is of the form f(x)=sumi alphai k(xi, x) + b . It is linear in its parameters, but usually not in its input components (except for the "linear kernel", that is k(x, x')=x.x'. So it is a linear discriminant classifier according to our definition.
12.    What is the “kernel trick”?
The kernel trick consists in noticing that there is an equivalence between the two types of linear discriminant:
f(x)=w. phi(x) +b (Perceptron) and f(x)=sumi alphai k(xi, x) + b (Kernel method), in the case where w = sumi alphai phi(xi) and if we define k(xi, x)=phi(xi).phi(x). Replacing one by the other does not seem to be an advantage (if N is the dimension of phi, the Preceptron dot product takes N operations, while the kernel machine would take Nm for m examples). However, we  usually never compute the kernel as the dot product of the phi vectors, because we know a faster-to-compute formular that is a simple function of the x vectors (e.g. a function of xi.x or of || xi-x ||). Thus we may be replacing N operations by m.something_small. Furthermore, for some kernels, N may be infinite.