Questions on lecture 2
Feature extraction
1. What are feature extraction, feature construction,
and feature selection?
Feature extraction = feature construction + feature selection.
Feature construction means creating new features from the raw data (this
includes normalizations, making products of features, using ad hoc algorithms
like extracting edges in image processing). Feature selection means reducing
the number of features by removing irrelevant or redundant features.
2. What are the three “ingredients” of a feature selection
method?
- Defining a criterion of selection (for individual features or
feature subsets). The criterion may be a ranking coefficient that measures
the degree of dependance of a feature with the target. It may be the performance
of a learning machine.
- Choosing a method of estimation (for instance evaluating
the criterion on training examples; in some cases cross-validation should
be used).
- Choosing a search strategy. When the number of subsets of
features to be assessed is too large to do an exhausive search, the space
of feature subsets must be search in a more efficent way.
3. What are filters/wrappers/embedded methods?
- Filters use a criterion of selection that does not make use of the
learning machine. An example of filter is the use of the Pearson correlation
coefficient for feature ranking.
- Wrappers use the learning machine to evaluate the performance of
alternative feature subsets. They use a search method to explore the space
of possible subsets. The learning machine is considered as a "black box",
i.e. no knowledge of the learning algorithm is necessary to apply the method.
- Embedded methods use feature selection strategies particular to
given learning machines.
4. What is a “univariate method”?
A method making the assumption that variables are independent. Univariate
feature selection methods assess the predictive power of individual features.
5. What is a “multivariate” method?
A method taking into account variable covariance. Multivariate feature selection
methods assess the predictive power of feature subsets.
6. What is the Pearson correlation coefficient? How
can it be used for feature selection?
The Pearson correlation coefficient between two vectors x and
y is defined as: C(x,y) = ( x-µ_{x}
).( y-µ_{y}
)/ (s_{x} s_{y})
where µ_{x} is the mean of the coefficients
of vector x and s_{x} its variance. So, essentially,
the Pearson correlation coefficient is a dot product (or scalar product)
between x and y, after "standardization". The standardization
operation consists in subtracting the mean and dividing by the standard deviation.
The absolute value of the Pearson correlation coefficient is used to rank
features. In this case, x is a column of the data matrix and
y is the vector of target values. Important: elsewhere, we call x
a line of the data matrix.
Architectures and
algorithms
1. What is a “linear discriminant” classifier? Name
examples.
A linear discriminant classifier if a function f(x) linear in
its parameters. Examples include f(x)=w.x+b (weighted sum of
the inputs), f(x)=w. phi(x) +b (the "Perceptron"),
and f(x)=sum_{i} alpha_{i} k(x_{i}, x) + b (kernel method). The
linear discriminant f(x)=w.x+b
is also linear in the input components. It builds a decision surface f(x)=0,
which is a hyperplane.
2. What is
an artificial neuron? What is a McCullogh-Pitts neuron?
An artificial neuron is a very simplified model of brain neuron. For the
McCullogh-Pitts artificial neuron, the inputs and outputs are binary (representing
2 states "active" or "inactive"), the synapse strength (connection between
neuron via e.g. a neurotransmitter) is modeled by a weight, the "potential"
of the neuron is modeled by a weighted sum of the inputs, and whether or
not the neuron "fires" by the thresholded potential. Such an artificial neuron
is a linear discriminant.
3. What is Hebb’s rule?
This is the simplest way of training an artificial neuron: the synapse
between two neurons is reinforced if there is co-activity, or equivalently
if for a given neuron its input is 1 and simultaneously its output is 1.
w_{j} <- w_{j} + x_{i} y
For neurons with binary 0/1 states,
the weight is updated only if positive activity takes place. But the rule
can also be applied to "neurons" with +1/-1 states and for linear discriminant
with continuous value inputs.
If there are the same numbers of examples in either class, a linear classifier
f(x)=w.x+b trained with
Hebb's rule classifies examples according to the nearest class centroid.
It may classify the training examples with a few errors.
4. What is
the “Perceptron”? What is the Perceptron algorithm?
The Perceptron is a linear discriminant invented by Rosenblatt in the
sixties: f(x)=w.
phi(x) +b. It may be trained with the Perceptron
algorithm, which also applies to the simpler f(x)=w.x+b model. The
Perceptron algorithm is like Hebb's rule, but updates are made only for misclassified
examples. If training examples are "linearly separable" the Perceptron algorithm
converges to a hyperplane separating the examples without error.
5. What is
gradient descent?
Gradient descent is a method of optimization. Given a cost function (or risk
functional) R[f] steps are made in the parameter space of f to decrease R[f]
in the direction of the steepest local slope. The method converges to a local
minimum of f. The error rate is the "natural" risk functional for classification
problems. However, it cannot be optimized by gradient descent because of
its discontinuities. It is often substituted by other risk functionals (the
"Perceptron" risk, the mean-square-error, etc.). Those will be reviewed again
in upcoming lectures.
6. What means “Batch gradient” descent? What means
“stochastic gradient” or "on-line" gradient descent?
In "batch gradient" a weight update is made using information from all
the training examples. Conversely, "on-line" or "stochastic" gradient makes
a weight update for each example individually.
7. What is the “Adaline”? What is LMS?
The Adaline is a linear model f(x)=w.x+b proposed in
the sixties by Widrow as an artificial neuron model. It was trained with
an on-line gradient algorithm optimizing the square loss L(f(x), y)
= (f(x) - y)^{2}.
8. What method(s)
solve(s) exactly the least square problem for a linear model? How does this
relate to LMS and Hebb’s rule?
The normal equations, or "pseudo-inverse" method solve the least-square
problem. So does the LMS algorithm. On can see that for the learning machine f(x)=w.x+b, the derivative of the loss L(f(x), y) = (f(x) - y)^{2}
with respect to a weight wj is
-2 (1-yf(x)) y x_{i}. The weight update of LMS goes in the
direction of the negative gradient of the loss for a single example. Hence,
we have:
w_{j} <- w_{j} + eta (1-yf(x)) x_{i} y
where eta is the learning rate.
We notice that LMS is similar to Hebb's rule, except that we have the factor
(1-yf(x)).
This factor decreases the update if the goal yf(x)=1 is nearly achived.
Thus, similarly to the Percetron algorithm, it does not insist on learning
examples already known.
9. What is
a “kernel”? What is a dot product? Give examples of kernels that are valid
dot products.
A kernel is similarity measure.
The kernels we will be taking about in this class are dot products. We all
know the "regular" dot product (or scalar product) in a Euclidean space x.y=sum_{j
}x_{j }y_{j }. More generally, a dot product on a vector
space V is a positive symmetric bilinear form:
<.,.>: V X V -> R
(x, x') -> <x,
x'>
with
<ax, x'> = a <x, x'>
and <x, ax'>
= a <x, x'>
(bilinearity)
and
<x, x> >= 0 with equality only for x=0 (positivity)
Kernels that can be expanded as
k(x, x') = phi(x) . phi(x') are
valid dot products.
10. What is
an RBF? What is a “potential function”? What is a “Parzen window”?
These are all names for kernels. RBF stands for "radial basis function".
A Gaussian kernel is an RBF. A potential function is also an RBF having the
form of an electric potential. A Parzen window is also an RBF of any shape,
used in particular for density estimation. You do not need to remember this
nomenclature, this is just for your information.
11. What is a kernel classifier? Is a kernel classifier
a linear discriminant classifier?
A kernel classifier is of
the form f(x)=sum_{i} alpha_{i} k(x_{i}, x) + b . It is linear in its
parameters, but usually not in its input components (except for the "linear
kernel", that is k(x, x')=x.x'. So it is a linear discriminant
classifier according to our definition.
12.
What is the “kernel trick”?
The kernel trick consists in noticing that there is an equivalence between
the two types of linear discriminant:
f(x)=w. phi(x)
+b (Perceptron) and f(x)=sum_{i} alpha_{i} k(x_{i}, x) + b (Kernel method), in the case where w =
sum_{i} alpha_{i} phi(x_{i}) and if we define k(x_{i}, x)=phi(x_{i}).phi(x). Replacing one by the other does not
seem to be an advantage (if N is the dimension of phi, the Preceptron dot
product takes N operations, while the kernel machine would take Nm for m
examples). However, we usually never compute the kernel as the dot
product of the phi vectors, because we know a faster-to-compute formular
that is a simple function of the x vectors (e.g. a function of x_{i}.x or of || x_{i}-x ||). Thus we may be replacing N operations
by m.something_small. Furthermore,
for some kernels, N may be infinite.