Questions on lecture 7
Support Vector Machines
What is the decision function
of a support vector classifier (SVC)?
There are two possible "dual" representations:
- The Perceptron representation f(x)=w.phi(x)
- The kernel method representation f(x)=sumk alphak yk k(x,xk)
What objectives are being pursued when training an SVC?
- Minimizing the number of training errors
- Having a margin between examples of wither class as large as possible
In what do those objectives differ from those of the Perceptron algorithm?
The Perceptron algorithm only attempts to minimize the number of training
errors.
What are support vectors?
Support vectors are those examples that are closest to the decision boundary
and entirey define the solution of the SVM optimization problem.
What are the main properties of the SVC solution?
- Unique solution
- A function only of "support vectors"
- Stable solution (does not change if any of the examples is removed, except
a support vector; training error does not change under small changes of the
weight vector.) Consequences: good leave-one-out error bounds.
Starting from a linear optimum margin classifier, how can one handle the
non-linearly separable case (three answers)?
- Make the classifier non-linear using the phi functions.
- Use a negative margin classifier (not a unique solution anymore).
- Use a "soft-margin" classifier.
What are the two types of support vectors for soft margin classifiers?
- Marginal support vectors (on the margin).
- Non-marginal support vector (within the margin or misclassified).
What is the "kernel trick"?
In algorithms for linear models where an input vector x and training
examples xk appear only through their dot product x.xk, the "trick" is to replace x.xk by another dot product k(x, xk) to make the model non-linear.
Another viewpoint is to start with a model linear in its parameter, but non-linear
in its input x, f(x)=w.phi(x).
Then, if the learning algorithm yields a linear combination of the examples
w=sumk alphak yk phi(xk), one obtains the dual representation:
f(x)=sumk
alphak yk phi(xk).phi(x) = sumk alphak yk k(x, xk),
where k(x, xk)=phi(xk).phi(x)
is a valid dot product.
The "trick" is then to use any
valid dot product k(x, xk) having a phi expansion (even an infinite
expansion or an integral).
What is the loss function of an SVC?
For an example x of margin z=y f(x): max(0, 1-z) or max(0, 1-z)2
Name other large margin loss
functions. How does the plot of the loss vs. margin look like?
The logistic loss log(1+e-z), the Adaboost loss e-z. They all increase sharply for z<0
(misclassified examples, but are non-zero in the 0<z<1 margin region.
They are also continuous smoothe
functions allowing gradient descent (unlike the 0/1 loss).
What regularizers are used with
SVC?
The 1-norm or the 2-norm of w.
Can those regularizers be used with the square loss as well? What are
the names of the corresponding techniques?
Yes. 2-norm regularization
-> kernel ridge regression. 1-norm regularization -> lasso.
How can one define a regression
SVM?
One can define an epsilon-tube playing the role of a margin. Examples
in the tybe do not incur a loss. The loss increases linearly outside the
tube.
Can one define an SVM for unsupervised learning?
Yes. In several different ways. The simplest one is the "one-class" SVM,
with application to define the support of a distribution, novelty detection,
and clustering.