Questions lecture 7

Questions on lecture 7

Support Vector Machines
What is the decision function of a support vector classifier (SVC)?
There are two possible "dual" representations:
- The Perceptron representation f(x)=w.phi(x)
- The kernel method representation f(x)=sum_k alpha_k y_k k(x,x_k)
What objectives are being pursued when training an SVC?
- Minimizing the number of training errors
- Having a margin between examples of wither class as large as possible
In what do those objectives differ from those of the Perceptron algorithm?
The Perceptron algorithm only attempts to minimize the number of training errors.
What are support vectors?
Support vectors are those examples that are closest to the decision boundary and entirey define the solution of the SVM optimization problem.
What are the main properties of the SVC solution?
- Unique solution
- A function only of "support vectors"
- Stable solution (does not change if any of the examples is removed, except a support vector; training error does not change under small changes of the weight vector.) Consequences: good leave-one-out error bounds.
Starting from a linear optimum margin classifier, how can one handle the non-linearly separable case (three answers)?
- Make the classifier non-linear using the phi functions.
- Use a negative margin classifier (not a unique solution anymore).
- Use a "soft-margin" classifier.
What are the two types of support vectors for soft margin classifiers?
- Marginal support vectors (on the margin).
- Non-marginal support vector (within the margin or misclassified).
What is the "kernel trick"?
In algorithms for linear models where an input vector x and training examples x_k appear only through their dot product x.x_k, the "trick" is to replace x.x_k by another dot product k(x, x_k) to make the model non-linear.
Another viewpoint is to start with a model linear in its parameter, but non-linear in its input x, f(x)=w.phi(x). Then, if the learning algorithm yields a linear combination of the examples w=sum_k alpha_k y_k phi(x_k), one obtains the dual representation:
f(x)=sum_k alpha_k y_k phi(x_k).phi(x) = sum_k alpha_k y_k k(x, x_k),
where k(x, x_k)=phi(x_k).phi(x) is a valid dot product.
The "trick" is then to use any valid dot product k(x, x_k) having a phi expansion (even an infinite expansion or an integral).
What is the loss function of an SVC?
For an example x of margin z=y f(x): max(0, 1-z) or max(0, 1-z)²
Name other large margin loss functions. How does the plot of the loss vs. margin look like?
The logistic loss log(1+e^-z), the Adaboost loss e^-z. They all increase sharply for z<0 (misclassified examples, but are non-zero in the 0<z<1 margin region. They are also continuous smoothe functions allowing gradient descent (unlike the 0/1 loss).
What regularizers are used with SVC?
The 1-norm or the 2-norm of w.
Can those regularizers be used with the square loss as well? What are the names of the corresponding techniques?
Yes. 2-norm regularization -> kernel ridge regression. 1-norm regularization -> lasso.
How can one define a regression SVM?
One can define an epsilon-tube playing the role of a margin. Examples in the tybe do not incur a loss. The loss increases linearly outside the tube.
Can one define an SVM for unsupervised learning?
Yes. In several different ways. The simplest one is the "one-class" SVM, with application to define the support of a distribution, novelty detection, and clustering.