Questions on lecture 12

Ensemble methods
1.   What is usually referred to as a learning machine ensemble, mixture of experts, or committee machine?
F(x)=sumk alpha
k fk(x), where fk(x) is an expert or a base learner (depending on the point of view)
2.   Give examples of simple "base learners"
- Decision stumps: these are classifiers using a single variable and a threshold, which are equivalent to the root node of a tree classifier.
- Kernel basis functions.
3.   What is the relation between ensemble methods and Bayesian learning?

In the Bayesian framework our model is an attempt to explain the data that can be "marginalized out": P(y|x,D) = sumf P(f|D) P(y|x,f,D). In this formula,
P(y|x,f,D) is our "base learner" and P(f|D) is our weight. The "Bayesian" makes predictions according to P(y|x,D), that is according to a vote of the base-learners with weight P(f|D) measuring how confident they are of their prediction. This is also called the "weighted-majority" algorithm.
4.   What is the relation between "stability" and ensemble methods?
Assume we have a training set D of size m. Let us consider an "ideal" committee built by voting among models of our model class trained with all possible training sets of size D. Loss functions can be split into two terms: a "bias" term  (measuring the error or the ideal committe) and a "variance" term
(measuring the discrepancy of the prediction of our model and that of the ideal committee). Both variance and bias must be reduced to ger good prediction error, but there is often a tradeoff. Committee work on reducing variance.  Stable methods have low variance (predictors built with training sets of the same size make similar predictions). Noting that the "ideal committee" has zero variance, we see that it is very stable. Committee machines immitate the "ideal committee".
5.    What makes a good "base learner"?
Looking at the bias-variance tradeoff (Question 4), we see that committees reduce variance, but not bias. In fact, the ideal committee has the same bias as the base learners. So to both reduce bias and variance, the base learners should have low bias.
6.    What are MCMCs and how does one use them to train ensembles of learning machines?
MCMC stands for Markov Chain Monte Carlo. They are methods for "sampling" probability distributions. the can be used to train ensembles in the Bayesian sense by sampling the posterior distribution
P(f|D). In this way, we get an empirical estimate of the weighted majority.
7.   How does one use MCMCs for feature selection?
One can imagine a generative model in which subsets of the features are chosen with some probability distribution. One then can compute P(feature_set|D) by marginalizing out the models. Similarly,
P(feature|D) can be computed by marginalizing out both the models and the other features.
8.  What is "bagging"?
Bagging is a bootstrap method applied to learning ensemble of classifiers. Each base learner is train with a resampled training set. A resampled training set is obtained by sampling with replacement m examples in a training set of size m. All the base learners vote with the same weight.
9.  How does one use the "out-of-bag" error estimate to create a feature ranking index and compute pvalues?
The out-of-bag error estimate is obtained by computing for each base learner the errors made on the examples not used for training (out-of-bag examples), then averaging the results for all base learners. The idea is the permute randomly the values of one feature in the out-of-bag examples and compute the difference in error rate between the unperturbed and perturbed data. This difference may be used as a ranking index for features. If it is normalized by its standard error, it is approaximatly distributed with the Normal Law, which provides a means of computing a pvalue.
10.  What are "random forests" (RF)?
Random Forests are ensembles of tree classifiers. each base learner is trained from a bootstrap sample. Additional variability is introduced by splitting nodes in the tree on the basis of on a random subset of the original features, not all the features. Typically, if there is a total of N features, one uses sqrt(N) features.
11.  How does one exploit the built-in feature selection of tree classifiers to define a ranking index with RF?
At each node, the feature selected to split the data is the one providing the largest information gain, i.e. having largest mutual information with the target. An index is obtained for each feature by adding the information gains it did or would provide at each node split. An average index can be obtained by averaging the indices obtained from several trees in a forest.
12.  What is "boosting"?
Boosting is a method of training an ensemble method by adding base learners in sequence. Each newly added base learner is trained with a higher proportion of the examples that were hard to learn up to then.
13.  How does one use boosting and decision stumps to feature selection?
One can use decision stumps as base learners. A decision stump is a classifier built from one variable (like the root node of a tree classifier). Boosting with decision stumps is a form of forward selection of variables.
14.  What are generic methods of combining sets of features selected with various methods?
- Averaging ranking indices to get a new global index
- Averaging ranks to get a new global index
- Intersecting feature subsets (soft intersections may be defined as setting a threshold on the number of times a feature happens in all subsets considered)
- Computing the "centroid" of feature subsets (the subset intersecting most with all others)
- Computing the "centroid" of feature rankings (the ranking closest to all other rankings)