Questions on lecture
12
Ensemble methods
1. What is usually referred to as a learning machine
ensemble, mixture of experts, or committee machine?
F(x)=sumk alphak fk(x), where fk(x) is an expert or a
base learner (depending on the point of view)
2. Give examples
of simple "base learners"
- Decision stumps: these are classifiers using a single variable and
a threshold, which are equivalent to the root node of a tree classifier.
- Kernel basis functions.
3. What is the relation between ensemble methods and Bayesian learning?
In the Bayesian framework our model is an attempt to explain the data that
can be "marginalized out": P(y|x,D) = sumf P(f|D) P(y|x,f,D).
In this formula, P(y|x,f,D)
is our "base learner" and P(f|D) is our weight. The "Bayesian"
makes predictions according to P(y|x,D), that is according to
a vote of the base-learners with weight P(f|D) measuring how confident they
are of their prediction. This is also called the "weighted-majority"
algorithm.
4. What is the
relation between "stability" and ensemble methods?
Assume we have a training set D of size m. Let us consider an "ideal"
committee built by voting among models of our model class trained with all
possible training sets of size D. Loss functions can be split into two terms:
a "bias" term (measuring the error or the ideal committe) and a "variance"
term (measuring the discrepancy
of the prediction of our model and that of the ideal committee). Both variance
and bias must be reduced to ger good prediction error, but there is often
a tradeoff. Committee work on reducing variance. Stable methods have
low variance (predictors built with training sets of the same size make similar
predictions). Noting that the "ideal committee" has zero variance, we see
that it is very stable. Committee machines immitate the "ideal committee".
5. What makes
a good "base learner"?
Looking at the bias-variance tradeoff (Question 4), we see that committees
reduce variance, but not bias. In fact, the ideal committee has the same
bias as the base learners. So to both reduce bias and variance, the base
learners should have low bias.
6. What are MCMCs and how does one use them to train
ensembles of learning machines?
MCMC stands for Markov Chain Monte Carlo. They are methods for "sampling"
probability distributions. the can be used to train ensembles in the Bayesian
sense by sampling the posterior distribution P(f|D). In this way, we get an empirical
estimate of the weighted majority.
7. How does one
use MCMCs for feature selection?
One can imagine a generative model in which subsets of the features
are chosen with some probability distribution. One then can compute P(feature_set|D)
by marginalizing out the models. Similarly, P(feature|D) can be computed by marginalizing
out both the models and the other features.
8. What is "bagging"?
Bagging is a bootstrap method applied
to learning ensemble of classifiers. Each base learner is train with a resampled
training set. A resampled training set is obtained by sampling with replacement
m examples in a training set of size m. All the base learners vote with the
same weight.
9. How does one use the
"out-of-bag" error estimate to create a feature ranking index and compute
pvalues?
The out-of-bag error estimate
is obtained by computing for each base learner the errors made on the examples
not used for training (out-of-bag examples), then averaging the results for
all base learners. The idea is the permute randomly the values of one feature
in the out-of-bag examples and compute the difference in error rate between
the unperturbed and perturbed data. This difference may be used as a ranking
index for features. If it is normalized by its standard error, it is approaximatly
distributed with the Normal Law, which provides a means of computing a pvalue.
10. What are "random
forests" (RF)?
Random Forests are ensembles of tree classifiers. each base learner is trained
from a bootstrap sample. Additional variability is introduced by splitting
nodes in the tree on the basis of on a random subset of the original features,
not all the features. Typically, if there is a total of N features, one uses
sqrt(N) features.
11. How does one
exploit the built-in feature selection of tree classifiers to define a ranking
index with RF?
At each node, the feature selected to split the data is the one providing
the largest information gain, i.e. having largest mutual information with
the target. An index is obtained for each feature by adding the information
gains it did or would provide at each node split. An average index can be
obtained by averaging the indices obtained from several trees in a forest.
12. What is "boosting"?
Boosting is a method of training an ensemble method by adding base learners
in sequence. Each newly added base learner is trained with a higher proportion
of the examples that were hard to learn up to then.
13. How does one
use boosting and decision stumps to feature selection?
One can use decision stumps as base learners. A decision stump is a classifier
built from one variable (like the root node of a tree classifier). Boosting
with decision stumps is a form of forward selection of variables.
14. What are generic
methods of combining sets of features selected with various methods?
- Averaging ranking indices to get a new global index
- Averaging ranks to get a new global index
- Intersecting feature subsets (soft intersections may be defined as setting
a threshold on the number of times a feature happens in all subsets considered)
- Computing the "centroid" of feature subsets (the subset intersecting most
with all others)
- Computing the "centroid" of feature rankings (the ranking closest to all
other rankings)