Questions on lecture 11

Information Theoreric Methods
1.   What is the mutual information?
MI(X,Y)= sumX,Y p(x,y) log p(x,y)/p(x)p(y)
This is the KL divergence between
p(x,y) and p(x)p(y). Independent random variables have zero MI.
2.   What is the relation between mutual information and information gain?
Two names for the same thing: MI(X, Y)=H(Y)-H(Y|X).

3.   What is the basic functioning of a binary tree classifier?
At the root node, all the training examples are in the same bin.  The "disorder" before we see the training data is measured by H(Y). To split a node in two children, one selects the feature providing the largest information gain, that is the largest reduction in entropy:
MI(X, Y)=H(Y)-H(Y|X). So, we need to discretize a candidate feature using a threshold and create two nodes containing the examples below and above the threshold. We then compute for the two candidate nodes their new entropy. H(Y|X) is the weighed average of these child node entropies, the weights being the proportions of examples that went into the children nodes. The procedure is iterated until all the nodes are "pure". The terminal nodes are labeled by the class of their members. To perform a classification of a new example, one lets the example go down the tree (going right or left according to the thresholds on the selected features), and classify it according to the label of the terminal node in which it ends up.
4.    What is the connection between MI, channel capacity, and rate distorsion theory?
The channel capacity is defined as the maximum MI between the input and output of a noisy channel, over all input distributions. The rate distorsion function is the minimum MI between and signal and a representation of this signal ensuring data reproducibility with a minimum loss.
5.    What is the information bottleneck method?
We consider a signal X, a new feature representation Phi(X) and an outcome Y. To maximize transmission (thus minimize error rate), we should maximize MI(X,Y | Phi)=MI(Phi,Y). Simultaneously, if we want a compact representation, we need to minimize the rate of distorsion measured by MI(X, Phi). This leads to the following optimization problem: min MI(X, Phi) - beta MI(Phi, Y)

6.   How does MI relate to the Pearson correlation coefficient for Gaussian  signals?
MI = -(1/2) log (1-R2)
7. Can you suggest methods of forward selection or backward elimination using MI?

- Forward selection 1. One method was described in class: we start with the feature having maximum MI with the target. Subsequent features are added by selecting the feature having maximal MI with the target, conditioned on each previously selected feature (not conditioned on all the previously selected features simultaneously). This works well for binary features, MI is hard to estimate otherwise.
- Backward elimination: One could use the Torkkola method described in class, starting with all scaling factors equal to one. Instead of adjusting the scaling factors by gradient descent, one can eliminate the feature with smallest gradient. By iterating, features are progressively eliminated.
- Forward selection 1: Same idea but starting with all the scaling factors at zero and adding the feature corresponding to the largest gradient. Iterate to progressively add features.

8.  Can you suggest new embedded methods not described in class combining loss functions and models?
The answer to this question is not provided. Students having difficulties answering this question should come to the office hour.