**Questions on lecture 11**

**Information Theoreric
Methods**

**1. What is the mutual information?**

MI(X,Y)= sum_{X,Y} p(x,y) log p(x,y)/p(x)p(y)

This is the KL divergence between p(x,y) and p(x)p(y). Independent random
variables have zero MI.

**2. What is the
relation between mutual information and information gain?**

Two names for the same thing: MI(X, Y)=H(Y)-H(Y|X).

**3. What is the
basic functioning of a binary tree classifier?**

At the root node, all the training examples are in the same bin. The
"disorder" before we see the training data is measured by H(Y). To split
a node in two children, one selects the feature providing the largest information
gain, that is the largest reduction in entropy:** **MI(X, Y)=H(Y)-H(Y|X). So, we need to
discretize a candidate feature using a threshold and create two nodes containing
the examples below and above the threshold. We then compute for the two candidate
nodes their new entropy. H(Y|X) is the weighed average of these
child node entropies, the weights being the proportions of examples that
went into the children nodes. The procedure is iterated until all the nodes
are "pure". The terminal nodes are labeled by the class of their members.
To perform a classification of a new example, one lets the example go down
the tree (going right or left according to the thresholds on the selected
features), and classify it according to the label of the terminal node in
which it ends up.

**4. What is
the connection between MI, channel capacity, and rate distorsion theory?**

The channel capacity is defined as the maximum MI between the input and output
of a noisy channel, over all input distributions. The rate distorsion function
is the minimum MI between and signal and a representation of this signal
ensuring data reproducibility with a minimum loss.

**5. What is the information bottleneck method?**

We consider a signal X, a new feature representation Phi(X) and an outcome
Y. To maximize transmission (thus minimize error rate), we should maximize
MI(X,Y | Phi)=MI(Phi,Y). Simultaneously, if we want a compact representation,
we need to minimize the rate of distorsion measured by MI(X, Phi). This leads
to the following optimization problem: min MI(X, Phi) - beta MI(Phi, Y)

**6. How does MI
relate to the Pearson correlation coefficient for Gaussian signals?**

MI = -(1/2) log (1-R^{2})

**7. Can you suggest methods of forward selection or backward elimination
using MI?**

**- Forward selection 1. **One
method was described in class: we start with the feature having maximum MI
with the target. Subsequent features are added by selecting the feature having
maximal MI with the target, conditioned on each previously selected feature
(not conditioned on all the previously selected features simultaneously).
This works well for binary features, MI is hard to estimate otherwise.

**- Backward elimination:** One could use the Torkkola method described
in class, starting with all scaling factors equal to one. Instead of adjusting
the scaling factors by gradient descent, one can eliminate the feature with
smallest gradient. By iterating, features are progressively eliminated.

**- Forward selection 1:** Same idea but starting with all the scaling
factors at zero and adding the feature corresponding to the largest gradient.
Iterate to progressively add features.

**8. Can you suggest new
embedded methods not described in class combining loss functions and models?**

The answer to this question is
not provided. Students having difficulties answering this question should
come to the office hour.