Questions on lecture 4

Feature construction
1.    What is a sigma-pi unit?
A sigma-pi unit is a special kind of Perceptron in which the phi functions correspond to products of the original features. The unit is thus effectively computing a polynomial function of the inputs.
2.    What  is a bottleneck neural network? How does this relate to PCA?
A bottleneck neural network is a 2 layer network in which the input layer and output layer have same dimension n and the hidden layer has a number of outputs n'<n. A bottleneck network can be trained with the same examples at the input and the output. If the units are linear and if the square loss is used for training, a bottleneck network actually computes the the first n' principal components, which are the weights of the neurons of the first layer. The second layer reconstructs the inputs and the weights of the neurons are given by the transpose of the weight matrix of the first layer.
3.    What becomes of the dot product between patterns when patterns are normalized with the L2 (Euclidean) norm?
The cosine between the two patterns.
4.    What becomes of the dot product between feature and target when the features (and the target) are standardized?
The Pearson correlation coefficient.
5.    When does it make sense to take the Log of the data matrix?
When the variance of the data increases with the magnitude of the features.
6.    What is a sytematic error? What is an intrinsic error?
A systematic error is an error that can be explained and reduced by calibration or normalization. An intrinsic error corresponds the unexplained "random" noise.
7.    How can one get rid of systematic errors?
By modeling the noise and trying to reverse the noise generating process, by calibration or normalization.
8.    What is an ANOVA model?
ANOVA stands for Analysis of Variance. An ANOVA model is a model of the effect on observations x of a systematic (or "controlled") factor of variability v taking a discrete number of values {v1, v
2…vj ,…} and intrinsic variability e (random error, Normally distributed):
            xij = m + vj + eij
(i index of observation, j index of “treatment” of “class”)
The ANOVA model supposes additive noise and equal variance in the classes (so take the log if you see the variance increase with the variable magnitude). The ANOVA test compares the variance of the controlled factor v (variance explained by the model)  to the intrinsic variance of e (residual variance or "intra-class" variance). If the first one is statistically significantly larger than the second one, factor v is found to contribute significantly to the noise.

9.    Build a taxonomy of factors of variability in terms of whether they are desired, known, controllable or observable. Explain the various cases and give examples. When building a new instrument, in which direction should one go?
                                                   factor of variability
                                               desired          undesired
                                                             known           unknown
                                         controllable            uncontrollable
                                                       observable                 unobservable
- The desired factor is our target (class labels) e.g. disease or normal
- The undesired factors are all the nuisance variable causing variance in the data that is not related to our target, e.g. differences in sample processing, temperature, patient gender, etc.
- The unknown factors are those which we have not considered yet (not recorded or controlled) the others are considered known.
- The uncontrollable factors are those on which we have no any handle (e.g. the weather, something happening inside the instrument to which we do not have access, some patient behavior that we cannot change).
- Controllable factors on the other hand let us choose values and lend themselves to experimental design.
- Unobservable factors are those uncontrolled factor that we cannot even record (
something happening inside the instrument to which we do not have access, some patient behavior that we cannot monitor).
- Observable factors are all the remaining factors that we can record, even though we might not be able to control them (e.g. the weather).
When designing an instrument, we should try to go in the direction
    unobservable -> observable
    uncontrollable -> controllable
    unknown -> known
so that we can more effectively reduce the undesired variance.

10.  What is experimental design? What is a "confounding factor"? Give examples of experimental plans.
Experimental design is the science of planning experiments to most effectively study the effect on a set of given factors on a given outcome. A confounding factor in a factor (usually unknown) the value of which co-varies with another know factor under study. For example if we want to study the effect of age on weight but all our young people are male and all our old people are female, the gender is a confounding factor. Of course this situation is a bogus experimental design. Good planned experiments try to consider "all" possible combinations of assignments of variables to values. In a factorial plan, each variable is allowed to take only 2 values and for k variables this leads to 2k experiments. To avoid the effect of possible unknown factors correlated with time, the order of the experiments can be randomized (randomized plan). To be able to study the variance of a given factot on the outcome, factors can be kept constant in some experiment blocks (block design).
11.  Why is it important to record a lot of "meta data"? Why is it difficult to plan experiments with a lot of factors of variability? How should one proceed?
It is important to record a lot of "meta" data to be able to eventually explain some of the unexplained variance. However, all the factors recorded are not always controlled in the planned experiment because or the combinatorial explosion of the number of experiments to be ran when a lot of factors are considered simultaneously. One should proceed iteratively by ruling out hypotheses progressively.
11.  What is a standard operating procedure (SOP)? What is calibration? What is this good for?
A standard operating procedure is a series of steps taken to generate the experimental data that is well documented and as reproducible as possible. SOP are used to reduce the unexplained variance. Calibration is a measurement made in a standard way, which allows normalizing the data (e.g. shifting or scaling it). For example, a standard solution may be periodically ran in place of the real solutions to be analyzed.
12.  Is calibration always desirable?
Not always. The calibration measurement may also have variance. Calibration in some cases can result in an increase of variance. One may prefer instead to normalize with a local average the itself because the normalization factor would then be computed from more data.
13.  What is a "match filter"? Give examples of learning algorithms using "match filters".
It is a vector of coefficients tk or "template" that we use to compute a feature value f
k as the dot product between tk and the input patterns x: fk = tk . x
Instead of a dot product, other similarity measures can be used. "Template matching" or "nearest neighbor" algorithms use match filters.
14.  What is a "filter bank"? Give examples of classical transforms based on filter banks.
An ensemble of
match filters is called a filter bank. Often the elements of a filter bank are chosen to be orthogonal. The cosine transform and the Fourier transform use orthogonal filters. So does PCA.
15.  What  is a convolution? Give examples of convolutional kernels. What is their effect on the signal?
A convolution is also a dot product operation aiming at producing new features. But this time, instead of using templates that are as different of one another as possible, we use a single template called "kernel" that we translate in all possible ways. For each position, we compute the dot product to obtain one feature in the new representation. A Gaussian kernel performs a local average and therefore smoothes the signal. A Mexican hat kernel enhances edges. Some kernels can be designed to extract end-points or lines.
16.  What are the similarities and differences between methods based on filter banks and convolutional methods?
Both methods are based on dot products. Filter bank methods use templates that are as different as possible from one another.  Convolutional methods use a single template in all possible positions.
17.  If a convolution is performed in input space, to what transform does this correspond to in Fourier space and vice versa?
A convolution in input space corresponds to a match filtering in Fourier space (the match filter being the Fourier transform of the convolutional kernel) and vice versa.
18.  What are low/high/band pass filters? Give examples of convolutional kernels and match filters in Fourier space implementing such filters.
A low pass filter removes high frequency components i.e. it smoothes the signal. Example: convolution with a Gaussian kernel. A high pass filter on the contrary removes low frequency components (e.g. the baseline). To achieve that effect, one can convolve with a wide Gaussian kernel and subtract the result from the original. A band pass filter lets components in a given frequency band go through. One can convolve with the difference of two Gaussian kernels of different width to achieve that effect. In Fourier space, the Fourier transform of a Gaussian being a Gaussian, one can just multiply with a Gaussian match filter.
19.   What is the Fourier transform of: a rectangle, a triangle, a Gaussian, a sinc?
rectangle -> sinc
triangle -> sinc2
Gaussian -> Gaussian
sinc -> rectangle
20.   Give examples of feature construction methods that are not simple normalizations and cannot be implemented by either match filters or convolutions.
- contour following algorithms
- connected component algorithms
- deskewing
- histograms
21.   What is a convolutional neural network?
A multi-layer neural network implementing several successive convolutions. Each convolution is followed by a subsampling to progressively reduce the resolution of the input and extract higer and higher level features. The weights of the network are the coefficients of the convolutional kernels and they are obtained by training.