Classification for High Dimensional Problems Using  Bayesian Neural Networks and Dirichlet Diffusion Trees

Radford M. Neal and Jianguo Zhang
radford@cs.utoronto.ca
jianguo@utstat.utoronto.ca
University of Toronto

We describe the methods we used for the high-dimensional classification problems in the NIPS feature selection contest.  As a preliminary step, we either looked at a moderate number of principle components from the data, or we looked at a moderate number of features selected using simple significance tests.  This reduces the computational requirements to a manageable level.  We then applied two very different appoaches to classification.  Neural networks (multi-layer perceptrons) are supervised methods, which do not model the distribution of training cases.  With Bayesian methods, these networks can be made quite complex without fear of overfitting the training data.  An Automatic Relevance Determination (ARD) prior can also be used to allow the model to discover that some inputs (features or principle components) are much more relevant than others.  The results of runs using ARD were sometimes used to select a further reduced set of features.  Dirichlet diffusion trees are a Bayesian method for modeling a distribution, based on hierarchical clustering. We used Dirichlet diffusion trees to model the input distribution for all cases (training, validation, and test), ignoring class labels. Hyperparameters allowed the model to learn which inputs relate to others, and which are (possibly) irrelevant "noise".  The result is a sample of trees that cluster the cases.  We then used distances in these trees to classify cases by simple neighbor-based methods.  Based on results on the validation set, we chose the Dirichlet diffusion tree method for the Arcene data and the Bayesian neural network method for Gisette, Dexter, and Dorothea.  For Madelon, where the two methods did about equally well, we averaged the probabilities produced by the two methods.  The results were very good, but they do look at most or all of the features for most of the data sets, due to the use of principle components.  Results were not as good using Bayesian neural networks with a much smaller fraction of the features (4.74% on average), but this approach might be useful if reducing the number of features used is important.