Feature selection in the setting of many irrelevant features

Andrew Y. Ng and Michael I. Jordan
University of California, Berkeley
<ang@cs.berkeley.edu>
<jordan@cs.berkeley.edu>

We consider feature selection in the setting of many irrelevant features. We show that, using a certain prior, the sample
complexity of Bayesian feature selection---that is, the number of training examples needed to learn ``well''---grows only
\emph{logarithmically} in the number of irrelevant features. This means that, from an information theoretical point of
view, the algorithm can learn even in the presence of \emph{exponentially} many irrelevant features as training
examples. This result recovers the best know such rates proved in the frequentist setting (Littlestone, 1988; Kivinen
and Warmuth, 1994; Ng, 1998), and beats the standard wrapper approach (John et al., 1994) to feature selection. Experimental
results also show the method exhibiting very high tolerance to the presence of many irrelevant features.