Feature Selection Tutorial

IJCNN 2007 2-hour tutorial:

Feature selection and causal discovery
fundamentals and applications

Isabelle Guyon

August 12, 2007
Renaissance Orlando Resort, Florida

Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available. These areas include text processing of internet documents, gene expression array analysis, and combinatorial chemistry. The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data. This tutorial will cover a wide range of aspects of such problems: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.

Most feature selection methods do not attempt to uncover causal relationships between feature and target and focus instead on making best predictions. We will examine situations in which the knowledge of causal relationships benefits feature selection. Such benefits may include: explaining relevance in terms of causal mechanisms, distinguishing between actual features and experimental artifacts, predicting the consequences of actions performed by external agents, and making predictions in non-stationary environments.

The objective of this tutorial is to provide a theoretical framework and practical tools to address the problem of feature selection in a variety of situations:
-   when the number of samples and/or features vary across several orders of magnitude,
-   when the feature or the target are binary, categorical, or continuous,
-   when the features are sparse or dense
-   when the classes are balanced or imbalanced
-   when the features are independent or correlated
-   when the data are clean or plagued with noise or experimental artifacts
-   when the data are i.i.d. or when there are changes in distributions, eventually resulting from interventions on the system by external agents.

A variety of algorithms will be reviewed and contrasted. Particular attention will be given to feature validity assessment methods, via statistical testing and/or proper use of cross-validation.

Audience

This tutorial will be accessible to a broad audience, including practitioners, researchers, and students, who want to catch up with the latest developments in feature selection. Some previous exposure to machine learning/data mining problems is preferable, but not necessary.

Links and resources

Book on feature selection: Part of the material covered will be taken from the book “Feature Extraction: Foundations and Applications” edited by I. Guyon et al., Springer, 2006. The book includes tutorial chapters and chapters reviewing the results of the NIPS 2003 feature selection challenge. Slides of a class taught using this material are available.
Tutorial on causality and feature selection: The book on FS is complemented by more recent developments described in the tutorial “Causal Feature Selection” by I. Guyon et al.(to appear in "Computational Methods of Feature Selection", Liu-Motoda Eds. Chapman&Hall, 2007. Slides for this tutorial are also available.
Video conference:

This class was taught again at the MMDSS07 workshop, Villa Cagnola, Gazzada, Italy, Sept 10-21, 2007. The video of the lecture is available on-line.

A related class was taught at the Pascal 2007 bootcamp, Vilanova, Spain, July 2007. A video of the courses is available on-line as well as practical work based on the CLOP package.

Software package: The Challenge Learning Object Package, was used to reproduce the best results of the NIPS 2003 feature selection challenge. See the results obtained by students using this package.
Slides of the presentation [ppt].

Contact information

Lecturer:
Isabelle Guyon
Clopinet Enterprises
955, Creston Road,
Berkeley, CA 94708, U.S.A.
Tel/Fax: (510) 524 6211

We are grateful to Health Discovery Corporation for sponsoring this event.