Feature Extraction, Foundations and Applications

Isabelle Guyon, Steve Gunn, Masoud Nikravesh, and Lofti Zadeh, Editors. 

Series Studies in Fuzziness and Soft Computing, Physica-Verlag, Springer, 2006.

Order from Amazon or from Springer.

This book is organized around the results of a benchmark that took place in 2003 (the website of the challenge is still active) and whose results were discussed at the NIPS 2003 workshop on feature extraction. Dozens of research groups competed on five large feature selection problems from various application domains: medical diagnosis, text processing, drug discovery, and handwriting recognition. The results of this effort pave the way to a new generation of methods capable of analyzing data tables with million of lines and/or columns. This book is a step towards validating, unifying, and formalizing approaches. The data and sample code with useful baseline methods implemented in Matlab are available as a single big archive (93 MB). Course material is also available.

FREE BOOK INTRODUCTION (courtesy of the publisher).

"This book compiles some very promising techniques in feature selection and supervised learning, coming from an extremely smart collection of researchers, delivering their best ideas in a competitive environment."
-- Trevor Hastie, Stanford University

"Feature Selection is a key technology for making sense of the high dimensional data which surrounds us. Isabelle Guyon et al. have done a splendid job in designing a challenging competition, and collecting the lessons learned."
-- Bernhard Schoelkopf, Max Planck Institute for Biological Cybernetics

"The major steps in building a pattern classifier are 1) choosing the learning algorithm, 2) choosing the features, 3) collecting training data, 4) training and 5) testing the classifier; designers typically jump between steps until the classifier attains a desired accuracy.  While most research has addressed steps 1, 4 and 5, there has been insufficient consideration of
feature selection algorithms, 2, and no unified presentation of leading methods and systematic comparisons of their strengths and weaknesses.  Until now.  This large volume, a product of the 2003 Neural Information Processing Systems Feature Selection Challenge, is noteworthy for the breadth of methods covered, the clarity of presentations, the unity in notation and the helpful statistical appendices.  And while it would be a mistake to crown a single algorithm as "the best" for all applications, the diligent reader will surely find methods here appropriate for any in a wide range of real-world problems.  This well-organized volume belongs on the shelf of every researcher developing pattern classifiers."
-- David G. Stork, Ricoh Innovations

Feature Extraction, one of the pillars of machine learning, finds application in biotechnology, industrial inspection, the Internet,
radar, sonar, and speech recognition, just to name a few. This new book, edited by Isabelle Guyon, Steve Gunn, Masoud Nikravek and Lotfi Zadeh, is unique:
- The book owes it origin to a competition, followed by a Neural Information Processing Systems (NIPS) Workshop that was
held in December 2003.
- Most, important, the book embodies many of the-state-of-the-art methods in feature extraction.
Simply put, the book will make a difference to the literature on machine learning.
-- Simon Haykin, Mc Master University

Foreword by Peter Norvig

Everyone loves a good competition. As I write this, two billion fans are eagerly anticipating the 2006 World Cup. Meanwhile, a fan base that is somewhat smaller (but presumably includes you, dear reader) is equally eager to read all about the results of the NIPS 2003 Feature Selection Challenge, contained herein. Fans of Radford Neal and Jianguo Zhang (or of Bayesian neural networks and Dirichlet diffusion trees) are gloating ``I told you so'' and looking for proof that their win was not a fluke. But the matter is by no means settled, and fans of SVMs are shouting "wait 'til next year!" You know this book is a bit more edgy than your standard academic treatise as soon as you see the dedication: ``To our friends and foes.''
Competition breeds improvement. 50 years ago, the champion in 100m butterfly swimming was 22% slower than today's champion; the women's marathon champion from just 30 years ago was 26% slower.  Who knows how much better our machine learning algorithms would be today if Turing in 1950 had proposed an effective competition rather than his elusive Test?
But what makes an effective competition? The field of Speech Recognition has had NIST-run competitions since 1988; error rates have been reduced by a factor of three or more, but the field has not yet had the impact expected of it. Information Retrieval has had its TREC competition since 1992; progress has been steady and refugees from the competition have played important roles in the hundred-billion-dollar search industry. Robotics has had the DARPA Grand Challenge for only two years, but in that time we have seen the results go from complete failure to resounding success (although it may have helped that the
second year's course was somewhat easier than the first's).
I think there are four criteria that define effective technical competitions:

  1. The task must be approachable.  Non-experts should be able to enter, to see some results, and learn from their better-performing peers.
  2. The scoring must be incremental. A pass-fail competition where everyone always fails (such as the Turing Test) makes for a boring game and discourages further competition.  On this score the Loebner Prize, despite its faults, is a better competition than the original Turing Test.  In one sense, everyone failed the DARPA Grand Challenge in the first year (because no entrant finished the race), but in another sense there were incremental scores: the distance each robot travelled, and the average speed achieved.
  3. The results should be open. Participants and spectators alike should be able to learn the best practices of all participants.  This means that each participant should describe their approaches in a written document, and that the data, auxiliary programs, and results should be publicly available.
  4. The task should be relevant to real-world tasks.  One of the problems with early competitions in speech recognition was that the emphasis on reducing word error rates did not necessarily lead to a strong speech dialog system---you could get almost all the words right and still have a bad dialog, and conversely you could miss many of the words and still recover. More recent competitions have done a better job of concentrating on tasks that really matter.

The Feature Selection Challenge meets the first three criteria easily. 75 teams entered, so they must have found it approachable. The scoring did a good job of separating the top performers while keeping everyone on the scale.  And the results are all available online, in this book, and in the accompanying CD.  All the data and Matlab code is provided, so the Challenge is easily reproducible. The level of explication provided by the entrants in the chapters of this book is higher than in other similar competitions.  The fourth criterion, real-world relevance, is perhaps the hardest to achieve. Only time will tell whether the Feature Selection Challenge meets this one. In the mean time, this book sets a high standard as the public record of an interesting and effective competition.

-Peter Norvig
Palo Alto, CA

Table of Contents

An Introduction to Feature Extraction
Isabelle Guyon, André Elisseeff
Feature extraction includes feature construction, space dimensionality reduction, sparse representations, and feature selection. All these techniques are commonly used as preprocessing to machine learning and statistics tasks of prediction, including pattern recognition and regression. Although such problems have been tackled by researchers for many years, there has been recently a renewed interest in feature extraction. A number of new applications with very large input spaces critically need space dimensionality reduction for efficiency and efficacy of the predictors. These applications include bioinformatics (DNA microarrays, mass-spectrometric data, etc.), combinatorial chemistry (e.g. high throughput screening of drug candidates), text processing (e.g. spam filtering), decision making (e.g. oil drilling), pattern recognition (e.g. handwriting recognition), speech processing, and vision.
An earlier tutorial by the same authors published in JMLR.

Part I: Feature Extraction Fundamentals
The first part of the book consists of tutorial chapters that synthesize recent approaches and new theoretical advances.
1) Learning machines

Krzysztof Grabczewski & Norbert Jankowski

Brief tutorial on some of the most widely used learning machines (Linear discriminant, naïve Bayes, neural networks, fuzzy logic, kernel methods -- including SVMs --, decision trees, clustering.) The goal of this chapter is to allow the reader to read the other chapters without referring to other textbooks.
2) Assessment methods
Gerard Dreyfus
Fundamentals in statistical testing and cross-validation. The goal of this chapter is to provide the reader the necessary tools to assess the goodness of selected features in a statistically significant way.
3) Filter methods
Wlodzislaw Duch
Correlation methods and other methods for ranking feature in order of relevance. The goal of this chapter is to provide the reader with easy to implement  methods, which are also important baseline methods.
4) Search strategies
Juha Reunanen
This chapter covers techniques that can be combined with any off-the-shelf learning machine package in a "wrapper" or "filter" feature selection setting, including exhaustive search (and computational and statistical limitations), classical greedy search (forward, backward), best first, branch and bound, beam search, GA, simulated annealing.
5) Embedded methods
Navin Lal, Olivier Chapelle, Jason Weston, Andre Elisseeff
This chapter covers algorithms that select features in the process of learning, including gradient based feature scaling; nested subset methods; use of performance bounds; direct objective optimization; decision trees.
6) Information theoretic methods
Kari Torkkola
Methods derived from information theoretic principles both for feature selection and construction, including information bottleneck, and Markov blankets.
7) Ensemble learning
Eugene Tuv
Methods involving voting over several feature subsets. Bayesian approaches to feature selection, bagging/stability methods.
8) Fuzzy Neural Networks
Madan M. Gupta, Noriyasu Homma, Zeng-Guang Hou
A new generation of learning machines combining Fuzzy Logic and neural networks. 

Part II:  Feature Selection Challenge
The second part of the book includes the results of the NIPS 2003 feature selection challenge. See the slides of the NIPS 2003 workshop.
9) Design and Analysis of the NIPS2003 Challenge
Isabelle Guyon, Steve Gunn, Asa Ben Hur, Gideon Dror
The challenge attracted 75 competitors. Researchers all over the world competed for 12 weeks on 5 sizeable datasets from a variety of applications, with number of input features ranking from 500 to 100,000. The goal was to use as few input features as possible without sacrificing prediction performance. The competitors succeeded in reducing the input space dimensionality by orders of magnitude while retaining or improving performance. Some of the most efficient methods are among the simplest and fastest. The benefits of using such techniques include performance improvements in terms of prediction speed and accuracy as well as gaining better data understanding by identifying the inputs that are most predictive of a given outcome. A short version was published in the NIPS proceedings.
10) High Dimensional Classification with Bayesian Neural
Networks and Dirichlet Diffusion Trees
Radford M. Neal, Jianguo Zhang
11) Ensembles of Regularized Least Squares Classifiers for High-Dimensional Problems
Kari Torkkola and Eugene Tuv
12) Combining SVMs with Various Feature Selection Strategies
Yi-Wei Chen, Chih-Jen Lin
13) Feature Selection with Transductive Support Vector Machines
Zhili Wu, Chunhung Li
14) Variable Selection using Correlation and Single Variable Classifier Methods: Applications
Amir Reza Saffari Azar Alamdari
15) Tree-Based Ensembles with Dynamic Soft Feature Selection
Alexander Borisov, Victor Eruhimov, Eugene Tuv
16) Sparse, Flexible and Efficient Modeling using L1 Regularization
Saharon Rosset, Ji Zhu
17) Margin Based Feature Selection and Infogain with Standard Classifiers
Ran Gilad-Bachrach, Amir Navot
18) Bayesian Support Vector Machines for Feature Ranking and Selection
Wei Chu, S. Sathiya Keerthi, Chong Jin Ong, Zoubin Ghahramani
19) Nonlinear Feature Selection with the Potential Support Vector Machine
Sepp Hochreiter, Klaus Obermayer
20) Combining a Filter Method with SVMs
Thomas Navin Lal, Olivier Chapelle, Bernhard Schölkopf
21) Feature Selection via Sensitivity Analysis with Direct Kernel PLS
Mark J. Embrechts, Robert A. Bress, Robert H. Kewley
22) Information Gain, Correlation and Support Vector Machines
Danny Roobaert, Grigoris Karakoulas, Nitesh V. Chawla
23) Mining for Complex Models Comprising Feature Selection and Classification
Krzysztof Grabczewski, Norbert Jankowski
24) Combining Information-Based Supervised and Unsupervised Feature Selection
Sang-Kyun Lee, Seung-Joon Yi, Byoung-Tak Zhang
25) An Enhanced Selective Naïve Bayes Method with Optimal Discretization
Marc Boullé
26) An Input Variable Importance Definition based on Empirical Data Probability Distribution
V. Lemaire, F. Clérot 

Part III New Perspectives in Feature Extraction
The last part of the book is devoted to recent advances in feature extraction.
27) Spectral Dimensionality Reduction

Yoshua Bengio, Olivier Delalleau, Nicolas Le Roux, Jean-Francois Paiement, Pascal Vincent, Marie Ouimet
28) Constructing Orthogonal Latent Features for Arbitrary Loss
Michinari Momma, Kristin P. Bennett
29) Large Margin Principles for Feature Selection
Ran Gilad-Bachrach, Amir Navot, Naftali Tishby
30) Feature Extraction for Classification of Proteomic Mass Spectra: A Comparative Study
Ilya Levner, Vadim Bulitko, Guohui Lin
31) Sequence motifs: highly predictive features of protein function
Asa Ben-Hur, Douglas Brutlag

A. Elementary Statistics
B. Feature Selection Challenge Datasets
C. Feature Selection Challenge Fact Sheets
D. Feature Selection Challenge Results Tables

Further information

Please contact Isabelle Guyon with any queries.