Feature Selection 

Pascal bootcamp, Vilanova i la Geltrú, Spain

July 2-6, 2007

Falcons

** THIS CLASS HAS BEEN VIDEOTAPED **

This course covers feature selection fundamentals and applications. The students will first be reminded of the basics of machine learning algorithms and the problem of overfitting avoidance. In the wrapper setting, feature selection will be introduced as a special case of the model selection problem. Methods to derive principled feature selection algorithms will be reviewed as well as heuristic method, which work well in practice. One class will be devoted to feature construction techniques. Finally, a lecture will be devoted to the connections between feature section and causal discovery. The class will be accompanied by several lab sessions. The course will be attractive to students who like playing with data and want to learn practical data analysis techniques. The instructor has ten years of experience with consulting for startup companies in the US in pattern recognition and machine learning. Datasets from a variety of application domains will be made available: handwriting recognition, medical diagnosis, drug discovery, text classification, ecology, marketing.

         The classes are taking place on the Vilanova campus of Universitat Politècnica de Catalunya (UPC), July 2-6, 2007. The event is planned by Jose Luis Balcázar.

       Feature extraction book:
       The class is based on a book, which compiles the results of the NIPS 2003 feature selection challenge and includes tutorial chapters.
       Download the feature extraction book introduction. Copies of the full book may be purchased.

       Suggested readings:

·    Structural risk minimization for character recognition
·    Kernel Ridge Regression tutorial
·    Linear discriminant and support vector classifiers
·    Causal feature selection
.    Naive Bayes

            CLOP package Installation
        You may download a version of CLOP (version 1.2) designed for this class. There are other versions of CLOP, please do not use them.

       ==> Windows users will just have to run a script to set the Matlab path properly to use most functions.
       ==> Unix users will have to compile the LibSVM package if they want to use support vector machines. Please use the latest Makefile: Makefile_amir.
              Also, some '\' create problems but we think we have gotten rid of those.
       ==> All users will have to install R to use random forests (RF and RFFS). Make sure you remove and file named Clop/challenge_objects/packages/Rlink/__Rpath. When you first start RF or RFFS, you will be prompted for the path of the R executable.

Schedule

  Slides Date
(July 2007)
Time
Lecture Exercise class

1

Introduction
Monday, 2 Lecture: 9:30-10:30
Lab group 1: 15:30-16:45
Lab group 2: 17:00-18:15

Introduction to Machine Learning
Basic learning machine. Principle of learning.
Introduction to the spider: loading data, training and testing a simple model (eg. Naïve Bayes). Use of a toy dataset. Description of the CLOP library.
2 Overfitting
Wednesday 4
Lecture 1: 9:00-9:50
Lecture 2: 10:00-10:50
Lecture 3: 11:30-12:20
Lab group 1: 15:30-16:45
Lab group 2: 17:00-18:15
Learning without overlearning
Overfitting avoidance, performance prediction, cross-validation


Play with the Dexter and Madelon datasets of the feature selection challenge. Apply naïve Bayes, ridge regression, SVM. Add filters.
3
Feature selection 1
Wednesday 4
Introduction to feature selection
Filters, wrappers, and embedded methods
4
Feature selection 2
Wednesday 4
Learning theory put to work to build feature selection algorithms
5
Feature construction
Thursday, 5
Lecture 1: 9:00-9:50
Lecture 2: 10:00-10:50
Lab group 1: 15:30-16:45
Lab group 2: 17:00-18:15

Feature construction
How to build better features with simple methods, convolutions, PCA, etc.

Play with the Gisette dataset of the feature selection challenge. See how with simple feature extraction methods, performances can be improved over the pure “agnostic” approach.

6


Causality

Thursday, 5
Causality and feature selection
Limitations of methods of feature selection ignoring the data selection process.

7




Friday, 6
Panel:
11:30-13:20
Lab group 1: 15:30-16:45
Lab group 2: 17:00-18:15

No lecture
Explore the latest CLOP version of the last challenge with R and Weka extensions.
Choose any dataset from the feature selection challenge or the AL vs PK challenge. Play to match or  outperform the best results (see ETH student results and NIPS model selection game).

Pascal          Clopinet