This course covers feature selection fundamentals and applications. The students will first be reminded of the basics of machine learning algorithms and the problem of overfitting avoidance. In the wrapper setting, feature selection will be introduced as a special case of the model selection problem. Methods to derive principled feature selection algorithms will be reviewed as well as heuristic method, which work well in practice. One class will be devoted to feature construction techniques. Finally, a lecture will be devoted to the connections between feature section and causal discovery. The class will be accompanied by several lab sessions. The course will be attractive to students who like playing with data and want to learn practical data analysis techniques. The instructor has ten years of experience with consulting for startup companies in the US in pattern recognition and machine learning. Datasets from a variety of application domains will be made available: handwriting recognition, medical diagnosis, drug discovery, text classification, ecology, marketing.
The classes are taking place on
the Vilanova campus of Universitat Politècnica
de Catalunya (UPC), July 2-6, 2007. The event is planned
by Jose Luis Balcázar.
Instructor:
Isabelle Guyon
ClopiNet consulting, Berkeley,
California
E m a il: isabelle@clopinet.com
Feature
extraction book:
The class is based on a book, which compiles
the results of the NIPS 2003
feature selection challenge and includes tutorial chapters.
Download the feature extraction
book introduction. Copies of the full book may be purchased.
Suggested readings:
· Structural risk minimization for character recognition
· Kernel Ridge Regression tutorial
· Linear discriminant and support vector classifiers
· Causal feature selection
. Naive Bayes
CLOP
package Installation
You may download a version of CLOP (version 1.2) designed for this class.
There are other versions of CLOP, please do not use them.
==> Windows users will just have to run a script to set the Matlab
path properly to use most functions.
==> Unix users will have to compile
the LibSVM package if they want to use support vector machines. Please
use the latest Makefile: Makefile_amir.
Also,
some '\' create problems but we think we have gotten rid of those.
==> All users will have to install
R to use random forests (RF and RFFS). Make sure you remove and file named
Clop/challenge_objects/packages/Rlink/__Rpath. When you first start RF or
RFFS, you will be prompted for the path of the R executable.
Schedule
Slides | Date (July 2007) |
Time |
Lecture | Exercise class | |
1 |
Introduction |
Monday, 2 | Lecture: 9:30-10:30 Lab group 1: 15:30-16:45 Lab group 2: 17:00-18:15 |
Introduction
to Machine Learning Basic learning machine. Principle of learning. |
Introduction
to the spider: loading data, training and testing a simple model (eg. Naïve
Bayes). Use of a toy dataset. Description of the CLOP library. |
2 | Overfitting |
Wednesday 4 |
Lecture 1: 9:00-9:50 Lecture 2: 10:00-10:50 Lecture 3: 11:30-12:20 Lab group 1: 15:30-16:45 Lab group 2: 17:00-18:15 |
Learning without overlearning Overfitting avoidance, performance prediction, cross-validation |
Play with the Dexter and Madelon datasets of the feature selection challenge. Apply naïve Bayes, ridge regression, SVM. Add filters. |
3 |
Feature selection 1 |
Wednesday 4 |
Filters, wrappers, and embedded methods
|
||
4 |
Feature
selection 2 |
Wednesday 4 |
Learning
theory put to work to build feature selection algorithms
|
||
5 |
Feature construction |
Thursday, 5 |
Lecture 1: 9:00-9:50 Lecture 2: 10:00-10:50 Lab group 1: 15:30-16:45 Lab group 2: 17:00-18:15 |
Feature construction How to build better features with simple methods, convolutions, PCA, etc. |
Play with the Gisette dataset
of the feature selection challenge. See how with simple feature extraction
methods, performances can be improved over the pure “agnostic” approach. |
6 |
Causality |
Thursday, 5 |
Causality and feature selection Limitations of methods of feature selection ignoring the data selection process. |
||
7
|
Friday, 6 |
Panel: 11:30-13:20 Lab group 1: 15:30-16:45 Lab group 2: 17:00-18:15 |
No lecture |
Explore the
latest CLOP version
of the last challenge with R and Weka extensions. Choose any dataset from the feature selection challenge or the AL vs PK challenge. Play to match or outperform the best results (see ETH student results and NIPS model selection game). |