# Preliminary tutorial chapter outlines

## Chapter 1: Learning machines

Authors: Krzysztof Grabczewski & Norbert Jankowski
Contact: Krzysztof Grabczewski kgrabcze@phys.uni.torun.pl

### 1. Introduction

1.1 Learning – general info
– supervised and unsupervised

– bias–variance, generalization – simple models first, complex if necessary, MDL etc.

1.2 Classification

1.3 Regression

1.4 Clustering

1.5 Visualization

### 2. Learning algorithms

2.1 Introduction
– different groups of models/algorithms

* knowledge modelling with different structures

* different theoretical background

* different building strategies

– learning via statistics, search strategies, optimization

* optimization theory

· mathematical programming (linear and quadratic)

– simple (blind) and heuristic search methods

– evolutionary algorithms

– simulated annealing etc.

2.2 Optimal Bayes Classifier

2.3 Naive Bayes Classifier

2.4 Linear discriminant methods

2.5 Kernel methods, SVM

2.6 Neural networks, MLP, CC, RBF, LVQ

2.7 Instance based learning, kNN, prototype vectors

2.8 Decision trees C45, CART, SSV

### 3. Complex systems

3.1 Committees, boosting
3.2 Data preprocessing (e.g. selection) for particular learning algorithms

3.3 Meta-learning, weighting in kNN, simple search, . . .

### 4. Problems

4.1 Complexity of learning methods
4.2 Learning and validation – the context of conjunction of transformation and

classification (or regression)

4.3 Missing data

We plan to provide more details on the following subjects:

• bias–variance, generalization

• Bayes classifiers

• linear discriminant methods

• kernel methods, SVM

• neural networks, MLP, CC, RBF, LVQ

• instance based learning, kNN, prototype vectors

• decision trees C45, CART, SSV

• committees

## Chapter 2: Assessment methods

Author/contact: Gerard Dreyfus Gerard.Dreyfus@espci.fr

This chapter aims at providing the reader with the tools required for a statistically significant assessment of feature relevance. The methods presented in this chapter can be integrated in feature selection wrappers. They can also serve for hyper-parameter selection or model selection. Finally, they can be helpful for assessing the confidence on predictions made by learning machines on fresh data. The concept of model complexity is ubiquitous in this chapter; however, a detailed discussion of model complexity in relation to training data is deferred to Chapter 10.

### 1 Elementary statistics

1.1Point estimation
1.2Interval estimation

1.3Guaranteed estimators, statistical tests, p-values

### 2 Validation/cross-validation

2.1Methodology: principles and caveats
2.2Cross-validation and its variants: leave-one-out (with/without analytical estimates), k-fold cross-validation, bootstrap, …

2.3Current best practices in cross-validation for performance assessment

### 3 Random probes

3.1Derivation and use of p-values using “random probes” to select the number of features in ordered features or nested feature subsets,
3.2Type I vs. type II errors (i.e. falsely relevant features vs. falsely irrelevant features),

3.3Relation to statistical tests commonly used for feature selection.

4Feature selection and bagging:

4.1Construction of bagged feature sets,

4.2Cross-validation for assessing the performance of bagged feature sets.

5Experimental planning for nonlinear models

5.1Principles and issues,

5.2Relation to feature and model selection.

## Chapter 3: Filter methods for feature relevance ranking

Author/contact: Wlodzislaw Duch (email Google: Duch)

The goal of this chapter is to provide the reader with the easiest to implement methods, which are also important baseline methods.

### 1 Introduction

filters and feature selection: relevance and redundancy;
filters for ranking and selection;

types of filters.

### 2 General issues

how to measure relevance,
local and global relevance,

logical rules and their relevance,

binary features and Bayesian optimality.

### 3 Filters based on information theory

Similarity of probability distributions
KL, mutual information, normalized information gain, Mantaras distance,

GD-distance, J-measure, average weight of evidence, symetrical uncertainty,...

### 4 Correlation-based filters

correlation coefficients: consistency, relevance (RELFSS), Pearson's
correlation, $\chi^2$-statistics, G-statistics ...

Relief, MDL and its variants, ROC .

### 5 Decision trees for filtering

One and two-level trees
1R discretization, Gini, separability ...

### 6 Reliability of relevance calculations:

discretization vs kernel methods

### 7 Discussion

Biases of relevance indices for multivalued attributes;
Noisy XOR

## Chapter 4: Search strategies for feature selection

Author/contact: Juha Reunanen Juha.Reunanen@iki.fi

The intent in this chapter is to describe a few algorithms at such a

level that implementation should be rather straightforward, and to

mention a few others to widen the scope a little bit. In other words,

the purpose is not to give a comprehensive list of all the existing

search strategies.

### 1. Introduction

- Mention that the emphasis is on the wrapper approach, although all
the algorithms can be used with a filter-type evaluation function as

well

### 2. Optimal results

2.1. Exhaustive search
2.2. Branch and bound

- Not in much detail

### 3. Sequential selection

3.1. Sequential pruning (SBS)
3.2. Sequential growing (SFS)

- Detailed descriptions of these simple yet powerful approaches

### 4. Extensions to sequential selection

4.1. Generalized sequential selection (GSFS, GSBS)
4.2. Backtracking during search (PTA, GPTA)

4.3. Floating search (SFFS, SBFS, AFS)

4.4. Oscillating search

- PTA(l,r) and two different versions of SFFS in detail (the original

one proposed by Pudil et al in 1994, and the corrected one by Somol

et al, 1999)

### 5. Stochastic search

5.1. Simulated annealing
5.2. Genetic algorithms

- Not in a lot of detail I think

### 6. Discussion and conclusions

- Any ideas are welcome :)

The notation that will be introduced and used:

1) A placeholder for any feature set represented as an array of 0s and

1s that state whether a particular feature is included in the set;

S, S'

2) A binary value denoting whether the j:th feature is included in the

set S; S_j

3) A placeholder for the number of features; k

4) A placeholder for the index of a feature; j

5) The best feature subset of size k found; B(k)

6) A function used to evaluate the performance of feature set S -- the

return value is the risk, i.e., smaller is better; J(S)

## Chapter 5: Embedded methods of feature selection

Authors: Navin Lal, Olivier Chapelle, Jason Weston, Andre Elisseeff
Contact: Thomas Navin Lal navin.lal@tuebingen.mpg.de

### 1 Introduction

What makes a method embedded
Which criterion is being used

o same criterion as used for training

o bound on expected risk

o ...

How is it optimized

### 2 Forward-Backward-Nested

2.1 Explicit addition or removal of features
o Feature selection for SVMs using bounds [Rakotomamonjy, 2003]

o Polynomials and neural networks [Rivals and Personnaz, 2003]

o Greedy least squares [Couvreur and Bresler, 2000]

o Gram Schmidt orthogonalization [Stoppiglia et al., 2003],

o Briefly: decision trees (e.g. CART [Breiman et al., 1984], ID3 [Quinlan, 1986], C4.5

[Quinlan, 1993], [Adam et al., 2002])

2.2 Scaling factors

o Feature selection for SVMs using bounds [Rakotomamonjy, 2003]

2.3 Sensitivity of the output

o For neural networks [Leray and Gallinari, 1999]

2.4 Weight based analyzes

o Recursive Feature Elimination [Guyon et al., 2003] (Extensions: [Zhu and Hastie, 2003]

to penalized logisitic regression, [Lal et al., 2003] treat grouped features.)

o RFE-perceptron with stopping criteria [Gentile, 2004]

o Optimal Brain Damage [LeCun et al., 1990]

### 3 Optimization of scaling factors

o Gradient descent on the R2w2 bound [Weston et al., 2000]
o Automatic Relevance Determination [Li et al., 2002]

o Optimization of scaling factors for SVMs: [Grandvalet and Canu, 2003]

o Variable scaling: extension to the Maximum Entropy Discrimination [Jebara and

Jaakkola, 2000]

o Joint classifier and feature optimization (JCFO) [Krishnapuram et al., 2004]

### 4 Sparsity term

o One-norm SVM:
– [Fung and Mangasarian, 2003]

– [Zhu et al., 2004]

– [Bi et al., 2003]

o Least Squares with l1-norm LASSO and Sparse Logistic Regression [Tibshirani, 1996]

(extensions [Roth, 2003] and [Shevade and Keerthi, 2003])

o Sparse Kernel Fisher Discriminant [Mika et al., 2000]

o [Perkins et al., 2003]: Objective includes l2 l1 and l0 regularization.

o Zero norm optimization [Weston et al., 2003].

o Concave minimization [Bradley and Mangasarian, 1998]

o Potential Support Vector Machine [Hochreiter and Obermayer, 2003]

## Chapter 6: Information theoretic methods

Author/contact: Kari Torkkola Kari.Torkkola@motorola.com

### 2. Basic concepts of information theory

- entropy, conditional entropy, mutual information (information gain)
- KL divergence and other divergence measures

- Usefulness of measures other than Shannon's

### 3. MI to enhance information relevant to the task

3.1 MI for classifier feature selection
- Optimal criterion for classification is Bayes error rate

- MI as a proxy

- bounds relating MI to Bayes error rate

- evaluation of MI in practice

- evaluation of the joint MI

- Markov blankets

3.2 MI for distance metrics

- learning local metrics that enhance relevant information

3.3 MI for feature construction

- learning global transforms that enhance relevant information

3.4 Distributional clustering

- Sufficient statistics

- Information Bottleneck, its variations, and rate-distortion theory

### 4. Discussion

- IT methods in relation to methods using 2nd order statistics
- goodness of IT methods

## Chapter 7: Variable/FeatureSelection and Ensemble Learning (dual view)

Author/contact: Eugene Tuv eugene.tuv@intel.com

### 1 Overview and evolution of learning with ensembles

-- combining multiple learners improves performance
-- parallel/independent ensembles (each expert built independently from

the rest):Bagging, Bumping, RandomForest, Stochastic Discrimination

-- serial/sequential ensembles (each new base learner built using

information on existing members of the committee): Arching, Boosting,

and Stagewise modeling [features - base learners (often simple functions

of input variables)]:

---- simple example: forward /backward stepwise selection in linear

regression

---- more advanced: forward stagewise modeling with regularization,

penalized regression resulting in sparse ensemble (e.g. l1

regularization/lasso) -> feature selection

-- combining experts, stacking, postprocessing

### 2 Effective ensembles combine accurate and diverse experts - feature

selection to improve performance of ensembles
-- measures of diversity, fitness functions

-- search strategies: different flavors of random subsampling, feature

decompositions, GAs, hill-climbing, etc

-- dealing with high dimensional noisy data: tree-based ensembles with

dynamic soft feature selection [BorisovEruhimovTuv]

### 3 Ensemble based variable filtering - by building very fast exploratory

model with RF or hybrid tree based ensemble [BorisovEruhimovTuv] which
can handle huge (both in terms of cases and vars) and very noisy

datasets.

-- variable selection is done based on global (not univariate)

importance to the exploratory model (one such measure of importance is

based on total variance reduction, requires no extra computation;

another is based on sensitivity of the learned model to a change in the

var of interest)

### 4 Bayesian approach to V/FS

- intro to Bayesian methods, relationship to bootstrap
- Bayesian variable selection in multivariate regression and

classification using latent binary vectors to identify different

submodels.

- variable selection using Bayesian decision theory attaching costs to

the inclusion of variables

- Bayesian Variable Selection for Clustering

- Bayesian neural networks and Gaussian Process Models

- automatic relevance determination

- practical implementations - Monte Carlo Methods, Gibbs Sampler,

Metropolis-Hastings algorithm

## Chapter 8: Signal-Based Feature Construction

Author/contact: Masoud Nikravesh nikravesh@cs.berkeley.edu

Over the past several years we have successfully developed the main and core technology for intelligent data analysis and mining, pattern recognition and trend analysis, information retrieval, filtering and analysis for a variety of applications including sensory information processing using state-of-the-art computational intelligence.

This chapterfocuses on the next-generation of sensor information processing, analysis and interpretation, and its application in medicine, oil industry, industrial inspection and diagnosis systems, and speech recognition. The technologies can be grouped according to the functionality of different components of the system as follows:

### 1 Signal processing and feature extraction

1.1 Determine relevant acoustic wave attributes associated with specific properties of the products and integrate various attributes and features to expand the diagnostic and predictive capability;
1.2 Pseudo signal generation for characterization and calibration: As an example, pseudo signals can be generated with constraints to expand the size of the datasets to ensure statistical significance, addressing a crucial problem of data scarcity in the current project;

1.3 Signal attributes/characteristics analysis such as conventional Hilbert attributes, FFT-based frequency-domain attributes, modulation spectral features, and other techniques such as wavelet analysis and empirical correlation of attributes, as well as linear and non-linear aggregation operators for attribute fusion.

### 2  Feature pre-processing and dimension reduction

2.1 Various data compression techniques for data mining, to simplify similarity evaluation between signals and cases;
2.2 Evaluation of the relative importance, synergy and redundancy of various attributes for preliminary feature selection, and reduction of data dimension, using linear and non-linear techniques including conventional statistical methods such as PCA and SVD, as well as fuzzy measures and fuzzy integrals.

## Chapter 9: Learning theory of feature selection

The goal of this chapter is to provide the reader with theoretical guidance to evaluate the complexity of feature selection methods. Some methods are more prone to overfitting than others and thus require more training data. The chapter goes over the notion of performance bounds involving number of training examples and model complexity. Feature selection is presented as a particular case of the more general problem of model selection. The ambition of this chapter is to provide sizes of training sets required for obtaining guarantees of generalization with various methods, or conversely a ranking of methods in order of anticipated prediction performance given the training set size.

This chapter complements Chapter 3. In Chapter 3, empirical methods of performance evaluation are provided. Such methods, like k-fold cross validation, are widely used by practitioners because they have proven reliable in applications. Error bars for the performance predictions of such methods are however difficult or impossible to compute (see e.g. the work of Yoshua Bengio). Learning theoretical performance bounds involving the ratio of the complexity of the model and the number of training examples complement this empirical approach. It has been suggested that fitting the constants of the bound with empirical data might yield better estimates of performance error bars (see e.g. the work of Corina Cortes). Thus, in relation to Chapter 3, it may be possible to compare more models using cross-validation methods without running at risk of overfitting, if some notion of model complexity is involved. Second, having a notion of model complexity may allow practitioners to rank models according to anticipated predictive power and decide which model to try first when faced with a new problem that has a given set of descriptive statistics. Ranking models is a problem that is easier than predicting performance. If the order is good, fewer methods need to be tried and cross-validation becomes more reliable.

Since the existing body of theoretical results may be insufficient to answers all the questions raised, this chapter will take care of carefully stating mathematically the problems so that theoreticians might be challenged to solve them.

Examples and dreams:

Some authors (see e.g. the work of Andrew Ng) have proved that methods like “ordered-FS” have low complexity. The sample complexity of ordered-FS, a forward selection technique, is logarithmic in the number of irrelevant features, compared to a power law for wrapper subset selection methods with extensive search strategies. This means that such variable ranking can tolerate a number of irrelevant variables that is exponential in the number of training examples. Can such complexities be calculated for all canonical feature selection methods? Beyond purely theoretical bounds that assume we know some data statistics like the number of useless features, which may be unknown, can we rank feature selection methods in order of predicted relative predictive strength, using descriptive statistics of the data? Such statistics may include: number of training examples, fraction of positive and negative examples, number of features, an index of feature covariance, the fraction of useless features determined with statistical tests.

## Chapter 10: Neuro-Fuzzy and Evolutionary Computing methods

Masoud Nikravesh nikravesh@cs.berkeley.edu

This chapter covers the basic ideas underlying soft computing. Its current incarnations have links to many earlier influences, among them Prof. Zadeh’s 1965 paper on fuzzy sets; the 1973 paper on the analysis of complex systems and decision processes; and the 1979 report (1981 paper) on possibility theory and soft data analysis. The principal constituents of soft computing (SC) are fuzzy logic (FL), neural network theory (NN) and probabilistic reasoning (PR), with the latter subsuming belief networks, evolutionary computing including DNA computing, chaos theory and parts of learning theory. Some of the achievements of field of soft computing are: fuzzy reasoning (set and logic), new soft computing algorithms making intelligent, semi-unsupervised use of large quantities of complex data, uncertainty analysis, perception-based decision analysis and decision support systems for risk analysis and management, computing with words, computational theory of perception (CTP), and precisiated natural language (PNL).

### 1 Hybrid case-based and rule-based classification and decision making

1.1 Neuro-fuzzy systems that integrate rule-based and case-based reasoning and provide a smooth transition between the two depending on the availability of data and expert knowledge. This may be constructed in a hierarchical, decision-tree like structure and also include unsupervised clustering.
1.2 Related methodologies and techniques for data fusion and mining, machine learning, and prediction and interpretation will be developed based on adaptation and integration of existing tools.

### 2 Further recognition and decision making technologies, and feature selection methods

2.1 Fuzzy-integration-based aggregation techniques and hybrid fuzzy-logic/genetic-algorithm for pattern recognition, decision analysis, multi-criteria decision-making and multi-attribute optimization;
2.2 Genetic algorithms (GA) and reinforcement learning for feature set optimization and preference learning, within a filter or wrapper model of feature selection;

2.3 Fuzzy reinforcement learning techniques for online learning, maintenance, and adaptation of developed systems.

Development of new tools and methodologies based on computing with words and perceptions (CWP). This will include the use of perception-based reinforcement learning and BISC-DSS system.