– bias–variance, generalization – simple models first, complex if necessary, MDL etc.
* knowledge modelling with different structures
* different theoretical background
* different building strategies
– learning via statistics, search strategies, optimization
* optimization theory
· mathematical programming (linear and quadratic)
· miscellaneous gradient methods
– simple (blind) and heuristic search methods
– evolutionary algorithms
– simulated annealing etc.
2.2 Optimal Bayes Classifier
2.3 Naive Bayes Classifier
2.4 Linear discriminant methods
2.5 Kernel methods, SVM
2.6 Neural networks, MLP, CC, RBF, LVQ
2.7 Instance based learning, kNN, prototype vectors
2.8 Decision trees C45, CART, SSV
3.3 Meta-learning, weighting in kNN, simple search, . . .
classification (or regression)
4.3 Missing data
We plan to provide more details on the following subjects:
• bias–variance, generalization
• Bayes classifiers
• linear discriminant methods
• kernel methods, SVM
• neural networks, MLP, CC, RBF, LVQ
• instance based learning, kNN, prototype vectors
• decision trees C45, CART, SSV
This chapter aims at providing the reader with the tools required for a statistically significant assessment of feature relevance. The methods presented in this chapter can be integrated in feature selection wrappers. They can also serve for hyper-parameter selection or model selection. Finally, they can be helpful for assessing the confidence on predictions made by learning machines on fresh data. The concept of model complexity is ubiquitous in this chapter; however, a detailed discussion of model complexity in relation to training data is deferred to Chapter 10.
1.3Guaranteed estimators, statistical tests, p-values
2.3Current best practices in cross-validation for performance assessment
3.3Relation to statistical tests commonly used for feature selection.
4Feature selection and bagging:
4.1Construction of bagged feature sets,
4.2Cross-validation for assessing the performance of bagged feature sets.
5Experimental planning for nonlinear models
5.1Principles and issues,
5.2Relation to feature and model selection.
The goal of this chapter is to provide the reader with the easiest to implement methods, which are also important baseline methods.
types of filters.
logical rules and their relevance,
binary features and Bayesian optimality.
GD-distance, J-measure, average weight of evidence, symetrical uncertainty,...
Relief, MDL and its variants, ROC .
The intent in this chapter is to describe a few algorithms at such a
level that implementation should be rather straightforward, and to
mention a few others to widen the scope a little bit. In other words,
the purpose is not to give a comprehensive list of all the existing
- Not in much detail
- Detailed descriptions of these simple yet powerful approaches
4.3. Floating search (SFFS, SBFS, AFS)
4.4. Oscillating search
- PTA(l,r) and two different versions of SFFS in detail (the original
one proposed by Pudil et al in 1994, and the corrected one by Somol
et al, 1999)
- Not in a lot of detail I think
The notation that will be introduced and used:
1) A placeholder for any feature set represented as an array of 0s and
1s that state whether a particular feature is included in the set;
2) A binary value denoting whether the j:th feature is included in the
set S; S_j
3) A placeholder for the number of features; k
4) A placeholder for the index of a feature; j
5) The best feature subset of size k found; B(k)
6) A function used to evaluate the performance of feature set S -- the
return value is the risk, i.e., smaller is better; J(S)
o same criterion as used for training
o bound on expected risk
How is it optimized
o Polynomials and neural networks [Rivals and Personnaz, 2003]
o Greedy least squares [Couvreur and Bresler, 2000]
o Gram Schmidt orthogonalization [Stoppiglia et al., 2003],
o Briefly: decision trees (e.g. CART [Breiman et al., 1984], ID3 [Quinlan, 1986], C4.5
[Quinlan, 1993], [Adam et al., 2002])
2.2 Scaling factors
o Feature selection for SVMs using bounds [Rakotomamonjy, 2003]
2.3 Sensitivity of the output
o For neural networks [Leray and Gallinari, 1999]
2.4 Weight based analyzes
o Recursive Feature Elimination [Guyon et al., 2003] (Extensions: [Zhu and Hastie, 2003]
to penalized logisitic regression, [Lal et al., 2003] treat grouped features.)
o RFE-perceptron with stopping criteria [Gentile, 2004]
o Optimal Brain Damage [LeCun et al., 1990]
o Optimization of scaling factors for SVMs: [Grandvalet and Canu, 2003]
o Variable scaling: extension to the Maximum Entropy Discrimination [Jebara and
o Joint classifier and feature optimization (JCFO) [Krishnapuram et al., 2004]
– [Zhu et al., 2004]
– [Bi et al., 2003]
o Least Squares with l1-norm LASSO and Sparse Logistic Regression [Tibshirani, 1996]
(extensions [Roth, 2003] and [Shevade and Keerthi, 2003])
o Sparse Kernel Fisher Discriminant [Mika et al., 2000]
o [Perkins et al., 2003]: Objective includes l2 l1 and l0 regularization.
o Zero norm optimization [Weston et al., 2003].
o Concave minimization [Bradley and Mangasarian, 1998]
o Potential Support Vector Machine [Hochreiter and Obermayer, 2003]
- Usefulness of measures other than Shannon's
- MI as a proxy
- bounds relating MI to Bayes error rate
- evaluation of MI in practice
- evaluation of the joint MI
- Markov blankets
3.2 MI for distance metrics
- learning local metrics that enhance relevant information
3.3 MI for feature construction
- learning global transforms that enhance relevant information
3.4 Distributional clustering
- Sufficient statistics
- Information Bottleneck, its variations, and rate-distortion theory
the rest):Bagging, Bumping, RandomForest, Stochastic Discrimination
-- serial/sequential ensembles (each new base learner built using
information on existing members of the committee): Arching, Boosting,
and Stagewise modeling [features - base learners (often simple functions
of input variables)]:
---- simple example: forward /backward stepwise selection in linear
---- more advanced: forward stagewise modeling with regularization,
penalized regression resulting in sparse ensemble (e.g. l1
regularization/lasso) -> feature selection
-- combining experts, stacking, postprocessing
-- search strategies: different flavors of random subsampling, feature
decompositions, GAs, hill-climbing, etc
-- dealing with high dimensional noisy data: tree-based ensembles with
dynamic soft feature selection [BorisovEruhimovTuv]
-- variable selection is done based on global (not univariate)
importance to the exploratory model (one such measure of importance is
based on total variance reduction, requires no extra computation;
another is based on sensitivity of the learned model to a change in the
var of interest)
classification using latent binary vectors to identify different
- variable selection using Bayesian decision theory attaching costs to
the inclusion of variables
- Bayesian Variable Selection for Clustering
- Bayesian neural networks and Gaussian Process Models
- automatic relevance determination
- practical implementations - Monte Carlo Methods, Gibbs Sampler,
Over the past several years we have successfully developed the main and core technology for intelligent data analysis and mining, pattern recognition and trend analysis, information retrieval, filtering and analysis for a variety of applications including sensory information processing using state-of-the-art computational intelligence.
This chapterfocuses on the next-generation of sensor information processing, analysis and interpretation, and its application in medicine, oil industry, industrial inspection and diagnosis systems, and speech recognition. The technologies can be grouped according to the functionality of different components of the system as follows:
1.3 Signal attributes/characteristics analysis such as conventional Hilbert attributes, FFT-based frequency-domain attributes, modulation spectral features, and other techniques such as wavelet analysis and empirical correlation of attributes, as well as linear and non-linear aggregation operators for attribute fusion.
The goal of this chapter is to provide the reader with theoretical guidance to evaluate the complexity of feature selection methods. Some methods are more prone to overfitting than others and thus require more training data. The chapter goes over the notion of performance bounds involving number of training examples and model complexity. Feature selection is presented as a particular case of the more general problem of model selection. The ambition of this chapter is to provide sizes of training sets required for obtaining guarantees of generalization with various methods, or conversely a ranking of methods in order of anticipated prediction performance given the training set size.
This chapter complements Chapter 3. In Chapter 3, empirical methods of performance evaluation are provided. Such methods, like k-fold cross validation, are widely used by practitioners because they have proven reliable in applications. Error bars for the performance predictions of such methods are however difficult or impossible to compute (see e.g. the work of Yoshua Bengio). Learning theoretical performance bounds involving the ratio of the complexity of the model and the number of training examples complement this empirical approach. It has been suggested that fitting the constants of the bound with empirical data might yield better estimates of performance error bars (see e.g. the work of Corina Cortes). Thus, in relation to Chapter 3, it may be possible to compare more models using cross-validation methods without running at risk of overfitting, if some notion of model complexity is involved. Second, having a notion of model complexity may allow practitioners to rank models according to anticipated predictive power and decide which model to try first when faced with a new problem that has a given set of descriptive statistics. Ranking models is a problem that is easier than predicting performance. If the order is good, fewer methods need to be tried and cross-validation becomes more reliable.
Since the existing body of theoretical results may be insufficient to answers all the questions raised, this chapter will take care of carefully stating mathematically the problems so that theoreticians might be challenged to solve them.
Examples and dreams:
Some authors (see e.g. the work of Andrew Ng) have proved that methods like “ordered-FS” have low complexity. The sample complexity of ordered-FS, a forward selection technique, is logarithmic in the number of irrelevant features, compared to a power law for wrapper subset selection methods with extensive search strategies. This means that such variable ranking can tolerate a number of irrelevant variables that is exponential in the number of training examples. Can such complexities be calculated for all canonical feature selection methods? Beyond purely theoretical bounds that assume we know some data statistics like the number of useless features, which may be unknown, can we rank feature selection methods in order of predicted relative predictive strength, using descriptive statistics of the data? Such statistics may include: number of training examples, fraction of positive and negative examples, number of features, an index of feature covariance, the fraction of useless features determined with statistical tests.
This chapter covers the basic ideas underlying soft computing. Its current incarnations have links to many earlier influences, among them Prof. Zadeh’s 1965 paper on fuzzy sets; the 1973 paper on the analysis of complex systems and decision processes; and the 1979 report (1981 paper) on possibility theory and soft data analysis. The principal constituents of soft computing (SC) are fuzzy logic (FL), neural network theory (NN) and probabilistic reasoning (PR), with the latter subsuming belief networks, evolutionary computing including DNA computing, chaos theory and parts of learning theory. Some of the achievements of field of soft computing are: fuzzy reasoning (set and logic), new soft computing algorithms making intelligent, semi-unsupervised use of large quantities of complex data, uncertainty analysis, perception-based decision analysis and decision support systems for risk analysis and management, computing with words, computational theory of perception (CTP), and precisiated natural language (PNL).
2.3 Fuzzy reinforcement learning techniques for online learning, maintenance, and adaptation of developed systems.
Development of new tools and methodologies based on computing with words and perceptions (CWP). This will include the use of perception-based reinforcement learning and BISC-DSS system.