–
bias–variance, generalization – simple models first, complex if necessary,
MDL etc.
1.2
Classification
1.3
Regression
1.4
Clustering
1.5
Visualization
*
knowledge modelling with different structures
*
different theoretical background
*
different building strategies
–
learning via statistics, search strategies, optimization
*
optimization theory
·
mathematical programming (linear and quadratic)
·
miscellaneous gradient methods
–
simple (blind) and heuristic search methods
–
evolutionary algorithms
–
simulated annealing etc.
2.2
Optimal Bayes Classifier
2.3
Naive Bayes Classifier
2.4
Linear discriminant methods
2.5
Kernel methods, SVM
2.6
Neural networks, MLP, CC, RBF, LVQ
2.7
Instance based learning, kNN, prototype vectors
2.8
Decision trees C45, CART, SSV
3.3
Meta-learning, weighting in kNN, simple search, . . .
classification
(or regression)
4.3
Missing data
We
plan to provide more details on the following subjects:
•
bias–variance, generalization
•
Bayes classifiers
•
linear discriminant methods
•
kernel methods, SVM
•
neural networks, MLP, CC, RBF, LVQ
•
instance based learning, kNN, prototype vectors
•
decision trees C45, CART, SSV
•
committees
This
chapter aims at providing the reader with the tools required for a statistically
significant assessment of feature relevance. The methods presented in this
chapter can be integrated in feature selection wrappers. They can also
serve for hyper-parameter selection or model selection. Finally, they can
be helpful for assessing the confidence on predictions made by learning
machines on fresh data. The concept of model complexity is ubiquitous in
this chapter; however, a detailed discussion of model complexity in relation
to training data is deferred to Chapter 10.
1.3Guaranteed
estimators, statistical tests, p-values
2.3Current
best practices in cross-validation for performance assessment
3.3Relation
to statistical tests commonly used for feature selection.
4Feature
selection and bagging:
4.1Construction
of bagged feature sets,
4.2Cross-validation
for assessing the performance of bagged feature sets.
5Experimental
planning for nonlinear models
5.1Principles
and issues,
5.2Relation
to feature and model selection.
The
goal of this chapter is to provide the reader with the easiest to implement
methods, which are also important baseline methods.
types
of filters.
logical
rules and their relevance,
binary
features and Bayesian optimality.
GD-distance,
J-measure, average weight of evidence, symetrical uncertainty,...
Relief,
MDL and its variants, ROC .
The
intent in this chapter is to describe a few algorithms at such a
level
that implementation should be rather straightforward, and to
mention
a few others to widen the scope a little bit. In other words,
the
purpose is not to give a comprehensive list of all the existing
search
strategies.
well
-
Not in much detail
-
Detailed descriptions of these simple yet powerful approaches
4.3.
Floating search (SFFS, SBFS, AFS)
4.4.
Oscillating search
-
PTA(l,r) and two different versions of SFFS in detail (the original
one
proposed by Pudil et al in 1994, and the corrected one by Somol
et
al, 1999)
-
Not in a lot of detail I think
The
notation that will be introduced and used:
1)
A placeholder for any feature set represented as an array of 0s and
1s
that state whether a particular feature is included in the set;
S,
S'
2)
A binary value denoting whether the j:th feature is included in the
set
S; S_j
3)
A placeholder for the number of features; k
4)
A placeholder for the index of a feature; j
5)
The best feature subset of size k found; B(k)
6)
A function used to evaluate the performance of feature set S -- the
return
value is the risk, i.e., smaller is better; J(S)
o
same criterion as used for training
o
bound on expected risk
o
...
How
is it optimized
o
Polynomials and neural networks [Rivals and Personnaz, 2003]
o
Greedy least squares [Couvreur and Bresler, 2000]
o
Gram Schmidt orthogonalization [Stoppiglia et al., 2003],
o
Briefly: decision trees (e.g. CART [Breiman et al., 1984], ID3 [Quinlan,
1986], C4.5
[Quinlan,
1993], [Adam et al., 2002])
2.2
Scaling factors
o
Feature selection for SVMs using bounds [Rakotomamonjy, 2003]
2.3
Sensitivity of the output
o
For neural networks [Leray and Gallinari, 1999]
2.4
Weight based analyzes
o
Recursive Feature Elimination [Guyon et al., 2003] (Extensions: [Zhu and
Hastie, 2003]
to
penalized logisitic regression, [Lal et al., 2003] treat grouped features.)
o
RFE-perceptron with stopping criteria [Gentile, 2004]
o
Optimal Brain Damage [LeCun et al., 1990]
o
Optimization of scaling factors for SVMs: [Grandvalet and Canu, 2003]
o
Variable scaling: extension to the Maximum Entropy Discrimination [Jebara
and
Jaakkola,
2000]
o
Joint classifier and feature optimization (JCFO) [Krishnapuram et al.,
2004]
–
[Zhu et al., 2004]
–
[Bi et al., 2003]
o
Least Squares with l1-norm LASSO and Sparse Logistic Regression [Tibshirani,
1996]
(extensions
[Roth, 2003] and [Shevade and Keerthi, 2003])
o
Sparse Kernel Fisher Discriminant [Mika et al., 2000]
o
[Perkins et al., 2003]: Objective includes l2 l1 and l0 regularization.
o
Zero norm optimization [Weston et al., 2003].
o
Concave minimization [Bradley and Mangasarian, 1998]
o
Potential Support Vector Machine [Hochreiter and Obermayer, 2003]
-
Usefulness of measures other than Shannon's
-
MI as a proxy
-
bounds relating MI to Bayes error rate
-
evaluation of MI in practice
-
evaluation of the joint MI
-
Markov blankets
3.2
MI for distance metrics
-
learning local metrics that enhance relevant information
3.3
MI for feature construction
-
learning global transforms that enhance relevant information
3.4
Distributional clustering
-
Sufficient statistics
-
Information Bottleneck, its variations, and rate-distortion theory
the
rest):Bagging, Bumping, RandomForest,
Stochastic Discrimination
--
serial/sequential ensembles (each new base learner built using
information
on existing members of the committee): Arching, Boosting,
and
Stagewise modeling [features - base learners (often simple functions
of
input variables)]:
----
simple example: forward /backward stepwise selection in linear
regression
----
more advanced: forward stagewise modeling with regularization,
penalized
regression resulting in sparse ensemble (e.g. l1
regularization/lasso)
-> feature selection
--
combining experts, stacking, postprocessing
--
search strategies: different flavors of random subsampling, feature
decompositions,
GAs, hill-climbing, etc
--
dealing with high dimensional noisy data: tree-based ensembles with
dynamic
soft feature selection [BorisovEruhimovTuv]
datasets.
--
variable selection is done based on global (not univariate)
importance
to the exploratory model (one such measure of importance is
based
on total variance reduction, requires no extra computation;
another
is based on sensitivity of the learned model to a change in the
var
of interest)
classification
using latent binary vectors to identify different
submodels.
-
variable selection using Bayesian decision theory attaching costs to
the
inclusion of variables
-
Bayesian Variable Selection for Clustering
-
Bayesian neural networks and Gaussian Process Models
-
automatic relevance determination
-
practical implementations - Monte Carlo Methods, Gibbs Sampler,
Metropolis-Hastings
algorithm
Over
the past several years we have successfully developed the main and core
technology for intelligent data analysis and mining, pattern recognition
and trend analysis, information retrieval, filtering and analysis for a
variety of applications including sensory information processing using
state-of-the-art computational intelligence.
This
chapterfocuses on the next-generation
of sensor information processing, analysis and interpretation, and its
application in medicine, oil industry, industrial inspection and diagnosis
systems, and speech recognition. The technologies can be grouped according
to the functionality of different components of the system as follows:
1.3
Signal attributes/characteristics analysis such as conventional Hilbert
attributes, FFT-based frequency-domain attributes, modulation spectral
features, and other techniques such as wavelet analysis and empirical correlation
of attributes, as well as linear and non-linear aggregation operators for
attribute fusion.
The
goal of this chapter is to provide the reader with theoretical guidance
to evaluate the complexity of feature selection methods. Some methods are
more prone to overfitting than others and thus require more training data.
The chapter goes over the notion of performance bounds involving number
of training examples and model complexity. Feature selection is presented
as a particular case of the more general problem of model selection. The
ambition of this chapter is to provide sizes of training sets required
for obtaining guarantees of generalization with various methods, or conversely
a ranking of methods in order of anticipated prediction performance given
the training set size.
This
chapter complements Chapter 3. In Chapter 3, empirical methods of performance
evaluation are provided. Such methods, like k-fold cross validation, are
widely used by practitioners because they have proven reliable in applications.
Error bars for the performance predictions of such methods are however
difficult or impossible to compute (see e.g. the work of Yoshua Bengio).
Learning theoretical performance bounds involving the ratio of the complexity
of the model and the number of training examples complement this empirical
approach. It has been suggested that fitting the constants of the bound
with empirical data might yield better estimates of performance error bars
(see e.g. the work of Corina Cortes). Thus, in relation to Chapter 3, it
may be possible to compare more models using cross-validation methods without
running at risk of overfitting, if some notion of model complexity is involved.
Second, having a notion of model complexity may allow practitioners to
rank models according to anticipated predictive power and decide which
model to try first when faced with a new problem that has a given set of
descriptive statistics. Ranking models is a problem that is easier than
predicting performance. If the order is good, fewer methods need to be
tried and cross-validation becomes more reliable.
Since
the existing body of theoretical results may be insufficient to answers
all the questions raised, this chapter will take care of carefully stating
mathematically the problems so that theoreticians might be challenged to
solve them.
Examples
and dreams:
Some
authors (see e.g. the work of Andrew Ng) have proved that methods like
“ordered-FS” have low complexity. The sample complexity of ordered-FS,
a forward selection technique, is logarithmic in the number of irrelevant
features, compared to a power law for wrapper subset selection methods
with extensive search strategies. This means that such variable ranking
can tolerate a number of irrelevant variables that is exponential in the
number of training examples. Can such complexities be calculated for all
canonical feature selection methods? Beyond purely theoretical bounds that
assume we know some data statistics like the number of useless features,
which may be unknown, can we rank feature selection methods in order of
predicted relative predictive strength, using descriptive statistics of
the data? Such statistics may include: number of training examples, fraction
of positive and negative examples, number of features, an index of feature
covariance, the fraction of useless features determined with statistical
tests.
This
chapter covers the basic ideas underlying soft computing. Its current incarnations
have links to many earlier influences, among them Prof. Zadeh’s 1965 paper
on fuzzy sets; the 1973 paper on the analysis of complex systems and decision
processes; and the 1979 report (1981 paper) on possibility theory and soft
data analysis. The principal constituents of soft computing (SC) are fuzzy
logic (FL), neural network theory (NN) and probabilistic reasoning (PR),
with the latter subsuming belief networks, evolutionary computing including
DNA computing, chaos theory and parts of learning theory. Some of the achievements
of field of soft computing are: fuzzy reasoning (set and logic), new soft
computing algorithms making intelligent, semi-unsupervised use of large
quantities of complex data, uncertainty analysis, perception-based decision
analysis and decision support systems for risk analysis and management,
computing with words, computational theory of perception (CTP), and precisiated
natural language (PNL).
2.3
Fuzzy reinforcement learning techniques for online learning, maintenance,
and adaptation of developed systems.
Development
of new tools and methodologies based on computing with words and perceptions
(CWP). This will include the use of perception-based reinforcement learning
and BISC-DSS system.