Stochastic optimization of a serial tree ensemble for CV error.
Victor Eruhimov, Vladimir Martyanov, Eugene Tuv
The problem of building a universal tool for data classification provides
quite a few challenges for researchers. One of the tools is a serial ensemble
of decision trees, where each consecutive tree explains the error of the
current tree set (so-called Gradient Boosted Trees, GBT). It also provides
a natural and efficient method of calculating the influence of each predictor
variable on the response in the given dataset. We propose a fast method for
building accurate tree-based models by utilizing decision tree feature weighting
algorithm on each step of the greedy algorithm. Each variable is assigned
a so-called importance weight so that the higher is the weight the higher
are the chances that this variable will be considered as a candidate for
a split calculation. The weights are dynamically recalculated on each step
with regard to the previous values in order to prevent overweighting of a
single variable.
The predictive power of the ensemble considerable depends on the choice of
several real-valued training parameters. We propose to choose these parameters
with an algorithm based on particle filtering with simulated annealing, optimizing
cross-validation error of the classifier.
We consider this tool as a universal method for versatile data analysis.