Sunday, February 20, 2011

Boosting and Random Trees

Described by Learning OpenCV as Best Out-of-the-box Supervised Classification Techniques available in the library

AdaBoost
  • Build a set of weak decision trees (weak = higher misclassification rates due to fewer splits).
  • Each tree could only be trained to fit a certain portion of the data well.
  • Each tree is assigned a weight on how its result will contribute to the final sum.
  • OpenCV Implementation supports only 2-Class Classification (yes-no). Result is presented by the Sigmoid(?) of the sum of weighted outputs from each tree.
  • 4 kinds of Boosting supported, of which the book recommends 'REAL' for categorical data and 'GENTLE' for regression data.
  • weight_trim_rate: used to reduce build time by skipping samples that have small enough weights.
Random Tree - based on Random Forests by Leo Breiman.
  • Train a collection of 'pure' (over-fit, high-variance) trees. Feed the same input vector to all trees. Final result is the class with the majority vote (for classification) or average (for regression) among tree results.
  • 'Random-Feature-Subset' - each node is split by choosing the best from a random subset (size = square-root of feature count) of variables.
  • 'Random-Data' - each tree is built with a different subset of data (out-of-bag, with replacement)
  • Question: is it possible to get fully-classified(pure) tree if the variable required is not in the current feature-subset?
  • Good for parallel processing - trees training are independent of each other. Input vector run down all trees at the same time at predication phase.
  • Able to report Proximity of two input vectors.
More Trees (comment without thorough understanding of the paper)
  • Extremely Randomized Tree (CvERTree) - a.k.a Extra-Trees. Taking the randomization further by randomizing the choice of 'cut-point' (value to split), regardless of purity of the resulting nodes. On the other hand, it is trained with full sample instead of bootstrap samples (bagging).
  • Gradient Boosting Tree (CvGBTree) - designed primarily for regression. The paper presents this from math point-of-view. It looks like a approximation series. Starts with an first approximation, each tree refines the last tree's prediction by adding/subtracting a value. It also borrows from Breiman (Random Trees) the OOB method of choosing random subset data for each tree. The term 'Gradient' could be referring to gradient-descent method of minimizing approximation error.
No free lunch - Despite good performance of R-Tree, the book still recommends deploying different classifiers (D-Tree, R-Tree, AdaBoost) once the classification data is defined.

Sample (tree_engine.cpp)
  • Compare different 'forests' with the same set of data (waveform or mushroom).
  • The waveform feature vector is numerical while the response is categorical.
  • Introduces CvMLData struct and member functions, particularly API for reading sample-data from CSV file, setting parameters regarding train/test splitting, attribute types.
  • Tree/Forest under comparison
    • Decision Tree (without surrogates-splits and pruning)
    • Random Tree (forest have 100 trees)
    • Extreme Random Tree ( 100 trees )
    • Gradient Boosting Tree ( not sure how many )
  • These APIs is not cv namespace, i.e. available in C only.
Results and Observations (waveform data)
  • R-Tree and ER-Tree has 0 training errors.
  • (Single) Decision Tree has test error 28.8 while the others are between 17-18.
  • Variable-Importance for ER-Tree does not fall off. Would be hard to use for cutting down the size of feature-vector.
  • GB-Tree takes much longer time to finish, presumably majority of the time is spent on training.
Sample (letter_recog) - 20000 data, 16-dimension feature vector, 26 classes (A-Z)
  • Three choices: Random-Trees, Boost, MLP-ANN
  • Boost: Demonstrate the technique of "Unrolling sample data" such that it could handle > 2-class problems: 
    • Increase dimension of the input vector by 1. The appended value of each is the class-number (0..25).
    • Each sample is duplicated 26 times, the response value becomes part of the feature vector - appending with the corresponding class number. The new response values are changed from A..Z to 0 or 1. The response is only set to 1 on the tuple where the appended class-number matches the expected class. In other words, only one of the 26 copies get response value of '1'. and therefore the total number of samples marked as '1' stays the same.
    • As expected, the prediction phase requires manually collecting the majority vote by calling the prediction function 26 times. The input vector is appended with the corresponding class number for each time.
Observations
Random-Trees (80% samples for training, 100 trees, 10-max-splits, random-set-size of 4)
  • Proximity value is low: Single-digit percentage value, even for the same letter 'T'.
  • Similar characters such as 'I' comparing to 'T' already drops to 0.
  • Variable Importance is presented as percentage of the overall total.
  • Demonstrate use of 'sample-idx' array to mask training set.
  • Recognition Rate: train 85.6%, test 81.6% [ build-time 31.795 secs ]
  • @ random-set-size of 10: train 85.6%, test 78.7% [ build time 36.288 secs ]; Proximity value is higher up to 22%. Test results drop because the random-set-size too big and reducing trees differences?
Boost (REAL, Unrolling to 26 classes, 50% samples for training, 100 trees - max-depth of 5)
  • Recognition Rate: train 92.6, test 86.9% [ Build-time 356.778 seconds ]
  • Recognition Rate (weight_trim_rate=0.7): train 92.6% test 86.9% [ Build-time 354.784 seconds ]
  • Recognition Rate (GENTLE boost): train 96.9% test 90.9% [ Build-time 374.948 ]. This higher rate is consistent with the OpenCV Book recommendation on using Gentle Boost for numerical data.
Multilayer-Perceptron (ANN)

Resources
Readings
  • Learning OpenCV
  • Introduction to Machine Learning
  • Extremely randomized trees, Geurts et al.
  • On various Boosting variants: ADDITIVE LOGISTIC REGRESSION: A STATISTICAL VIEW OF BOOSTING, Friedman, Hastie, Tibshirani)
  • GBT
    • Greedy Function Approximation: A Gradient Boosting Machine, J. Friedman
    • Stochastic Gradient Boosting, J. Friedman

2 comments:

  1. Hi,

    Any tutorials on the Opencv code to implement regression calculations?

    Dan

    Peace.

    ReplyDelete
  2. Hi Dan,

    As far as I know, CvStatModel is the base class for many of the machine-learning algorithms in OpenCV. It supports regression function approximation (var_type = CV_VAR_NUMERICAL). And Decision Tree implementation supports regression. You could probably find some dataset here to try it out:
    http://archive.ics.uci.edu/ml/

    ReplyDelete