Friday, February 25, 2011

Learning Deformable Models with Latent SVM

The sample program only demonstrates how to use the latent SVM for classification. The paper describes the training part in details. Although I don't understand all of it, here is the summary:

Latent SVM is a system built to recognize object by matching both
1. the HOG models, which consists of the 'whole' object and a few of its 'parts', and 2. the position of parts. The learned positions of object-parts and the 'exact' position of the whole object are the Latent Variables. The 'exact' position is with regard to the annotated bounding box from the input image. As an example, a human figure could be modeled by its outline-shape (whole-body head-to-toe) together with its parts (head, upper-body, left arm, right arm, left lower lib, right lower lib, feet).

The HOG descriptor for the whole body is Root Filter and those for the body parts are Parts Filter.

The target function is the best response by scanning a window over an image. The responses consists of the outputs from the all the filters. The search for best match is done in a multi-scale image pyramid. The classifier is trained iteratively using coordinate-descent method by holding some components constant while training the others. The components are Model Parameters (Filters Positions, Sizes), weight coefficients and error constants. The iteration process is a bit complicated - so much to learn! One important thing to note is the positive samples are composed of moving the parts around an allowable distance. There is a set of latent variables for this ( size of the movable-region, center of all the movable-regions, quadratic loss function coefficients). Able to consider the 'movable' parts is what I think being 'deformable' means.

Detection Code

The code for latent SVM detector code is located at OpenCV/modules/objdetect/. It seems to be self-contained. It has all the code needed to build HOG pyramids.
The detection code extract HOG descriptors from the input image and build multi-scale pyramids. It then scan the models (root and parts) over the pyramids for the good matches. Non-max suppression is used I think to remove those proximity matches. A threshold is applied to the score from SVM equation to determine the classification.

Some trained models in matlab file format (voc-release4.tgz and older) are available for download at the website. But how to convert the available matlab files (such as cat_final.mat) to that XML format? There is a VOCWriteXML function in the VOC devkit (in matlab). Wonder if that could help.

Sample (latentsvmdetector.cpp)
  • Load a pre-built model and detect the object from an input image.
  • There does not seem to be a detector builder in OpenCV.
  • By looking at cat.xml The cat model has 2 models. They are probably bilateral symmetric model. Each model has 6 parts. The root filter sizes are 7x11 and 8x10.
Results (with cat.xml model)
  • [cat.jpg] Took 61 seconds to finish. Able to detect the cat. Two false-positives at the top-right corner.
  • [lena.jpg] Took 77 seconds. It detected Lena's beautiful face (including the purple feather hat and shoulder) ! Two other detected objects: her hat and some corner at the top-left corner of the picture.
  • [tennis-cats.jpg] Took 44 seconds. It detected all 3 cats. Although the middle one and left cat and treated as one. Those two are closer together.
  • [295087.jpg from GrabCut collection] Took 50 seconds. Somehow classified the Tree and the Rock Landscape as a cat!
  • [260058.jpg from GrabCut collection] Took 76.5 seconds. Detected two false objects: 1) an area of the desert sand (small pyramid at the top edge), 2) part of the sky with clouds nears the edges.
  • Without knowing how the model is trained, hard to tell the quality of this detector.; It is possible that it is taken from the 'trained' classifier parameters from the releases from the paper author (voc*-release.tgz).

Latent SVM:

A Discriminatively Trained, Multiscale, Deformable Part Model, P. Felzenszwalb, et al.

Object Recognition - Bag of Keypoints

Bag of Keypoints is a object recognition technique presented in a paper "Visual Categorization with Bags of Keypoints". The Bag-of-Keypoints idea is borrowed from Bag-of-Words for text data mining.

The paper differentiates this 'multi-class' object recognition technique from 'object recognition', 'content-based image retrieval' and 'object detection'.

Very Brief Summary
The goal is to recognize 'classes' of object given an input image. Each object class is associated with a bunch of interest-points (features). In Naive-Bayes terms, we could train a classifier with labeled data to predict the class of object present in an image by looking at the set of detected interest-points. Linear SVM is also trained as classifier for comparison. Since it is a  multi-class problem, a one-against-all method is used. They trained 'm = # classes' numbers of SVM. Each output a confidence value on whether the input image belongs to that class.
The bag-of-words refers to a vocabulary set. The vocabulary set is built by k-mean clustering of keyPoint-descriptors (such as SIFT). A BOW Descriptor is a histogram of vocabularies. One BOW descriptor for one image. Key-points detected on images are looked up from its associated vocabulary. The corresponding bin from the histogram is incremented. The BOW descriptors (histogram) from training images are then used to train the SVM classifiers.

Harris-Affine-Detector -> SIFT Descriptor -> Assigned to Cluster -> 'Feature Vector' -- (+ label) -> Naive Bayer Bayes (or m SVMs).

Sample (bagofwords_classification.cpp)

The bagofwords sample program performs the bag-of-keypoints training and classification as in the paper. The training and test data format supported is PASCAL 2007-2010.

The program defines a class VocData that understands the PASCAL VOC challenge data format. It is used to look up the list of training-set and test-set image files for a specific object-class. The class also defines helper functions to load/save classifier results and gnuplots.

The program defines functions to load/save last run parameters, vocabulary-set and BOW descriptors.

Despite the big chunk of code, the functions are pretty well-defined. There are sufficient code-comments.

User specifies the keypoint-detection method, keypoint-descriptor and keypoint-matching method to the BOWKmeansTrainer. The vocabulary-set is built with one chosen class of object training images. The code-comment says building with one particular class is enough.

SVM is used for object classification. The CvSVM class is used. The number of instances is the same as the number of object classes. Each is trained with both positive and negative samples of a particular class object. That SVM would be tested with all classes of test objects.

See LIBSVM for more details on CvSvm implementation for OpenCV.

BOW Image descriptor is the histogram of vocabulary occurrences in a single image. It is a simple array - rows-of-image x cols-of-vocabulary. Each row is send to SVM for training.

DDMParams load/stores the keypoint detector-descriptor-matcher type.

VocabTrainParams stores the name of the object-class to be used for training. It also loads/saves the maximum vocabulary size, memory to use, and proportion-of-descriptor to use for building vocabulary. Not all the detected image key-points are used. The last parameter specifies the fraction of that to be randomly picked from each input.

SVMTrainParamsExt stores some parameters that control the input to the SVM training process. These do not overlap with the CvSVMParam. There are 3 parameters:

  1. descPercent controls the fraction of the BOW image descriptors to be used for training. 
  2. targetRatio is preferred ratio of positive to negative samples. Precisely this parameter is the fraction of positive samples from all samples. It also means that some of the samples will be thrown away to maintain this ratio. 
  3. balanceClasses is a boolean. If it is true, then the C-SVC weight given to the positive and negative samples will be same to the pos:neg ratio of samples used for training. See CvSVMParams::class_weights for usage. If it's set to true, the targetRatio will not be used.

RBF is chosen as kernel function for SVMs. The related parameters will be chosen automatically, presumably by the crawling the 'Grid'. See LIBSVM docs.


Used Harris Affine Detector - SIFT descriptor - BruteForce matcher for key Points matching.
Default parameters for BOWKmeansTrainer.
As stated above, the demo application saves user-preferences, BOW descriptors, SVM classifier parameters and Test results to an output directory.
Stopped the running after 'aeroplane' class. It took too long. Save for another time when there is a spare PC. On the other hand, 10103 BOW descriptors are already built. And there are 11322 JPEG images. That means only 1000 more image descriptors to extract. Most of the time would be spent on training SVMs in the future.

Took very long time to build the vocabulary - k-means never seem to converge below the default error value. So it stops after 100 iterations which is the default maximum.
Computing Feature Descriptors (Detect + Extract): 6823 secs ~ 2hrs
Vocabulary Training ( 3 attempts of (k=1000)-means ): 75174 secs ~ 21 hours

SVM Classifier Training (for one classifer, aeroplane)
  • Took 5 hours to extract BOW Descriptors from 4998 Training Set images.
  • Took another 2.6 hours to train SVM classifer with 2499 descriptors of above. Meaning only 50% is used for training. Of which 143 are positive and 2356 are negative.
SVM Classifier Testing (for one classifier, aeroplane)
  • Took 5 hours to extract image descriptors from the 5105 Test Set images.
  • Took only 0.04 seconds to classify all the Test Set descriptors.
  • The output has a gnuplot command file. Applied to cygwin gnuplot, output a PNG file. It shows the Average Precision of 0.058 and a plot of Precision versus Recall.

  • Visual Categorization with Bags of Keypoints, Csurka, et al.
  • A Practical Guide to Support Vector Classification, see LIBSVM from Resources

Wednesday, February 23, 2011

HOG Descriptor

Excellent paper by Dalal and Triggs. It gives a working example on choosing of various modules at the recognition pipeline for human figure (pedestrians).

Much simplified summary
It uses Histogram of Gradient Orientations as a descriptor in a 'dense' setting. Meaning that it does not detect key-Points like SIFT detectors (sparse). Each feature vector is computed from a window (64x128) placed across an input image. Each vector element is a histogram of gradient orientations (9 bins from 0-180 degrees, +/- directions count as the same). The histogram is collected within a cell of pixels (8x8). The contrasts are locally normalized by a block of size 2x2 cells (16x16 pixels). Normalization is an important enhancement. The block moves in 8-pixel steps - half the block size. Meaning that each cell contributes to 4 different normalization blocks. A linear SVM is trained to classify whether a window is human-figure or not. The output from a trained linear SVM is a set of coefficient for each element in a feature vector.

I presume Linear SVM means the Kernel Method is linear, and no projections to higher dimension. The paper by Hsu, et al suggests that linear method is enough when the feature dimension is already high.

OpenCV implementation (hog.cpp, objdetect.hpp)
The HOGDescriptor class is not found in the API documentation. Here is notable points judging by the source code and sample program(people_detect.cpp):

  • Comes with a default human-detector. It says at the file comment that it is "compatible with the INRIA Object Detection and Localization toolkit. I presume this is a trained linear SVM classifier represented as a vector of coefficients;
  • No need to call SVM code. The HOGDescriptor.detect() function simply uses the coefficients on the input feature-vector to compute the weight-sum. If the sum is greated than the user specified 'hitThreshold' (default to 0), then it is a human-figure.
  • 'hitThreshold' argument could be negative.
  • 'winStride' argument (default 8x8)- controls how the window is slide across the input window.
  • detectMultiScale() arguments
    • 'groupThreshold' pass-through to cv::groupRectangles() API - non-Max-Suppression?
    • 'scale0' controls how much down-sampling is performed on the input image before calling 'detect()'. It is repeated for 'nlevels' number of times. Default is 64. All levels could be done in parallel.
Sample (people_detect.cpp)
  • Uses the built-in trained coefficients.
  • Actually needs to eliminate for duplicate rectangles from the results of detectMultiScale(). Is it because it's calling to match at multiple-scales?
  • detect() return list of detected points. The size is the detector window size.
  • With GrabCut BSDS300 test images - only able to detect one human figure (89072.jpg). The rest could be either too small or big or obscured. Interestingly, it detected a few long-narrow upright trees as human figure. It takes about 2 seconds to process each picture.
  • With GrabCut Data_GT test images - able to detect human figure from 3 images: tennis.jpg, bool.jpg (left), person5.jpg (right), _not_ person7.jpg though. An interesting false-positive is from grave.jpg. The cut-off tomb-stone on the right edge is detected. Most pictures took about 4.5 seconds to process.
  • MIT Pedestrian Database (64x128 pedestrian shots):
    • The default HOG detector window (feature-vector) is the same size as the test images.
    • Recognized 72 out of 925 images with detectMultiScale() using default parameters. Takes about 15 ms for each image.
    • Recognized 595 out of 925 images with detect() using default parameters. Takes about 3 ms for each image.
    • Turning off gamma-correction reduces the hits from 595 to 549.
  • INRIA Person images (Test Batch)
    • (First half) Negative samples are smaller in size at (1 / 4) of Positives, 800 - 1000 ms, the others takes about 5 seconds.
    • Are the 'bike_and_person' samples there for testing occlusion?
    • Recognized 232/288 positive images. 65 / 453 negative images - Takes 10-20 secs for each image.
    • Again cut-off boxes resulting in long vertical shape becomes false positives
    • Lamp Poles, Trees, Rounded-Top Extrances, Top part of a tower, long windows are typical false positives. Should upright statue considered 'negative' sample?
    • Picked a few false-negatives to re-run with changing parameters. I picked those with large human-figure and stands mostly upright. (crop_00001.jpg, crop001688.jpg, crop001706.jpg, person_107.jpg).
      • Increased the nLevels from default(64) to 256.
      • Decrease 'hitThreshold' to -2: a lot more small size hits.
      • Half the input image size from the original.
      • Decrease the scaleFactor from 1.05 to 1.01.
      • Tried all the above individually - still unable to recognize the tall figure. I suppose this has something to do with their pose, like how they placed their arms.
Histograms of Oriented Gradients for Human Detection, Dalal & Triggs.
A Practical Guide to Support Vector Classifier, Hsu, Chang & Lin

Tuesday, February 22, 2011

Cascade Classifier and Face Detection

There is an excellent and easy-to-understand description from OpenCV Book on using the Haar Features Cascade Classifiers for Face Detection.

Very Simplified Summary
Haar Feature is similar to Haar Wavelet. The weights inside the box-filter could be oriented horizontally, vertically, diagonally.
Viola-Jones Classifier is a 2-class Cascade Classifier. The cascade is made up of a series of nodes.  Each node is a AdaBoost forest (2-class-classifier). An input vector is classified as 'Yes' only if it 'passes' all the cascaded nodes. The classification process aborts when it sees a 'No' from the current node.
Each node is built with high-acceptance rate - therefore many false-positives, and low rejection rate. The trees of AdaBoost forest typically has only a single split. And each forest has about 10  decision stumps (single-split tree). The theory is that the nodes are built to recognize faces of different orientations. Early rejection meaning it spends little time for negative samples.

Found this excellent page from the forum after I wrote this entry:

It requires thousands of good samples and 10s of thousands of bad samples to train the classifier. The book says it could take hours or whole day even for a fast computer. There is no exact number given. I guess it depends on the size of the feature-vector or number of features.
'haarTraining' is a standalone program that will train the classifier with pre-processed feature points from Positive Samples and Negative Samples. User is able to specify parameters to 'shape' the nodes and trees.

Sample Vectors
Positive Samples: Images with faces marked with rectangle. Best results if the the faces are aligned similarly. Do not mix upright with tilted.
Negative Samples: Simply pictures without faces. Preferably with backgrounds similar to the 'Positive samples'.
'createSample' is a standalone program that extracts the face-rectangles and rescale it to the same size as specified by the user.

(Paraphrasing) OpenCV book says Haar Feature Detector works well with Rigid Body with blocky features (like eyes). Objects that's only distinguishing feature is its outline (coffee mug) is hard to detect. 'Rigid' means object that the amount deformation by external pressure is negligible.

Building and Running 'createSamples' and 'haarTraining'
Source code:  OpenCV/modules/haartraining/
VC++ Solution file: CMAKE_Build/modules/haartraining/
Documentation: OpenCV/doc/

Test Sample with Coca-Cola Logo (Step 1: createSample)

createSample uses OpenCV built-in C API to make training and test images by superimposing an input foreground image into a list of user-provided background images. In order to create varieties, the object(foreground) image is transformed (perspective), intensity-adjusted before finally scaled to the specified size and overlaid on to the background image.

  • Training Samples: Use createSample to produce a _single_ 'vec' file suitable for training. All the input images are embedded in that file. See header file for details (comment added).
  • Test Samples: Use createSample to produce a set of test images together with an 'info' file. The plain text file specifies the region of the transformed foreground object inside each test image. Only a single object would be overlaid on each background image.
  • 'createSample' application can be used to view the images inside a 'vec' file.

Produced 500 images of with coca-cola logo embedded on 6 of the background images chosen from the GrabCut BSDS300 test images.

Test Sample with Coca-Cola Logo (Step 2: haartraining)

The haartraining program is straightforward, it calls the cvCreateTreeCascadeClassifier with the necessary cascade-parameters, input 'vec' file location and output directory location.

What is the difference between cvCreateTreeCascadeClassifier() and cvCreateCascadeClassifier()?

No idea. Glanced through the code. cvCascadeClassifier seems to be more straightforward. cvCreateTreeCascadeClassifier does more than basic Cascade training. There is early termination condition checking. And there is training-data clustering, probably for evaluation of the classifier stages.

Explanation of the 'mem' command-line parameter of haartraining.cpp is misguided. 

haartraining.htm says it specifies the maximum memory allocated for pre-calculation in Megabytes. It is actually passed to cvCreateTreeCascadeClassifier() as 'numprecalculated' argument. It specifies the number of features to be 'pre-calculated', whatever that means. So it is true that a higher number requires more memory. But the value itself does not cap the amount of memory allocated for this pre-calculation task. In fact,  code-comment from cvhaartraining.hpp includes a formula on how the memory for 'feature pre-calculation' is a function of this argument.

  • Used the 'createSample' to produce 500 Positive Samples with a Coca-Cola Logo embedded on about 6 background images chosen at random. The cola-cola logo image is reduced from 482x482 to 36x36 in size.
  • Used all 200 images from the set GrabCut test samples as Negative Samples.
  • Classifier is created in 2 forms.  A single XML file and a database format. The database consists of a set of directories - one per stage. cvCreateTreeCascadeClassifier() actually calls cvLoadHaarCascadeClassifierCascade() to produce the XML file from the directory-set, as demonstrated from in convert_cascade sample.
  • The number of stages built is actually 8 instead of 14 as specified. The training stops with this message: "Required leaf false alarm rate achieved. Branch training terminated.".
  • The training function reports the performance using training data: 98.6 hit rate, 8.96e-6 false-alarm rate.
  • The 99% hit rate is achieved at the first stage, the rest of the stages lowers the false-alarm rate which starts at 10%.
Console Output
  • BACKGROUND PROCESSING TIME: Time taken to load negative sample (and extract Haar features?)
  • "Number of used features": Varies from 1 to 5, corresponding to the number of rows a tabular format output. This number seems to represent the number of trees at the current cluster (stage).
  • "Number of features used" (different from the last point): Simply calculated from the size of the object and not from 'feature-detection' of the actual training pictures.
  • How come 'Chosen number of splits' and 'Total number of splits' are always zero?
  • Training time could be long and requires lots of CPU and memory.
  • In fact, the CPU constantly maxes out.
  • Time required is proportional to the number of features, and that in turn is proportional to the size of the foreground object picture (coca-cola logo).
  • At original resolution (482x482) - program ran out of memory in a few minutes.
  • At 48x48 resolution ~ about 4.1 million 'features' and MEM set to 512. 1st stage takes an hour ( did not wait to complete).
  • At 36x36 resolution ~ about 1.3 million features and MEM kept at 512. it takes 3 hours to complete. It terminates by itself after 8 stages out of 14, with reason stated earlier.

Test Sample with Coca-Cola Logo (Step 3 - final: face_detect)

OpenCV book gives excellent description on function parameters for CascadeClassifer::detectMultiScale(). Especially on the 'scaleFactor' and the 'flags' arguments.

Test Data
  • Create 6 test image similar to training images.
Test Results
  • Original parameters: Able to detect from 3 out of 6 images.
    One that have failed are much smaller size than the rest (36x36), which is actually the original object size! The other two failures are probably related to the object is tilted.
    The book suggests training separately the upright and tilted objects.
  • Reduced 'scaleFactor' from 1.1 to 1.01: Able to detect the 36x36 object.
    The detection is scale-sensitive. So giving it a finer scaling steps increases the hit-rate, at the expense of receiving more false-positive results.
  • Re-generate the set of test images, with half the maximum rotation angle for distortion: more 1 more object is recognized.

Test Sample with Running Face Classifier (face_detect)

The face_detect sample demonstrates how to 'nest' classifiers to detect finer features. By default the sample deploys the face-alt-2 classifiers to find face regions. Followed by the eye-tree-eyeglasses classifier to find smaller features from within each of the regions returned by the face-alt-2 classifier.

Pre-built Face Cascade Classifiers
  • Location: OpenCV/data/haarcascades/
  • Dimensions of the training object could be found in most classifier files inside XML comments near the beginning.
  • Check the value of 'minSize' to detectMultiScale() of nested Classifier. The minimum for face could be too big as for mouth.
  • Set the 'minSize' argument to maintain aspect ratio of the trained object.
  • It takes around one second to finish the detecting process for a 30x30 object from a VGA picture. ScaleFactor at 1.1.
Wikipedia on Rigid Body: (Script to expand CMU GroundTruth Data)
CMU-MIT Face Detection Test Sets:
Face Databases as noted from some of the haarcascade classifier files:
Robust Real-Time Face Detection, Viola & Jones, International Journal of Computer Vision 57.

Sunday, February 20, 2011

Boosting and Random Trees

Described by Learning OpenCV as Best Out-of-the-box Supervised Classification Techniques available in the library

  • Build a set of weak decision trees (weak = higher misclassification rates due to fewer splits).
  • Each tree could only be trained to fit a certain portion of the data well.
  • Each tree is assigned a weight on how its result will contribute to the final sum.
  • OpenCV Implementation supports only 2-Class Classification (yes-no). Result is presented by the Sigmoid(?) of the sum of weighted outputs from each tree.
  • 4 kinds of Boosting supported, of which the book recommends 'REAL' for categorical data and 'GENTLE' for regression data.
  • weight_trim_rate: used to reduce build time by skipping samples that have small enough weights.
Random Tree - based on Random Forests by Leo Breiman.
  • Train a collection of 'pure' (over-fit, high-variance) trees. Feed the same input vector to all trees. Final result is the class with the majority vote (for classification) or average (for regression) among tree results.
  • 'Random-Feature-Subset' - each node is split by choosing the best from a random subset (size = square-root of feature count) of variables.
  • 'Random-Data' - each tree is built with a different subset of data (out-of-bag, with replacement)
  • Question: is it possible to get fully-classified(pure) tree if the variable required is not in the current feature-subset?
  • Good for parallel processing - trees training are independent of each other. Input vector run down all trees at the same time at predication phase.
  • Able to report Proximity of two input vectors.
More Trees (comment without thorough understanding of the paper)
  • Extremely Randomized Tree (CvERTree) - a.k.a Extra-Trees. Taking the randomization further by randomizing the choice of 'cut-point' (value to split), regardless of purity of the resulting nodes. On the other hand, it is trained with full sample instead of bootstrap samples (bagging).
  • Gradient Boosting Tree (CvGBTree) - designed primarily for regression. The paper presents this from math point-of-view. It looks like a approximation series. Starts with an first approximation, each tree refines the last tree's prediction by adding/subtracting a value. It also borrows from Breiman (Random Trees) the OOB method of choosing random subset data for each tree. The term 'Gradient' could be referring to gradient-descent method of minimizing approximation error.
No free lunch - Despite good performance of R-Tree, the book still recommends deploying different classifiers (D-Tree, R-Tree, AdaBoost) once the classification data is defined.

Sample (tree_engine.cpp)
  • Compare different 'forests' with the same set of data (waveform or mushroom).
  • The waveform feature vector is numerical while the response is categorical.
  • Introduces CvMLData struct and member functions, particularly API for reading sample-data from CSV file, setting parameters regarding train/test splitting, attribute types.
  • Tree/Forest under comparison
    • Decision Tree (without surrogates-splits and pruning)
    • Random Tree (forest have 100 trees)
    • Extreme Random Tree ( 100 trees )
    • Gradient Boosting Tree ( not sure how many )
  • These APIs is not cv namespace, i.e. available in C only.
Results and Observations (waveform data)
  • R-Tree and ER-Tree has 0 training errors.
  • (Single) Decision Tree has test error 28.8 while the others are between 17-18.
  • Variable-Importance for ER-Tree does not fall off. Would be hard to use for cutting down the size of feature-vector.
  • GB-Tree takes much longer time to finish, presumably majority of the time is spent on training.
Sample (letter_recog) - 20000 data, 16-dimension feature vector, 26 classes (A-Z)
  • Three choices: Random-Trees, Boost, MLP-ANN
  • Boost: Demonstrate the technique of "Unrolling sample data" such that it could handle > 2-class problems: 
    • Increase dimension of the input vector by 1. The appended value of each is the class-number (0..25).
    • Each sample is duplicated 26 times, the response value becomes part of the feature vector - appending with the corresponding class number. The new response values are changed from A..Z to 0 or 1. The response is only set to 1 on the tuple where the appended class-number matches the expected class. In other words, only one of the 26 copies get response value of '1'. and therefore the total number of samples marked as '1' stays the same.
    • As expected, the prediction phase requires manually collecting the majority vote by calling the prediction function 26 times. The input vector is appended with the corresponding class number for each time.
Random-Trees (80% samples for training, 100 trees, 10-max-splits, random-set-size of 4)
  • Proximity value is low: Single-digit percentage value, even for the same letter 'T'.
  • Similar characters such as 'I' comparing to 'T' already drops to 0.
  • Variable Importance is presented as percentage of the overall total.
  • Demonstrate use of 'sample-idx' array to mask training set.
  • Recognition Rate: train 85.6%, test 81.6% [ build-time 31.795 secs ]
  • @ random-set-size of 10: train 85.6%, test 78.7% [ build time 36.288 secs ]; Proximity value is higher up to 22%. Test results drop because the random-set-size too big and reducing trees differences?
Boost (REAL, Unrolling to 26 classes, 50% samples for training, 100 trees - max-depth of 5)
  • Recognition Rate: train 92.6, test 86.9% [ Build-time 356.778 seconds ]
  • Recognition Rate (weight_trim_rate=0.7): train 92.6% test 86.9% [ Build-time 354.784 seconds ]
  • Recognition Rate (GENTLE boost): train 96.9% test 90.9% [ Build-time 374.948 ]. This higher rate is consistent with the OpenCV Book recommendation on using Gentle Boost for numerical data.
Multilayer-Perceptron (ANN)

  • Learning OpenCV
  • Introduction to Machine Learning
  • Extremely randomized trees, Geurts et al.
  • On various Boosting variants: ADDITIVE LOGISTIC REGRESSION: A STATISTICAL VIEW OF BOOSTING, Friedman, Hastie, Tibshirani)
  • GBT
    • Greedy Function Approximation: A Gradient Boosting Machine, J. Friedman
    • Stochastic Gradient Boosting, J. Friedman

Friday, February 18, 2011

Multilayer Perceptron - ANN (BackPropagation)

According to OpenCV Book, MLP (Multi-layer Perceptron) is slow to train but able to make quick predict due to it's simplicity. And it's good for text recognition.

API documentation also give a good introduction. What follows is basically paraphrasing from there:

There are 2 implementations in OpenCV:
  1. classical random-sequential back-propagation
  2. batch RPROP (default)
There are 3 choices of activation function - used for all layers of neurons:
  1. Identity - simply the weighted sum
  2. Sigmoid (default) - result would be converted to -1 to 1 (zero-crossing at x=0)
  3. Gaussian (not completely supported?)
Things to consider when deciding network topology (# layers, # hidden nodes)
  • Too big causes over-fitting
  • Takes long time to train
Recommends pre-process the inputs with CalcPCA (Principal Component Analysis) to reduce the dimension of feature-vector - speed up training.

ANN is designed for numerical data, the API documentation suggests a work-around to handle categorical data.

  • Training Method
  • Activation function
  • Network topology: Number of nodes at each layer and number of layers
  • Free parameters: activation functions (alpha, beta) - shape the Sigmoid
Sample (letter-recog.cpp, for comparing with Random Trees and AdaBoost)
  • 80% sample data used for training
  • Topology: Input layers 16 nodes, 2 hidden layer @ 100 nodes each, Output layer 26 nodes
  • Unroll categorical response data (A...Z) to numerical. In this case, each response is a 26-element vector, conceptually representing a bit-vector. For example, if the response is 'C', then the 3rd element is set to 1, others kept at 0. The choice of 26-element vector because it corresponds to the 26 output layer nodes.
  • Is there any easy way in OpenCV to create a new matrix out of an arbitrary set of columns from an existing matrix (without copying)?
Observations (16000 training samples, 16-variable feature-vector, 26 classes)
  • Classical Random-Sequential BP (bp_bw_scale=0.001): train 95.2% test 93.5% (Build-time 3367.66 secs !)
  • RPROP (rp_dw0=0.05): train 75.6% test 74.3% (Build-time 1350.67 secs)
  • RPROP(rp_dw0=0.1): train 78.9% test 78% (Build-time 1333.09 secs)
  • RPROP(rp_dw0=1.0): train 15.2% test 14.3% (Build-time 1329.59 secs)
RPROP parameter rp_dw0 is initialized to different values (0.1,  1.0) in various occasions. Why?

  • . Wikipedia article about the back-propagation algorithm.
  • LeCun, L. Bottou, G.B. Orr and K.-R. Muller, “Efficient backprop”, in Neural Networks—Tricks of the Trade, Springer Lecture Notes in Computer Sciences 1524, pp.5-50, 1998.
  • Riedmiller and H. Braun, “A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm”, Proc. ICNN, San Francisco (1993).

Thursday, February 17, 2011

K-means Clustering

Good description available from Learning OpenCV book.


  • Support multiple attempts with a single call, each time starting with a different set of initial centers/labels. cv::kmeans() returns the result from the best one.
  • 'Best' results is the one with least error (best fit?). Error is the squared-sum of distances between each center and the points that are classified to which it belongs.. 
Sample (kmeans.cpp)

  • Generate 2-D data points, gaussian distributed around K center-points. Data and results visualized as an image.
  • Uses Kmeans++ center initialization (see Resources) by default.
  • Re-discover the K center-points with cv::kmeans()
  • Introduces a shuffle-function: randShuffle(), to randomly re-order the generated data points of a matrix. Allow setting the number of iterations to shuffle.

The idea of kmeans++ is that Far-Apart centers makes good initialization. The first center is picked with random (uniform weights) from all data points. Of the second to the K-th center, we would pick from a weighted distribution. Weights are proportional to the distance of the candidate point from the last chosen center.


  • Picking from 3 'attempts' definitely improve results compared to '1', regardless of how the initial centers are chosen.
  • Error increases when two(or more) 'real' centers are close together such that part of their data points overlaps. In such cases, the centers could easily be misplaced. The function would predict one center for 2 nearby clusters. And one large cluster having two centers.
  • Kmeans++ shows significant advantage over pure random choice of initial points in cases where there are 5 center points and they are far apart (such as one of them near the corner of the range), provided that sample size is large enough.
  • In other cases, their results could be similar, it is not rare to see Kmeans++ actually results in higher error.

Kmeans++ center initialization:


  • Learning OpenCV, Gary Bradski & Adrian Kaehler (O'Reilly Press)
  • Introduction to Machine Learning, Ethern Alpaydin(MIT Press)

Decision Tree

Learning OpenCV book has a good description of the Decision Tree ML method. The API documentation also have a good summary.

Based on paper by "Classification and Regression Trees" by Brieman et al.


  • Related to Tree Pruning (Post-Build)
    •   use_1st_rule, truncate_prune_tree
  • Is cv_fold use for both tree building and pruning with respect to average Gini error?
  • Support surrogate splits to handle unknown data - meaning it finds 'backup' attributes from the feature vector that would split the node with similar 'purity'.

Sample (mushroom)
  • Classification of mushrooms of being Poisonous or Edible based on 20 discrete attributes.
  • Demonstrate decision tree traversal with interactive-prediction phase.
  • Display a table of importance of attributes after the tree is built.
  • Lots of data available to use from UC-Irvine ML Data Repository (see Resources)
  • Sample could be easily modified to tackle other classification databases.
  • Setting 'penalty-weight' to 1 gives 8 percent of false-negatives. Quickly decreased to 0 when it is set to 2.


  • Learning OpenCV, Gary Bradski & Adrian Kaebler (O'Reilly Press)
  • Introduction to Machine Learning, 2nd Edition, Ethern Alpaydin (MIT Press)

Wednesday, February 16, 2011

Single Camera Calibration

Besides OpenCV book, the code documentation section "Camera Calibration and 3d Reconstruction" also provide a brief overview of the elements involved in calibration.

OpenCV provides a set of functions to estimate camera intrinsic and distortion effects from a set of image-views taken from a variety of perspectives on a 3D/planar object. A set of built-in functions to support using NxM chessboard as a planar-object (Z=0) is also included.

Camera Intrinsic [ 3x3 matrix ]

  • focal-lengths (fx, fy) in x, y directions, center-of-projection (cx, cy).

Distortion effects [ (5 to 8) x 1 vector ]

  • Radial Distortion - a vector of 3 to 6 (k1,...k6) coefficients; The book only mentions up to k3.
  • Tangential Distortion. a vector of 2 coefficients (p1, p2). The book suggests setting Tangential coefficient 0 zero for well-built camera. Value too close to 0 could introduce large noise values.

Camera Extrinsic [ 3x4 matrix per view ]

  • 3x3 Rotational appended with 3x1 Translation. This transforms the objects coordinate-system to match that of projected image-plane. 
  • It could be further simplified if the object is planar (2D, Z=0). 
  • For computation efficiency the 3x3 rotational matrix is represented by a 3x1 vector. See Rodrigues Transform below.

The chessboard grid elements are by default represented as 1x1 unit. This is handy to describe the object-coordinates. Conversion that to real-life metrics (such as inches, millimeters) in post-processing is up to the user.

Essential functions
  • calibrateCamera() - estimate the camera intrinsic, distortion coefficients and optionally the extrinsic for each input view. The RMS value of projection-error is returned for result evaluation.
  • projectPoints() - could be used to calculate projection error to find other points of interest from a view that has been used to deduce camera parameters, together with perspective transform of this view.
ChessBoard Functions
  • findChessBoardCorners()
  • drawChessBoardCorners()
  • ChessBoardGeneratorClass - from test and 'calibration_artificial sample'
Undistort functions - undistort the projected view with distortion parameters.
  • undistort()
  • undistortPoints()
  • initUndistortRectifyMap()
  • getOptimalNewCameraMatrix() - adjust the camera intrinsic such that would result in a zoomed view with fewer black regions. The new camera matrix will be passed to initUndistortRectifyMap().
Supporting functions
  • Rodrigues(): perform Rodrigues Transform

Rodrigues Transform
It provides the two-way conversion between 3x3 rotational matrix and 3x1 rotational vector. My understanding is, the 3x3 rotational matrix is made up of a sequential rotation of the 3 axes. The direction of the corresponding 3x1 vector represents a new axis in the 3D space that is equivalent to the combined result of the 3 rotations. The amount of rotation is represented by magnitude of the 3x1 vector.

Chessboard Picture
OpenCV comes with a 9x7 Chessboard in the name of 'pattern.pdf' under OpenCV/docs directory.

Sample (calibration.cpp)
  • Tested with 9x6 chessboard (left series), begin with zero-tangential distortion
  • Discovered bug in saving extrinsic parameters to file-storage - the type of rvecs[i] and tvecs[i] does not match bigmat.
  • Radial distortion example - see right edge of chessboard at left06.jpg before and after 'undistort'.
  • Despite what the manual says, a thin black strip remains at the middle top from distortion-correction even by setting 'alpha' value to 0 to getOptimalNewCameraMatrix().
  • Didn't have a chance to test the video capture method, try later.  
  • calibrateCamera() returns the RMS of back-projection errors. The sample does it's own version in computeReprojectionErrors().
  • Higher order Radial distortion coefficients support with flag CV_CALIB_RATIONAL_MODEL
Sample (Calibration_Artificial.cpp)
  • The camera, chessboard, projected-chessboard-images are all created / defined in code, no actual chessboard/camera is used.
  • The program define camera intrinsic and distortion coefficients for a "virtual camera". Uses chess-board-generator class (copied from 'test' module) to make up 20 chessboard pictures that could have been captured by this virtual camera. The perspective of the chessboards are constrained by the camera parameters.
  • It detects the chess-corners from the generated pictures and call cameraCalibration() function with those detected points. We could see how well it could recover the original camera parameters and what the error rate is.
  • All 'k' elements of a set of valid Distortion Coefficients should be used together. The estimated k1, k2 and k3 values of the previous calibration sample are non-zero (using the leftN.jpg series). Tried replacing the sample-defined distortion coefficients with only the k1 and k2 from the previous example yields projected chessboard images that does not look planar, especially at the 4 corners.
  • ChessBoardGenerator::rendererResolutionMultiplier draws a finer resolution chessboard by first scaling up the square-vertexes locations with the multiplier. Connect the points with contours at the enlarged canvas before down-sampling the result to fit the window.
    Multiplier values - Avg Reprojection Error
    1: 0.0010
    4: 0.0005 (default)
    16: 0.0004

Sunday, February 13, 2011

Video Capture with DirectShow

Hooked up my Canon camera as a PC webcam with ExtraWebCam. Worked fine with Skype/VideoLAN. OpenCV failed to open as DirectShow device. In fact, the DirectShow and code of other video capture methods are disabled because HAVE_VIDEOINPUT macro is not defined. (see cap.cpp)

Updated CMakefile according to this ticket and it worked.

At the beginning I thought the macro is somehow related to missing DirectShow SDK. Then I also installed DirectX-11 SDK (thinking that it comes with DirectShow), followed by Windows SDK for Windows 7 (.Net Framework 3.5SP1). Full installation of Windows SDK requires 4G of hard-disk space. Removing .Net related matters lowers the space needed to 1.5G.

'fback' is a good sample to test.

Saturday, February 12, 2011

Kalman Filter

The OpenCV book gives decent description of Kalman filter. I suspect there are quite a few typos at the equations though. My attempt to summarize the basic idea: Kalman Filter can be used to estimate motion for computer vision. User provides a model of the motion, measurement and Gaussian error co-variances. The Kalman filter updates the uncertainty parameters (mean, variances) of the model with each new measurement value. The model gives prediction on the next time-frame. The weight given to the new measurement is inversely proportional to its variance.

Cycle: Collect Data -> Make Prediction -> Update Model

Prediction must be made before model update. Prediction error is part of the update parameter.

Sample (kalman.cpp)
  • Predict the location of a 'pixel' moving in a circle of some changing angular velocity.
  • Visualization: red line drawn between 'real' and 'meansured'. yellow line drawn between 'real' and 'predict'. Seeing the 'yellow' line 'chasing' the 'red'.
  • Simple break-down:
    • User specified model parameters: (F, Q) that governs the pixel movement and (H, R) for measurement.
    • Real movements are simulated with (F + Q) [ never be known to the filter ]
    • Measurements are simulated with Real movements x H + R.
    • Kalman predicts a location (angle).
    • Kalman (mean, covar) is corrected with the measurement.

How to take advantage of this prediction? As demonstrated from the sample, the measurements are noisy, jumps around a lot, the estimation is more stable (slower to respond to direction or large changes). As the name tells it, the it is a signal filter. It reduces noise with a weighted sum of measurement history. It requires a model (transitional matrix) for action and noise.

I suppose it would help like in the case of road-side camera. The locations are observed(measured), find other aspects of the car motion that could be modeled with transitional-matrix (F), like the accelerator and steering wheel in the GPS application. The model is updated with camera measurements. This is also called sensor data fusion (from Wikipedia).

Kalman filter is also good performance. At each iteration it updates its 'last best model' instead of keeping the whole history. Simplifies weighted sum to only 2 terms (last+best + new).

Seems like it's popular for tracking in Augmented / Virtual Reality, which is a class of applications that requires real-time video tracking.

Learning OpenCV, O'Reilly.

Future Reading
  • High-Performance Wide-Area Optical Tracking, Welch, Bishop, et al.
  • Robust Optical User Motion Tracking Using a Kalman Filter, Dorfm¨ ullerUlhaas
  • Subject on Condensation (Particle Filter) Algorithm and application to non-linear and non-Gaussian motion

Friday, February 11, 2011

Meanshift, CamShift

This is essentially tracking by histogram. An object with a distinguished color histogram is to be tracked. Typical choice is a face. The input image is back-projected with this histogram, essentially picking up areas there is a high probability of these color appearing. Using mean-shift algorithm, a randomly placed window will try to converge to a local maxima within certain iterations. This is repeated at each new frame. The starting location could be re-used to speed up the convergence, assuming the object does not move much / disappear from the scene.

The Cam-Shift method improve on the mean-shift such that the input window size is updated (by size and orientation) to fit the object in the current frame. The paper presented this method as part a computer user-interface that tracks the user's face/head movement. This allows tracking as the user move towards / away from the camera besides the lateral movements that is supported by Mean-Shift. It does so by changing the window size based on the value of the Moments calculated at each iteration.

Meanshift() documentation recommends removing noise from the back-projections before calling meanshift.

CalcBackProject() documentation has a very brief description of the camshift algorithm.

Sample (camshiftdemo)

The actual mean-shift is done on the Hue plane of the HSV space. Histogram covers first half of the full-hue value range - 0 to 180.  0-90 is close to the human skin while the far end is blue-ish.

Trivial change could be made to compare Meanshift and Camshift.

CamShift window resizes and re-orient according to the subject motion or change in camera perspective. The accuracy is pretty impressive.

In cases where it fails to converge (typically when the object disappear from scene or too far away from the last 'seen' location),  the marker stops at a place no where near a face-colored region. This happens when the video cuts to another scene or shown with camera from a vastly different perspective.

The good thing about auto-sizing feature of CamShift is that, I could simply select a small area inside the face. And quickly it will expand automatically to fill the largest area possible.

It is designed to track only one object (see the paper).

The application crashes consistently with more than 1 video files, typically after a few minutes.

Computer Vision Face Tracking For Use in a Perceptual User Interface, Bradski

Thursday, February 10, 2011

Motion Templates

Trying to briefly summarize the description from the "Learning OpenCV"
The motion templates records location of blobs at each time-stamp. It estimates the direction of one(local, segmented) or more blobs (global) with the corresponding 'spatial-timestamp gradients'.
It makes use of a motion-history-image keep track of the latest movements together with timestamps. There is a predefined time window of which history will be kept. The detail mechanics could be referred from Davis99, Bradski00.

The 3 functions that together provide motion template technique:

Sample (motempl.cpp)

The demo application uses simple frame-differencing and thresholding to get the object silhouettes. It uses a cyclical buffer to store the successive frames. The difference is computed with respect to the Nth previous frame where N is the buffer size.

Visualization: The values of motion-history-image (mhi) is converted into the Blue channel of a RGB image by scaling the time-stamp values to fit the 256 discrete levels. This results in a blue-on-black image, where newer and older blobs are shown, overlaid with estimated orientations.  Older blob locations will have faded blue colors. Both global and local orientations are shown.

Test Video 1: Road-Side Camera Video
The orientation values perturbed quite a bit even as the direction of where the car is going does not change. Increasing the cyclical buffer size from 4 to 16 helps. The code comment remarked choosing this value according to video frame rate.

Test Video 2: Aquarium
The perturbation is less serious than the road-side-camera video even with default values. The relatively smooth movements of fish is probably the reason.

G Bradski is one of the authors of the Learning OpenCV and also of the Motion Templates paper?

  • Learning OpenCV
  • Motion segmentation and pose recognition with motion history gradients, Bradski and Davis.
  • Real-time motion template gradients using Intel CVLib, Davis and Bradski

Dense Optical Flow - Polynomial Expansion by Gunner Farneback

It probably requires time to get into details of Polynomial Expansion in order to have a full understanding.

The follow paragraph could be very wrong. Need revisiting later to understand what's under-the-hood.
Farneback method uses Polynomial Expansion to approximate the neighbors of a pixel. The Expansion could be seen as a quadratic equation with Matrices and Vectors as variable and coefficients. This dense optical flow analysis produces a displacement field from two successive video frames. Each displacement vector in the field is estimated by minimizing the 'error' under a 'constraint'. The 'constraint' is an equation A(x)d(x)=delta-b(x) derived from the polynomial expansion. The 'error' is the weighted sum of differences in the pixel neighborhood (x + delta-x) between the images.
Image Pyramids is used to detect large displacements. Also uses Gaussian to smooth out the neighboring displacements. I suppose that's based on the assumption that pixels in proximity move in similar directions.

Sample (fback.cpp)
  • The larger distance movement are not detected with fewer pyramid levels. And there is no noticeable gain in performance.
  • Displacement vectors are drawn at 16 pixels, possibly to avoid cluttering.'
  • Overall is slower than sparse as expected. It seems to have fewer noise. And it detects new objects by nature (Dense).

Two Frame Motion Estimation Based on Polynomial Expansion, Farneback
Polynomial Expansion for Orientation and Motion Estimation, Farneback

Wednesday, February 9, 2011

Sparse Optical Flow - Lucas-Kanade

Quoted from the paper, the goal of feature tracking is to find displacement vector d such that v = u + d for a feature point.

Lucas-Kanade published a sparse tracking method. It asserts some properties for a pixel-in-motion. Devises a velocity equation and track each feature point from one frame to the next. It does so by iterative approximation (Newton Method?). The properties are: there is relatively little change in (1) brightness, and (2) distance, and (3) neighboring pixels move in the same direction. The last property is used to avoid Aperture Effect. The 'matching' is done by comparing the pixel intensities between patch areas. The best-match is the one that gives minimum value of sum of squared-differences.

Image pyramid is built to detect large movement. The algorithm starts from the tip of the pyramid (lowest res), and works its way down to the original resolution. The displacement vector dL calculated at each level is passed down to the next level as the initial estimates. So the final vector d is like a sum of vectors from each level.

Sample (lkdemo) - Lucas-Kanade Pyramid 
  • The demo app uses GoodFeaturesToTrack to find feature points and refine the location with cornerSubPix().  User needs to press a key to (re)initialize the set of feature points from the current frame. User could select points from the video and add to the set.  The current frame becomes the reference frame for the next frame. Reference feature points that failed to be tracked will be dropped. Re-initializing set of feature points from is required to track new objects entering the screen.
  • Noticed that the feature points would quickly be lost track with fewer pyramid levels, as they move away from the camera (tested with the road-side camera video).
  • Using the default parameters give pretty good results in terms of tracking accuracy despite too slow for a real-time video running on this PC.
  • Learning OpenCV (Book)
  • Pyramidal Implementation of the Lucas-Kanade Feature Tracker Description of the algorithm, by Bouguet.

Tuesday, February 8, 2011

BRIEF descriptor

According to the paper, BRIEF is designed to speed up keypoint-matching. The descriptor is simple to create and compare. The trade off is the current design is variant to scale and orientation.

The descriptor is a binary string. Each bit represents a simple comparison between two points inside a 'patch'. The key-point is centered at the patch area. Bit is set to '1' if first point is of higher intensity than the other, '0' otherwise. The paper suggests different ways of selecting points. The empirical results with best recognition rates are points chosen with Gaussian distributed at the image center.

The image needs to be smoothed before comparing pixels to reduce noise.

Hamming distance is used for matching. This takes advantage of the XOR and bit-counting CPU instructions(SSE).

OpenCV Implementation

A typical BRIEF descriptor is made of 16, 32 or 64 (x8) comparisons. The patch size is 48 of length. The spatial-distribution of the comparing-pixels is defined in test-pairs.txt. Integral of the image is computed once. And subsequent comparisons are done in terms of the sum of kernel (9x9). Implying that only the KERNEL_SIZE parameter could be adjusted.

Sample (brief_match_test.cpp)

Using data-sets provided by Visual Geometry Group from Oxford, UK. The results is as expected. The paper uses the same set of data. It could tolerate small rotation but not scaling or change in perspective.
The brief_match_test uses the matches to warp the second image to match the first one(homography). It shows a differential picture by subtracting the warped image from the  first. The more the good quality matches, the better the 'difference' would look like a Canny edge image.

Sample (video_homography.cpp)

This sample makes use of FAST detector, BRIEF descriptor and BruteForce (Hamming distance) matching to correlate Key-Points between video frames. Tested with video clips. It is able to track larger movements  if the camera zooms in to an object, or in cases where the camera pan while the objects stays relatively stationary. By default, it is not able to correlate the points on the car moving in the road-side camera. The tracks does show up after disabling the match_mask to drawMatchesRelative(). A new homography matrix is computed from the matches of every new frame. The match_mask makes up the keyPoints that fits the transform. But since there is no perspective-change from the fixed road-side camera, the mask becomes over-constraint.

The sample introduces a OpenCV class GridAdaptedFeatureDetector. Caller specifies an arbitrary grid size (default 4x4) and maximum feature points. The detection process would divide the max feature points among the grid cells. The grid spreads the feature points around the frame area. The per-cell max-feature-points selects the strongest feature points - highest 'response' value from the KeyPoint class.

The sample uses a technique to track only the points that appear in previous frames. This makes sense because a new scene would cut in and the old points becomes irrelevant. It does that by back-projecting the key points of the current frame with the previous perspective-transform matrix H. The back-projection is done by  applying the inverse of H.  A 2D mask is created with cv::windowedMatchingMask(). The mask elements would be set to 1 if the back-projected location is within a certain distance from a KeyPoint of the reference frame. The reference frame could be last frame or a user selected frame.

The homography matrix also resets itself to Identity Matrix if too few points between the reference (previous) frame and the current frame can be fit with the newly homography matrix.

BRIEF: Binary Robust Independent Elementary Features, Calonder et al.