The paper differentiates this 'multi-class' object recognition technique from 'object recognition', 'content-based image retrieval' and 'object detection'.
Very Brief Summary
The goal is to recognize 'classes' of object given an input image. Each object class is associated with a bunch of interest-points (features). In Naive-Bayes terms, we could train a classifier with labeled data to predict the class of object present in an image by looking at the set of detected interest-points. Linear SVM is also trained as classifier for comparison. Since it is a multi-class problem, a one-against-all method is used. They trained 'm = # classes' numbers of SVM. Each output a confidence value on whether the input image belongs to that class.
The bag-of-words refers to a vocabulary set. The vocabulary set is built by k-mean clustering of keyPoint-descriptors (such as SIFT). A BOW Descriptor is a histogram of vocabularies. One BOW descriptor for one image. Key-points detected on images are looked up from its associated vocabulary. The corresponding bin from the histogram is incremented. The BOW descriptors (histogram) from training images are then used to train the SVM classifiers.
Harris-Affine-Detector -> SIFT Descriptor -> Assigned to Cluster -> 'Feature Vector' -- (+ label) -> Naive
The bagofwords sample program performs the bag-of-keypoints training and classification as in the paper. The training and test data format supported is PASCAL 2007-2010.
The program defines a class VocData that understands the PASCAL VOC challenge data format. It is used to look up the list of training-set and test-set image files for a specific object-class. The class also defines helper functions to load/save classifier results and gnuplots.
The program defines functions to load/save last run parameters, vocabulary-set and BOW descriptors.
Despite the big chunk of code, the functions are pretty well-defined. There are sufficient code-comments.
User specifies the keypoint-detection method, keypoint-descriptor and keypoint-matching method to the BOWKmeansTrainer. The vocabulary-set is built with one chosen class of object training images. The code-comment says building with one particular class is enough.
SVM is used for object classification. The CvSVM class is used. The number of instances is the same as the number of object classes. Each is trained with both positive and negative samples of a particular class object. That SVM would be tested with all classes of test objects.
See LIBSVM for more details on CvSvm implementation for OpenCV.
BOW Image descriptor is the histogram of vocabulary occurrences in a single image. It is a simple array - rows-of-image x cols-of-vocabulary. Each row is send to SVM for training.
DDMParams load/stores the keypoint detector-descriptor-matcher type.
VocabTrainParams stores the name of the object-class to be used for training. It also loads/saves the maximum vocabulary size, memory to use, and proportion-of-descriptor to use for building vocabulary. Not all the detected image key-points are used. The last parameter specifies the fraction of that to be randomly picked from each input.
SVMTrainParamsExt stores some parameters that control the input to the SVM training process. These do not overlap with the CvSVMParam. There are 3 parameters:
- descPercent controls the fraction of the BOW image descriptors to be used for training.
- targetRatio is preferred ratio of positive to negative samples. Precisely this parameter is the fraction of positive samples from all samples. It also means that some of the samples will be thrown away to maintain this ratio.
- balanceClasses is a boolean. If it is true, then the C-SVC weight given to the positive and negative samples will be same to the pos:neg ratio of samples used for training. See CvSVMParams::class_weights for usage. If it's set to true, the targetRatio will not be used.
RBF is chosen as kernel function for SVMs. The related parameters will be chosen automatically, presumably by the crawling the 'Grid'. See LIBSVM docs.
Used Harris Affine Detector - SIFT descriptor - BruteForce matcher for key Points matching.
Default parameters for BOWKmeansTrainer.
As stated above, the demo application saves user-preferences, BOW descriptors, SVM classifier parameters and Test results to an output directory.
Stopped the running after 'aeroplane' class. It took too long. Save for another time when there is a spare PC. On the other hand, 10103 BOW descriptors are already built. And there are 11322 JPEG images. That means only 1000 more image descriptors to extract. Most of the time would be spent on training SVMs in the future.
Took very long time to build the vocabulary - k-means never seem to converge below the default error value. So it stops after 100 iterations which is the default maximum.
Computing Feature Descriptors (Detect + Extract): 6823 secs ~ 2hrs
Vocabulary Training ( 3 attempts of (k=1000)-means ): 75174 secs ~ 21 hours
SVM Classifier Training (for one classifer, aeroplane)
- Took 5 hours to extract BOW Descriptors from 4998 Training Set images.
- Took another 2.6 hours to train SVM classifer with 2499 descriptors of above. Meaning only 50% is used for training. Of which 143 are positive and 2356 are negative.
- Took 5 hours to extract image descriptors from the 5105 Test Set images.
- Took only 0.04 seconds to classify all the Test Set descriptors.
- The output has a gnuplot command file. Applied to cygwin gnuplot, output a PNG file. It shows the Average Precision of 0.058 and a plot of Precision versus Recall.
- Object Recognition Data from Oxford VGG: http://www.robots.ox.ac.uk/~vgg/data/
- PASCAL VOC http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/
- Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
- Visual Categorization with Bags of Keypoints, Csurka, et al.
- A Practical Guide to Support Vector Classification, see LIBSVM from Resources