Tuesday, December 6, 2011

Building OpenCV 2.2 SDK Package

I built a OpenCV 2.2 package on 64-bit Windows 7 and installed to 32-bit Windows XP. The package is an executable. It includes source(optional), built binaries, docs, samples. It requires NSIS to create the package.

CMake Configuration
  • Supply Path to MSVC Runtime Redistributable DLLs
  • Disable BUILD_TESTS
  • Disable Python support
From Visual C++ 2010:
  • Build OpenCV Solution in Debug Configuration
  • Build OpenCV Solution in Release Configuration
  • Copy opencv.ico to where it is specified in Package.cmake.in
  • Copy the opencv.pdf to where it is specified in Package.cmake.in
  • Build PACKAGE project in OpenCV Solution

Tuesday, November 22, 2011

Show Running Video Frame Number

Spent some time figuring out which existing (free of charge) media player could display current frame-number or resolutions up to milliseconds. I focused my search on Windows platform.
  • VLC: Unable to find a way to step frame-by-frame
  • VirtualDub: Yes - it is on by default. Noticed the limited types of media container support. It works with AVI, mpg, but not mp4.
  • Media Player Classic: Yes - but I am only able to show it as OSD, not the status / information area.
    1. Open the video file
    2. Configure the OSD option to display 'frame number' from Media Player Classic menu: Play -> Filters -> ffdshow Video Decoder -> Properties
    3. Turn on OSD option from ffdshow (available from windows taskbar)

Friday, August 26, 2011

Convert RAW YUV File to AVI with AVISynth and VirtualDubMod

Here is a straightforward way of placing a raw video file in YUV420 format into an AVI container.
  • Download and Install AviSynth 2.5.8
  • Download and Place the RawSource AviSynth Plug-in to AviSynth plugins directory.
  • Download and Install VirtualModDub
  • Now prepare a simple AviSynth script file called importYUV.avs with the following lines:
file_1 = rawsource("overbridge.yuv", 1920, 1088, "I420")
return file_1
  • Run VirtualModDub
  • Open the importYUV.avs as Video File.
    Now the first frame of video should appear on VirtualModDub window.
  • Choose File->SaveAs to save the imported YUV file as AVI
    • Choose 'Direct stream copy' as "Video mode".
    • Press 'Save' Button.
  • Ta-da! 

Tuesday, August 2, 2011


Some highlights of my primitive understanding on NPP and CUDA. The description below does not cover the graphical aspects of CUDA. There are topics on textures, getting resources from other 3D APIs and others.

General notions
Host - CPU, Device - GPU
Thread, Block, Grid, Core
Kernel - a single function to operate on data-array.

NPP - NVIDIA Performance Primitives (NPP_Library.pdf)
It is a set of C primitives that operates an arrays of data. Typical program flow: allocate memory on device, copying the input array to device, call the NPP functions, and copy the result array from device memory.
There are 2 sets of API. First set operates on 1D array - Signals. Another set operates on 2D array - Image.

  • Signal functions (nppsXXX) : arithmetic, set, shift, logical, reduction.
  • Image functions (nppiXXX) : arithmetic, set/copy, stats (min, max, mean,...), histogram, transforms (warp, affine, perspective, domain), color-space conversion, filtering, etc.
NVCC - NVIDIA compiler
The detail compile-flow is sophisticated. NVCC separates the input source (.cu) to run on host and device. It delegates host code to compiler that is responsible for the host application. Device portion of the code will be compiled to intermediate code (.ptx) or architecture-specific code (.cubin), based on the compiler options. Either way, the compiled device code will be embedded in the application binary. At application launch time, the PTX code will be compiled to arch-specific image, and download to device memory. PTX is the output for 'virtual architecture'. It is an abstract version of a device that is characterized with its compute-capability index (1.0, 1.1, .., 2.0,...). The NVIDIA Runtime library will find out what the actual device hardware at execution time and compile the PTX code accordingly.

CUDA - C extensions, Parallelizing Framework ( CUDA_C_Programming_Guide.pdf )
It has nothing to do with NPP (?). It primarily lets host applications to perform parallel data processing using GPU cores (SIMT). It defines a set of C extensions (Appendix B) so that programmer could define how code and data are placed and executed on the device. The framework supplies a set of Runtime API and Driver API. Device API is a lot like a Runtime API. It allows finer control in some cases, e.g. pushing / popping contexts.
Device contexts is similar to CPU processes.
Driver API - cuXXXX()
Typical programming flow: Initialize device - Create contexts - Load module ( PTX or arch-specific-binary) - Choose Kernel (function) from current context - Execute it.
Each host-thread keep a stack of contexts. The top-of-stack is 'current' context. Creating a context for a device automatically push it on top of the stack. A context could be popped from stack. It remains valid. Any threads could pick it up and run it. Omitted from the simplified flow above: calls to copy data from and to device around the kernel execution.
Runtime API - cudaXXXX()
Concept - Single Function Many Data
A Function is called Kernel. It is defined by prefixing __global__ to a C function in CUDA source (.cu). These functions could only be called by functions defined in CUDA source. Typically a Kernel calculates one element of output array.
Each core runs a block of threads by time-slicing(?). A GPU has N cores. GeForce 310M has 16.
The work load of processing an array is spread across available cores by grouping threads into blocks. NVIDIA runtime decide scheduling of these blocks into available cores. CUDA 4 supports up to 3 array dimensions (x, y, z).
Program Flow
Typical program flow is similar to aforementioned - Allocate host and/or device memory - Copy/Map data to device - Launch the kernel with the (in, out) array data locations, and number of blocks and block-size - Copy back the result data to host.
Error handling needs special care to get because of the asynchronous nature. 2 types of error checking: At-Entry (parameter checking) and At-Finish (Kernel function returns)
__global__ defines the function as Kernel.
__device__, __const__ defines the variable on device global and constant area.
__shared__ defines the variable to be placed in thread-block memory.
Only supports a subset of C++. See Appendix B of CUDA_C_Programming_Guide.pdf.
A pair of triple-arrow-operator <<<, >>> specifies the data, thread-block-info and launch the kernel. It is ASYNCHRONOUS!
Other aspects
  • Device Memory Hierarchy - Global, Block(shared), Thread
  • Efficient use of memory - Copying from global to block is expensive. Make use of __shared__.
  • Concurrency - Streams ( a sequence of asynchronous commands ). Data copy could be made asynchronous with 'async' variant of data copy functions. Concurrent-data-transfer and Concurrent-kernel-exec depends on GPU capability.
  • Synchronization - Events, Explicit CUDA API, Implicit (device data copy to host, and others).

Monday, August 1, 2011

OpenCV 2.3.0 GPU speed-up with CUDA 4

Now it's time to build OpenCV 2.3.0 with GPU enabled.

Configuration and Build - http://opencv.willowgarage.com/wiki/OpenCV_GPU
Follow the steps described in the OPENCV_GPU page for Visual Studio 64-bit build.
module-gpu build error: Configuration(null)
Solution - missing vcvars64.bat in Windows SDK amd64 directory. Create that by following the simple instructions here http://www.w7forums.com/vcvarsall-bat-no-64bit-support-vcvars64-bat-missing-t6606.html
Taken by surprised at first because I am able to build 64-bit OpenCV. I suspect it has to do with nvidia compiler (nvcc). It probably open a windows shell to do compilation. And that would not have the 64-bit environment set up without this vcvars64.bat.

Test GPU build by running module-gpu-test suite from VS 2010 Express
See "Implementing tests" section of http://opencv.willowgarage.com/wiki/CodingStyleGuide
Setting up Test Data
Test-Data is required by the gpu-test-suite (and others too). Download a snapshot of the opencv-extra package that is tagged for OpenCV 2.3.0 release from WillowGarage. There is a "Download Zip" link in the source browsing page that makes it convenient.
Set the environment variable OPENCV_TEST_DATA_PATH to point to the testdata directory.
Run the project module-gpu-test
Resulted in 3 types of failures
  1. My NVidia hardware that has compute-capability of 1.2. 1 case requires 1.3
  2. Crash in meanShift and meanShiftProc. The stack trace shows that it dies at the point where GpuMat variable is being released.
  3. Assertion error in NVidia.TestHaarCascadeAppl. (Didn't investigate further).
The other tests run OK.

Learned to use the gtest_ command-line argument - see code comments above ParseGoogleTestFlagsOnlyImpl()
  • gtest_list_tests : shows the tests selected to run and quit
  • gtest_filter= : select the tests to run / or not to run by matching a specified pattern against test name. Pattern for negative matching begins with minus sign.
  • gtest_output=xml[: directory name / file-name ] : output a summary of tests results in XML. Details see ts_gtest.cpp (search for GTEST_DEFINE_string_)
OpenCV GPU module
The library implements accelerated versions of other areas of OpenCV  - image processing, image filtering, matrix calculations, features-2D and, object detection, camera calibration. The API and data-structures are defined in nested namespace cv::gpu::. The accelerations makes use of both NPP API and CUDA parallelization.

Run a few OpenCV GPU samples that could readily compared with non-GPU ones
  • surf_keypoint_matcher vs matcher_simple: speed up from 46 secs to 6 secs with the graffiti image from VGG set.
  • mofology vs morphology2 : not very obvious in my quick test. still noticeable when changing the element shape at a Open/Close set at 17-iterations.
  • hog_gpu vs peopledetect : speed up from 67 to 17 secs with my 5M-pixel test image.
  • cascadeclassifier_nvidia_api vs cascadeclassifier(GPU)vs facedetect (no-nested-cascade) : overall (secs): 5.1 / 4.8 / 4.5; detection-only(secs): 1 secs / 1 / 3.1 

Saturday, July 30, 2011

CUDA 4 installation

OpenCV requires CUDA 4 Toolkit. Moreover, CUDA 4 supports VS 2010 while CUDA 3.2 only supports up to VS 2008.

Installation - README_SDK_Release_Notes.txt
Requires NVidia driver of version 270+. Upgraded such for my GeForce 310M.
Installed 64-bit CUDA Toolkit 4.0
Installed GPU Computing SDK.
Also downloaded BuildCustomization FIX just in case.

Verify the installation - CUDA_C_Getting_Started_Windows.pdf
Simply follow the Getting Started guide.
Some hiccups to resolve -
  1. Error building 64-bit 'cutil' - Change the Toolset configuration to Windows SDK 7.1 in order to get $(WindowSdkDir) macro pointing to 64-bit instead of 32 bit. http://stackoverflow.com/questions/3599079/windowssdkdir-is-not-set-correctly-in-visual-studio-2010
  2. Error building shrUtils - the source and include files are misplaced. Copied required source as seen in VC++ Project file from "C/Common/src". Add "C/Common/inc" to shrUtils "Additional Include Directories" to pick up the misplaced headers. http://forums.nvidia.com/index.php?showtopic=197097
Now able to run the build and run the tests suggested in the Getting Started guide.

Applied the BuildCustomization Fix
Not essential but definitely come across such problem in later experiments. So it's still a fix needed for this toolkit. Description of the problem is in the README file that comes with the patch. Quite straightforward, actually.

Friday, July 29, 2011

Install OpenCV 2.3.0 for CUDA

Decided to try out what OpenCV + CUDA is like. Prefer to start with 2.3.0 - quite a few fixes since 2.2.0 in this area.

Simple Build and Run of OpenCV 2.3.0 Release
Download OpenCV-2.3.0-win-src package.
Typical configuration with CMakefile for VS2010 Express - with C samples.
Build 32-bit - run video-starter to check video-capture is working - YES.
Build 64-bit - run video-starter to see video-capture is working - YES.

Build API documentation to understand GPU module
Download and Installed MikTex 2.9 Portable version to C:\Program Files (x86).
Configure CMake: BUILD_DOCS=yes and set MIKTEX_BINARY_PATH = < miktex/bin directory >
Needs Sphinx python module (sphinux-build.exe) to satisfy CMake configuration of Build Documentation
  • Install Python 2.6.5 (Win32)
  • Install setuptools
  • Download sphinx python egg
  • Run easy_install (from setuptools) on the egg.
  • Run CMake config again, specify exact path to the sphinx. This entry now appears now that Python installation is detected.
Open OpenCV.sln with VS2010 Express - ALL_BUILDS configuration currently excludes 'docs' and 'html_docs' project.
Compile 'docs' first - error at the end(?) saying the pdflatex.exe: Access is denied. Seems like pdflatex is trying to write some file within the Miktex program directory. Next time I should move the MikTex Portable directory to C:\ProgramData instead.
Build 'html_docs' - finished OK; HTML API docs now appears in the build directory.

Also noticed:
a WITH_OPENNI option in CMake - and there is a kinect_sample.

Thursday, July 28, 2011

Speed up with Intel Integrated Performance Primitives (IPP)

Current OpenCV support
OpenCV 2.3 : IPP 7.0
OpenCV 2.2 : IPP 5 - 6.1
The directory structure of IPP 7.0 and IPP 6.1 is different. And the OpenCV 2.2 CMakefile only checks for IPP library versions up to 6.1.

It has both commercial and non-commercial license. A single-user commercial license costs $199 + $80 for annual-renewal. Non-commercial version only supports Linux.

Tried the 30-day evaluation of non-commercial version. Surprised it requires 2GB disk space for installation. The package itself is about 237MB.

What is IPP?
Relationship with OpenCV according to Intel - outdated with respect to current OpenCV code.
IPP uses OpenMP for parallelization
Did not spend time to go into details. My impression is it provides a lot of OpenCV routines in image-processing, camera-calibration, optical flow. And quite of few of them have a 1-to-1 mapping to the corresponding function in IPP. Surprised to see so few area actually uses IPP now. Based on what I read from the discussion forum, IPP speeded up OpenCV1.x a lot, before having SSE acceleration.
SSE is instruction-set level hardware acceleration. IPP is a library that implements algorithms that takes advantage of SSE.

How is it used
Did a quick search of 'HAVE_IPP' from OpenCV 2.2 source code. These are areas currently IPP appears to be relevant: dxt,cpp (e.g. DFT), haar.cpp, hog.cpp, LK-optical-flow, HS-optical-flow.
Something I noticed is how the IPP library is loaded at run time. There are a few linking methods to choose from at build time. I think what OpenCV 2.2 is using is what is called Static Linking with Dispatching - ippStaticInit(). At build time OpenCV linked in a static library which provides a 'jump-table' to the actual dynamic-library at run-time. The IPP routines are optimized differently across processors based on their capabilities. ippStaticInit() chooses the suitable library to load based on the processor it's running on. See the IPP User Guide for details.

CMakeFile Configuration
Downloaded and installed IPP 6.1 Build 6 to Linux.
Point IPP_PATH CMake variable to the IPP lib directory, not IPP root directory.

dft sample - the time took to complete dft() function reduced from 7XX ms to 3XX for an 5M-pixel image file.
facedetect sample - 5% speed up of face-detection function using haarcascade_frontalface_alt.xml. Some classifier will not trigger IPP enhanced code. See haar.cpp for the conditions required.

Wednesday, July 27, 2011

Parallelizing Loops with Intel Thread Building Blocks

According to OpenCV Release Notes, the code will use Intel TBB (2.2+) instead of OpenMP.

Overview of TBB.

It is basically a library of C++ templates for concurrency. It covers concurrent random-access and even sequential containers, parallel iterations. It also gives concurrent memory allocator (reducing cache-line collision). Moreover, it has a scheduler for light-weight tasks instead of the more 'bulky' thread.

Paraphrasing from the TBB FAQ section, that means use OpenMP for C program and "large, predictable data parallel problems", use TBB for C++ and "less-structured and consistent parallelism". And of course OpenMP works at compiler level, meaning it relies on such support.


Intel uses dual-licensing for TBB library. Basically a commercial license with support and other goodies, another is no-frills open-source. The commercial license costs $299 for the first year, and $120 for renewal. The open source license is GPLv2 with C++ Runtime-Exception clause. I think it means that instantiating TBB templates does not require you to open-source your program, unless you modified the TBB itself.

Use in OpenCV 2.2 - (2.3 should have more in ML module)

internal.hpp: Implements a tiny subset of TBB in serial fashion when HAVE_TBB is undefined: template<> inline cv::parallel_for() inline cv::parallel_do inline cv::parallel_reduce ConcurrentVector, Split, BlockedRange.

Summary by template:
  • parallel_for: boost, cascadedetect, distransform, haar, hog, lkpyramid, stereobm, surf
  • parallel_reduce: rtrees.cpp, tree, features2d::evaluation.cpp
  • parallel_do : stereobm; wont end until the all items in list is processed, even the new items are added in operator(); downside - no random access meaning no concurrency.
  • ConcurrentRectVector: cascadedetect, haar, hog

Configure and Build - OpenCV 2.2 Win32 + TBB 3.0

  • Picked up some changes from OpenCV trunk to add TBB support for VS2010 in CMakeFile.
  • The rest is straight-forward. http://opencv.willowgarage.com/wiki/TBB
  • Run sample facedetect (CascadeClassifier):
    Shortens time required to detect faces and mouths (nested) from a selected picture: 34 sec -> 14 sec
  • Run sample peopledetect (HOG descriptor):
    Half the time taken: 1.5 -> 0.7 secs

Configure and Build - OpenCV 2.2 Linux GCC 4 + TBB 3.0
  • Must specify TBB_LIB_DIR all the way to the exact location (e.g. <tbb-top-level-lib-dir>/ia32/<gcc-version>)
  • It takes only around half the time for traincascade finishes. From 2:14:13 -> 1:13:08.

Tuesday, July 26, 2011

Parallelizing Loops with OpenMP

According to OpenCV Release Notes, use of OpenMP is no longer in active support since OpenCV 2.1. They have been replaced by Thread Building Blocks (TBB).

OpenMP relies on #pragma directives. Telling compiler to parallelize loops / code blocks. Change of existing code is small compare to other methods.

_OPENMP will be defined by compiler that supports OpenMP.

Searching _OPENMP from source code discover current 'leftover' implementations here:
  • selfsimilarity.cpp (disabled with #if 0 block)
  • spinimages.cpp ( parallel for in computeSpinImages() )
  • stardetector.cpp ( parallel for in icvStarDetectorComputeResponses() )
  • system.cpp (getNumThreads(),...)
    • cvboost.cpp ( parallel private in cvCreateMTStumpClassifier() ) 
    • cvhaartraining ( uses CV_OPENMP instead of _OPENMP. Moreover, it's only enabled with MSVC and ICC compilers, not GCC )
    • blobtrackanalysisior.cpp: parallel CvBlobTrackAnalysisIOR:Process()
    • blobtrackingmsfg.cpp: parallel UpdateWeightsMS(), UpdateWeightsCC()

    Experiment with haartraining
    1. Used Linux Build too take advantage of GCC 4 OpenMP support. There are OpenMP options in MSVC 2010 Express. But Microsoft webpage states there is no support. (http://msdn.microsoft.com/en-us/library/tt15eb9t.aspx)
    2. Recovered ENABLE_OPENMP in OpenCV 2.2 CMakeFiles by un-commenting such occurrences.
    3. Used CMake-GUI to configure build with ENABLE_OPENMP turned on.
    4. Called CvGetNumThreads() in haartraining.cpp to see if both CPU cores are available for use.
    • Took half the time to perform the same training. The results are basically the same. Noticeable differences are the node-split threshold values differs from the 5th decimal points on. 
    • cvhaartraining.cpp uses a variant of the OPENMP define - CV_OPENMP. Training terminated with SegFault with that turned on.

    Thursday, March 24, 2011

    What are my GUI options?

    Evaluating my GUI options for Windows PC applications
    1. Windows Forms (.Net) with C++ or C#; .Net port of Emgu Project
    2. MFC (Native) - Visual Studio Standard/Professional and up
    Nokia -> Digia?
    1. Qt: Visual Studio - Qt-add-in to take care of the Qt-specific compilation process. Visual Studio Standard/Pro and up.
    2. Qt: QT Creator
    Win32++: Does it have widget like OpenFileDialog?

    1. HighGUI + Qt as back-end
    2. There is a window_gtk.cpp, wonder if it works on Windows also.

    Initial thoughts
    Based on what I read from various internet discussions, I could use another GUI framework while keeping highGUI module for image, (camera) video, encode/ decode/read/write. The only work I need to write is to implement cvDrawImage().
    Like C# but afraid that .Net would slow down execution. And wonder if C# port would introduce another set of issues/bugs.
    The pros of choosing Qt is the cvDrawImage() is already there in window_QT.cpp.

    Going with Qt

    Problem with MSVC OpenCV DLL
    Installed QTCreator. Created a simple app that open a file with the QT File Dialog. It uses OpenCV API to open an image file. Result: C API compiled and ran fine (cvLoadImageM()). Using the corresponding C++ API got linking error (undefined references to cv::imread()). Made the case simpler by trying cvGetTickCount(). Again the C API worked while C++ function cv::GetTickCount() did not. This is the configuration I used for building the application: http://www.barbato.us/2011/03/18/using-opencv2-within-qtcreator-in-windows-2/
    Tried to fix the problem with 'reimp' MinGW utility like this: http://www.mingw.org/wiki/MSVC_and_MinGW_DLLs
    Unfortunately, 'reimp' did not like opencv_core220d.lib. Giving error 'invalid or corrupt import library'. Someone filed a bug report on this, not sure if that's the same case http://sourceforge.net/tracker/?func=detail&atid=102435&aid=3120866&group_id=2435
    Someone points to this article to describe the common issues of linking C++ libraries built with different compilers: http://chadaustin.me/cppinterface.html
    Giving up on this for now.

    Debugger failed to step into OpenCV DLL
    Even though calling OpenCV C API worked, I was unable to get the debugger to step into OpenCV functions. Installed 32-bit Debugging Tools for Windows (x86). Enable CDB in QTCreator Options menu. Followed instructions here: http://msdn.microsoft.com/en-us/windows/hardware/gg463016.aspx ; Still cannot get it to work.

    OpenCV MinGW Build
    Since QTCreator compiles with the MinGW toolchain. Rebuilt OpenCV 2.2 with latest stable version of MinGW(not the one packaged with QTCreator). Instructions here: http://opencv.willowgarage.com/wiki/MinGW ; Encountered a error linking with VideoInput.a - and the workaround is described here: https://code.ros.org/trac/opencv/ticket/587
    The OpenCV was then built successfully. There are some compilation errors with the C/C++ samples. g++ doesn't like:
    vector<vector<Point>> cpoints So I changed to this vector<vector<Point> > cpoints
    Fixes are trivial. I am now able to run both the adaptiveskindetector and camshiftdemo from the command-line. It can even read from the webcam right away!

    MinGW versions
    QtCreator uses its own MinGW. The gcc version is 4.4. The latest MinGW that I used to build OpenCV is gcc4.5.2. It caused problem at runtime. I fixed it by setting the PATH such that the newer MinGW DLLs are picked up.
    Debugger: Unusably slow stepping in OpenCV DLLs. Wonder why.... Related to this?! http://bugreports.qt.nokia.com/browse/QTCREATORBUG-3115
    The qmake manual says automatic code completion and syntax highlighting will be available from external libraries after they are declared in the INCLUDEPATH and LIBS variables. I noticed that it only works if the directory is assigned to INCLUDEPATH as literal string, not a variable value $().

    At the end...
    I am able to get the application to open an image and display it correctly. It uses OpenCV C++ API to load image - swap color channels from BGR to RGB. The display part uses QtGraphicsScene and QtGraphicsView to show the QtImage.

    Here is a hyperlink found in window_QT.hpp, a Qt article on how to interactively pan and zoom images smoothly

    Wednesday, March 9, 2011

    One Way Descriptor Matching

    Might be useful to know that Lepetit and Fua, who co-authored the Ferns and Randomized Trees Key-Point Classifier technique, are also contributors to One Way Descriptor paper.

    The team devises a new patch descriptor that combines both offline and online training. Descriptor extraction is not needed for run-time query. That is why it is 'One-Way'. The goal was to save time for feature-point matching and real-time object-tracking from video. The team's experience with SLAM suggests that this technique works well objects that lacks distinguishing texture.

    Training: Try to find an image with frontal view of the object-to-track. The idea is to train a classifier with the same set of feature points viewed from multiple-poses. The key-point patches came from only a few input images. They are expanded into many warped versions. At the end, a key-point patch would be represented by a set of mean-patches. Each mean-patch represents  one single pose. At each single-pose, the image is again expanded into many poses. Only this time the variations are small and around the associated pose. A mean-patch is the 'average' of patches that 'differs' a little bit around that pose.

    Matching: Directly compare an incoming patch with mean-patches from all the poses of all the patches in database. (How does that work? Comparing pixels to PV)? The patches are normalized before comparison. (Some heuristics to speed up search like K-D Tree). The search not only returns a mean-patch but also its associated 'coarse' pose.

    Speed up the calculation of mean-patches
    The author makes use of linear method of computing mean-patches such that it would be preferable to other blurring techniques such as Gaussian. The perspective transforms to make mean-patches takes up most of the training time, too slow to be used for online training. According to the authors, it takes 300 patches to compute the mean in order to get good results. A 'math trick' allows the training will be split into two parts.

    Offline Training
    Principal Components Analysis is used for offline training. A reference patch is broken down into a mean and L components of principal vectors. (L is user-defined). So instead of warping the image patch in terms of pixel-arrays, it will be acted on means and principal vectors. The mean-patch is a weighted sum of the average-warped-means and average-warped-PVs.

    Online Training
    With the offline training done the heavy lifting - and the mean-patch calculation now only requires time proportional to the number of PV components, not the number of 'small-warps'. The major work left in online-training is to deduce the 'weights' for each new patch. It will be projected into eigenvector space and solve for a set of coefficients (weights). These will be used to compute a mean patch - the linear sum. (But which feature-point-pose eigenvector-space to use?!)

    Demo Application (one_way_sample.cpp)
    The demo code makes use of OneWayDescriptorBase (not the OneWayDescriptorMatcher).

    Offline phase: Build the mean-patches from a set of 2 images (same chessboard viewed from 2 different angles). The number of 'dimensions' (L) is set to 100 by default. The image patches are of size 24x24. OpenCV implementation uses SURF to detect key-points from training images. And it does 2 versions of it, another one is half the specified patch-dimensions(12x12). The paper mentions something about this to improve consistency. There will be 50 random poses for each patch. The result would be saved to pca.yml file. And it would be loaded back in as a PCA as an array of OneWayDescriptor. I am so far unable to find the definition of how many 'small-pose-changes' from which the 'mean' is computed.
    Online phase: SURF detector is used to detect key-points from the reference input image. The OneWayDescriptorBase would compute One-Way Descriptors for these key-points.

    Uses SURF detector to find key-points from the second (incoming) image. Incoming key-points will be queried one-by-one from the first (reference) key-points with the OneWayDescriptorBase::FindDescriptor(). The matching criteria is a distance threshold. The pose will also be available.
    At the end a correspondence map will be drawn and displayed.

    • 42 keypoints detected from the training images (scene_l.bmp, scene_r.bmp)
    • Took 2 hours to create 100 PCA components.
    • Reference image descriptors prepared in 2.1 ms
    • Matching 37 keypoints takes 2.2 ms [ result is good but i guess it's too simple]
    Tried matching box.png(train) and box_in_scene.png(query): 72 keypoints
    • Beware pca.yml was produced under training directory while it was loaded from working directory.
    • The matching result is so-so, 1/3 of them are false-matches.
    Tried img1.ppm and img2.ppm set from Dundee Graf set: 502 keypoints
    • Took several minutes to do matching.
    • Cannot tell if it is good or not with so many correspondences on screen.
    Demo (generic_descriptor_match.cpp)
    • Able to load the PCA data (with a bug discovered) from previous example.
    • It does not work - at least not supposed to be. First, the GenericDescriptorMatcher::match() calls OneWayDescriptorMatcher::clone() with the parameter indicating that the training was not performed. That means the OneWayDescriptorMatcher is re-instantiated again, discarding the PCA data loaded with GenericDescriptorMatcher::create(). I noticed that this when the training part takes too long. And the function Initialize() is called instead of InitializeFast() inside OneWayDescriptorBase::InitializeDescriptor().

    More note: There is trouble writing the PCA file (150MB) from the PC. It stops (without warning) at about 7MB. It was able to do the matching (I suppose the data is cached). No such problem running from the notebook.

    Further note: The paper is too brief for me to understand totally. Especially on how to learn new key-points. It seems kind of like magic by training using unrelated images to produce some mean-patches that is used to compare 2 other pairs of images. Is it supposed to work like this?!

    Real-Time Learning of Accurate Patch Rectification, Hinterstoisser et al.

    Future Reading
    • Simultaneous recognition and homography extraction of local patches with a simple linear classifier, Hinterstoisser et al.
    • Online learning of patch perspective rectification for efficient object detection, Hinterstoisser et al.

    Sunday, March 6, 2011

    Random Ferns Classifier - Semi-Naive-Bayes

    The team that proposes training Randomized Trees on Binary Descriptors for fast key-point matching is trying another approach to speed up training and matching. This time they use Semi-Naive-Bayes classification instead of Randomized Trees. The word 'semi' here means that not all the input elements are independent. The input vector would be divided into groups. Only the probability densities among groups are assumed to be independent. The grouping is selected by randomized permutation. Input vector is extracted from a key-point region using binary-intensity-differences. 300 of them will be extracted from a 32x32 patch region. A typical group-size is 11, so there will be about 28 groups. Each group is a 'Fern', so it's called a Fern Classifier / Matcher. An input patch will be characterized by the SNB classifier to one of the classes - set of stable key-points. A product of posterior probabilities is calculated given a class label is true. The input patch would be classified to the one of highest value.

    Training Phase: Very similar to Randomized Tree. Only a few training images is required. A set of stable key-points will be chosen by transforming the input images in many ways(300). These stable key-points becomes the class labels. Each image is then transformed again many more times (1000) to obtain the view-set. The classifier will keep count of each Fern pattern (vector of binary-intensity-differences of a group of pixel-pairs) for each associated class label. The counts are used to set the prior probabilities.

    The training and testing for 2D matching is done on a video frame sequence. The frame with upright front facing object is chosen for training.

    Implementation decision has to be made on how to divide up the input vector into groups - Fern-Size. Increasing fern-size yields better 'variations' handling. (Is this referring to perspective, lighting variants?) Care must be taken with respect to memory usage. The amount required to store the distributions increases quickly with Fern size. And it would need more training samples (to build distributions of a bigger set of possible values?). On the other hand, increasing number of Ferns while keeping the same Fern size (small?) (increased vector size?) gives better recognition rate. The comes with only linear memory increase. But the run-time costs increases (relevant?!).

    There is a paper on mobile AR application using Ferns - Citation 34 "Pose Tracking from Natural Features on Mobile Phones", Wagner et al.

    Demo (find_fern_obj)

    This demo uses the LDetector class to detect object keypoints. And it uses PlanarObjectDetector class to do matching. FernsClassifier is one of PlanarObjectDetector members.
    1. Determine the most stable key-points from the object image (by recovering the key-points from affine-transformed object images).
    2. Build 3-level Image Pyramid for object image.
    3. Train the stable key-points with FernsClassifier and save the result to a file. The image pyramid is also supplied for training. Parameters include Ferns size, number of Ferns, Patch size, and Patch Generator.
    4. Load the PlanarObjectDetector from the file obtained from the last step.
    5. Use the LDetector to find keypoints from the scene image pyramid. Match them against the object key-points using the PlanarObjectDetector. The results are represented as index-pairs between the model-keypoints and the scene-keypoints. The model-keypoints are the stable keypoints of object-image. The list is available from the loaded PlanarObjectDetector instance.
    6. Draw the correspondences on screen.
    More notes: Object and Scene Images are loaded as grayscale. And they are smoothed with a Gaussian Filter.

    Demo (generic_descriptor_matcher)

    The simplest way to exercise Ferns matching is to use the FernDescriptorMatcher class. The demo program is very straightforward. The find_obj_ferns demo app is more informative.

    Results and Observations

    Using ball.pgm (book is pictured sideways) from Dundee test set as training image.

    In most cases, it is able to find and locate correctly from the scene images it appears on. The worst result is TestImg010.jpg. It cannot locate the whole upside down book. I suppose that is because the lack of keypoints detected. The book title "Rivera" is obscured.

    Test for false-positive using TestImg02.pgm. The detector return status of 'found' but it was obviously wrong. Half of it is 'out-of-the-picture'.

    Fast Keypoint Recognition using Random Ferns, Ozuyal et al.

    Calonder Descriptor, Generic Trees and Randomized Trees

    Summary of both papers (see Reading)

    The first paper proposes to use ML classification techniques to do fast keypoint matching. Time will be spent in offline training and resulting in a shorter matching time. They found that Randomized Trees is a good candidate. It supports multi-class classification and by experiment they give good matching results. If the node-split-criteria is chosen randomly also (Extreme Randomize Trees?), not only the training time be reduced, but also better matching results. The forest will classify an input key-point from the set of training key-points. So given a key-point from input image, the forest is able to characterize whether it matches one of the original keypoints (or none-of-the-above). The training key-points are chosen by its ability to be recovered from a set of distorted views. Additionally, they found that using simple key-point detector (that inspired the BRIEF descriptor?) is good enough to achieve illumination invariance. The paper devises a way to achieve invariants with this simple descriptor by 'reproducing' each training image multiple times into a 'view set'. Each original image is randomly rotated and scaled into 100 separate images for the 'view set'. All images from the view-set will be used to train the classifier so that it will be able to detect the same keypoint patch from various viewpoints at run-time. Scale invariants is improved by building image pyramids from which key-points are extracted also. (Small) Position invariants is enhanced by injected random 'noise' (in terms of positions?) to the view-set. Placing the training patches on random backgrounds so that the classifier could pick out those trained key-points from cluttered background. Such Binary Intensity Descriptor like this together with the view-set performs very well against sophisticated SIFT descriptor, provided that the random trees is able to divide the keypoint space up to a certain granularity.

    The second paper is a continuation of first paper. The focus this time is try recognizing objects by key-point matching fast enough to use in real time video. An important application is SLAM where offline learning is not practical as objects cannot be learned ahead of time. The authors propose a Generic Tree Algorithm. First, a randomized tree classifier is trained with a set of key-points called 'base-set'. The key points are selected from only a small number of training images. And similarly, the images are warped in many ways for training to achieve invariants. At run time, a query key-point will be go down the classifier, resulting in a set of probability for n-classes. This set is treated as a Signature (descriptor-vector) for this key-point. The Signature would have n-elements (corresponding to the n-classes). Each element is a thresholded value of the class output. The matching between key-point signatures is done using Euclidean Distance.. The theory is that any new correspondence keypoint-pair will have similar classifier output, even though they do not belong to the base-set.


    OpenCV defines CalonderDescriptor class that could produce the Signature of a query point-of-interest from a given Randomized Trees. RTreeClassifier class represents the forest and it is trained with a given base-set and a patch-generator. The base-set is basically collection of key-point locations from a training image. The size of the base-set is the number of classes trained to classify. PatchGenerator objects are used to warp an image using the specified ranges - angles for rotation, intensities for background colors, deltas for position-noises, lambda for scales.

    Demo code (find_obj_calonder.cpp)

    Dataset - Dundee University set.

    The demo code trains Randomized Tree Classifier of 48-trees of 9 levels deep. Took more than 30 minutes to train 176 keypoints (selected out of 1423) from a single image. PatchGenerator creates 100 views for each key-point. The classifier will be saved to a file after training. At run-time, it uses SURF to pick interest points from reference and query images, extracts Calonder Descriptor with the classifier and performs Brute-Force(L2) matching. By default the input image from command-line argument will be used as a reference image. The query image is a warped version of itself. All images are converted to gray-scale for training and tests.

    Wrote another test function so that instead of warping the input image, user supplies another image for matching.

    Results and Observations
    • Trained (one-at-a-time) with upright object-only image: book1, book2, ball
    • Finding the object image from the object-in-a-bigger-picture images did not do well. Many false matches.
    • Most time spent on loading the classifier data file (~16MB).
    Used one of the trained classifiers. Run the default test-case - matching between a query image and its warped version. The testing images are not related to the training image. The warping is not too severe. The results are satisfactory.

    Site for image databases and associated 3D model (stereo-vision):

    • Keypoint Recognition with Randomized Trees, Lepetit & Fua
    • Keypoint Signatures for Fast Learning and Recognition, Calonder, Lepetit & Fua

    Opponent Color Space Descriptors, Evaluations

    The Sande, Gevers and Snoek paper proposes using color to complement the existing intensity-based corner detection and salient region description. The goal is be able to find more salient key-points, and represent the surrounding region with a discriminative descriptor for better matching and object recognition.

    The idea is basically to extend the current single-channel methods to support multi-channel. Images get pre-processed for image-space transform. The author uses Opponent Color Space as an example. Described how to extend Harris-Laplace Corner Detector. The paper briefly go over a few color-SIFT detector such as: OpponentSIFT, WSIFT, rgSIFT.

    The authors compared the degree of color-invariance among those color-SIFT methods. Color Invariance - invariant to illumination highlights, shadow, noise.

    Opponent Color Transformation Steps:
    1. RGB -> Opponent Color Space (O1, O2, O3)
    2. Salient Color Boosting - Normalize the Opponent values with weights 0.850, 0.524, 0.065.

    W-SIFT - I guess that 'W' is the W invariant property, which is a ratio of some spatial-differential transformed pixel value. The transformation is Gaussian Color Model. I suppose this property would be part of the descriptor, useful in matching.

    OpenCV implements a Opponent Color Conversion. But I cannot find where it does the Saliency Boosting. It  does not seem to implement the corner detection using multi-channel images. And it supports descriptor using separate channels implicitly.

    The OpponentDescriptorExtractor expands an existing Descriptor Type with Opponent Color. It does so by repeating the extraction on all 3 channels and concatenate them together as one big descriptor.

    Demo (descriptor_extractor_matcher.cpp)
    Cannot find dedicated demo for Opponent Color Space, borrowing this generic one as a try-out.

    User picks a trio of Detector, Descriptor and Matcher to perform keypoint-matching between 2 images. The second (query) image could come from 2 different sources: 1) user provided image, 2) a 'warped' version synthesized from the first one.
    • Since Opponent Color Descriptor builds on an intensity-based one, specify the latter by peggy-backing, such as OpponentSURF = OpponentColor + SURF.
    • Cross-Check-Matching: Pick out strong matches by including only those appearing on both forward and backward matching.
    • Optionally draw in-lier matches only: Use the original homography or (RANSAC) approximate one from the strongly matched keypoints. Transform the reference key-points using this H. Query key-points must be within a threshold distance from the corresponding warped key-point in order to be considered an inlier match.
    Secondary function of this application demonstrates how to use Evaluation API from OpenCV. Implementation in evaluation.cpp under features2d module.
    • Feature-Detector Evaluation - Given 2 sets of key-points from different viewpoints of the same scene and its homography matrix. The evaluator returns the number of correspondences and repeatability value. Repeatability is basically a ratio of the correspondences to key-points count. It does so by analyzing the overlapping elliptical key-point region between the query key-point and the projected reference key-point.
    • Generic Descriptor Matcher Evaluation - Given a DescriptorExtractor-DescriptorMatcher pair in the form of GenericDescriptorMatcher type, and two-sets of key-points and its homography. The evaluator returns the Recall-Precision Curve. Recall values are the ratio of the Current Correct Matches to Total Correspondences. Precision values are the ratio of Current Correct Matches to Current Total Correspondences. Using the term 'Current' in a sense that each ratio value is associated with a match. They are calculated in the order of descending strength (matching distance). A match is Correct if the overlapping region is big enough, just like how the detector is evaluated.
    Results using Graf set from Oxford UGG

    Results (img1 vs img2, 3203 vs 3536 keypoints detected)
    Strong Matches 988 - Inliers 499
    Strong Matches 1323 - Inliers 834

    Results (img1 vs warped-img1 3203 vs 2758 keypoints detected)
    FD Evaluation (took 5 minutes) - repeatability 0.673, correspondences 1719
    Strong Matches 993 - Inlier 565
    GDM Evaluation (took 20 minutes)
    1-precision = 0.0; recall = 0.000966184
    1-precision = 0.1; recall = 0.000966184
    1-precision = 0.2; recall = 0.000966184
    1-precision = 0.3; recall = 0.000966184
    1-precision = 0.4; recall = 0.000966184
    1-precision = 0.5; recall = 0.000966184
    1-precision = 0.6; recall = 0.00966184
    1-precision = 0.7; recall = 0.0995169
    1-precision = 0.8; recall = 0.19372
    1-precision = 0.9; recall = 0.319324
    Strong Matches 1175 - Inlier 761
    1-precision = 0.0; recall = 0.00193237
    1-precision = 0.1; recall = 0.00193237
    1-precision = 0.2; recall = 0.00193237
    1-precision = 0.3; recall = 0.00193237
    1-precision = 0.4; recall = 0.00289855
    1-precision = 0.5; recall = 0.00338164
    1-precision = 0.6; recall = 0.00772947
    1-precision = 0.7; recall = 0.0144928
    1-precision = 0.8; recall = 0.0550725
    1-precision = 0.9; recall = 0.241063

    • Color Descriptors for Object Category Recognition, van de Sande, Gevers and Snoek
    • (Opponent Color Space) Boosting Saliency in Color Image Features, Weijer, Gevers.

    Wednesday, March 2, 2011

    VSC++ 2010 Express Migration

    Able to run the C/C++ samples from VC++ 2010 Express on Windows 7, with WebCam working too. But only in 32-bit mode. So in CMake configuration choose Visual Studio 10 without x64.

    Problem for 64-bit target on OpenCV 2.2
    Link error for highgui - see bug 735. The patches has yet to solve the 64-bit build problem for me.
    *update* the problem does not appear any more in OpenCV 2.3. Able to build 32-bit and 64-bit  out of the box. Video capture from webcam works with starter_video sample.
    *update-2* seems like vcvars64.bat is not installed by VC Express by default, causing OpenCV 2.3 gpu module build to fail with error: configuration file '(null)' could not be found for installation at "C:/Program Files (x86)/Microsoft Visual Studio 10.0/VC/bin/../.."
    Solution: follow the simple instructions here to generate the vcvars64.bat. http://www.w7forums.com/vcvarsall-bat-no-64bit-support-vcvars64-bat-missing-t6606.html

    VC++ Directory Settings is changed.
    For me, the tricky part is to found out that the per-user (all projects) settings cannot be view/edited until a project (OpenCV in this case) is opened.
    On property sheet (which already there in earlier versions of VS):

    Not sure if this is a windows 7 thing - but ran into failure when building OpenCV documentation with buildall script. Fortunately, people have already solved this issue - http://www.mylifestartingup.com/2009/04/fatal-error-unable-to-remap-to-same.html
    I followed all the steps except the 'Reboot' part and it worked.

    Redistributable Binaries
    The VS2010 SPI 32-bit redist binaries are installed to Windows\System32 instead of the C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\redist\. and not in a directory structure like x86\Microsoft.VC90.CRT. CMake Config complains about not able to find the DLLs when turning on BUILD_PACKAGE option.

    Tuesday, March 1, 2011

    Blob Tracking - Video Surveillance Demo

    A blob tracking system is included in OpenCV
    Code: OpenCV/modules/legacy/
    Doc: OpenCV/docs/vidsurv/

    The blob-tracking code consists of a pipeline of detecting,  tracking and analyzing foreground objects. It should be a video surveillance system demo by the name of its folder. It not only detect and track blobs, it tries to pick out unusual movements with the analyzer.

    The main loop of processing the input video frames is CvBlobTrackerAuto1::Process(). And the order of the processing order differs a little from the Blob_Tracking_Modules.doc. For example, the BlobDetector is actually run after BlobTracker and Post-Processing.

    Each stage has a few methods to choose, for example, I could select between FGD and MOG for the FG/BG separation stage.

    Some notes I skim through some of the code of the following stages:

    Blob-Detectors  (BD_CC, BD_Simple)
    • The file header suggests reading literature by Andrew Senior et al (see Resources section).
    • The purpose of this module is to identify blobs that are 'worthy' of tracking. It does so by first filtering out noise from the foreground mask. The main criteria is that the CCs has to move in a reasonable 'speed' (uniform motion). It determines so by keeping a list of candidates connected-components for the last 5 frames. That probably means that the frame-to-frame change of blob location does not exceed a certain amount in order to qualify.
    • Well I cannot tell the difference between the BD_CC and BD_Simple method by looking at the code.

    Blob-Trackers (CCMSPF, CC, MS, MSFG, Particle-Filter)
    • Some literature listed at blobtrackingccwithcr.cpp regarding Particle Filter and tracking temporarily obstructed objects. 
    • Some code are there for Multiple Hypothesis Tracking (pBlobHyp) - but execution breakpoints were not triggered during my testing.
    • The CC trackers uses Contour and Shapes (Moments) to represent blobs while the MS-based uses color intensity histograms. 
    Both Connected-Component trackers use Kalman Filter to predict the next blob location. The ProcessBlob() function updates the blob model by the weighted sum of newly captured value and the prediction. In the case of blobs collision, only the predicted value would be used.

    Collision Checking

    • CC Method: Detect collision by exhaustively examining every Blob (position + size) 
    • CCMSPF Method: Go further by 'resolving' collision with a MeanShift-Particle-Filter.
    All MeanShift trackers below are an instance of CvBlobTrackerList. It holds a list of the corresponding CvBlobTrackerOneMSXXX instances. One tracker for each blob.
    • MS Method: Simple mean-shift tracker for every blob. 
    • MS with FG weights: Foreground pixels are weighted during calculations. A higher value makes the blob accelerates the movements and resizes itself in the model. 
    • MS with Particle Filter: 200 particles is allocated for each Tracker (Blob). Each particle represent the same blob moving and resizing a little differently from others. It position and size delta is generated within some preset variances plus some random value. At the each frame, the particles got randomized with the new values with those parameters. And a weighted sum yields the new prediction (position, size). And then the particles are shuffled and the weights are reset to 1. Each particle is associated with a weight. The weights are updated every frame. They are functions of the Bhattacharyya Coefficients calculated between the current Model Histogram and the Candidate Histogram. The Model Histogram is updated every frame from the blob position and size. The Candidate Histogram is the histogram calculated with the hypothesis particle. Where is the mean-shift?!
    Condenation Algorithm ((Conditional Density Propagation) - to represent non-Gaussian distribution. When an object is obstructed, there are multiple possibilities of what could happen. And that is not Gaussian. My understanding is the Particle filter is able to represent multi-modal distribution. The distribution is represented by groups of particles. The density of each group represent the probability density of one of the range along the x-axis.

    Post-Processing (Kalman Filter)
    Results from Tracking stage will be adjusted by Kalman Filter. That is the Blob Position and Size will be updated.

    Track Generator
    • Record the track (position and size) of each blob to a user-specified file. 
    • The values of both information are represented as a fraction of the video frame size. 

    Blob-Track-Analyzers ( Histogram analysis of 2D, 4D, 5D, SS feature-vectors, TrackDist, IOR)
    A 'status value is maintained on all active blobs:  Normal or Abnormal.

    • Tracks of Past Blobs (no longer a foreground in recent frames) are added to track database. They will be used as templates to compare with the active blobs tracks. 
    • Find the closest match from the templates for each active blob in terms of their similarity in position, velocity. 
    • The state is Normal if it could find a reasonably close match.
    Histogram P, PV, PVS
    • Each active blob has a histogram representing its track. There are 3 types of dimensions: 1) position, ) position and velocity, 3) position, velocity and state-change. As far as I could understand, the state-change represents the number of successive frames during which the blob moves very slowly. The 'slowness' is making it almost stationary between frames. 
    • A sparse-matrix is used to store the histogram of these continuous vector values. 
    • Nearby histogram bins are smoothed at which every new vector is collected. 
    • All histograms are updated at every frame. 
    • Past Blobs will have its histogram merged with a global histogram. Similarly, it will be used to decide whether a particular active blob track is 'normal'.
    Histogram SS
    Similar to P-V-S histogram except that the vector consists only of starting-position and stop-position. A blob would be seen as stopped as soon as the state-change counter reached 5.

    Demo Code (blobtrack_sample.cpp):
    • The demo code put everything together. In its simplest form, user supplies a input video file, it would display the 'Tracking' window - marking the moving blobs on the video with a location-size-circle, BlobID and Analyzer status (Abnormal in red, Normal in Green). 
    • The tracking video could be saved to a video file if provided a file name. 
    • The foreground masks could be shown at a separate window and save to a video file of user's choice. 
    • If a Track-Generator output file is specified, the precise blob locations at every frame together with its frame-of-entrance will be would be recorded to that file. 
    • And there is a general log file showing the user parameters, same as those appear on the command console.
    • Although haven't tried it, user should be able to pass method-specific arguments. For example, in addition to choosing Meanshift method for tracking, user is also able to pass specific parameters. See function set_params().
    Results from Road-Side Camera Video)

    Command-Line Arguments: bd=BD_Simple fg=FG_1 bt=MSPF fgavi=fgSeparated.mp4 btavi=blobtracked.mp4 bta_data=trackAnalysis.log track=track.log log=blobtracksample.log

    In general I am not too satisfied with the results in terms of tracking. I don't know whether this is expected with video I have. For example, the blobID for the same car could change as it goes farther from the camera, and vice-versa. The analyzer result is often abnormal for some reason even if the cars are simply going along the road.

    cvBlobsLib: http://opencv.willowgarage.com/wiki/cvBlobsLib

    Learning OpenCV, O'Reilly Press.

    Friday, February 25, 2011

    Learning Deformable Models with Latent SVM

    The sample program only demonstrates how to use the latent SVM for classification. The paper describes the training part in details. Although I don't understand all of it, here is the summary:

    Latent SVM is a system built to recognize object by matching both
    1. the HOG models, which consists of the 'whole' object and a few of its 'parts', and 2. the position of parts. The learned positions of object-parts and the 'exact' position of the whole object are the Latent Variables. The 'exact' position is with regard to the annotated bounding box from the input image. As an example, a human figure could be modeled by its outline-shape (whole-body head-to-toe) together with its parts (head, upper-body, left arm, right arm, left lower lib, right lower lib, feet).

    The HOG descriptor for the whole body is Root Filter and those for the body parts are Parts Filter.

    The target function is the best response by scanning a window over an image. The responses consists of the outputs from the all the filters. The search for best match is done in a multi-scale image pyramid. The classifier is trained iteratively using coordinate-descent method by holding some components constant while training the others. The components are Model Parameters (Filters Positions, Sizes), weight coefficients and error constants. The iteration process is a bit complicated - so much to learn! One important thing to note is the positive samples are composed of moving the parts around an allowable distance. There is a set of latent variables for this ( size of the movable-region, center of all the movable-regions, quadratic loss function coefficients). Able to consider the 'movable' parts is what I think being 'deformable' means.

    Detection Code

    The code for latent SVM detector code is located at OpenCV/modules/objdetect/. It seems to be self-contained. It has all the code needed to build HOG pyramids.
    The detection code extract HOG descriptors from the input image and build multi-scale pyramids. It then scan the models (root and parts) over the pyramids for the good matches. Non-max suppression is used I think to remove those proximity matches. A threshold is applied to the score from SVM equation to determine the classification.

    Some trained models in matlab file format (voc-release4.tgz and older) are available for download at the website. But how to convert the available matlab files (such as cat_final.mat) to that XML format? There is a VOCWriteXML function in the VOC devkit (in matlab). Wonder if that could help. http://fwd4.me/wSG

    Sample (latentsvmdetector.cpp)
    • Load a pre-built model and detect the object from an input image.
    • There does not seem to be a detector builder in OpenCV.
    • By looking at cat.xml The cat model has 2 models. They are probably bilateral symmetric model. Each model has 6 parts. The root filter sizes are 7x11 and 8x10.
    Results (with cat.xml model)
    • [cat.jpg] Took 61 seconds to finish. Able to detect the cat. Two false-positives at the top-right corner.
    • [lena.jpg] Took 77 seconds. It detected Lena's beautiful face (including the purple feather hat and shoulder) ! Two other detected objects: her hat and some corner at the top-left corner of the picture.
    • [tennis-cats.jpg] Took 44 seconds. It detected all 3 cats. Although the middle one and left cat and treated as one. Those two are closer together.
    • [295087.jpg from GrabCut collection] Took 50 seconds. Somehow classified the Tree and the Rock Landscape as a cat!
    • [260058.jpg from GrabCut collection] Took 76.5 seconds. Detected two false objects: 1) an area of the desert sand (small pyramid at the top edge), 2) part of the sky with clouds nears the edges.
    • Without knowing how the model is trained, hard to tell the quality of this detector. http://tech.dir.groups.yahoo.com/group/OpenCV/message/75507; It is possible that it is taken from the 'trained' classifier parameters from the releases from the paper author (voc*-release.tgz).

    Latent SVM: http://people.cs.uchicago.edu/~pff/latent/

    A Discriminatively Trained, Multiscale, Deformable Part Model, P. Felzenszwalb, et al.

    Object Recognition - Bag of Keypoints

    Bag of Keypoints is a object recognition technique presented in a paper "Visual Categorization with Bags of Keypoints". The Bag-of-Keypoints idea is borrowed from Bag-of-Words for text data mining.

    The paper differentiates this 'multi-class' object recognition technique from 'object recognition', 'content-based image retrieval' and 'object detection'.

    Very Brief Summary
    The goal is to recognize 'classes' of object given an input image. Each object class is associated with a bunch of interest-points (features). In Naive-Bayes terms, we could train a classifier with labeled data to predict the class of object present in an image by looking at the set of detected interest-points. Linear SVM is also trained as classifier for comparison. Since it is a  multi-class problem, a one-against-all method is used. They trained 'm = # classes' numbers of SVM. Each output a confidence value on whether the input image belongs to that class.
    The bag-of-words refers to a vocabulary set. The vocabulary set is built by k-mean clustering of keyPoint-descriptors (such as SIFT). A BOW Descriptor is a histogram of vocabularies. One BOW descriptor for one image. Key-points detected on images are looked up from its associated vocabulary. The corresponding bin from the histogram is incremented. The BOW descriptors (histogram) from training images are then used to train the SVM classifiers.

    Harris-Affine-Detector -> SIFT Descriptor -> Assigned to Cluster -> 'Feature Vector' -- (+ label) -> Naive Bayer Bayes (or m SVMs).

    Sample (bagofwords_classification.cpp)

    The bagofwords sample program performs the bag-of-keypoints training and classification as in the paper. The training and test data format supported is PASCAL 2007-2010.

    The program defines a class VocData that understands the PASCAL VOC challenge data format. It is used to look up the list of training-set and test-set image files for a specific object-class. The class also defines helper functions to load/save classifier results and gnuplots.

    The program defines functions to load/save last run parameters, vocabulary-set and BOW descriptors.

    Despite the big chunk of code, the functions are pretty well-defined. There are sufficient code-comments.

    User specifies the keypoint-detection method, keypoint-descriptor and keypoint-matching method to the BOWKmeansTrainer. The vocabulary-set is built with one chosen class of object training images. The code-comment says building with one particular class is enough.

    SVM is used for object classification. The CvSVM class is used. The number of instances is the same as the number of object classes. Each is trained with both positive and negative samples of a particular class object. That SVM would be tested with all classes of test objects.

    See LIBSVM for more details on CvSvm implementation for OpenCV.

    BOW Image descriptor is the histogram of vocabulary occurrences in a single image. It is a simple array - rows-of-image x cols-of-vocabulary. Each row is send to SVM for training.

    DDMParams load/stores the keypoint detector-descriptor-matcher type.

    VocabTrainParams stores the name of the object-class to be used for training. It also loads/saves the maximum vocabulary size, memory to use, and proportion-of-descriptor to use for building vocabulary. Not all the detected image key-points are used. The last parameter specifies the fraction of that to be randomly picked from each input.

    SVMTrainParamsExt stores some parameters that control the input to the SVM training process. These do not overlap with the CvSVMParam. There are 3 parameters:

    1. descPercent controls the fraction of the BOW image descriptors to be used for training. 
    2. targetRatio is preferred ratio of positive to negative samples. Precisely this parameter is the fraction of positive samples from all samples. It also means that some of the samples will be thrown away to maintain this ratio. 
    3. balanceClasses is a boolean. If it is true, then the C-SVC weight given to the positive and negative samples will be same to the pos:neg ratio of samples used for training. See CvSVMParams::class_weights for usage. If it's set to true, the targetRatio will not be used.

    RBF is chosen as kernel function for SVMs. The related parameters will be chosen automatically, presumably by the crawling the 'Grid'. See LIBSVM docs.


    Used Harris Affine Detector - SIFT descriptor - BruteForce matcher for key Points matching.
    Default parameters for BOWKmeansTrainer.
    As stated above, the demo application saves user-preferences, BOW descriptors, SVM classifier parameters and Test results to an output directory.
    Stopped the running after 'aeroplane' class. It took too long. Save for another time when there is a spare PC. On the other hand, 10103 BOW descriptors are already built. And there are 11322 JPEG images. That means only 1000 more image descriptors to extract. Most of the time would be spent on training SVMs in the future.

    Took very long time to build the vocabulary - k-means never seem to converge below the default error value. So it stops after 100 iterations which is the default maximum.
    Computing Feature Descriptors (Detect + Extract): 6823 secs ~ 2hrs
    Vocabulary Training ( 3 attempts of (k=1000)-means ): 75174 secs ~ 21 hours

    SVM Classifier Training (for one classifer, aeroplane)
    • Took 5 hours to extract BOW Descriptors from 4998 Training Set images.
    • Took another 2.6 hours to train SVM classifer with 2499 descriptors of above. Meaning only 50% is used for training. Of which 143 are positive and 2356 are negative.
    SVM Classifier Testing (for one classifier, aeroplane)
    • Took 5 hours to extract image descriptors from the 5105 Test Set images.
    • Took only 0.04 seconds to classify all the Test Set descriptors.
    • The output has a gnuplot command file. Applied to cygwin gnuplot, output a PNG file. It shows the Average Precision of 0.058 and a plot of Precision versus Recall.

    • Visual Categorization with Bags of Keypoints, Csurka, et al.
    • A Practical Guide to Support Vector Classification, see LIBSVM from Resources

    Wednesday, February 23, 2011

    HOG Descriptor

    Excellent paper by Dalal and Triggs. It gives a working example on choosing of various modules at the recognition pipeline for human figure (pedestrians).

    Much simplified summary
    It uses Histogram of Gradient Orientations as a descriptor in a 'dense' setting. Meaning that it does not detect key-Points like SIFT detectors (sparse). Each feature vector is computed from a window (64x128) placed across an input image. Each vector element is a histogram of gradient orientations (9 bins from 0-180 degrees, +/- directions count as the same). The histogram is collected within a cell of pixels (8x8). The contrasts are locally normalized by a block of size 2x2 cells (16x16 pixels). Normalization is an important enhancement. The block moves in 8-pixel steps - half the block size. Meaning that each cell contributes to 4 different normalization blocks. A linear SVM is trained to classify whether a window is human-figure or not. The output from a trained linear SVM is a set of coefficient for each element in a feature vector.

    I presume Linear SVM means the Kernel Method is linear, and no projections to higher dimension. The paper by Hsu, et al suggests that linear method is enough when the feature dimension is already high.

    OpenCV implementation (hog.cpp, objdetect.hpp)
    The HOGDescriptor class is not found in the API documentation. Here is notable points judging by the source code and sample program(people_detect.cpp):

    • Comes with a default human-detector. It says at the file comment that it is "compatible with the INRIA Object Detection and Localization toolkit. I presume this is a trained linear SVM classifier represented as a vector of coefficients;
    • No need to call SVM code. The HOGDescriptor.detect() function simply uses the coefficients on the input feature-vector to compute the weight-sum. If the sum is greated than the user specified 'hitThreshold' (default to 0), then it is a human-figure.
    • 'hitThreshold' argument could be negative.
    • 'winStride' argument (default 8x8)- controls how the window is slide across the input window.
    • detectMultiScale() arguments
      • 'groupThreshold' pass-through to cv::groupRectangles() API - non-Max-Suppression?
      • 'scale0' controls how much down-sampling is performed on the input image before calling 'detect()'. It is repeated for 'nlevels' number of times. Default is 64. All levels could be done in parallel.
    Sample (people_detect.cpp)
    • Uses the built-in trained coefficients.
    • Actually needs to eliminate for duplicate rectangles from the results of detectMultiScale(). Is it because it's calling to match at multiple-scales?
    • detect() return list of detected points. The size is the detector window size.
    • With GrabCut BSDS300 test images - only able to detect one human figure (89072.jpg). The rest could be either too small or big or obscured. Interestingly, it detected a few long-narrow upright trees as human figure. It takes about 2 seconds to process each picture.
    • With GrabCut Data_GT test images - able to detect human figure from 3 images: tennis.jpg, bool.jpg (left), person5.jpg (right), _not_ person7.jpg though. An interesting false-positive is from grave.jpg. The cut-off tomb-stone on the right edge is detected. Most pictures took about 4.5 seconds to process.
    • MIT Pedestrian Database (64x128 pedestrian shots):
      • The default HOG detector window (feature-vector) is the same size as the test images.
      • Recognized 72 out of 925 images with detectMultiScale() using default parameters. Takes about 15 ms for each image.
      • Recognized 595 out of 925 images with detect() using default parameters. Takes about 3 ms for each image.
      • Turning off gamma-correction reduces the hits from 595 to 549.
    • INRIA Person images (Test Batch)
      • (First half) Negative samples are smaller in size at (1 / 4) of Positives, 800 - 1000 ms, the others takes about 5 seconds.
      • Are the 'bike_and_person' samples there for testing occlusion?
      • Recognized 232/288 positive images. 65 / 453 negative images - Takes 10-20 secs for each image.
      • Again cut-off boxes resulting in long vertical shape becomes false positives
      • Lamp Poles, Trees, Rounded-Top Extrances, Top part of a tower, long windows are typical false positives. Should upright statue considered 'negative' sample?
      • Picked a few false-negatives to re-run with changing parameters. I picked those with large human-figure and stands mostly upright. (crop_00001.jpg, crop001688.jpg, crop001706.jpg, person_107.jpg).
        • Increased the nLevels from default(64) to 256.
        • Decrease 'hitThreshold' to -2: a lot more small size hits.
        • Half the input image size from the original.
        • Decrease the scaleFactor from 1.05 to 1.01.
        • Tried all the above individually - still unable to recognize the tall figure. I suppose this has something to do with their pose, like how they placed their arms.
    Histograms of Oriented Gradients for Human Detection, Dalal & Triggs.
    A Practical Guide to Support Vector Classifier, Hsu, Chang & Lin

    Tuesday, February 22, 2011

    Cascade Classifier and Face Detection

    There is an excellent and easy-to-understand description from OpenCV Book on using the Haar Features Cascade Classifiers for Face Detection.

    Very Simplified Summary
    Haar Feature is similar to Haar Wavelet. The weights inside the box-filter could be oriented horizontally, vertically, diagonally.
    Viola-Jones Classifier is a 2-class Cascade Classifier. The cascade is made up of a series of nodes.  Each node is a AdaBoost forest (2-class-classifier). An input vector is classified as 'Yes' only if it 'passes' all the cascaded nodes. The classification process aborts when it sees a 'No' from the current node.
    Each node is built with high-acceptance rate - therefore many false-positives, and low rejection rate. The trees of AdaBoost forest typically has only a single split. And each forest has about 10  decision stumps (single-split tree). The theory is that the nodes are built to recognize faces of different orientations. Early rejection meaning it spends little time for negative samples.

    Found this excellent page from the forum after I wrote this entry: http://note.sonots.com/SciSoftware/haartraining.html

    It requires thousands of good samples and 10s of thousands of bad samples to train the classifier. The book says it could take hours or whole day even for a fast computer. There is no exact number given. I guess it depends on the size of the feature-vector or number of features.
    'haarTraining' is a standalone program that will train the classifier with pre-processed feature points from Positive Samples and Negative Samples. User is able to specify parameters to 'shape' the nodes and trees.

    Sample Vectors
    Positive Samples: Images with faces marked with rectangle. Best results if the the faces are aligned similarly. Do not mix upright with tilted.
    Negative Samples: Simply pictures without faces. Preferably with backgrounds similar to the 'Positive samples'.
    'createSample' is a standalone program that extracts the face-rectangles and rescale it to the same size as specified by the user.

    (Paraphrasing) OpenCV book says Haar Feature Detector works well with Rigid Body with blocky features (like eyes). Objects that's only distinguishing feature is its outline (coffee mug) is hard to detect. 'Rigid' means object that the amount deformation by external pressure is negligible.

    Building and Running 'createSamples' and 'haarTraining'
    Source code:  OpenCV/modules/haartraining/
    VC++ Solution file: CMAKE_Build/modules/haartraining/
    Documentation: OpenCV/doc/

    Test Sample with Coca-Cola Logo (Step 1: createSample)

    createSample uses OpenCV built-in C API to make training and test images by superimposing an input foreground image into a list of user-provided background images. In order to create varieties, the object(foreground) image is transformed (perspective), intensity-adjusted before finally scaled to the specified size and overlaid on to the background image.

    • Training Samples: Use createSample to produce a _single_ 'vec' file suitable for training. All the input images are embedded in that file. See header file for details (comment added).
    • Test Samples: Use createSample to produce a set of test images together with an 'info' file. The plain text file specifies the region of the transformed foreground object inside each test image. Only a single object would be overlaid on each background image.
    • 'createSample' application can be used to view the images inside a 'vec' file.

    Produced 500 images of with coca-cola logo embedded on 6 of the background images chosen from the GrabCut BSDS300 test images.

    Test Sample with Coca-Cola Logo (Step 2: haartraining)

    The haartraining program is straightforward, it calls the cvCreateTreeCascadeClassifier with the necessary cascade-parameters, input 'vec' file location and output directory location.

    What is the difference between cvCreateTreeCascadeClassifier() and cvCreateCascadeClassifier()?

    No idea. Glanced through the code. cvCascadeClassifier seems to be more straightforward. cvCreateTreeCascadeClassifier does more than basic Cascade training. There is early termination condition checking. And there is training-data clustering, probably for evaluation of the classifier stages.

    Explanation of the 'mem' command-line parameter of haartraining.cpp is misguided. 

    haartraining.htm says it specifies the maximum memory allocated for pre-calculation in Megabytes. It is actually passed to cvCreateTreeCascadeClassifier() as 'numprecalculated' argument. It specifies the number of features to be 'pre-calculated', whatever that means. So it is true that a higher number requires more memory. But the value itself does not cap the amount of memory allocated for this pre-calculation task. In fact,  code-comment from cvhaartraining.hpp includes a formula on how the memory for 'feature pre-calculation' is a function of this argument.

    • Used the 'createSample' to produce 500 Positive Samples with a Coca-Cola Logo embedded on about 6 background images chosen at random. The cola-cola logo image is reduced from 482x482 to 36x36 in size.
    • Used all 200 images from the set GrabCut test samples as Negative Samples.
    • Classifier is created in 2 forms.  A single XML file and a database format. The database consists of a set of directories - one per stage. cvCreateTreeCascadeClassifier() actually calls cvLoadHaarCascadeClassifierCascade() to produce the XML file from the directory-set, as demonstrated from in convert_cascade sample.
    • The number of stages built is actually 8 instead of 14 as specified. The training stops with this message: "Required leaf false alarm rate achieved. Branch training terminated.".
    • The training function reports the performance using training data: 98.6 hit rate, 8.96e-6 false-alarm rate.
    • The 99% hit rate is achieved at the first stage, the rest of the stages lowers the false-alarm rate which starts at 10%.
    Console Output
    • BACKGROUND PROCESSING TIME: Time taken to load negative sample (and extract Haar features?)
    • "Number of used features": Varies from 1 to 5, corresponding to the number of rows a tabular format output. This number seems to represent the number of trees at the current cluster (stage).
    • "Number of features used" (different from the last point): Simply calculated from the size of the object and not from 'feature-detection' of the actual training pictures.
    • How come 'Chosen number of splits' and 'Total number of splits' are always zero?
    • Training time could be long and requires lots of CPU and memory.
    • In fact, the CPU constantly maxes out.
    • Time required is proportional to the number of features, and that in turn is proportional to the size of the foreground object picture (coca-cola logo).
    • At original resolution (482x482) - program ran out of memory in a few minutes.
    • At 48x48 resolution ~ about 4.1 million 'features' and MEM set to 512. 1st stage takes an hour ( did not wait to complete).
    • At 36x36 resolution ~ about 1.3 million features and MEM kept at 512. it takes 3 hours to complete. It terminates by itself after 8 stages out of 14, with reason stated earlier.

    Test Sample with Coca-Cola Logo (Step 3 - final: face_detect)

    OpenCV book gives excellent description on function parameters for CascadeClassifer::detectMultiScale(). Especially on the 'scaleFactor' and the 'flags' arguments.

    Test Data
    • Create 6 test image similar to training images.
    Test Results
    • Original parameters: Able to detect from 3 out of 6 images.
      One that have failed are much smaller size than the rest (36x36), which is actually the original object size! The other two failures are probably related to the object is tilted.
      The book suggests training separately the upright and tilted objects.
    • Reduced 'scaleFactor' from 1.1 to 1.01: Able to detect the 36x36 object.
      The detection is scale-sensitive. So giving it a finer scaling steps increases the hit-rate, at the expense of receiving more false-positive results.
    • Re-generate the set of test images, with half the maximum rotation angle for distortion: more 1 more object is recognized.

    Test Sample with Running Face Classifier (face_detect)

    The face_detect sample demonstrates how to 'nest' classifiers to detect finer features. By default the sample deploys the face-alt-2 classifiers to find face regions. Followed by the eye-tree-eyeglasses classifier to find smaller features from within each of the regions returned by the face-alt-2 classifier.

    Pre-built Face Cascade Classifiers
    • Location: OpenCV/data/haarcascades/
    • Dimensions of the training object could be found in most classifier files inside XML comments near the beginning.
    • Check the value of 'minSize' to detectMultiScale() of nested Classifier. The minimum for face could be too big as for mouth.
    • Set the 'minSize' argument to maintain aspect ratio of the trained object.
    • It takes around one second to finish the detecting process for a 30x30 object from a VGA picture. ScaleFactor at 1.1.
    Wikipedia on Rigid Body: http://en.wikipedia.org/wiki/Rigid_body
    http://note.sonots.com/SciSoftware/haartraining.html (Script to expand CMU GroundTruth Data)
    CMU-MIT Face Detection Test Sets: http://www.ri.cmu.edu/research_project_detail.html?project_id=419&menu_id=261
    Face Databases as noted from some of the haarcascade classifier files:
    Robust Real-Time Face Detection, Viola & Jones, International Journal of Computer Vision 57.