This is essentially tracking by histogram. An object with a distinguished color histogram is to be tracked. Typical choice is a face. The input image is back-projected with this histogram, essentially picking up areas there is a high probability of these color appearing. Using mean-shift algorithm, a randomly placed window will try to converge to a local maxima within certain iterations. This is repeated at each new frame. The starting location could be re-used to speed up the convergence, assuming the object does not move much / disappear from the scene.
The Cam-Shift method improve on the mean-shift such that the input window size is updated (by size and orientation) to fit the object in the current frame. The paper presented this method as part a computer user-interface that tracks the user's face/head movement. This allows tracking as the user move towards / away from the camera besides the lateral movements that is supported by Mean-Shift. It does so by changing the window size based on the value of the Moments calculated at each iteration.
Meanshift() documentation recommends removing noise from the back-projections before calling meanshift.
CalcBackProject() documentation has a very brief description of the camshift algorithm.
The actual mean-shift is done on the Hue plane of the HSV space. Histogram covers first half of the full-hue value range - 0 to 180. 0-90 is close to the human skin while the far end is blue-ish.
Trivial change could be made to compare Meanshift and Camshift.
CamShift window resizes and re-orient according to the subject motion or change in camera perspective. The accuracy is pretty impressive.
In cases where it fails to converge (typically when the object disappear from scene or too far away from the last 'seen' location), the marker stops at a place no where near a face-colored region. This happens when the video cuts to another scene or shown with camera from a vastly different perspective.
The good thing about auto-sizing feature of CamShift is that, I could simply select a small area inside the face. And quickly it will expand automatically to fill the largest area possible.
It is designed to track only one object (see the paper).
The application crashes consistently with more than 1 video files, typically after a few minutes.
Computer Vision Face Tracking For Use in a Perceptual User Interface, Bradski