Target tracking single target tracking based on OpenCV

In this tutorial, we will learn to track objects using OpenCV. OpenCV 3.0 began to introduce the tracking API. We will learn how and when to use the eight different trackers available in OpenCV 4.2 - BOOSTING, MIL, KCF, TLD, MEDIANFLOW, GOTURN, MOSSE and CSRT. We will also learn the general theory behind modern tracking algorithms.

1. What is target tracking?

Simply put, locating an object in consecutive frames of video is called tracking.
This definition sounds simple, but in computer vision and machine learning, tracking is a very broad term. It contains ideas with similar concepts but different technologies. For example, all of the following different but related ideas are usually studied in object tracking

  • 1.Dense Optical flow: these algorithms help to estimate the motion vector of each pixel in the video frame.
  • 2.Sparse optical flow: these algorithms, such as Kanade Lucas tomashi (KLT) feature tracker, track the position of several feature points in the image.
  • 3.Kalman Filtering: a very popular signal processing algorithm, which is used to predict the position of moving targets based on a priori motion information. One of the early applications of the algorithm is missile guidance!
  • 4. Mean shift and CAMSHIFT: These are algorithms for locating the maximum value of density function. They are also used for tracking.
  • 5.Single object trackers: in this type of tracker, the first frame uses a rectangle to mark the position of the object we want to track. Then, the tracking algorithm is used to track the target in the subsequent frame. In most real-life applications, these trackers are used in combination with object detectors.
  • 6.Multiple object track finding algorithms: when we have a fast target detector, it is meaningful to detect multiple targets in each frame, and then run the trajectory search algorithm to identify which rectangle in one frame corresponds to the rectangle in the next frame.

2. Tracking and detection

If you have ever played OpenCV face detection, you know it works in real time. You can easily detect faces in each frame. So why do you need to track from the beginning? Let's explore the different reasons why you might want to track objects in a video rather than just duplicate detection.

  • 1. Tracking is faster than detection: generally, the tracking algorithm is faster than the detection algorithm. The reason is simple. When you track an object detected in the previous frame, you will know a lot about the appearance of the object. You also know the position in the previous coordinate system and the direction and speed of its movement. So in the next frame, you can use all this information to predict the position of the object in the next frame, and make a small search for the expected position of the object to accurately locate the object. A good tracking algorithm uses all the information it has about the target, and the detection algorithm always starts from scratch. Therefore, when designing an efficient system, target detection is usually carried out every N frames, and tracking algorithm is used between n-1 frames. Why don't we detect the target directly in the first frame and then track it? Tracking can really benefit from the additional information it has, but if an object stays behind an obstacle for a long time, or it moves too fast for the tracking algorithm to keep up, you will also lose tracking it. Tracking algorithms often accumulate errors, and the bounding box of the tracking object will slowly deviate from the tracking object. We often use detection algorithm to solve these problems of tracking algorithm. Detection algorithms are based on big data training, so they have a better understanding of the general categories of objects. On the other hand, tracking algorithms know more about the specific instances of the classes they track.
  • 2. When the detection fails, tracking can help: if you run the face detector in the video and the person's face is blocked by an object, the face detector is likely to fail. A good tracking algorithm will solve some degree of occlusion.
  • 3. Tracking protection ID: the output of object detection is a rectangular array containing objects. However, the object has no attached identity. For example, in the following video, a detector that detects red dots will output a rectangle corresponding to all the points it detects in a frame. In the next frame, it will output another rectangular array. In the first frame, a particular point may be represented by a rectangle at position 10 in the array, while in the second frame, it may be at position 17. When using detection on a frame, we don't know which rectangle corresponds to which object. On the other hand, tracing provides a way to connect these points!

3. Use OpenCV 4 to realize object tracking

OpenCV 4 comes with a tracking API, which includes the implementation of many single object tracking algorithms. There are eight different trackers available in OpenCV 4.2 - BOOSTING, MIL, KCF, TLD, MEDIANFLOW, GOTURN, MOSSE, and CSRT.

Note: OpenCV 3.2 implements these six trackers - BOOSTING, MIL, TLD, MEDIANFLOW, MOSSE and GOTURN. OpenCV 3.1 implements these five trackers - BOOSTING, MIL, KCF, TLD, MEDIANFLOW. OpenCV 3.0 implements the following four trackers - boosting, mil, TLD and medianflow.

In OpenCV 3.3, the tracking API has changed. The code checks the version and then uses the appropriate API.

Before briefly describing these algorithms, let's take a look at their settings and usage. In the following comment code, we first set the tracker by selecting the tracker type - BOOSTING, MIL, KCF, TLD, MEDIANFLOW, GOTURN, MOSSE or CSRT. Then we open a video and grab a frame. We define a bounding box containing the first frame object, and initialize the tracker with the first frame and bounding box. Finally, we read the frame from the video and update the tracker in the loop to get a new bounding box of the current frame. The results are then displayed.

3.1 using OpenCV 4 to realize object tracking C + + code

#include <opencv2/opencv.hpp>
#include <opencv2/tracking.hpp>
#include <opencv2/core/ocl.hpp>

using namespace cv;
using namespace std;

// Convert to string
#define SSTR( x ) static_cast< std::ostringstream & >( ( std::ostringstream() << std::dec << x ) ).str()

int main(int argc, char **argv)
    // List of tracker types in OpenCV 3.4.1
    string trackerTypes[8] = {"BOOSTING", "MIL", "KCF", "TLD","MEDIANFLOW", "GOTURN", "MOSSE", "CSRT"};
    // vector <string> trackerTypes(types, std::end(types));

    // Create a tracker
    string trackerType = trackerTypes[2];

    Ptr<Tracker> tracker;

    #if (CV_MINOR_VERSION < 3)
        tracker = Tracker::create(trackerType);
        if (trackerType == "BOOSTING")
            tracker = TrackerBoosting::create();
        if (trackerType == "MIL")
            tracker = TrackerMIL::create();
        if (trackerType == "KCF")
            tracker = TrackerKCF::create();
        if (trackerType == "TLD")
            tracker = TrackerTLD::create();
        if (trackerType == "MEDIANFLOW")
            tracker = TrackerMedianFlow::create();
        if (trackerType == "GOTURN")
            tracker = TrackerGOTURN::create();
        if (trackerType == "MOSSE")
            tracker = TrackerMOSSE::create();
        if (trackerType == "CSRT")
            tracker = TrackerCSRT::create();
    // Read video
    VideoCapture video("videos/chaplin.mp4");
    // If the video is not open, exit
        cout << "Could not read video file" << endl; 
        return 1; 

    // Read the first frame
    Mat frame; 
    bool ok =; 

    // Define initial bounding box
    Rect2d bbox(287, 23, 86, 320); 

    // Uncomment the line below to select a different bounding box 
    // bbox = selectROI(frame, false); 
    // Show bounding box
    rectangle(frame, bbox, Scalar( 255, 0, 0 ), 2, 1 ); 

    imshow("Tracking", frame); 
    tracker->init(frame, bbox);
        // Start timer
        double timer = (double)getTickCount();
        // Update tracking results
        bool ok = tracker->update(frame, bbox);
        // Calculate frames per second (FPS)
        float fps = getTickFrequency() / ((double)getTickCount() - timer);
        if (ok)
            // Tracking succeeded: draw the tracked object
            rectangle(frame, bbox, Scalar( 255, 0, 0 ), 2, 1 );
            // Tracking failed
            putText(frame, "Tracking failure detected", Point(100,80), FONT_HERSHEY_SIMPLEX, 0.75, Scalar(0,0,255),2);
        // Show tracker type on frame
        putText(frame, trackerType + " Tracker", Point(100,20), FONT_HERSHEY_SIMPLEX, 0.75, Scalar(50,170,50),2);
        // Frame display FPS
        putText(frame, "FPS : " + SSTR(int(fps)), Point(100,50), FONT_HERSHEY_SIMPLEX, 0.75, Scalar(50,170,50), 2);

        // Display frame
        imshow("Tracking", frame);
        // Press ESC to exit.
        int k = waitKey(1);
        if(k == 27)


3.2 Python code for object tracking using OpenCV 4

import cv2
import sys

(major_ver, minor_ver, subminor_ver) = (cv2.__version__).split('.')

if __name__ == '__main__' :

    # Establish tracker
    # In addition to MIL, you can also use

    tracker_types = ['BOOSTING', 'MIL','KCF', 'TLD', 'MEDIANFLOW', 'GOTURN', 'MOSSE', 'CSRT']
    tracker_type = tracker_types[2]

    if int(minor_ver) < 3:
        tracker = cv2.Tracker_create(tracker_type)
        if tracker_type == 'BOOSTING':
            tracker = cv2.TrackerBoosting_create()
        if tracker_type == 'MIL':
            tracker = cv2.TrackerMIL_create()
        if tracker_type == 'KCF':
            tracker = cv2.TrackerKCF_create()
        if tracker_type == 'TLD':
            tracker = cv2.TrackerTLD_create()
        if tracker_type == 'MEDIANFLOW':
            tracker = cv2.TrackerMedianFlow_create()
        if tracker_type == 'GOTURN':
            tracker = cv2.TrackerGOTURN_create()
        if tracker_type == 'MOSSE':
            tracker = cv2.TrackerMOSSE_create()
        if tracker_type == "CSRT":
            tracker = cv2.TrackerCSRT_create()

    # Read video
    video = cv2.VideoCapture("videos/chaplin.mp4")

    # If the video is not open, exit.
    if not video.isOpened():
        print "Could not open video"

    # Read the first frame.
    ok, frame =
    if not ok:
        print('Cannot read video file')
    # Define an initial bounding box
    bbox = (287, 23, 86, 320)

    # Uncomment the line below to select a different bounding box
    # bbox = cv2.selectROI(frame, False)

    # Initialize the tracker with the first frame and bounding box
    ok = tracker.init(frame, bbox)

    while True:
        # Read a new frame
        ok, frame =
        if not ok:
        # Start timer
        timer = cv2.getTickCount()

        # Update tracker
        ok, bbox = tracker.update(frame)

        # Calculate frame rate (FPS)
        fps = cv2.getTickFrequency() / (cv2.getTickCount() - timer);

        # Draw bounding box
        if ok:
            # Tracking successful
            p1 = (int(bbox[0]), int(bbox[1]))
            p2 = (int(bbox[0] + bbox[2]), int(bbox[1] + bbox[3]))
            cv2.rectangle(frame, p1, p2, (255,0,0), 2, 1)
        else :
            # Tracking failed
            cv2.putText(frame, "Tracking failure detected", (100,80), cv2.FONT_HERSHEY_SIMPLEX, 0.75,(0,0,255),2)

        # Displays the tracker type name on the frame
        cv2.putText(frame, tracker_type + " Tracker", (100,20), cv2.FONT_HERSHEY_SIMPLEX, 0.75, (50,170,50),2);
        # Display frame rate FPS on frame
        cv2.putText(frame, "FPS : " + str(int(fps)), (100,50), cv2.FONT_HERSHEY_SIMPLEX, 0.75, (50,170,50), 2);

        # Display results
        cv2.imshow("Tracking", frame)

        # Press ESC to exit
        k = cv2.waitKey(1) & 0xff
        if k == 27 : break

4. Analysis of tracking algorithm

In this section, we will delve into different tracking algorithms. Our goal is not to have a profound theoretical understanding of each tracker, but to understand them from a practical point of view.
Let me first explain some general principles of tracking. In tracking, our goal is to find an object in the current frame, because we have successfully tracked this object in all (or almost all) previous frames.

Because we keep tracking the object until the current frame, we know how it moves. In other words, we know the parameters of the motion model. Motion model is just a fancy way of saying that you know the position and speed of the object in the first few frames (speed + direction of motion). If you know nothing about the object, you can predict the new position according to the current motion model, and you will be very close to the new position of the object.

But we have more information than object motion. We know what the object looks like in every previous frame. In other words, we can build an appearance model that encodes the appearance of the object. The appearance model can be used to search the position in the small neighborhood predicted by the motion model, so as to predict the position of the object more accurately.

The motion model predicts the approximate position of the object. The appearance model fine tunes this estimate to provide a more accurate appearance based estimate.

If the object is very simple and doesn't change its appearance much, we can use a simple template as the appearance model and look for the template. However, real life is not that simple. The appearance of an object will change dramatically. In order to solve this problem, in many modern trackers, this appearance model is a classifier trained online. Don't panic! Let me explain it in simpler terms.

The work of the classifier is to classify the rectangular region in the image into objects or backgrounds. The classifier receives the image patch as input and returns a score between 0 and 1, indicating the probability that the image patch contains the object. When it is completely determined that the image patch is the background, the score is 0; When the patch is completely determined to be an object, the score is 1.

In machine learning, we use the word "online" to refer to algorithms that perform dynamic training at runtime. Offline classifiers may need thousands of examples to train a classifier, but online classifiers usually use few examples for training at run time.

The classifier is trained by inputting positive (object) and negative (background) examples to the classifier. If you want to build a classifier for detecting cats, you can train it with thousands of images containing cats and thousands of images without cats. In this way, the classifier learns to distinguish between what is a cat and what is not. When building an online classifier, we don't have the opportunity to have thousands of examples of positive and negative classes.

Let's see how different tracking algorithms deal with the problem of online training.

4.1 BOOSTING Tracker

The tracker is based on the online version of AdaBoost - the algorithm used internally in the face detector based on HAAR cascade. This classifier needs to be trained with positive and negative examples of objects at run time. Take the initial bounding box provided by the user (or the initial bounding box provided by other target detection algorithms) as the positive example of the target, and take many image patch es outside the bounding box as the background.

Given a new frame, the classifier runs on each pixel near the previous position and records the score of the classifier. The new location of the object is the location with the highest score. Now the classifier has another positive sample. When more frames come in, the classifier will update with this additional data.

Advantage: No. This algorithm has a history of 10 years and works well, but I can't find a good reason to use it, especially when other advanced trackers (MIL, KCF) based on similar principles are available.
Disadvantages: mediocre tracking performance. It cannot reliably know when tracing failed.

4.2 MIL Tracker

This tracker is similar in thought to the above-mentioned BOOSTING tracker. The biggest difference is that it does not only consider the current position of the object as a positive sample, but looks for several potential positive samples in a small field around the current position. You might think this is not a good idea because in most of these "positive sample" examples, the object is not centered.

This is where MIL can be used. In MIL, instead of specifying positive and negative examples, you specify positive and negative "bags". The image sets in the front "bag" are not all positive examples. Instead, only one image in the front bag needs to be a positive example!

In our example, a front bag contains a patch centered on the current position of the object and a patch in a small neighborhood around it. Even if the current position of the tracked object is not accurate, when the sample from near the current position is placed in the front bag, the front bag is likely to contain an image in which at least one object is well centered.

Advantages: good performance. It doesn't drift like the BOOSTING tracker and does reasonable work under partial occlusion. If you are using OpenCV 3.0, this may be the best tracker you can use. However, if you are using a later version, consider KCF.
Cons: tracking failures cannot be reliably reported. Cannot recover from full occlusion.

4.3 KCF Tracker

KFC stands for kernelized correlation filters. The tracker builds on the ideas presented in the first two trackers. The tracker takes advantage of the fact that multiple positive samples used in MIL tracker have large overlapping areas. This overlapping data leads to some good mathematical features that the tracker uses to make tracking faster and more accurate.

Advantages: the accuracy and speed are better than MIL, and it reports better tracking failures than BOOSTING and MIL. If you are using OpenCV 3.1 and later, I recommend using it for most applications.
Disadvantages: cannot recover from full occlusion.

4.4 TLD Tracker

TLD stands for tracking, learning and testing. As the name suggests, this tracker divides the long-term tracking task into three parts -- (short-term) tracking, learning and detection. From the author's paper, "the tracker tracks the object frame by frame. The detector locates all the appearances observed so far and corrects the tracker if necessary.

Learn to estimate detector errors and update them to avoid them in the future. " The output of this tracker often jumps a little. For example, if you are tracking pedestrians and there are other pedestrians in the scene, this tracker can sometimes temporarily track pedestrians that are different from the pedestrians you intend to track. On the positive side, this track seems to track objects over a wider range of motion and occlusion. If you have a video sequence with one object hidden behind another, this tracker may be a good choice.

Advantages: the effect is best under the occlusion of multiple frames. In addition, track the best scale changes.
Disadvantages: a large number of false positives make it almost unusable.

4.5 MEDIANFLOW Tracker

Internally, the tracker tracks the object forward and backward in time and measures the difference between the two trajectories. Minimizing this forward and backward error enables them to reliably detect tracking failures and select reliable tracks in video sequences.

In my tests, I found that the tracker works best when the motion is predictable and small. Unlike other trackers, which continue to run even when the tracking fails significantly, this tracker knows when the tracking fails.

Advantage: excellent tracking failure reporting. The effect is good when the motion is predictable and there is no occlusion.
Disadvantages: failure in big sports.

4.6 GOTURN tracker

Among all the tracking algorithms of the tracker class, this is the only one based on convolutional neural network (CNN). From the OpenCV document, we know that it is "robust to viewpoint change, illumination change and deformation". But it doesn't handle occlusion well.

Note: GOTURN is a CNN based tracker that uses Caffe model for tracking. Caffe model and proto text files must exist in the directory where the code is located. These files can also be downloaded from OpenCV_ The extra repository is downloaded, connected, and extracted before use.

4.7 MOSSE tracker

The sum of squares of minimum output error (MOSSE) uses adaptive correlation for object tracking, which will produce a stable correlation filter when using single frame initialization. MOSSE tracker is robust to changes in illumination, scale, posture and non rigid deformation. It also detects occlusion based on the peak to sidelobe ratio, which enables the tracker to pause when the object reappears and recover from where it was interrupted. The MOSSE tracker also operates at higher fps (450 fps or higher). In addition, it is very easy to execute, as accurate as other complex trackers, and faster. However, in terms of performance, it lags behind the tracker based on deep learning.

4.8 CSRT tracker

In dcf-csr (discriminative correlation filter with channel and spatial reliability), we use spatial reliability mapping to adjust the support of the filter to adapt to the tracking part of the selected area in the frame. This ensures magnification and positioning of the selected area and improves tracking of non rectangular areas or objects. It uses only two standard features (hog and Colornames). It also runs at a relatively low fps (25 fps), but provides high target tracking accuracy.

Reference catalogue

Keywords: OpenCV

Added by slough on Mon, 07 Mar 2022 14:48:14 +0200