Excerpt:
- OpenCV parallel acceleration_ for_ ParallelLoopBody tutorial
- opencv parallel computing function parallel_for_ Use of
- Official website: Parallel Processing
- Gao Xiang, Zhang Tao, 14 lectures on visual SLAM
In the process of using OpenCV, the amount of processing and calculation of pictures is still large, so how to calculate efficiently in the implementation of running programs will save a lot of time. There are many existing methods, such as OpenMp, TBB, OpenCL and, of course, Nvidia's CUDA.
cuda is a good thing, but it is not suitable for millisecond programs. The transmission time of a single picture between cpu and gpu has reached 300ms (using the cuda library function of openCV); cuda is directly programmed on TX2, and the data transmission is more than 50ms (excluding initialization), which can not be used for real-time operation at all. Therefore, how to calculate more efficiently on the cpu becomes particularly important. I accidentally found the parallel computing function of OpenCV_ for_, It integrates the above components.
For some basic loop operations, if we directly use loops, even if we use pointers, the operation efficiency is not high. If we use parallel computing, the operation efficiency will be greatly improved. Many operations in OpenCV use parallel acceleration. In OpenCV 3.2, the parallel framework is provided in the following order:
- Intel thread building blocks (third party libraries, which should be explicitly enabled), such as TBB
- OpenMP (integrated into the compiler and should be explicitly enabled)
- APPLE GCD (wide system range, automatic use (APPLE only))
- Windows RT concurrency (system wide, automatic use (Windows RT only))
- Windows concurrency (part of runtime, automatic use (windows only) - MSVC + + > = 10))
- Pthreads (if any)
As you can see, multiple parallel frameworks can be used in the OpenCV Library:
Some parallel libraries are third-party libraries that must be explicitly built and enabled in CMake (such as TBB). Others can be automatically used with platforms (such as APPLE GCD). However, you should be able to use these libraries to access the parallel framework and rebuild the libraries directly or by enabling the options in CMake;
The second (weak) prerequisite is more relevant to the task to be implemented because not all calculations are appropriate / can be run in parallel. In order to keep it simple, tasks that can be decomposed into multiple basic operations without memory dependency (no possible competition conditions) are easy to parallelize. Computer vision processing is usually easy to parallelize, because most of the time, the processing of one pixel does not depend on the state of other pixels.
Parallel_for_
This article mainly describes Parallel_for_ For the use of ParallelLoopBody, you need to use < opencv2 / core / utility HPP > this header file.
Parallel_for_ parallel data processor is introduced and can be used in two ways:
void cv::parallel_for_ (const Range &range, const ParallelLoopBody &body, double nstripes=-1.)
static void cv::parallel_for_ (const Range &range, std::function< void(const Range &)> functor, double nstripes=-1.)
1. Start with the operator overloading of the constructor
When we call a function, we actually use the bracket () operator. The constructor is also an ordinary function, so we also use the bracket operator. What if we want to overload the bracket () operator? Let's start with a simple example:
class Animal { public: Animal(float weight_,int age_) { weight = weight_; age = age_; } void operator()(float height) const //Overload operator () { height = 100.0; std::cout << "the age is : " << age << std::endl; std::cout << "the weight is : " << weight << std::endl; std::cout << "the height is : " << height << std::endl; } private: float weight; int age; };
Now call:
int main(int argc, char* argv[]) { myanimal::Animal animal(50.0,25); //create object animal(100.0); //Call overloaded bracket () operator through object getchar(); return 0; } /* the age is : 25 the weight is : 50 the height is : 100 */
Summary:
(1) The overloaded bracket operator is like the method of an object. It is still called through the object in the form of "object name (parameter list)";
(2) The general operation of overloaded bracket operators is "return type operator (parameter list)", and the const after it can be omitted. The parameter list can be arbitrary,
If there are no parameters, the calling method is: obj()
One parameter, the calling method is obj (parameter 1)
If there are multiple parameters, the calling method is obj (parameter 1, parameter 2,...)
2. Parallel_for_ General steps for use with ParallelLoopBody
The use steps generally follow the principle of three steps
(1) Step 1: customize a class or structure to inherit from the ParallelLoopBody class, as follows:
class MyParallelClass : public ParallelLoopBody {} struct MyParallelStruct : public ParallelLoopBody {}
(2) Step 2: override the bracket operator () in the custom class or structure. Note: Although the bracket operator overload can accept any number of parameters, only one parameter of type Range can be accepted here (which is different from the general overload), because the following parallel_for_ Required, as follows:
void operator()(const Range& range) { //The operation is "in this loop" }
(3) Step 3: use parallel_for_ Parallel processing
Take another look at parallel_for_ Function prototype
CV_EXPORTS void parallel_for_(const Range& range, //The parameter in the overloaded bracket operator is a cv::Range type const ParallelLoopBody& body, //Self implemented classes or struct objects inherited from the ParallelLoopBody class double nstripes=-1.);
The usage is as follows:
parallel_for_(Range(start, end), MyParallelClass(Constructor list)); //Range(start, end) is a range object //Myparallelclass (constructor list) is an object of a class inherited from ParallelLoopBody
The conventional usage should be as follows. There is no problem in writing this way, but in this way, the contents in the bracket overloaded operator are executed in order without concurrent processing.
MyParallelClass obj = MyParallelClass(Constructor list)); //Construction object obj(Range(start, end)); //call
And the use method described above:
parallel_for_(Range(start, end), MyParallelClass(Constructor list));
There is no explicit call to the overloaded bracket operator, but it is actually called implicitly, and the contents of the overloaded operation are processed in a concurrent manner.
3. Parallel_for_ Acceleration effect experiment combined with ParallelLoopBody
3.1 user defined class implementation
Task description: I want to define the element by element product of two Mat matrices, as shown below
(1) Customize a class that inherits from ParallelLoopBody and overloads bracket operation
class ParallelAdd : public ParallelLoopBody//Refer to the official answer to construct a parallel loop body class { public: ParallelAdd(Mat& _src1,Mat& _src2,Mat _result) //Constructor { src1 = _src1; src2 = _src2; result = _result; CV_Assert((src1.rows == src2.rows) && (src1.rows == src2.rows)); rows = src1.rows; cols = src1.cols; } void operator()(const Range& range) const //Overload operator () { int step = (int)(result.step / result.elemSize1());//Get the total number of elements in each row (equivalent to cols*channels, equivalent to step 1) for (int col = range.start; col < range.end; ++col) { float * pData = (float*)result.col(col).data; float * p1 = (float*)src1.col(col).data; float * p2 = (float*)src2.col(col).data; for (int row = 0; row < result.rows; ++row) pData[row*step] = p1[row*step] * p2[row*step]; } } private: Mat src1; Mat src2; Mat result; int rows; int cols; };
It can be seen that the overloaded operator is a time-consuming operation. Now, two methods are defined to realize the element by element product of two mats, one is ordinary element by element processing, the other is concurrent processing using parallel, which is completed by two functions respectively, as follows:
//Call directly through obj() without concurrent processing void testParallelClassWithFor(Mat _src1,Mat _src2,Mat result) { result = Mat(_src1.rows, _src1.cols, _src1.type()); int step = (int)(result.step / result.elemSize1()); int totalCols = _src1.cols; ParallelAdd add = ParallelAdd(_src1, _src2, result); add(Range(0, totalCols)); //Direct call without concurrency } void testParallelClassTestWithParallel_for_(Mat _src1,Mat _src2,Mat result) { result = Mat(_src1.rows, _src1.cols, _src1.type()); int step = (int)(result.step / result.elemSize1()); int totalCols = _src1.cols; parallel_for_(Range(0, totalCols), ParallelAdd(_src1,_src2,result)); //Implicit call, concurrency }
Now start testing time comparison:
int main(int argc, char* argv[]) { Mat testInput1 = Mat::ones(6400, 5400, CV_32F); Mat testInput2 = Mat::ones(6400, 5400, CV_32F); Mat result1, result2, result3; clock_t start, stop; //****************Comparison of test time**************************** start = clock(); testParallelClassWithFor(testInput1, testInput2, result1); stop = clock(); cout << "Running time using \'general for \':" << (double)(stop - start) / CLOCKS_PER_SEC * 1000 << "ms" << endl; start = clock(); testParallelClassWithParallel_for_(testInput1, testInput2, result2); stop = clock(); cout << "Running time using \'parallel for \':" << (double)(stop - start) / CLOCKS_PER_SEC * 1000 << "ms" << endl; start = clock(); result3 = testInput1.mul(testInput2); stop = clock(); cout << "Running time using \'mul function \':" << (double)(stop - start) / CLOCKS_PER_SEC * 1000 << "ms" << endl; return 0; } /* Running time using 'general for ':645ms Running time using 'parallel for ':449ms Running time using 'mul function ':70ms */
Summary: as we can see, use parallel_for_ The concurrent method is indeed faster than the direct call, nearly 200ms faster, but it still does not use the standard function mul function of OpenCV. The speed is fast, because the function library implemented by opencv not only goes through parallel processing, but also uses more powerful underlying optimization. Therefore, as long as it is the method of OpenCV itself, it is generally preferred, Unless you write better than OpenCV.
3.2 implementation of user-defined structure
Task description: now define a parallel operation structure to realize the cubic operation of Mat element by element
(1) Customize a structure that inherits from ParallelLoopBody and overloads bracket operation, as follows:
struct ParallelPow:ParallelLoopBody//Construct a parallel_ Loop structure used by for { Mat* src; //Structure member, a pointer of type Mat ParallelPow(Mat& _src)//Struct struct constructor { src = &_src; } void operator()(const Range& range) const { Mat& result = *src; int step = (int)(result.step / result.elemSize1()); for (int col = range.start; col < range.end; ++col) { float* pData = (float*)result.col(col).data; for (int row = 0; row < result.rows; ++row) pData[row*step] = std::pow(pData[row*step], 3); //Cubic element by element } } };
The following two functions are defined: one is to perform cubic operation element by element directly through the for loop, and the other is to perform parallel operation_ for_ Concurrent operation is realized by two functions, as shown below:
void testParallelStructWithFor(Mat _src) { int totalCols = _src.cols; ParallelPow obj = ParallelPow(_src); obj(Range(0, totalCols)); } void testParallelStructWithParallel_for(Mat _src) { int totalCols = _src.cols; parallel_for_(Range(0, totalCols), ParallelPow(_src)); }
Now start testing time comparison:
int main(int argc, char* argv[]) { Mat testInput1 = Mat::ones(6400, 5400, CV_32F); Mat testInput2 = Mat::ones(6400, 5400, CV_32F); Mat result1, result2, result3; clock_t start, stop; start = clock(); testParallelStructWithFor(testInput1); stop = clock(); cout << "Running time using \'general for \':" << (double)(stop - start) / CLOCKS_PER_SEC * 1000 << "ms" << endl; start = clock(); testParallelStructWithParallel_for(testInput1); stop = clock(); cout << "Running time using \'parallel for \':" << (double)(stop - start) / CLOCKS_PER_SEC * 1000 << "ms" << endl; start = clock(); testInput1.mul(testInput1).mul(testInput1); stop = clock(); cout << "Running time using \'mul function \':" << (double)(stop - start) / CLOCKS_PER_SEC * 1000 << "ms" << endl; return 0; } /* Running time using 'general for ':881ms Running time using 'parallel for ':195ms Running time using 'mul function ':76ms */
It can be found that the parallel operation efficiency has been significantly improved, but it is still slow compared with the standard implementation of OpenCV. Therefore, when you can use opencv standard functions, try not to define your own operations. The standard opencv function formula has been optimized and the operation efficiency is very high.
4. Parallel_for_ Comparison between not combined and combined ParallelLoopBody
In fact, in c++11, it is not necessary to use a class or structure to inherit the ParallelLoopBody class of parallel computing. Functions can be used instead. The comparison of the two methods can be seen below:
class Parallel_My : public ParallelLoopBody { public: Parallel_My (Mat &img, const float x1, const float y1, const float scaleX, const float scaleY) : m_img(img), m_x1(x1), m_y1(y1), m_scaleX(scaleX), m_scaleY(scaleY) { } virtual void operator ()(const Range& range) const { for (int r = range.start; r < range.end; r++) //process of for loop { /*** ***/ } } Parallel_My& operator=(const Parallel_My &) { return *this; }; private: Mat &m_img; float m_x1; float m_y1; float m_scaleX; float m_scaleY; }; int main() { //! [mandelbrot-transformation] Mat Img(4800, 5400, CV_8U); float x1 = -2.1f, x2 = 0.6f; float y1 = -1.2f, y2 = 1.2f; float scaleX = mandelbrotImg.cols / (x2 - x1); float scaleY = mandelbrotImg.rows / (y2 - y1); #ifdef CV_CXX11 //method one parallel_for_(Range(0, Img.rows*tImg.cols), [&](const Range& range) { for (int r = range.start; r < range.end; r++) //This is a for loop that requires parallel computing { } }); #else //method two Parallel_My parallel_my0(Img, x1, y1, scaleX, scaleY); parallel_for_(Range(0, Img.rows*Img.cols), parallel_my0); #endif }
5. Single layer optical flow (supplementary content)
The above should be clear. You can see an example of optical flow:
class OpticalFlowTracker { public: OpticalFlowTracker( const Mat &img1_, const Mat &img2_, const vector<KeyPoint> &kp1_, vector<KeyPoint> &kp2_, vector<bool> &success_, bool inverse_ = true, bool has_initial_ = false) : img1(img1_), img2(img2_), kp1(kp1_), kp2(kp2_), success(success_), inverse(inverse_), has_initial(has_initial_) {} void calculateOpticalFlow(const Range &range); private: const Mat &img1; const Mat &img2; const vector<KeyPoint> &kp1; vector<KeyPoint> &kp2; vector<bool> &success; bool inverse = true; bool has_initial = false; }; void OpticalFlowSingleLevel( const Mat &img1, const Mat &img2, const vector<KeyPoint> &kp1, vector<KeyPoint> &kp2, vector<bool> &success, bool inverse, bool has_initial) { kp2.resize(kp1.size()); success.resize(kp1.size()); OpticalFlowTracker tracker(img1, img2, kp1, kp2, success, inverse, has_initial); parallel_for_(Range(0, kp1.size()), std::bind(&OpticalFlowTracker::calculateOpticalFlow, &tracker, placeholders::_1)); } void OpticalFlowTracker::calculateOpticalFlow(const Range &range) { // parameters int half_patch_size = 4; int iterations = 10; for (size_t i = range.start; i < range.end; i++) { auto kp = kp1[i]; double dx = 0, dy = 0; // dx,dy need to be estimated if (has_initial) { dx = kp2[i].pt.x - kp.pt.x; dy = kp2[i].pt.y - kp.pt.y; } double cost = 0, lastCost = 0; bool succ = true; // indicate if this point succeeded // Gauss-Newton iterations Eigen::Matrix2d H = Eigen::Matrix2d::Zero(); // hessian Eigen::Vector2d b = Eigen::Vector2d::Zero(); // bias Eigen::Vector2d J; // jacobian for (int iter = 0; iter < iterations; iter++) { if (inverse == false) { H = Eigen::Matrix2d::Zero(); b = Eigen::Vector2d::Zero(); } else { // only reset b b = Eigen::Vector2d::Zero(); } cost = 0; // compute cost and jacobian for (int x = -half_patch_size; x < half_patch_size; x++) for (int y = -half_patch_size; y < half_patch_size; y++) { double error = GetPixelValue(img1, kp.pt.x + x, kp.pt.y + y) - GetPixelValue(img2, kp.pt.x + x + dx, kp.pt.y + y + dy);; // Jacobian if (inverse == false) { J = -1.0 * Eigen::Vector2d( 0.5 * (GetPixelValue(img2, kp.pt.x + dx + x + 1, kp.pt.y + dy + y) - GetPixelValue(img2, kp.pt.x + dx + x - 1, kp.pt.y + dy + y)), 0.5 * (GetPixelValue(img2, kp.pt.x + dx + x, kp.pt.y + dy + y + 1) - GetPixelValue(img2, kp.pt.x + dx + x, kp.pt.y + dy + y - 1)) ); } else if (iter == 0) { // in inverse mode, J keeps same for all iterations // NOTE this J does not change when dx, dy is updated, so we can store it and only compute error J = -1.0 * Eigen::Vector2d( 0.5 * (GetPixelValue(img1, kp.pt.x + x + 1, kp.pt.y + y) - GetPixelValue(img1, kp.pt.x + x - 1, kp.pt.y + y)), 0.5 * (GetPixelValue(img1, kp.pt.x + x, kp.pt.y + y + 1) - GetPixelValue(img1, kp.pt.x + x, kp.pt.y + y - 1)) ); } // compute H, b and set cost; b += -error * J; cost += error * error; if (inverse == false || iter == 0) { // also update H H += J * J.transpose(); } } // compute update Eigen::Vector2d update = H.ldlt().solve(b); if (std::isnan(update[0])) { // sometimes occurred when we have a black or white patch and H is irreversible cout << "update is nan" << endl; succ = false; break; } if (iter > 0 && cost > lastCost) { break; } // update dx, dy dx += update[0]; dy += update[1]; lastCost = cost; succ = true; if (update.norm() < 1e-2) { // converge break; } } success[i] = succ; // set kp2 kp2[i].pt = kp.pt + Point2f(dx, dy); } }
The code implements the single layer optical flow function in the calculateOpticalFlow function, which calls cv:: parallel_. for_ Call OpticalFlowTracker::calculateOpticalFlow in parallel to calculate the optical flow of feature points within the specified range. This parallel for loop is internally implemented by Intel tbb library. We only need to define the function ontology according to its interface, and then pass the function to it as an std::function object.
If you are not familiar with std::bind, you can refer to the previous article: