Corrosion and expansion of binary images beyond halcon speed to realize the fastest radius correlation algorithm at present (with core source code)

This blog post is from the blogger Imageshop. If you want to reward or view more content, you can move to Imageshop.

Reprinted from: https://www.cnblogs.com/Imageshop/p/10563354.html Invasion and deletion

I wrote in my blog two years ago SSE image algorithm optimization series 7: fast rectangular core corrosion and expansion (maximum and minimum) algorithm based on SSE Through SSE optimization, the corrosion and expansion of rectangular core are not only independent of radius, but also quite fast. At that time, some bloggers put forward the following questions in the comments of Blog:

#1st floor
2018-02-21 20:26 | Hu Yitan  
The blogger's idea is very clever, but the algorithm itself is not fast enough. There is still a big gap between the optimization effect and commercial software, 4096 X8192 Size grayscale commercial software( halcon)Only 33 ms, This article requires 250 ms,Considering that commercial software adopts multi-core optimization, my test machine is 4-core,
Generally, the optimization acceleration ratio is about 3 times, so the theoretical time after parallelization in this paper is 250/3=83.33ms. But I use OpenMP After optimizing the algorithm in this paper, the acceleration ratio is less than 3 times. We still need to find better ideas.

At that time, after reading this comment, I really felt that this blogger was wrong. How could such a large image be processed in only 33ms? It is the simplest image processing algorithm. Invert also takes about 7 / 8 milliseconds after extreme optimization. Therefore, I didn't recognize this speed in my heart at that time.

Later, I also considered the particularity of binary images. I once considered, for example, stopping the cycle when I encounter a white pixel during expansion. I also considered using histogram for optimization. After all, the histogram has only two pixels, but it still can't reach the above speed, and some are even slower. So there has been no progress in the follow-up.

A few days ago, the netizen LQC Jack suddenly mentioned this problem again. He believes that there are indeed faster methods to solve this problem. After all, the second is worth the particularity:

     

Among them, the key is "if you filter, sum > 0, the current point is 255". Yes, for the binary image, find the maximum value in the local rectangle and the local mean of the binary image. If we can establish a relationship, we can indirectly realize corrosion or expansion with the help of a fast local mean algorithm, I have many articles in my blog about the ultimate optimization of local mean, especially SSE image algorithm optimization series 13: implementation and optimization of ultra-high speed BoxBlur algorithm (five times the speed of Opencv) The method mentioned in the article is extremely efficient. It can be done in about 30ms for the binary graph of 4096X8192. Hope ignites.

How to bridge the two? It's really simple to think about it carefully. If you want to find the maximum value (expansion), as long as a local pixel is 255, the result will be 255, At this time, the local mean value must be greater than 0 (considering the actual factors, it should be the local cumulative value, because considering the final division, it does not exclude a local area, and there is only one white point. When the local area is too large, the result after division may also be 0). Only when all local pixels are 0, the maximum value is 0, and the local cumulative value at this time must also be 0. If it is to find the minimum value (corrosion), as long as one local pixel is 0, the result will be 0. Only if all local pixels are 255, the result will be 255. Then the information fed back to the local mean value here is equivalent to saying that the average value is 255, the result will be 255, otherwise the result will be 0 (similarly, the local cumulative value should be used in actual programming, not the average value).

In this way, we will find that this implementation process has fewer steps than the standard box fuzziness. Let's paste the core part of SSE optimization box Fuzziness:

 1 int BlockSize = 4, Block = (Width - 1) / BlockSize;
 2 __m128i OldSum = _mm_set1_epi32(LastSum);
 3 __m128 Inv128 = _mm_set1_ps(Inv);
 4 for (int X = 1; X < Block * BlockSize + 1; X += BlockSize)
 5 {
 6     __m128i ColValueOut = _mm_loadu_si128((__m128i *)(ColValue + X - 1));
 7     __m128i ColValueIn = _mm_loadu_si128((__m128i *)(ColValue + X + Radius + Radius));
 8     __m128i ColValueDiff = _mm_sub_epi32(ColValueIn, ColValueOut);                            //    P3 P2 P1 P0                                                
 9     __m128i Value_Temp = _mm_add_epi32(ColValueDiff, _mm_slli_si128(ColValueDiff, 4));        //    P3+P2 P2+P1 P1+P0 P0
10     __m128i Value = _mm_add_epi32(Value_Temp, _mm_slli_si128(Value_Temp, 8));                 //    P3+P2+P1+P0 P2+P1+P0 P1+P0 P0
11     __m128i NewSum = _mm_add_epi32(OldSum, Value);
12     OldSum = _mm_shuffle_epi32(NewSum, _MM_SHUFFLE(3, 3, 3, 3));                              //    Reassign to the latest value
13     __m128 Mean = _mm_mul_ps(_mm_cvtepi32_ps(NewSum), Inv128);
14     _mm_storesi128_4char(_mm_cvtps_epi32(Mean), LinePD + X);
15 }

Note that lines 14 and 15 are the process of calculating the mean value and finally saving the data to memory. The stored value in NewSum is the accumulated value. Note that because there is division, the relevant instructions of floating-point version are borrowed, and the relevant type conversion process is added.

Here, in order to meet our corrosion and expansion needs, these two sentences are unnecessary. According to the above analysis, such as expansion effect, only the following changes need to be made:

LinePD[X + 0] = NewSum.m128i_i32[0] > 0 ? 255 : 0;
LinePD[X + 1] = NewSum.m128i_i32[1] > 0 ? 255 : 0;
LinePD[X + 2] = NewSum.m128i_i32[2] > 0 ? 255 : 0;
LinePD[X + 3] = NewSum.m128i_i32[3] > 0 ? 255 : 0;

The above code is only for illustration. If it is written like this, it will destroy the overall harmony of SSE algorithm, and the insertion of ordinary C code in SSE will bring great loss in performance. A processing method is as follows:

__m128i Flag = _mm_cmpgt_epi32(NewSum, _mm_setzero_si128());
Flag = _mm_packs_epi32(Flag, Flag);
*((int *)(LinePD + X)) = _mm_cvtsi128_si32(_mm_packs_epi16(Flag, Flag));

We use the particularity of comparison operators in SSE to produce results such as 0xfffffffff, and then reduce them to 8 bits through saturation operations. Note that the above uses signed saturation calculations.

Do you know how to write about the corrosion process?

After a simple test, we processed a 4096X8192 binary graph with arbitrary radius. The time-consuming was basically stable at about 24ms, which was much faster than boxblur.

I have also conceived the judgment of impractical summation, such as using or operation or and operation, but they can not solve the processing problem of in and out pixels. Therefore, on the whole, summation is the most scientific.

In fact, there are faster methods for small radius. Here is a brief description, but many people may not understand it.

In our above implementation, we use int type data to save the accumulated value. This is because if the radius is a little larger, the accumulated value may exceed the range expressed by short type. However, int type SSE can only process 4 at a time, while short type data SSE can process 8 at a time. Therefore, is it possible to use short type if appropriate changes are made, It's possible.

Because it is a binary graph, there are only two values: 0 and 255. The value of 0 doesn't matter. If we change the value of 255 to 1, when the radius is not greater than a certain value (64 or other numbers, you can draw it yourself), the accumulated value will be controllable within the range that can be expressed by the short type.

There is another problem, that is, how to change the value of 255 to 1, if used_ mm_ blendv_ The judgment statements related to epi8 set can be implemented, but this Blend is time-consuming, but the gain is not worth the loss. One of the best ways is to make full use of the characteristics between unsigned and signed numbers. When we cast an unsigned char data type equal to 255 to signed char, its value is equal to - 1, which is opposite to the value 1 we want. At this time, it is in our original code_ mm_add_epi8 can be used_ mm_sub_epi8 replaces and vice versa. In SSE, this type conversion does not need to be enforced because it operates directly on memory.

If we post the following code, someone may understand what it means.

memset(ColValue + Radius, 0, Width * sizeof(unsigned char));
for (int Z = -Radius; Z <= Radius; Z++)
{
    unsigned char *LinePS = Src + ColOffset[Z + Radius] * Stride;
    int BlockSize = 16, Block = Width / BlockSize;
    for (int X = 0; X < Block * BlockSize; X += BlockSize)
    {
        unsigned char *DestP = ColValue + X + Radius;
        __m128i Sample = _mm_loadu_si128((__m128i *)(LinePS + X));                //    255 becomes - 1
        _mm_storeu_si128((__m128i *)DestP, _mm_sub_epi8(_mm_loadu_si128((__m128i *)DestP), Sample));
    }
    for (int X = Block * BlockSize; X < Width; X++)
    {
        ColValue[X + Radius] += (LinePS[X] == 255 ? 1 : 0);                                            //    Update column data
    }
}

The ordinary C code part is implemented directly in time, while the SSE part does not see the obvious conversion between 255 and 1, which is all in those simple codes.

Through this related optimization, the graph of 4096X8192 can achieve between 12 and 13 milliseconds, which has completely exceeded the speed of Halcon.

The corrosion and expansion in halcon also have circular radius. Under the same radius, the time-consuming of circular radius in halcon is about 8 times that of rectangular radius. I believe that halcon's algorithm of circular radius is also realized through EDM algorithm. See details for details SSE image algorithm optimization series 25: Calculation and optimization of Euclidean distance map (EDM) feature map of binary image One article, and I don't always have such a time ratio here.

Extremely optimized version project: https://files.cnblogs.com/files/Imageshop/SSE_Optimization_Demo.rar , see binary - > processing - > erode / digest menu.

 

Added by il_cenobita on Thu, 20 Jan 2022 07:36:10 +0200