avx2 uses vpshufb instruction to classify characters

Recently, I saw simdjson's paper on character matching and classification using vpshufb instruction. I feel that this method is very fruitful. I want to share the following.

Let's talk about why this instruction is used in simdjson. It needs to extract six control characters (':', \ ',': ',' ',' {','} ') from the character array, and four meaningless characters (' \ r ',' \ n ','t', ') such as space newline.

vpshufb is an assembly instruction, which can be used in C + +__ m256i _mm256_shuffle_epi8 (__m256i a, __m256i b).

__m256i _mm256_shuffle_epi8 (__m256i a, __m256i b)

Algorithm process:
FOR j := 0 to 15
    i := j*8
    IF b[i+7] == 1
        dst[i+7:i] := 0
    ELSE
        index[3:0] := b[i+3:i]
        dst[i+7:i] := a[index*8+7:index*8]
    FI
    IF b[128+i+7] == 1
        dst[128+i+7:128+i] := 0
    ELSE
        index[3:0] := b[128+i+3:128+i]
        dst[128+i+7:128+i] := a[128+index*8+7:128+index*8]
    FI
ENDFOR
dst[MAX:256] := 0

For a brief explanation__ m256i can be regarded as having 256 bits. Assuming i=8, b[i+3:i] is b[11:8], that is, the 8th to 11th bits. These four bits form an unsigned integer. The maximum of the four bits is 1111, which is the hexadecimal f.

_ mm256_shuffle_epi8 can be regarded as a table lookup method, but it can only be checked with the lower four bits of each byte. Get the lower four bits of each byte from b. The value can only be one of 0-15. Take this value as an index, and then find the value corresponding to this index from a.
In short, it can be considered that_ mm256_shuffle_epi8 is a mapping relationship that maps one number to another.

Below is_ mm256_ shuffle_ Two use examples of epi8.

If you have a byte array with ascii, you need to find lowercase letters from it.
The hexadecimal representation of a-z is: 61-6f, 70-7a (61-7a is not written here because it is required for subsequent processing)

61-6f are grouped because they are 6 in the upper four places and 7 in the upper four places of 70-7a. Because the same number of the upper four digits can be divided into a group.

The idea of the algorithm is to use the upper four bits and the lower four bits of a byte respectively_ mm256_shuffle_epi8 is used for mapping, and the results of the two mappings are used for and operation. If the final result is not 0, it means that it is the required character.

The key is how to construct this mapping table.
Take the letters a and b for example:

a (0110 0001)and b(0110 0010),Their upper four bits are the same, so the mapping result of their upper four bits is the same.

For the top four digits:

We assume 0110 passes_mm256_shuffle_epi8 After mapping, 0000 0001

For the lower four digits:

So let's put a and b After the low four bit mapping, only the minimum is 1(xxxx xxx1),For example, 0000 0001 or 0000 0011,Then the result of the sum operation of these two numbers must be 0000 0001.

If it matches four characters a, B, P and Q

a (0110 0001),b(0110 0010),p(0111 0000),q(0111 0001)

For the top four digits:

Suppose 0110 passes_mm256_shuffle_epi8 0000 0001 after mapping
 Assumption 0111 passed_mm256_shuffle_epi8 0000 0010 after mapping

For the lower four digits:

hold a and b After the low four bit mapping, only the lowest bit contains 1(xxxx xxx1),p and q After mapping, the second bit is required to be 1(xxxx xx1x),But you'll find, a and q The lower four are the same,That is, they are the same after mapping. So the mapping result is:
0000-->0010,0001-->0011,0010-->0001. 

I won't push it down, just share examples
For letters a-z A-Z
High 4bit:

0110 And 0100, mapped to 0001
0111 And 0101, mapped to 0001
 Other mapped to 0

Low 4bit:

0000 Map to 0010
0001 To 1010 to 0011
1011 To 1111 to 0001

In fact, the extraction of letters can be handled in the way of 'a' > = character & & character < ='z '_ mm256_shuffle_epi8, we still need to deal with some more complex classifications When I extract some structures in HTML:
I need to get three types of characters
First: A-Z, A-Z/ (because < + letter may be the beginning of an html tag, < / <? May be a comment < / may be a closed tag)
The second type: = "'> (string or wrap label attribute is linked with a = sign >, which may be closed by the label)
Third: \ n \f \r \t n \ f \ R \ t \ 0 spaces (these are meaningless characters that need to be skipped)

Direct code:

void strExtractTestAvx2(char* ms ,unsigned int length) {

    //Conversion of lower four bits
    __m256i vpshufbCharMaskn = _mm256_setr_epi64x(0x1303030303130bc2L, 0x0d21a18101838303L, 0x1303030303130bc2L, 0x0d21a18101838303L);
    //Conversion of upper four bits
    __m256i vpshufbCharMaskh = _mm256_setr_epi64x(0x0201020124580080L, 0, 0x0201020124580080L, 0);
    //Character type judgment
    __m256i alphaMask = _mm256_set1_epi8(0b00001111);
    __m256i constructMask = _mm256_set1_epi8(0b00110000);
    __m256i emptyMask = _mm256_set1_epi8(0b11000000);
    //All 0
    __m256i zero = _mm256_set1_epi64x(0);
    //Four high positions 0 per byte
    __m256i vpshufbMask = _mm256_set1_epi64x(0x0f0f0f0f0f0f0f0fL);

    const unsigned int length1 = length - 31;

    for (int i = 0; i < length1; i += 32)
    {

        __m256i ymm1 = _mm256_lddqu_si256((__m256i*) & ms[i]);

        __m256i ymm2 = _mm256_srli_epi16(ymm1, 4);
        ymm2 = _mm256_and_si256(ymm2, vpshufbMask);
        __m256i ymm3 = _mm256_shuffle_epi8(vpshufbCharMaskn, ymm1);
        ymm2 = _mm256_shuffle_epi8(vpshufbCharMaskh, ymm2);
        ymm3 = _mm256_and_si256(ymm3, ymm2);
        
        __m256i  alpha = _mm256_and_si256(ymm3, alphaMask);
        __m256i construct = _mm256_and_si256(ymm3, constructMask);
        __m256i  empty = _mm256_and_si256(ymm3, emptyMask);
        // <
        __m256i less = _mm256_cmpeq_epi8(ymm1, lessSignMask);

        //Letters/
        alpha = _mm256_cmpgt_epi8(alpha, zero);
        
        //= " ' >
        construct = _mm256_cmpgt_epi8(construct, zero);

        empty = _mm256_srli_epi64(empty, 1);
        //6 spaces
        empty = _mm256_cmpgt_epi8(empty, zero);

    }
}

Keywords: C++

Added by raytri on Sun, 26 Dec 2021 20:35:31 +0200