How to access memory efficiently

The main factors affecting memory access speed are:
1. Memory bandwidth: the amount of data read and written to memory per second, which is determined by the hardware configuration.
2. CACHE cache: the buffer between CPU and memory. When the hit rate is high, it can greatly provide the average access speed of memory.
3. TLB conversion side view buffer: a high-speed table lookup mechanism for the conversion of system virtual address to physical address. The conversion speed is faster than that of ordinary conversion mechanism.

We can only optimize points 2 and 3. Due to the synchronous competition between CACHE's small capacity and SMP, how to maximize the use of CACHE is our clear optimization breakthrough (take common data structures as an example):
1. Compressed structure size: small capacity for CACHE.
2. Align structure: compete with CACHE synchronization on SMP for memory address read-write characteristics.
3. Apply for memory space with continuous addresses: small capacity for TLB and CACHE hit.
4. Other optimization: comprehensive consideration of various factors

Specific optimization methods:
1. Compress structure size
The system CACHE is limited and its capacity is very small. Fully compressing the structure size so that CACHE can CACHE more accessed data is nothing more than one of the effective methods to improve the average memory access speed.
Compressing the size of the structure requires us to make a more reasonable design of the application logic and remove unnecessary fields as much as possible. There are also some additional compression methods for the structure itself.

1.1. Arrange the structure fields reasonably
Due to the alignment characteristics of structures, structures with the same field will produce structures of different sizes in different field order.

//Size: 12 bytes
struct box_a
{
    char a; 
    short b;
    int c;  
    char d; 
};  


//Size: 8 bytes
struct box_b
{
    char a; 
    char d; 
    short b;
    int c;  
};

1.2. Utilization bit field
In fact, some structure fields do not need so much storage space. For example, the flag field representing true and false flag only takes one of two values, 0 or 1. At this time, one bit can be used. If a single field of int type is used, the space will be greatly wasted.
Example: TCP h

struct tcphdr {
    __be16  source;
    __be16  dest;
    __be32  seq;
    __be32  ack_seq;
#if defined(__LITTLE_ENDIAN_BITFIELD)
    __u16   res1:4,
        doff:4,
        fin:1,
        syn:1,
        rst:1,
        psh:1,
        ack:1,
        urg:1,
        ece:1,
        cwr:1;
#elif defined(__BIG_ENDIAN_BITFIELD)
    __u16   doff:4,
        res1:4,
        cwr:1,
        ece:1,
        urg:1,
        ack:1,
        psh:1,
        rst:1,
        syn:1,
        fin:1;
#else
#error  "Adjust your <asm/byteorder.h> defines"
#endif  
    __be16  window;
    __sum16 check;
    __be16  urg_ptr;
};

1.3. Using union
union structure is also one of the methods to compress the size of structure. It allows us to merge multiple fields of structure or store small byte fields into large byte fields in some cases.
Example: skbuff h

struct sk_buff { 
    ...
    union {
        __wsum      csum;
        struct {
            __u16   csum_start;
            __u16   csum_offset;
        };
    };
    ...
};

2. Align structure
There are two meanings to align structures: one is to align smaller structures with machine words, and the other is to align larger structures with CACHE LINE.

2.1 machine word alignment for small structures
We know that for modern computer hardware, memory can only be accessed through specific aligned addresses (such as machine words). For example, on a 64 bit machine, whether we want to read the 0th byte or the 1st byte, the signals transmitted on the hardware are the same. Because it will read all 8 bytes from address 0 to address 7 to the CPU. Only when we need to read the 0th byte, we lose the next 7 bytes. When we need to read the first byte, we lose the first and the next 6 bytes.
When the bytes we want to read just fall within two machine words, we access the memory twice. At the same time, we can get the final result through some logical calculations.
Therefore, in order to better improve the performance, we must align the structure with machine word (or multiple) as much as possible, and some frequently accessed fields in the structure shall also be arranged in the position of machine word alignment as much as possible.

//Size: 12 bytes
struct box_c
{
    char a; 
    char d; 
    short b;
    int c;  
    int e;  
};

//Size: 16 bytes
struct box_d
{
    char a; 
    char d; 
    short b;
    int c;  
    int e;  
    char padding[4];
};

Box on the right of the table above_ D structure, increase the structure size to 16 bytes by adding a padding field, so as to align with the machine word multiple, which is in our application for continuous box_d structure array, it can still ensure that each structure in the array is aligned with the machine word multiple.
It is a common practice to align the structure size with the machine word multiple by filling in the field padding, which can be seen everywhere in the Linux kernel source code.  

2.2 CACHE LINE alignment for large structures
We know that the minimum unit of CACHE and memory exchange is CACHE LINE, and the size of a CACHE LINE takes 64 bytes as an example. When the size of our structure is not aligned with 64 bytes, a structure may occupy more CACHE LINE than originally required. For example, when a structure without 64 bytes in memory is cached in CACHE, even if the structure itself may not have 64 bytes in length, because it is overlapped on two CACHE lines, two CACHE lines will be eliminated when it is eliminated.
This is not the most serious problem. Non CACHE LINE aligned structures are easy to cause a CACHE problem called error sharing on SMP machines. For example, structures T1 and T2 do not have CACHE LINE alignment. If they (the rear half of T1 and the front half of T2) occupy the same CACHE on the SMP machine, if CPU 0 modifies the rear half of structure T1, CACHE LINE 1 of CPU 1 will fail. Similarly, if CPU 1 modifies the front half of structure T2, CACHE LINE 1 of CPU 0 will also fail. If CPU 0 and CPU 1 make corresponding modifications repeatedly, the adverse results are obvious. The structures T1 and T2 that are not shared logically actually share CACHE LINE 1, which is called error sharing.
Linux source code provides the use of GCC__ attribute__ Expand the macro defined by the attribute to do this alignment. In the file / linux-2.6 xx/include/linux/cache. Multiple similar macros can be found in H, such as:

#define ____cacheline_aligned __attribute__((__aligned__(SMP_CACHE_BYTES)))

This macro can be used to modify a structure field to force the field address to align with the start address of CACHE LINE mapping.
See / linux-2.6 xx/drivers/net/e100. Implementation of nic in C, three____ cacheline_aligned decorated fields, which means that these fields are forced to align with the start address of CACHE LINE mapping.

struct nic {
    /* Begin: frequently used values: keep adjacent for cache effect */
    u32 msg_enable              ____cacheline_aligned;
    /* 4 Byte hole */
    struct net_device *netdev;
    struct pci_dev *pdev;
    /* 40 Byte hole */
    struct rx *rxs              ____cacheline_aligned;
    struct rx *rx_to_use;
    struct rx *rx_to_clean;
    struct rfd blank_rfd;
    enum ru_state ru_running;
    /* 20 Byte hole */
    spinlock_t cb_lock          ____cacheline_aligned;
    spinlock_t cmd_lock;
    struct csr __iomem *csr;
    enum scb_cmd_lo cuc_cmd;
    unsigned int cbs_avail;
    struct napi_struct napi;
    ...
};

Back to the previous question, if we add to the first field of structure T2____ cacheline_aligned modifier, the error can be solved by sharing.

2.3 isolation and alignment of read-only fields and read-write fields
The purpose of isolation and alignment of read-only fields and read-write fields is to ensure that those read-only fields and read-write fields are concentrated in different CACHE lines of CACHE. Since the read-only field hardly needs to be updated, it can be stably cached in the CACHE to reduce the frequent failure of the corresponding CACHE LINE caused by the mixing of read and write fields, so as to improve the efficiency; The read and write fields are relatively concentrated, which can also ensure that the number of polluted CACHE lines is relatively small when the program reads and writes the structure.

typedef struct {
    /* ro data */
    size_t block_count;     // number of total blocks
 
    size_t meta_block_size; // sizeof per skb meta block
    size_t data_block_size; // sizeof per skb data block
     
    u8 *meta_base_addr;     // base address of skb meta buffer 
    u8 *data_base_addr;     // base address of skb data buffer 
 
    /* rw data */
    size_t current_index    ____cacheline_aligned;  // index
     
} bc_buff, * bc_buff_t;

3. Apply for memory space with consecutive addresses
As the address space changes from 32-bit to 64 bit, there are more and more directory levels for page memory management, and level 4 directory address translation is also a lot of overhead. The hardware manufacturer provides us with TLB buffer to accelerate the conversion from virtual address to physical address. However, after all, the TLB is limited. When accessing the memory space with continuous addresses, the TLB can get more hits, and the CACHE cache has a greater chance of hitting.
Two pieces of code realize the same function, but the first method has relatively good memory reading and writing efficiency in practical use, especially when the applied memory is large (malloc exception is not considered):

//Method 1:
#define MAX 100
int i;
char *p;
struct box_d *box[MAX];
p = (char *)malloc(sizeof(struct box_d) * MAX);
for (i = 0; i < MAX; i ++)
{
    box[i] = (struct box_d *)(p + sizeof(struct box_d) * i);
}

//Method 2:
#define MAX 100
int i;
struct box_d *box[MAX];
for (i = 0; i < MAX; i ++)
{
    box[i] = (struct box_d *)malloc(sizeof(struct box_d));
}

In addition, if we use the paging mechanism of larger pages (such as 2M or 1G), we can also improve the performance; Compared with the original 4K per page paging mechanism, the application applies for the same size of memory, and the large page paging mechanism requires fewer pages, thus occupying fewer TLB items. While reducing the number of conversions from virtual address to physical address, it improves the hit rate of TLB and shortens the time required for each conversion. Because most operating systems need to align by page when allocating memory, the disadvantage of large page paging mechanism is that memory waste is relatively serious. Only when the physical memory is sufficient, the large page paging mechanism can show its advantages.

4. Other optimization
4.1 read ahead instruction read memory
Prefetch the data in memory into CACHE in advance, improve the hit rate of CACHE and accelerate the reading speed of memory, which is the main purpose of designing prefetch instructions. If the current operation complexity is relatively high, the prefetching and operation can be carried out synchronously, so as to eliminate the delay of memory access in the next step. The corresponding prefetch assembly instructions are prefetch0, prefetch1, prefetch2 and prefetchnta.
The prefetch instruction only gives the CPU a prompt, so it can be ignored by the CPU, and even prefetching a wrong address will not cause CPU exceptions. prefetchnta prefetch instruction is generally used because it will not pollute CACHE. It stores the data obtained each time in the first CACHE LINE of L2 CACHE, and several other instructions will replace the least recently used CACHE LINE in CACHE.

4.2. Non temporary move instruction write memory
We know that in order to ensure the data consistency between CACHE and memory, there are two main ways for CPU write operations to CACHE to synchronize to memory, Write Through and write back. Either synchronization method consumes performance, and in some cases, writing CACHE is unnecessary:
What situations do you not need to write CACHE? For example, when copying data (efficient memcpy function implementation), or we already know that the written data will not be used in the recent period of time (or forever), then we can not write CACHE and make the corresponding CACHE LINE automatically invalidate in order to CACHE other data. This is very useful in some special scenarios. The corresponding assembly instructions are movntq, movntsd, movntss, movntps, movntpd, movntdq and movntdqa.
Complete high-speed memory copy function realized by pre read instruction and non temporary move instruction:

void X_aligned_memcpy_sse2(void* dest, const void* src, const unsigned long size_t)
{
  __asm
  {
    mov esi, src;    //src pointer
    mov edi, dest;   //dest pointer
    mov ebx, size_t; //ebx is our counter 
    shr ebx, 7;      //divide by 128 (8 * 128bit registers)
 
    loop_copy:
      prefetchnta 128[ESI]; //SSE2 prefetch
      prefetchnta 160[ESI];
      prefetchnta 192[ESI];
      prefetchnta 224[ESI];
 
      movdqa xmm0, 0[ESI]; //move data from src to registers
      movdqa xmm1, 16[ESI];
      movdqa xmm2, 32[ESI];
      movdqa xmm3, 48[ESI];
      movdqa xmm4, 64[ESI];
      movdqa xmm5, 80[ESI];
      movdqa xmm6, 96[ESI];
      movdqa xmm7, 112[ESI];
 
      movntdq 0[EDI], xmm0; //move data from registers to dest
      movntdq 16[EDI], xmm1;
      movntdq 32[EDI], xmm2;
      movntdq 48[EDI], xmm3;
      movntdq 64[EDI], xmm4;
      movntdq 80[EDI], xmm5;
      movntdq 96[EDI], xmm6;
      movntdq 112[EDI], xmm7;
 
      add esi, 128;
      add edi, 128;
      dec ebx;
 
      jnz loop_copy; //loop please
    loop_copy_end:
  }
}

Summary:
To access memory efficiently, we must make full use of the CACHE function of the system CACHE, because at present, CACHE access speed is much faster than memory. Specific optimization methods include:
1. Compress the size of the structure with the design.
2. The structure shall be aligned with machine words (multiple) as far as possible.
3. The frequently accessed fields in the structure shall be placed in the machine word alignment position as far as possible.
4. Multiple structural variables read and written frequently shall be applied at the same time as far as possible, so that they are distributed in a small linear space as far as possible, so that TLB buffer can be used.
5. When the structure is relatively large, it is better to initialize or set the value of the structure field from the first field in turn, so as to ensure that the memory access is carried out in sequence.
6. Additional optimizations can use non temporary move instructions (such as movntdq) and read ahead instructions (such as prefetchnta).
7. In special cases, multimedia instructions SSE2, SSE4, etc. can be considered.
Of course, there are conflicts between some of the above steps, such as compression structure and structure alignment, which need to be considered comprehensively.

Reprint: http://www.lenky.info/archives/2011/11/310

Added by CWebguy on Sun, 23 Jan 2022 23:02:49 +0200