What is memory out of order access?

What is memory out of order access?


It's more and more interesting to keep digging into the underlying principles of the computer. Today, let's talk about the topic of memory out of order execution.
First of all, ask a question: will our written programs be executed in the established order?
There seems to be no doubt about it. However, understanding the "bottom" knowledge of compilation and linking principles will not be easy to come to a conclusion. This problem is also exposed when using multithreading and memory sharing without locking.
So unfortunately, in some cases, the execution order of program instructions will change, which leads to what we call Memory out of order problem.

Out - of - order execution technology is the optimization that the processor makes against the original order of code in order to improve the operation speed

But fortunately, we can take the hand to correct "disorder" to "order".
Memory out of order access is generally divided into two types: compilation out of order and execution out of order. Next, we illustrate the phenomenon and introduce the methods to avoid disorder.

1. Compilation disorder

The fundamental reason for compiler out of order optimization is that the processor can only analyze a small piece of instructions at a time, but the compiler can analyze the code in a large range, so as to make a better strategy.

Let's write two simple lines of program to reproduce the performance of compilation disorder.

int x, y, z;
void fun(){
    x = y;
    z = 1;
}

View the compiled assembly instructions through gcc. Here we use O3 optimization level:

gcc -S demo.c -O3

Intercept a code we focus on:

fun:
.LFB0:
	.cfi_startproc
	endbr64
	movl	$1, z(%rip)  " z = 1
	movl	y(%rip), %eax
	movl	%eax, x(%rip) " x = y
	ret
	.cfi_endproc

Obviously, the compiler swapped x = y; z = 1; The execution order of the two statements.

So how to solve the trouble caused by disordered compilation? There are several solutions:

  • Compile optimization level
  • volatile
  • Compiler barrier
  • Lock

1.1 compilation optimization level

We adjust the compilation optimization level to O0 and observe the effect.
gcc -S demo.c -O0

fun:
.LFB0:
	.cfi_startproc
	endbr64
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	movl	y(%rip), %eax   " x = y
	movl	%eax, x(%rip)
	movl	$1, z(%rip)	  " z = 1
	nop
	popq	%rbp
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc

Generally, hardware devices are compiled with the optimization level of - Os, which is between - O2 and - O3. The differences are as follows:

  • -Os minimizes the size of the object code on the basis of - O2;
  • -O3 will try its best to improve the running speed, even if it increases the size of the object code

1.2 using volatile

Volatile keyword is no stranger to us. When accessing a variable modified by volatile, we force access to the value in memory instead of the value in the cache. The variable declared by volatile indicates that the variable may change at any time. Do not compile and optimize the operations related to the variable to avoid errors
volatile official description
Therefore, using volatile to modify variables, that is, using O3 level optimization, will not change the order of statements.

volatile int x, y, z;
void fun(){
    x = y;
    z = 1;
}

Compilation result:

fun:
.LFB0:
	.cfi_startproc
	endbr64
	movl	y(%rip), %eax
	movl	%eax, x(%rip)
	movl	$1, z(%rip)
	ret
	.cfi_endproc

1.3 compiler barrier

The Linux kernel provides the function barrier() to let the compiler ensure that its previous memory access is completed before its subsequent memory access. This can prevent the code before the compilation barrier and the code after the compilation barrier from being out of order.

#define barrier() _asm_ _volatile_("": : :"memory")

Continue overwriting source program:

 int x, y, z;
void fun(){
    x = y;
    __asm__ __volatile__("": : :"memory");
    z = 1;
}

Compilation result:

fun:
.LFB0:
	.cfi_startproc
	endbr64
	movl	y(%rip), %eax
	movl	%eax, x(%rip)
	movl	$1, z(%rip)
	ret
	.cfi_endproc

1.4 locking

Locking shared memory is necessary, which can save a lot of trouble.

#include <pthread.h>
pthread_mutex_t m;

 int x, y, z;
void fun(){
    pthread_mutex_lock(&m);
    x = y;
    pthread_mutex_unlock(&m);
    z = 1;
}

Compilation result:

fun:
.LFB1:
	.cfi_startproc
	endbr64
	subq	$8, %rsp
	.cfi_def_cfa_offset 16
	leaq	m(%rip), %rdi
	call	pthread_mutex_lock@PLT
	movl	y(%rip), %eax
	leaq	m(%rip), %rdi
	movl	%eax, x(%rip)
	call	pthread_mutex_unlock@PLT
	movl	$1, z(%rip)
	addq	$8, %rsp
	.cfi_def_cfa_offset 8
	ret
	.cfi_endproc

2. Disordered operation

When running, the CPU itself will execute instructions out of order.
Early processors were in order processors, which always executed instructions in the order written by the developer. If the input operands of the instructions were unavailable (usually due to the need to get them from memory), the processor would not execute the instructions available to the input operands, but wait for the current input operands to be available.

In contrast, out of order processors will first process the instructions with available input operation objects (rather than sequential execution), so as to avoid waiting and improve efficiency. On modern computers, processors run much faster than memory, and ordered processors can process a large number of instructions in the time they spend waiting for available data. Even if modern processors execute out of order, on a single CPU, instructions can be obtained and executed through the instruction queue sequence, and the results are returned to the register file in the queue sequence (for details, please refer to http:// http://en.wikipedia.org/wiki/Out-of-order_execution ), which makes all memory access operations seem to be executed in the order in which the program code is written, Therefore, the memory barrier is unnecessary (provided that compiler optimization is not considered).

The following is an example to illustrate the phenomenon and solutions.

/*=============================================================================
*
* Author: Terrance[wangtao_27@qq.com]
*
* Official account: embedded Island
*
* Last modified: 2021-11-13 23:02
*
* Filename: cpuchaos.c
*
* Description: Access and prevention of memory out of order execution
*
=============================================================================*/
#define _GNU_SOURCE
#include  <stdio.h>
#include  <stdlib.h>
#include  <pthread.h>
#include  <string.h>

int x, y, p, q;
int runtime = 0;

static pthread_barrier_t barrier_start;
static pthread_barrier_t barrier_end;

static void *thread1(void *args)
{
    for (; ;){
        pthread_barrier_wait(&barrier_start);
        x = 1;
#ifdef CPU_MEM_FENCE
        __asm__ __volatile__("mfence":::"memory"); // CPU memory barrier
#endif
        p = y;
        pthread_barrier_wait(&barrier_end);
    }
    return NULL;
}

static void *thread2(void *args)
{
    for (; ;){
        pthread_barrier_wait(&barrier_start);
        y = 1;
#ifdef CPU_MEM_FENCE
        __asm__ __volatile__("mfence":::"memory");
#endif
        q = x;
        pthread_barrier_wait(&barrier_end);
    }
    return NULL;
}
void start(void)
{
    x = y = p = q = 0;
}

void end(void)
{
    ++runtime;
    printf("[%d] %d %d\n", runtime, p, q);

    /* In case of disorder, terminate the program */
    if (p == 0 && q == 0){ 
        puts("chaos coming!");
        exit(-1);
    }

}

int main(int argc, char *argv[])
{
    int err;
    pthread_t t1, t2;

    err = pthread_barrier_init(&barrier_start, NULL, 3);
    if (err != 0){
        perror("pthread_barrier_init");
        exit(-1);
    }

    err = pthread_barrier_init(&barrier_end, NULL, 3);
    if (err != 0){
        perror("pthread_barrier_init");
        exit(-1);
    }

    /* create thread */
    err = pthread_create(&t1, NULL, thread1, NULL);
    if (err != 0){
        perror("pthread_create");
        exit(-1);
    }

    err = pthread_create(&t2, NULL, thread2, NULL);
    if (err != 0){
        perror("pthread_create");
        exit(-1);
    }
    /* Thread 1 is bound to CPU0 for execution */
    cpu_set_t cst;
    CPU_ZERO(&cst);
    CPU_SET(0, &cst);
    err = pthread_setaffinity_np(t1, sizeof(cst), &cst);
    if (err != 0){
        perror("pthread_setaffinity_np");
        exit(-1);
    }
	
    /* Thread 2 is bound to CPU1 for execution */
    CPU_ZERO(&cst);
    CPU_SET(1, &cst);
    err = pthread_setaffinity_np(t2, sizeof(cst), &cst);
    if (err != 0){
        perror("pthread_setaffinity_np");
        exit(-1);
    }

    for (;;){
        start();
        pthread_barrier_wait(&barrier_start);
        pthread_barrier_wait(&barrier_end);
        end();
    }

    return 0;
}
# linux @ ubuntu in ~/codelab/c/Nov [21:35:52] 
$ gcc cpuchaos.c  -o chaos -lpthread

# linux @ ubuntu in ~/codelab/c/Nov [21:35:53] 
$ ./chaos 
[1] 1 0
[2] 1 0
[3] 1 0
[4] 1 0
[5] 1 0
[6] 1 0
[7] 0 1
......
[6000] 0 1
[6001] 1 0
[6002] 1 0
[6003] 1 0
[6004] 1 0
[6005] 0 0
chaos coming!

Disorder occurred and the program terminated.

# linux @ ubuntu in ~/codelab/c/Nov [21:35:58] C:255
$ gcc cpuchaos.c  -o chaos -lpthread -DCPU_MEM_FENCE

# linux @ ubuntu in ~/codelab/c/Nov [21:37:54] 
$ ./chaos 
[1] 1 0
[2] 1 0
[3] 1 0
[4] 1 0
[5] 1 0
[6] 1 0
[7] 0 1
......
[405185] 0 1
[405186] 0 1
[405187] 0 1
[405188] 0 1
[405189] 0 1
[405190] 0 1
[405191] 0 1
[405192] 0 1
[405193] 0 1
^C

After running more than 400000 times, there is no disorder, and the memory barrier takes effect.

However, if the hardware product is a single core, there is no need to worry about out of order.

3. Summary

This paper discusses the phenomenon of memory disorder, including compilation disorder and execution disorder. Therefore, for shared data, the lock can basically avoid the memory optimization problem.

Keywords: memory management compiler

Added by yeehawjared on Sat, 20 Nov 2021 16:11:08 +0200