[OpenMP learning notes] compile guidance instructions

preface

OpenMP implements parallelization by inserting compilation guidance instructions into the serial program. The compiler supporting OpenMP can recognize, process these instructions and realize the corresponding functions All compiled guidance instructions start with #pragma omp, followed by specific functional instructions or commands The general format is as follows:

#pragma omp directive [clause [[,] clause]...]
    structured block

Parallel construct (parallel domain structure)

In order to make the program execute in parallel, we first need to construct a parallel region. Here, we use the parallel instruction to construct the parallel region, and its syntax form is as follows

#pragma omp parallel  [clause [[,] clause]...]
     structured block

We can see that in fact, a parallel keyword is added after omp. The main function of this instruction is to construct parallel domain, create thread group and execute tasks concurrently It should be noted that this instruction only ensures that the code is executed in parallel, but it is not responsible for task distribution between threads At the end of parallel domain execution, there will be an implicit barrier to synchronize all threads in the region Here is an example:

void parallel_construct() {
    #pragma omp parallel 
    {
        printf("Hello from thread %d\n", omp_get_thread_num());
    }
}

Where omp_get_thread_num() is used to get the number of the current thread. This function is defined in < OMP h> Medium The output results are as follows:

Hello from thread 1
Hello from thread 3
Hello from thread 0
Hello from thread 2

The parallel instruction can be followed by some clauses, as shown below

if(scalar-expression)

num_threads(integer-expression)

private(list)

firstprivate(list)

shared(list)

default(none | shared)

copyin(list)

reduction(operator:list)

The usage of these clauses will be introduced later

Work sharing structure

Task sharing instruction is mainly used to assign different tasks to threads. A work sharing region must be associated with an active parallel region. If the task sharing instruction is in an inactive parallel domain or in a serial domain, the instruction will be ignored In C/C + +, there are three task sharing instructions: for, sections and single. Strictly speaking, only for and sections are task sharing instructions, while single is only an instruction to assist task sharing

for

Used in a for loop to assign different loops to different threads. The syntax is as follows:

#pragma omp for [clause[[,] clause]...]
    for-loop

Here is an example:

void parallel_for() {
    int n = 9;
    int i = 0;
    #pragma omp parallel shared(n) private(i) 
    {
        #pragma omp for
        for(i = 0; i < n; i++) {
            printf("Thread %d executes loop iteration %d\n", omp_get_thread_num(),i);
        }
    }
}

The following is the result of program execution

Thread 2 executes loop iteration 5
Thread 2 executes loop iteration 6
Thread 3 executes loop iteration 7
Thread 3 executes loop iteration 8
Thread 0 executes loop iteration 0
Thread 0 executes loop iteration 1
Thread 0 executes loop iteration 2
Thread 1 executes loop iteration 3
Thread 1 executes loop iteration 4

In the above program, a total of 4 threads execute 9 cycles. Threads are divided into 3 cycles and the remaining threads are divided into 2 cycles. This is a common scheduling method, that is, assuming n cycle iterations and t threads, each thread is allocated to n/t or n/t + 1 consecutive iteration calculation, but in some cases, this method is not the best choice, We can use schedule to specify the scheduling method, which will be described in detail later The following are some clauses that can be followed by the for instruction:

private(list)

fistprivate(list)

lastprivate(list)

reduction(operator:list)

ordered

schedule(kind[,chunk_size])

nowait

sections

The sections instruction can assign different tasks to different threads. The syntax is as follows:

#pragma omp sections [clause[[,] clause]...] 
    {
        [#pragma omp section]
            structured block
        [#pragma omp section]
            structured block
        ...
    }

From the above code, we can see that sections divide the code into multiple sections, and each thread processes one section. The following is an example:

/**
 * Use #pragma omp sections and #pragma omp sections to enable different threads to perform different tasks
 * If the number of threads is greater than the number of section s, the redundant threads will be idle
 * If the number of threads is less than the number of sections, one thread will execute multiple section codes
 */

void funcA() {
    printf("In funcA: this section is executed by thread %d\n",
            omp_get_thread_num());
}

void funcB() {
    printf("In funcB: this section is executed by thread %d\n",
            omp_get_thread_num());
}

void parallel_section() {
    #pragma omp parallel
    {
        #pragma omp sections
        {
            #pragma omp section 
            {
                (void)funcA();
            }

            #pragma omp section 
            {
                (void)funcB();
            }
        }
    } 
}

Here are the execution results:

In funcA: this section is executed by thread 3
In funcB: this section is executed by thread 0

Here are some clauses that can be followed by sections

private(list)

firstprivate(list)

lastprivate(list)

reduction(operator:list)

nowait

single

The single instruction is used to specify that a certain code block can only be executed by one thread. If there is no nowait clause, all threads will synchronize at the implicit synchronization point at the end of the single instruction. If there is a nowait clause in the single instruction, other threads will execute directly However, the single instruction does not specify which thread to execute The syntax is as follows:

#pragma omp single [clause[[,] clause]...]
    structured block

The following is a usage example

void parallel_single() {
    int a = 0, n = 10, i;
    int b[n];
    #pragma omp parallel shared(a, b) private(i)
    {
        // Only one thread will execute this code, and other threads will wait for the thread to finish executing
        #pragma omp single 
        {
            a = 10;
            printf("Single construct executed by thread %d\n", omp_get_thread_num());
        }

        // A barrier is automatically inserted here
        
        #pragma omp for
        for(i = 0; i < n; i++) {
            b[i] = a;
        }
    }

    printf("After the parallel region:\n");
    for (i=0; i<n; i++)
        printf("b[%d] = %d\n",i,b[i]);
}

Here are the execution results:

Single construct executed by thread 2
After the parallel region:
b[0] = 10
b[1] = 10
b[2] = 10
b[3] = 10
b[4] = 10
b[5] = 10
b[6] = 10
b[7] = 10
b[8] = 10
b[9] = 10

The following are the clauses that can be followed by the single instruction:

private(list)

firstprivate(list)

copyprivate(list)

nowait

Combined Parallel Work-Sharing Constructs

The parallel instruction and work-sharing instruction are combined to make the code more concise As shown in the following code

#pragma omp parallel
{
    #pragma omp for
    for(.....)
}

Can be written as

#pragma omp parallel for
    for(.....)

Using these combined structures not only increases the readability of the program, but also helps the performance of the program When using these composite structures, the compiler can know what to do next, which may generate more efficient code

Clauses to Control Parallel and Work-Sharing Constructs

OpenMP instructions can be followed by clauses to control the behavior of constructors Here are some common clauses

shared

The shared clause is used to specify which data is shared between threads. The syntax form is shared(list). The following is how to use it:

#pragma omp parallel for shared(a)
    for(i = 0; i < n; i++)
    {
        a[i] += i;
    }

When using shared variables in the parallel domain, if there are write operations, the shared variables need to be saved, because there may be situations where multiple threads modify the shared variables at the same time or another variable is updating the shared variables when one thread reads the shared variables, which may cause program errors

private

The private clause is used to specify which data is thread private, that is, each thread has a private copy of variables, and threads do not affect each other The syntax form is private(list), and the usage method is as follows:

void test_private() {
    int n = 8;
    int i=2, a = 3;
    // i. After a is defined as private, the original value will not be changed
    #pragma omp parallel for private(i, a)
    for ( i = 0; i<n; i++)
    {
        a = i+1;
        printf("In for: thread %d has a value of a = %d for i = %d\n", omp_get_thread_num(),a,i);
    }

    printf("\n"); 
    printf("Out for: thread %d has a value of a = %d for i = %d\n", omp_get_thread_num(),a,i);
}

The following is the result of the program:

In for: thread 2 has a value of a = 5 for i = 4
In for: thread 2 has a value of a = 6 for i = 5
In for: thread 3 has a value of a = 7 for i = 6
In for: thread 3 has a value of a = 8 for i = 7
In for: thread 0 has a value of a = 1 for i = 0
In for: thread 0 has a value of a = 2 for i = 1
In for: thread 1 has a value of a = 3 for i = 2
In for: thread 1 has a value of a = 4 for i = 3

Out for: thread 0 has a value of a = 3 for i = 2

For variables in the private clause, you need to pay attention to the following two points:

Whether the variable has an initial value or not, it is uninitialized after entering the parallel domain
The modification of variables in the parallel domain only works in this domain. After leaving the parallel domain, the variable value is still the value before entering the parallel domain

lastprivate

Lastprivate will save the last value of its modified variable when exiting the parallel domain. It can act on for and sections. The syntax format is lastprivate(list) Definition of last value: if it acts on the for instruction, then last value refers to the value of the last cycle executed in series; If it is applied to the sections instruction, the last value is the value after the last section containing the variable is executed The usage is as follows:

void test_last_private() {
    int n = 8;
    int i=2, a = 3;
    // lastprivate assigns the value of the last loop (i == n-1) a in for to a    
#pragma omp parallel for private(i) lastprivate(a)
    for ( i = 0; i<n; i++)
    {
        a = i+1;
        printf("In for: thread %d has a value of a = %d for i = %d\n", omp_get_thread_num(),a,i);
    }

    printf("\n");
    printf("Out for: thread %d has a value of a = %d for i = %d\n", omp_get_thread_num(),a,i);
}

The program execution result is:

In for: thread 3 has a value of a = 7 for i = 6
In for: thread 3 has a value of a = 8 for i = 7
In for: thread 2 has a value of a = 5 for i = 4
In for: thread 2 has a value of a = 6 for i = 5
In for: thread 1 has a value of a = 3 for i = 2
In for: thread 0 has a value of a = 1 for i = 0
In for: thread 0 has a value of a = 2 for i = 1
In for: thread 1 has a value of a = 4 for i = 3

Out for: thread 0 has a value of a = 8 for i = 2

firstprivate

The first private clause is used to provide an initial value for a private variable Variables decorated with firstprivate will use the value of the previously defined variable with the same name as its initial value The syntax form is firstprivate(list), and the usage method is as follows:

void test_first_private() {
    int n = 8;
    int i=0, a[n];

    for(i = 0; i < n ;i++) {
        a[i] = i+1;
    }
#pragma omp parallel for private(i) firstprivate(a)
    for ( i = 0; i<n; i++)
    {
        printf("thread %d: a[%d] is %d\n", omp_get_thread_num(), i, a[i]);
    }
}

The results are as follows:

thread 0: a[0] is 1
thread 0: a[1] is 2
thread 2: a[4] is 5
thread 2: a[5] is 6
thread 3: a[6] is 7
thread 3: a[7] is 8
thread 1: a[2] is 3
thread 1: a[3] is 4

default

Default clause is used to set the default data sharing attribute of variables. In C/C + +, only default(none | shared) is supported. default(shared) sets all variables to be shared by default, and default(none) cancels the default attribute of variables. It is necessary to display whether the specified variables are shared or private

nowait

It is used to cancel the implicit barrier in the work sharing structures. The following is an example:

void test_nowait() {
    int i, n =6;
    #pragma omp parallel 
    {
        #pragma omp for nowait 
        for(i = 0; i < n; i++) {
            printf("thread %d: ++++\n", omp_get_thread_num());
        }

        #pragma omp for
        for(i = 0; i < n; i++) {
            printf("thread %d: ----\n", omp_get_thread_num());
        }
    }
}

If the first for is not followed by nowait, the output is as follows:

thread 3: ++++
thread 0: ++++
thread 0: ++++
thread 2: ++++
thread 1: ++++
thread 1: ++++
thread 0: ----
thread 0: ----
thread 3: ----
thread 1: ----
thread 1: ----
thread 2: ----

Because the for instruction has an implicit barrier, it will synchronize all threads until the first for loop is executed, and then continue to execute Adding nowait eliminates this barrier, so that the thread can execute the second for loop without waiting for other threads after executing the first for loop. The following is the output after adding nowait:

thread 2: ++++
thread 2: ----
thread 1: ++++
thread 1: ++++
thread 1: ----
thread 1: ----
thread 3: ++++
thread 3: ----
thread 0: ++++
thread 0: ++++
thread 0: ----
thread 0: ----

When using nowait, you should pay attention to whether there is a dependency between the front and back for. If the second for loop needs the result of the first for loop, using nowait may cause program errors

schedule

The schedule clause only works on the loop structure, which is used to set the scheduling mode of circular tasks The syntax form is schedule(kind[,chunk_size]), in which the values of kind include static, dynamic, guided, auto, runtime and chunksize. It is optional and can be specified or not Here's how to use it:

void test_schedule() {
    int i, n = 10;

#pragma omp parallel for default(none) schedule(static, 2) \
    private(i) shared(n)
    for(i = 0; i < n; i++) {
        printf("Iteration %d executed by thread %d\n", i, omp_get_thread_num());
    }
}

Let's introduce the meaning of each value. Suppose there are n cycles and t threads

_static

Static scheduling, if chunk is not specified_ Size, then n/t or n/t + 1 (not divisible) consecutive iterations will be allocated to each thread. If chunk is specified_ Size, then each time a chunk is allocated to a thread_ Size iterative calculation. If the first round of allocation is not completed, the next round of allocation will be carried out circularly. Assuming n = 8 and T = 4, the following table gives chunk_ The allocation when size is not specified, equal to 1 or 3

Thread number \ chunk_size	Unspecified	chunk_size = 1	chunk_size = 3
0	0 1	0 4	0 1 2
1	2 3	1 5	3 4 5
2	4 5	2 6	6 7
3	6 7	3 7

dynamic Dynamic scheduling dynamically allocates iterative computation to threads. As long as the thread is idle, it will allocate tasks to it. Threads with fast computation will be allocated to more iterations If chunk is not specified_ Size parameter, an iterative loop is allocated to one thread at a time (equivalent to chunk_size=1). If chunk is specified_ Size, chunk is allocated to one thread at a time_ This is an iterative loop Under dynamic scheduling, the allocation result is not fixed. If the same program is executed repeatedly, the allocation result of each time is generally different. When n = 12 and T = 4, chunk is given below_ Allocation when size is not specified and equal to 2 (run twice)

Thread number \ chunk_size	Unspecified (first time)	Unspecified (second time)	chunk_ Size = 2 (first time)	chunk_ Size = 2 (second time)
0	2	0	4 5 8 9 10 11	0 1
1	0 4 5 6 7 8 9 10 11	3	0 1	4 5
2	3	1 4 5 6 7 8 9 10 11	2 3	6 7
3	1	2	6 7	2 3 8 9 10 11

Using dynamic can reduce the problem of load imbalance to some extent, but it should be noted that there will be some overhead in task dynamic application

guided Guided scheduling is a specified heuristic self scheduling method At the beginning, each thread will be allocated a larger iteration block, and then the size of the allocated iteration block will gradually decrease If chunk is specified_ Size, the iteration block will drop exponentially to the specified chunk_ Size. If the size parameter is not specified, the minimum iterative block size will be reduced to 1 (equivalent to chunk_size=1) Like dynamic scheduling, threads executing blocks will be assigned more tasks. The difference is that the size of iterative blocks varies here Similarly, the allocation result of guided scheduling is not fixed, and repeated execution will get different allocation results The following gives n=20, t=4, chunk_size not specified, chunk_ Allocation when size = 3 (executed twice)

Thread number \ chunk_size	Unspecified (first time)	Unspecified (second time)	chunk_ Size = 3 (first time)	chunk_ Size = 3 (second time)
0	12 13	0 1 2 3 4	0 1 2 3 4	5 6 7 8 　18 19
1	5 6 7 8 　16 17　 18 19	5 6 7 8	9 10 11	9 10 11
2	0 1 2 3 4　 14 15	9 10 11　 14 15 16　 17 18 19	5 6 7 8 　15 16 17 　18 19	0 1 2 3 4 　 15 16 17
3	9 10 11	12 13	12 13 14	12 13 14

When chunk is set_ When size = 3, because there are only 18 and 19 loops left in the end, the last thread to execute is only allocated to 2 loops

The following figure shows the allocation of static, (dynamic,7) and (guided, 7) scheduling modes when the number of cycles is 200 and the number of threads is 4

runtime Runtime scheduling is not a real scheduling method. The environment variable OMP is used at runtime_ Schedule to determine the scheduling type. The final scheduling type is still one of the above three scheduling methods Under bash, you can set it in the following way:

export OMP_SCHEDULE="static"

auto Give the compiler the right to choose and let the compiler choose the appropriate scheduling decision

Load imbalance

In the for loop, if the time spent between each loop is different, the problem of load imbalance may occur. The following code simulates this situation,

void test_schedule() {
    int i,j, n = 10;

    double start, end;
    GET_TIME(start);
#pragma omp parallel for default(none) schedule(static) \
    private(i, j) shared(n)
    for(i = 0; i < n; i++) {
        //printf("Iteration %d executed by thread %d\n", i, omp_get_thread_num());
        for(j = 0; j < i; j++) {
            system("sleep 0.1");
        }
    }
    GET_TIME(end);
    printf("static   : use time %.2fs\n", end-start);

    GET_TIME(start);
#pragma omp parallel for default(none) schedule(static,2) \
    private(i, j) shared(n)
    for(i = 0; i < n; i++) {
        for(j = 0; j < i; j++) {
            system("sleep 0.1");
        }
    }
    GET_TIME(end);
    printf("static,2 : use time %.2fs\n", end-start);

    GET_TIME(start);
#pragma omp parallel for default(none) schedule(dynamic) \
    private(i, j) shared(n)
    for(i = 0; i < n; i++) {
        for(j = 0; j < i; j++) {
            system("sleep 0.1");
        }
    }
    GET_TIME(end);
    printf("dynamic  : use time %.2fs\n", end-start);

    GET_TIME(start);
#pragma omp parallel for default(none) schedule(dynamic, 2) \
    private(i, j) shared(n)
    for(i = 0; i < n; i++) {
        for(j = 0; j < i; j++) {
            system("sleep 0.1");
        }
    }
    GET_TIME(end);
    printf("dynamic,2: use time %.2fs\n", end-start);

    GET_TIME(start);
#pragma omp parallel for default(none) schedule(guided) \
    private(i, j) shared(n)
    for(i = 0; i < n; i++) {
        for(j = 0; j < i; j++) {
            system("sleep 0.1");
        }
    }
    GET_TIME(end);
    printf("guided   : use time %.2fs\n", end-start);

    GET_TIME(start);
#pragma omp parallel for default(none) schedule(guided, 2) \
    private(i, j) shared(n)
    for(i = 0; i < n; i++) {
        for(j = 0; j < i; j++) {
            system("sleep 0.1");
        }
    }
    GET_TIME(end);
    printf("guided,2 : use time %.2fs\n", end-start);
}

GET_TIME is defined as follows:

#ifndef _TIMER_H_
#define _TIMER_H_

#include <sys/time.h>
#include <time.h>
#include <stdio.h>

#define GET_TIME(now) { \
   struct timeval t; \
   gettimeofday(&t, NULL); \
   now = t.tv_sec + t.tv_usec/1000000.0; \
}

#endif

In the above code, for the first for loop, the larger i is, the more time the loop consumes. Here is the output when n=10

static   : use time 1.74s
static,2 : use time 1.84s
dynamic  : use time 1.53s
dynamic,2: use time 1.84s
guided   : use time 1.63s
guided,2 : use time 1.53s

Here is the output of n=20

static   : use time 8.67s
static,2 : use time 6.42s
dynamic  : use time 5.62s
dynamic,2: use time 6.43s
guided   : use time 5.92s
guided,2 : use time 6.43s

For static scheduling, if chunk is not specified_ The value of size will distribute the last few cycles to the last thread, and the last few cycles are the most time-consuming. Other threads need to wait for this thread to complete their work, wasting system resources, resulting in load imbalance dynamic and guided can reduce load imbalance to some extent, but they are not absolute. The final choice depends on the specific problem

Synchronization constructs

Synchronization instruction is mainly used to control the access of shared variables between multiple threads It can ensure that threads update shared variables in a certain order, or ensure that two or more threads do not modify shared variables at the same time

barrier

Synchronous barrier: when a thread encounters a barrier, it must stop and wait until all threads in the parallel area reach the barrier point There will be an implicit synchronization barrier at the end of each parallel domain and task sharing domain, that is, there will be an implicit barrier after the regions constructed by parallel, for, sections and single. Therefore, we don't need to display the insertion barrier in many cases The following is the grammatical form:

#pragma omp barrier

Here is an example:

void print_time(int tid, char* s ) {
    int len = 10;
    char buf[len];
    NOW_TIME(buf, len);
    printf("Thread %d %s at %s\n", tid, s, buf);
}

void test_barrier() {
    int tid;
#pragma omp parallel private(tid)
    {
        tid = omp_get_thread_num();
        if(tid < omp_get_num_threads() / 2)
            system("sleep 3");
        print_time(tid, "before barrier ");
        
        #pragma omp barrier

        print_time(tid, "after  barrier ");
    }
}

Including NOW_TIME is defined as follows

#ifndef _TIMER_H_
#define _TIMER_H_

#include <sys/time.h>
#include <time.h>
#include <stdio.h>

#define NOW_TIME(buf, len) { \
    time_t nowtime; \
    nowtime = time(NULL); \
    struct tm *local; \
    local = localtime(&nowtime); \
    strftime(buf, len, "%H:%M:%S", local); \
}

#endif

In the above code, half of the threads (TID < 2) will sleep for 3 seconds before continuing to execute. First, look at the output without roadblocks, that is, remove #pragma omp barrier

Thread 3 before barrier  at 16:55:44
Thread 2 before barrier  at 16:55:44
Thread 3 after  barrier  at 16:55:44
Thread 2 after  barrier  at 16:55:44
Thread 1 before barrier  at 16:55:47
Thread 0 before barrier  at 16:55:47
Thread 0 after  barrier  at 16:55:47
Thread 1 after  barrier  at 16:55:47

The output results of roadblocks are added below:

Thread 3 before barrier  at 17:05:29
Thread 2 before barrier  at 17:05:29
Thread 0 before barrier  at 17:05:32
Thread 1 before barrier  at 17:05:32
Thread 0 after  barrier  at 17:05:32
Thread 1 after  barrier  at 17:05:32
Thread 2 after  barrier  at 17:05:32
Thread 3 after  barrier  at 17:05:32

Through comparison, we can see that after adding roadblocks, each thread should synchronize at the roadblock point once, and then continue to execute

ordered

The ordered structure allows a piece of code to be executed in serial order in the parallel domain. If we want to print the data calculated by different threads in order in the parallel domain, we can use this clause. The following is the syntax form

#pragma omp ordered
    structured block

There are two points to pay attention to when using

ordered works only on loop structures

When using ordered, you need to add the ordered clause when constructing the parallel domain, as shown below

The following is a usage example

void test_order() {
    int i, tid, n = 5;
    int a[n];
    for(i = 0; i < n; i++) {
        a[i] = 0;
    }

#pragma omp parallel for default(none) ordered  schedule(dynamic) \
    private (i, tid) shared(n, a)
    for(i = 0; i < n; i++) {
        tid = omp_get_thread_num();
        printf("Thread %d updates a[%d]\n", tid, i);

        a[i] += i;

        #pragma omp ordered
        {
            printf("Thread %d printf value of a[%d] = %d\n", tid, i, a[i]);
        }
    }
}

The following is the result of the program:

Thread 0 updates a[0]
Thread 2 updates a[2]
Thread 1 updates a[3]
Thread 0 printf value of a[0] = 0
Thread 0 updates a[4]
Thread 3 updates a[1]
Thread 3 printf value of a[1] = 1
Thread 2 printf value of a[2] = 2
Thread 1 printf value of a[3] = 3
Thread 0 printf value of a[4] = 4

From the output results, we can see that it is updated in disorderly order during update, but printed in serial order during printing

critical

Critical area: the critical area ensures that only one thread executes the code in the area in any time period. To enter the critical area, a thread must wait for the critical area to be idle. The following is the syntax form

#pragma omp critical [(name)]
    structured block

Where name is a name specified for the critical area The following is an example of summation. Note that it is only used to illustrate the role of critical area. For summation operation, we can use the reduction instruction

void test_critical() {
    int n = 100, sum = 0, sumLocal, i, tid;
    int a[n];
    for(i = 0; i < n; i++) {
        a[i] = i;
    }

#pragma omp parallel shared(n, a, sum) private (tid, sumLocal)
    {
        tid = omp_get_thread_num();
        sumLocal = 0;
        #pragma omp for
        for(i = 0; i < n; i++) {
            sumLocal += a[i];
        }

        #pragma omp critical(update_sum) 
        {
            sum += sumLocal;
            printf("Thread %d: sumLocal = %d sum =%d\n", tid, sumLocal, sum);
        }
    }

    printf("Value of sum after parallel region: %d\n",sum);
}

In this code, sum is global, and localSum is the sum value after each thread completes its own summation task. Adding sumLocal of each thread to sum is the final sum value When executing the sum+=sunLocal operation, it is necessary to ensure that only one thread executes the operation at a time. Therefore, the critical area is used here. The following is the operation result:

Thread 2: sumLocal = 1550 sum =1550
Thread 3: sumLocal = 2175 sum =3725
Thread 1: sumLocal = 925 sum =4650
Thread 0: sumLocal = 300 sum =4950
Value of sum after parallel region: 4950

The following is the operation result of removing the critical zone (the operation result is not fixed, here is only one of them):

Thread 2: sumLocal = 1550 sum =1550
Thread 3: sumLocal = 2175 sum =2475
Thread 1: sumLocal = 925 sum =925
Thread 0: sumLocal = 300 sum =300
Value of sum after parallel region: 2475

Through comparison, we can see that the critical zone ensures the correctness of the program

atomic

atomic operation can lock a special storage unit (which can be a single variable or an array element), so that the storage unit can only be updated atomically, and multiple threads are not allowed to write at the same time atomic can only act on a single assignment statement, not on a code block The syntax form is:

#pragma omp atomic
    statement

In C/C + +, statement must be one of the following forms

x++, x--, ++x, --x
x binop= expr, where binop is one of the binary operators: +, -, *, /, &, ^, |, < <, > >

Atomic can effectively use the atomic operation mechanism of hardware to control the write operation of multiple threads on shared variables, which is more efficient. The following is an example

void test_atomic() {
    int counter=0, n = 1000000, i;

#pragma omp parallel for shared(counter, n)
    for(i = 0; i < n; i++) {
        #pragma omp atomic
        counter += 1;
    }

    printf("counter is %d\n", counter);
}

For the following cases

#pragma omp atomic
ic += func();

Atomic only guarantees that the update of ic is atomic, that is, it will not be updated by multiple threads at the same time, but it will not guarantee that the execution of func function is atomic, that is, multiple threads can execute func function at the same time. If you want to make the execution of func atomic, you can use the critical area

locks

Mutex provides a lower level mechanism to deal with synchronization. It has more flexibility than using critical and atomic, but it is also relatively more complex openmp provides two types of locks - simple locks and nested locks. For a simple lock, if it is locked, it may not be locked again For nested locks, they can be locked multiple times by the same thread Here are some functions of simple lock

void omp_init_lock(omp_lock_t *lck)   // Initialize mutex
void omp_destroy_lock(omp_lock_t *lck)   // Destroy mutex
void omp_set_lock(omp_lock_t *lck)   // Get mutex
void omp_unset_lock(omp_lock_t *lck)   // Release mutex
bool omp_test_lock(omp_lock_t *lck)   // Try to obtain the mutex. If successful, return true; otherwise, return false

The functions of nested locks are slightly different from simple locks, as shown below

void omp_init_nest_lock(omp_nest_lock_t *lck)
void omp_destroy_nest_lock(omp_nest_lock_t *lck)
void omp_set_nest_lock(omp_nest_lock_t *lck)
void omp_unset_nest_lock(omp_nest_lock_t *lck)
void omp_test_nest_lock(omp_nest_lock_t *lck)

The following is a usage example

void test_lock() {
    omp_lock_t lock;
    int i,n = 4;
    omp_init_lock(&lock);
#pragma omp parallel for
    for(i = 0; i < n; i++) {
        omp_set_lock(&lock);
        printf("Thread %d: +\n", omp_get_thread_num());
        system("sleep 0.1");
        printf("Thread %d: -\n", omp_get_thread_num());
        omp_unset_lock(&lock);
    }
    omp_destroy_lock(&lock);
}

system("sleep 0.1") is used to provide an interval between two outputs, so as to compare it with the case without lock The following is the output of the program:

Thread 1: +
Thread 1: -
Thread 2: +
Thread 2: -
Thread 3: +
Thread 3: -
Thread 0: +
Thread 0: -

The following is the output of removing the lock

Thread 3: +
Thread 2: +
Thread 0: +
Thread 1: +
Thread 2: -
Thread 3: -
Thread 0: -
Thread 1: -

master

Used to specify that a piece of code is executed only by the main thread The differences between the master instruction and the single instruction are as follows:

The code segment contained in the master instruction can only be executed by the main thread, while the code contained in the single instruction can be executed by any thread
The master instruction has no implicit synchronization at the end, and the nowait clause cannot be used

Here is an example:

void test_master() {
    int a, i, n = 5;
    int b[n];
#pragma omp parallel shared(a, b) private(i)
    {
        #pragma omp master
        {
            a = 10;
            printf("Master construct is executed by thread %d\n", omp_get_thread_num());
        }
        #pragma omp barrier

        #pragma omp for
        for(i = 0; i < n; i++)
            b[i] = a;
    }

    printf("After the parallel region:\n");
    for(i = 0; i < n; i++)
        printf("b[%d] = %d\n", i, b[i]);
}

Here are the output results

Master construct is executed by thread 0
After the parallel region:
b[0] = 10
b[1] = 10
b[2] = 10
b[3] = 10
b[4] = 10

Added by Pioden on Wed, 02 Mar 2022 11:44:13 +0200

Programming VIP

[OpenMP learning notes] compile guidance instructions

preface

Parallel construct (parallel domain structure)

Work sharing structure

for

sections

single

Combined Parallel Work-Sharing Constructs

Clauses to Control Parallel and Work-Sharing Constructs

shared

private

lastprivate

firstprivate

default

nowait

schedule

Load imbalance

Synchronization constructs

barrier

ordered

critical

atomic

locks

master

Popular Keywords