preface
OpenMP implements parallelization by inserting compilation guidance instructions into the serial program. The compiler supporting OpenMP can recognize, process these instructions and realize the corresponding functions All compiled guidance instructions start with #pragma omp, followed by specific functional instructions or commands The general format is as follows:
#pragma omp directive [clause [[,] clause]...] structured block
Parallel construct (parallel domain structure)
In order to make the program execute in parallel, we first need to construct a parallel region. Here, we use the parallel instruction to construct the parallel region, and its syntax form is as follows
#pragma omp parallel [clause [[,] clause]...] structured block
We can see that in fact, a parallel keyword is added after omp. The main function of this instruction is to construct parallel domain, create thread group and execute tasks concurrently It should be noted that this instruction only ensures that the code is executed in parallel, but it is not responsible for task distribution between threads At the end of parallel domain execution, there will be an implicit barrier to synchronize all threads in the region Here is an example:
void parallel_construct() { #pragma omp parallel { printf("Hello from thread %d\n", omp_get_thread_num()); } }
Where omp_get_thread_num() is used to get the number of the current thread. This function is defined in < OMP h> Medium The output results are as follows:
Hello from thread 1 Hello from thread 3 Hello from thread 0 Hello from thread 2
The parallel instruction can be followed by some clauses, as shown below
if(scalar-expression) num_threads(integer-expression) private(list) firstprivate(list) shared(list) default(none | shared) copyin(list) reduction(operator:list)
The usage of these clauses will be introduced later
Work sharing structure
Task sharing instruction is mainly used to assign different tasks to threads. A work sharing region must be associated with an active parallel region. If the task sharing instruction is in an inactive parallel domain or in a serial domain, the instruction will be ignored In C/C + +, there are three task sharing instructions: for, sections and single. Strictly speaking, only for and sections are task sharing instructions, while single is only an instruction to assist task sharing
for
Used in a for loop to assign different loops to different threads. The syntax is as follows:
#pragma omp for [clause[[,] clause]...] for-loop
Here is an example:
void parallel_for() { int n = 9; int i = 0; #pragma omp parallel shared(n) private(i) { #pragma omp for for(i = 0; i < n; i++) { printf("Thread %d executes loop iteration %d\n", omp_get_thread_num(),i); } } }
The following is the result of program execution
Thread 2 executes loop iteration 5 Thread 2 executes loop iteration 6 Thread 3 executes loop iteration 7 Thread 3 executes loop iteration 8 Thread 0 executes loop iteration 0 Thread 0 executes loop iteration 1 Thread 0 executes loop iteration 2 Thread 1 executes loop iteration 3 Thread 1 executes loop iteration 4
In the above program, a total of 4 threads execute 9 cycles. Threads are divided into 3 cycles and the remaining threads are divided into 2 cycles. This is a common scheduling method, that is, assuming n cycle iterations and t threads, each thread is allocated to n/t or n/t + 1 consecutive iteration calculation, but in some cases, this method is not the best choice, We can use schedule to specify the scheduling method, which will be described in detail later The following are some clauses that can be followed by the for instruction:
private(list) fistprivate(list) lastprivate(list) reduction(operator:list) ordered schedule(kind[,chunk_size]) nowait
sections
The sections instruction can assign different tasks to different threads. The syntax is as follows:
#pragma omp sections [clause[[,] clause]...] { [#pragma omp section] structured block [#pragma omp section] structured block ... }
From the above code, we can see that sections divide the code into multiple sections, and each thread processes one section. The following is an example:
/** * Use #pragma omp sections and #pragma omp sections to enable different threads to perform different tasks * If the number of threads is greater than the number of section s, the redundant threads will be idle * If the number of threads is less than the number of sections, one thread will execute multiple section codes */ void funcA() { printf("In funcA: this section is executed by thread %d\n", omp_get_thread_num()); } void funcB() { printf("In funcB: this section is executed by thread %d\n", omp_get_thread_num()); } void parallel_section() { #pragma omp parallel { #pragma omp sections { #pragma omp section { (void)funcA(); } #pragma omp section { (void)funcB(); } } } }
Here are the execution results:
In funcA: this section is executed by thread 3 In funcB: this section is executed by thread 0
Here are some clauses that can be followed by sections
private(list) firstprivate(list) lastprivate(list) reduction(operator:list) nowait
single
The single instruction is used to specify that a certain code block can only be executed by one thread. If there is no nowait clause, all threads will synchronize at the implicit synchronization point at the end of the single instruction. If there is a nowait clause in the single instruction, other threads will execute directly However, the single instruction does not specify which thread to execute The syntax is as follows:
#pragma omp single [clause[[,] clause]...] structured block
The following is a usage example
void parallel_single() { int a = 0, n = 10, i; int b[n]; #pragma omp parallel shared(a, b) private(i) { // Only one thread will execute this code, and other threads will wait for the thread to finish executing #pragma omp single { a = 10; printf("Single construct executed by thread %d\n", omp_get_thread_num()); } // A barrier is automatically inserted here #pragma omp for for(i = 0; i < n; i++) { b[i] = a; } } printf("After the parallel region:\n"); for (i=0; i<n; i++) printf("b[%d] = %d\n",i,b[i]); }
Here are the execution results:
Single construct executed by thread 2 After the parallel region: b[0] = 10 b[1] = 10 b[2] = 10 b[3] = 10 b[4] = 10 b[5] = 10 b[6] = 10 b[7] = 10 b[8] = 10 b[9] = 10
The following are the clauses that can be followed by the single instruction:
private(list) firstprivate(list) copyprivate(list) nowait
Combined Parallel Work-Sharing Constructs
The parallel instruction and work-sharing instruction are combined to make the code more concise As shown in the following code
#pragma omp parallel { #pragma omp for for(.....) }
Can be written as
#pragma omp parallel for for(.....)
Using these combined structures not only increases the readability of the program, but also helps the performance of the program When using these composite structures, the compiler can know what to do next, which may generate more efficient code
Clauses to Control Parallel and Work-Sharing Constructs
OpenMP instructions can be followed by clauses to control the behavior of constructors Here are some common clauses
shared
The shared clause is used to specify which data is shared between threads. The syntax form is shared(list). The following is how to use it:
#pragma omp parallel for shared(a) for(i = 0; i < n; i++) { a[i] += i; }
When using shared variables in the parallel domain, if there are write operations, the shared variables need to be saved, because there may be situations where multiple threads modify the shared variables at the same time or another variable is updating the shared variables when one thread reads the shared variables, which may cause program errors
private
The private clause is used to specify which data is thread private, that is, each thread has a private copy of variables, and threads do not affect each other The syntax form is private(list), and the usage method is as follows:
void test_private() { int n = 8; int i=2, a = 3; // i. After a is defined as private, the original value will not be changed #pragma omp parallel for private(i, a) for ( i = 0; i<n; i++) { a = i+1; printf("In for: thread %d has a value of a = %d for i = %d\n", omp_get_thread_num(),a,i); } printf("\n"); printf("Out for: thread %d has a value of a = %d for i = %d\n", omp_get_thread_num(),a,i); }
The following is the result of the program:
In for: thread 2 has a value of a = 5 for i = 4 In for: thread 2 has a value of a = 6 for i = 5 In for: thread 3 has a value of a = 7 for i = 6 In for: thread 3 has a value of a = 8 for i = 7 In for: thread 0 has a value of a = 1 for i = 0 In for: thread 0 has a value of a = 2 for i = 1 In for: thread 1 has a value of a = 3 for i = 2 In for: thread 1 has a value of a = 4 for i = 3 Out for: thread 0 has a value of a = 3 for i = 2
For variables in the private clause, you need to pay attention to the following two points:
- Whether the variable has an initial value or not, it is uninitialized after entering the parallel domain
- The modification of variables in the parallel domain only works in this domain. After leaving the parallel domain, the variable value is still the value before entering the parallel domain
lastprivate
Lastprivate will save the last value of its modified variable when exiting the parallel domain. It can act on for and sections. The syntax format is lastprivate(list) Definition of last value: if it acts on the for instruction, then last value refers to the value of the last cycle executed in series; If it is applied to the sections instruction, the last value is the value after the last section containing the variable is executed The usage is as follows:
void test_last_private() { int n = 8; int i=2, a = 3; // lastprivate assigns the value of the last loop (i == n-1) a in for to a #pragma omp parallel for private(i) lastprivate(a) for ( i = 0; i<n; i++) { a = i+1; printf("In for: thread %d has a value of a = %d for i = %d\n", omp_get_thread_num(),a,i); } printf("\n"); printf("Out for: thread %d has a value of a = %d for i = %d\n", omp_get_thread_num(),a,i); }
The program execution result is:
In for: thread 3 has a value of a = 7 for i = 6 In for: thread 3 has a value of a = 8 for i = 7 In for: thread 2 has a value of a = 5 for i = 4 In for: thread 2 has a value of a = 6 for i = 5 In for: thread 1 has a value of a = 3 for i = 2 In for: thread 0 has a value of a = 1 for i = 0 In for: thread 0 has a value of a = 2 for i = 1 In for: thread 1 has a value of a = 4 for i = 3 Out for: thread 0 has a value of a = 8 for i = 2
firstprivate
The first private clause is used to provide an initial value for a private variable Variables decorated with firstprivate will use the value of the previously defined variable with the same name as its initial value The syntax form is firstprivate(list), and the usage method is as follows:
void test_first_private() { int n = 8; int i=0, a[n]; for(i = 0; i < n ;i++) { a[i] = i+1; } #pragma omp parallel for private(i) firstprivate(a) for ( i = 0; i<n; i++) { printf("thread %d: a[%d] is %d\n", omp_get_thread_num(), i, a[i]); } }
The results are as follows:
thread 0: a[0] is 1 thread 0: a[1] is 2 thread 2: a[4] is 5 thread 2: a[5] is 6 thread 3: a[6] is 7 thread 3: a[7] is 8 thread 1: a[2] is 3 thread 1: a[3] is 4
default
Default clause is used to set the default data sharing attribute of variables. In C/C + +, only default(none | shared) is supported. default(shared) sets all variables to be shared by default, and default(none) cancels the default attribute of variables. It is necessary to display whether the specified variables are shared or private
nowait
It is used to cancel the implicit barrier in the work sharing structures. The following is an example:
void test_nowait() { int i, n =6; #pragma omp parallel { #pragma omp for nowait for(i = 0; i < n; i++) { printf("thread %d: ++++\n", omp_get_thread_num()); } #pragma omp for for(i = 0; i < n; i++) { printf("thread %d: ----\n", omp_get_thread_num()); } } }
If the first for is not followed by nowait, the output is as follows:
thread 3: ++++ thread 0: ++++ thread 0: ++++ thread 2: ++++ thread 1: ++++ thread 1: ++++ thread 0: ---- thread 0: ---- thread 3: ---- thread 1: ---- thread 1: ---- thread 2: ----
Because the for instruction has an implicit barrier, it will synchronize all threads until the first for loop is executed, and then continue to execute Adding nowait eliminates this barrier, so that the thread can execute the second for loop without waiting for other threads after executing the first for loop. The following is the output after adding nowait:
thread 2: ++++ thread 2: ---- thread 1: ++++ thread 1: ++++ thread 1: ---- thread 1: ---- thread 3: ++++ thread 3: ---- thread 0: ++++ thread 0: ++++ thread 0: ---- thread 0: ----
When using nowait, you should pay attention to whether there is a dependency between the front and back for. If the second for loop needs the result of the first for loop, using nowait may cause program errors
schedule
The schedule clause only works on the loop structure, which is used to set the scheduling mode of circular tasks The syntax form is schedule(kind[,chunk_size]), in which the values of kind include static, dynamic, guided, auto, runtime and chunksize. It is optional and can be specified or not Here's how to use it:
void test_schedule() { int i, n = 10; #pragma omp parallel for default(none) schedule(static, 2) \ private(i) shared(n) for(i = 0; i < n; i++) { printf("Iteration %d executed by thread %d\n", i, omp_get_thread_num()); } }
Let's introduce the meaning of each value. Suppose there are n cycles and t threads
_static
Static scheduling, if chunk is not specified_ Size, then n/t or n/t + 1 (not divisible) consecutive iterations will be allocated to each thread. If chunk is specified_ Size, then each time a chunk is allocated to a thread_ Size iterative calculation. If the first round of allocation is not completed, the next round of allocation will be carried out circularly. Assuming n = 8 and T = 4, the following table gives chunk_ The allocation when size is not specified, equal to 1 or 3
Thread number \ chunk_size | Unspecified | chunk_size = 1 | chunk_size = 3 |
---|---|---|---|
0 | 0 1 | 0 4 | 0 1 2 |
1 | 2 3 | 1 5 | 3 4 5 |
2 | 4 5 | 2 6 | 6 7 |
3 | 6 7 | 3 7 |
dynamic Dynamic scheduling dynamically allocates iterative computation to threads. As long as the thread is idle, it will allocate tasks to it. Threads with fast computation will be allocated to more iterations If chunk is not specified_ Size parameter, an iterative loop is allocated to one thread at a time (equivalent to chunk_size=1). If chunk is specified_ Size, chunk is allocated to one thread at a time_ This is an iterative loop Under dynamic scheduling, the allocation result is not fixed. If the same program is executed repeatedly, the allocation result of each time is generally different. When n = 12 and T = 4, chunk is given below_ Allocation when size is not specified and equal to 2 (run twice)
Thread number \ chunk_size | Unspecified (first time) | Unspecified (second time) | chunk_ Size = 2 (first time) | chunk_ Size = 2 (second time) |
---|---|---|---|---|
0 | 2 | 0 | 4 5 8 9 10 11 | 0 1 |
1 | 0 4 5 6 7 8 9 10 11 | 3 | 0 1 | 4 5 |
2 | 3 | 1 4 5 6 7 8 9 10 11 | 2 3 | 6 7 |
3 | 1 | 2 | 6 7 | 2 3 8 9 10 11 |
Using dynamic can reduce the problem of load imbalance to some extent, but it should be noted that there will be some overhead in task dynamic application
guided Guided scheduling is a specified heuristic self scheduling method At the beginning, each thread will be allocated a larger iteration block, and then the size of the allocated iteration block will gradually decrease If chunk is specified_ Size, the iteration block will drop exponentially to the specified chunk_ Size. If the size parameter is not specified, the minimum iterative block size will be reduced to 1 (equivalent to chunk_size=1) Like dynamic scheduling, threads executing blocks will be assigned more tasks. The difference is that the size of iterative blocks varies here Similarly, the allocation result of guided scheduling is not fixed, and repeated execution will get different allocation results The following gives n=20, t=4, chunk_size not specified, chunk_ Allocation when size = 3 (executed twice)
Thread number \ chunk_size | Unspecified (first time) | Unspecified (second time) | chunk_ Size = 3 (first time) | chunk_ Size = 3 (second time) |
---|---|---|---|---|
0 | 12 13 | 0 1 2 3 4 | 0 1 2 3 4 | 5 6 7 8 18 19 |
1 | 5 6 7 8 16 17 18 19 | 5 6 7 8 | 9 10 11 | 9 10 11 |
2 | 0 1 2 3 4 14 15 | 9 10 11 14 15 16 17 18 19 | 5 6 7 8 15 16 17 18 19 | 0 1 2 3 4 15 16 17 |
3 | 9 10 11 | 12 13 | 12 13 14 | 12 13 14 |
When chunk is set_ When size = 3, because there are only 18 and 19 loops left in the end, the last thread to execute is only allocated to 2 loops
The following figure shows the allocation of static, (dynamic,7) and (guided, 7) scheduling modes when the number of cycles is 200 and the number of threads is 4
runtime Runtime scheduling is not a real scheduling method. The environment variable OMP is used at runtime_ Schedule to determine the scheduling type. The final scheduling type is still one of the above three scheduling methods Under bash, you can set it in the following way:
export OMP_SCHEDULE="static"
auto Give the compiler the right to choose and let the compiler choose the appropriate scheduling decision
Load imbalance
In the for loop, if the time spent between each loop is different, the problem of load imbalance may occur. The following code simulates this situation,
void test_schedule() { int i,j, n = 10; double start, end; GET_TIME(start); #pragma omp parallel for default(none) schedule(static) \ private(i, j) shared(n) for(i = 0; i < n; i++) { //printf("Iteration %d executed by thread %d\n", i, omp_get_thread_num()); for(j = 0; j < i; j++) { system("sleep 0.1"); } } GET_TIME(end); printf("static : use time %.2fs\n", end-start); GET_TIME(start); #pragma omp parallel for default(none) schedule(static,2) \ private(i, j) shared(n) for(i = 0; i < n; i++) { for(j = 0; j < i; j++) { system("sleep 0.1"); } } GET_TIME(end); printf("static,2 : use time %.2fs\n", end-start); GET_TIME(start); #pragma omp parallel for default(none) schedule(dynamic) \ private(i, j) shared(n) for(i = 0; i < n; i++) { for(j = 0; j < i; j++) { system("sleep 0.1"); } } GET_TIME(end); printf("dynamic : use time %.2fs\n", end-start); GET_TIME(start); #pragma omp parallel for default(none) schedule(dynamic, 2) \ private(i, j) shared(n) for(i = 0; i < n; i++) { for(j = 0; j < i; j++) { system("sleep 0.1"); } } GET_TIME(end); printf("dynamic,2: use time %.2fs\n", end-start); GET_TIME(start); #pragma omp parallel for default(none) schedule(guided) \ private(i, j) shared(n) for(i = 0; i < n; i++) { for(j = 0; j < i; j++) { system("sleep 0.1"); } } GET_TIME(end); printf("guided : use time %.2fs\n", end-start); GET_TIME(start); #pragma omp parallel for default(none) schedule(guided, 2) \ private(i, j) shared(n) for(i = 0; i < n; i++) { for(j = 0; j < i; j++) { system("sleep 0.1"); } } GET_TIME(end); printf("guided,2 : use time %.2fs\n", end-start); }
GET_TIME is defined as follows:
#ifndef _TIMER_H_ #define _TIMER_H_ #include <sys/time.h> #include <time.h> #include <stdio.h> #define GET_TIME(now) { \ struct timeval t; \ gettimeofday(&t, NULL); \ now = t.tv_sec + t.tv_usec/1000000.0; \ } #endif
In the above code, for the first for loop, the larger i is, the more time the loop consumes. Here is the output when n=10
static : use time 1.74s static,2 : use time 1.84s dynamic : use time 1.53s dynamic,2: use time 1.84s guided : use time 1.63s guided,2 : use time 1.53s
Here is the output of n=20
static : use time 8.67s static,2 : use time 6.42s dynamic : use time 5.62s dynamic,2: use time 6.43s guided : use time 5.92s guided,2 : use time 6.43s
For static scheduling, if chunk is not specified_ The value of size will distribute the last few cycles to the last thread, and the last few cycles are the most time-consuming. Other threads need to wait for this thread to complete their work, wasting system resources, resulting in load imbalance dynamic and guided can reduce load imbalance to some extent, but they are not absolute. The final choice depends on the specific problem
Synchronization constructs
Synchronization instruction is mainly used to control the access of shared variables between multiple threads It can ensure that threads update shared variables in a certain order, or ensure that two or more threads do not modify shared variables at the same time
barrier
Synchronous barrier: when a thread encounters a barrier, it must stop and wait until all threads in the parallel area reach the barrier point There will be an implicit synchronization barrier at the end of each parallel domain and task sharing domain, that is, there will be an implicit barrier after the regions constructed by parallel, for, sections and single. Therefore, we don't need to display the insertion barrier in many cases The following is the grammatical form:
#pragma omp barrier
Here is an example:
void print_time(int tid, char* s ) { int len = 10; char buf[len]; NOW_TIME(buf, len); printf("Thread %d %s at %s\n", tid, s, buf); } void test_barrier() { int tid; #pragma omp parallel private(tid) { tid = omp_get_thread_num(); if(tid < omp_get_num_threads() / 2) system("sleep 3"); print_time(tid, "before barrier "); #pragma omp barrier print_time(tid, "after barrier "); } }
Including NOW_TIME is defined as follows
#ifndef _TIMER_H_ #define _TIMER_H_ #include <sys/time.h> #include <time.h> #include <stdio.h> #define NOW_TIME(buf, len) { \ time_t nowtime; \ nowtime = time(NULL); \ struct tm *local; \ local = localtime(&nowtime); \ strftime(buf, len, "%H:%M:%S", local); \ } #endif
In the above code, half of the threads (TID < 2) will sleep for 3 seconds before continuing to execute. First, look at the output without roadblocks, that is, remove #pragma omp barrier
Thread 3 before barrier at 16:55:44 Thread 2 before barrier at 16:55:44 Thread 3 after barrier at 16:55:44 Thread 2 after barrier at 16:55:44 Thread 1 before barrier at 16:55:47 Thread 0 before barrier at 16:55:47 Thread 0 after barrier at 16:55:47 Thread 1 after barrier at 16:55:47
The output results of roadblocks are added below:
Thread 3 before barrier at 17:05:29 Thread 2 before barrier at 17:05:29 Thread 0 before barrier at 17:05:32 Thread 1 before barrier at 17:05:32 Thread 0 after barrier at 17:05:32 Thread 1 after barrier at 17:05:32 Thread 2 after barrier at 17:05:32 Thread 3 after barrier at 17:05:32
Through comparison, we can see that after adding roadblocks, each thread should synchronize at the roadblock point once, and then continue to execute
ordered
The ordered structure allows a piece of code to be executed in serial order in the parallel domain. If we want to print the data calculated by different threads in order in the parallel domain, we can use this clause. The following is the syntax form
#pragma omp ordered structured block
There are two points to pay attention to when using
ordered works only on loop structures
When using ordered, you need to add the ordered clause when constructing the parallel domain, as shown below
The following is a usage example
void test_order() { int i, tid, n = 5; int a[n]; for(i = 0; i < n; i++) { a[i] = 0; } #pragma omp parallel for default(none) ordered schedule(dynamic) \ private (i, tid) shared(n, a) for(i = 0; i < n; i++) { tid = omp_get_thread_num(); printf("Thread %d updates a[%d]\n", tid, i); a[i] += i; #pragma omp ordered { printf("Thread %d printf value of a[%d] = %d\n", tid, i, a[i]); } } }
The following is the result of the program:
Thread 0 updates a[0] Thread 2 updates a[2] Thread 1 updates a[3] Thread 0 printf value of a[0] = 0 Thread 0 updates a[4] Thread 3 updates a[1] Thread 3 printf value of a[1] = 1 Thread 2 printf value of a[2] = 2 Thread 1 printf value of a[3] = 3 Thread 0 printf value of a[4] = 4
From the output results, we can see that it is updated in disorderly order during update, but printed in serial order during printing
critical
Critical area: the critical area ensures that only one thread executes the code in the area in any time period. To enter the critical area, a thread must wait for the critical area to be idle. The following is the syntax form
#pragma omp critical [(name)] structured block
Where name is a name specified for the critical area The following is an example of summation. Note that it is only used to illustrate the role of critical area. For summation operation, we can use the reduction instruction
void test_critical() { int n = 100, sum = 0, sumLocal, i, tid; int a[n]; for(i = 0; i < n; i++) { a[i] = i; } #pragma omp parallel shared(n, a, sum) private (tid, sumLocal) { tid = omp_get_thread_num(); sumLocal = 0; #pragma omp for for(i = 0; i < n; i++) { sumLocal += a[i]; } #pragma omp critical(update_sum) { sum += sumLocal; printf("Thread %d: sumLocal = %d sum =%d\n", tid, sumLocal, sum); } } printf("Value of sum after parallel region: %d\n",sum); }
In this code, sum is global, and localSum is the sum value after each thread completes its own summation task. Adding sumLocal of each thread to sum is the final sum value When executing the sum+=sunLocal operation, it is necessary to ensure that only one thread executes the operation at a time. Therefore, the critical area is used here. The following is the operation result:
Thread 2: sumLocal = 1550 sum =1550 Thread 3: sumLocal = 2175 sum =3725 Thread 1: sumLocal = 925 sum =4650 Thread 0: sumLocal = 300 sum =4950 Value of sum after parallel region: 4950
The following is the operation result of removing the critical zone (the operation result is not fixed, here is only one of them):
Thread 2: sumLocal = 1550 sum =1550 Thread 3: sumLocal = 2175 sum =2475 Thread 1: sumLocal = 925 sum =925 Thread 0: sumLocal = 300 sum =300 Value of sum after parallel region: 2475
Through comparison, we can see that the critical zone ensures the correctness of the program
atomic
atomic operation can lock a special storage unit (which can be a single variable or an array element), so that the storage unit can only be updated atomically, and multiple threads are not allowed to write at the same time atomic can only act on a single assignment statement, not on a code block The syntax form is:
#pragma omp atomic statement
In C/C + +, statement must be one of the following forms
- x++, x--, ++x, --x
- x binop= expr, where binop is one of the binary operators: +, -, *, /, &, ^, |, < <, > >
Atomic can effectively use the atomic operation mechanism of hardware to control the write operation of multiple threads on shared variables, which is more efficient. The following is an example
void test_atomic() { int counter=0, n = 1000000, i; #pragma omp parallel for shared(counter, n) for(i = 0; i < n; i++) { #pragma omp atomic counter += 1; } printf("counter is %d\n", counter); }
For the following cases
#pragma omp atomic ic += func();
Atomic only guarantees that the update of ic is atomic, that is, it will not be updated by multiple threads at the same time, but it will not guarantee that the execution of func function is atomic, that is, multiple threads can execute func function at the same time. If you want to make the execution of func atomic, you can use the critical area
locks
Mutex provides a lower level mechanism to deal with synchronization. It has more flexibility than using critical and atomic, but it is also relatively more complex openmp provides two types of locks - simple locks and nested locks. For a simple lock, if it is locked, it may not be locked again For nested locks, they can be locked multiple times by the same thread Here are some functions of simple lock
void omp_init_lock(omp_lock_t *lck) // Initialize mutex void omp_destroy_lock(omp_lock_t *lck) // Destroy mutex void omp_set_lock(omp_lock_t *lck) // Get mutex void omp_unset_lock(omp_lock_t *lck) // Release mutex bool omp_test_lock(omp_lock_t *lck) // Try to obtain the mutex. If successful, return true; otherwise, return false
The functions of nested locks are slightly different from simple locks, as shown below
void omp_init_nest_lock(omp_nest_lock_t *lck) void omp_destroy_nest_lock(omp_nest_lock_t *lck) void omp_set_nest_lock(omp_nest_lock_t *lck) void omp_unset_nest_lock(omp_nest_lock_t *lck) void omp_test_nest_lock(omp_nest_lock_t *lck)
The following is a usage example
void test_lock() { omp_lock_t lock; int i,n = 4; omp_init_lock(&lock); #pragma omp parallel for for(i = 0; i < n; i++) { omp_set_lock(&lock); printf("Thread %d: +\n", omp_get_thread_num()); system("sleep 0.1"); printf("Thread %d: -\n", omp_get_thread_num()); omp_unset_lock(&lock); } omp_destroy_lock(&lock); }
system("sleep 0.1") is used to provide an interval between two outputs, so as to compare it with the case without lock The following is the output of the program:
Thread 1: + Thread 1: - Thread 2: + Thread 2: - Thread 3: + Thread 3: - Thread 0: + Thread 0: -
The following is the output of removing the lock
Thread 3: + Thread 2: + Thread 0: + Thread 1: + Thread 2: - Thread 3: - Thread 0: - Thread 1: -
master
Used to specify that a piece of code is executed only by the main thread The differences between the master instruction and the single instruction are as follows:
- The code segment contained in the master instruction can only be executed by the main thread, while the code contained in the single instruction can be executed by any thread
- The master instruction has no implicit synchronization at the end, and the nowait clause cannot be used
Here is an example:
void test_master() { int a, i, n = 5; int b[n]; #pragma omp parallel shared(a, b) private(i) { #pragma omp master { a = 10; printf("Master construct is executed by thread %d\n", omp_get_thread_num()); } #pragma omp barrier #pragma omp for for(i = 0; i < n; i++) b[i] = a; } printf("After the parallel region:\n"); for(i = 0; i < n; i++) printf("b[%d] = %d\n", i, b[i]); }
Here are the output results
Master construct is executed by thread 0 After the parallel region: b[0] = 10 b[1] = 10 b[2] = 10 b[3] = 10 b[4] = 10