Contents


Deprecated: See the LONI Moodle Courses



By Brett Estrade

Note: This tutorial is meant to complement the presentation.

Intended Audience

This tutorial is aimed at those familiar with the shared memory model of parallel programming as well as the basic features of OpenMP. It is a logical continuation of the LONI tutorial, an Introduction to OpenMP.

Objectives

The objectives of this tutorial include:

  • detailed look into work sharing constructs
  • a look at parallelizing an existing code

Work Sharing

The term, work sharing, refers to the sharing of work among available nodes. OpenMP provides some ways to control how the work is distributed among nodes, including:

  • the parallelization of loops
  • the ability to define parallel sections
  • the ability to define critical sections
  • the ability to provide for exclusive writing of shared variables (i.e., the shared memory)

The following discusses these controls and concludes with an example of how to parallelize an existign chunk of code. The design of a shared memory program from scratch will be covered in another tutorial.

Un-Parallelizable Loops

OpenMP is not magic. A loop must be obviously parallelizable in order for OpenMP to unroll it and facilitate the assignment of iterations among threads. If there are any data dependencies from one iteration to the next, then OpenMP can't parallelize it.

An example of an un-parallelizable loops is show below:

A[0] = 1;
for (i = 1; i < MAX; i++)
    A[i] = 2 * A[i-1];

Parallelizing Loops

In Fortran, the do directive is used to indicate a loop. In C/C++, the for directive is used.

Parallelizing loops with OpenMP is straightforward. One simply denotes the loop to be parallelized and a few parameters, and OpenMP takes care of the rest.

Example: parallel do

Fortran – DO Directive Example: (source)

       PROGRAM VECTOR_ADD
       USE OMP_LIB
       PARAMETER (N=100)
       INTEGER N, I
       REAL A(N), B(N), C(N)
       CALL OMP_SET_DYNAMIC (.FALSE.)  ! ensures use of all available threads
       CALL OMP_SET_NUM_THREADS (20)   ! sets number of available threads to 20
 ! Initialize arrays A and B.
       DO I = 1, N
         A(I) = I * 1.0
         B(I) = I * 2.0
       ENDDO
 ! Compute values of array C in parallel.
 !$OMP PARALLEL SHARED(A, B, C), PRIVATE(I) 
 !$OMP DO 
       DO I = 1, N
         C(I) = A(I) + B(I)
       ENDDO
 !$OMP END PARALLEL
       PRINT *, C(10)
       END

Example: parallel for

C/C++ – For Directive Example: (source)

#include <stdio.h>
#include <omp.h>
#define N 10000000
int main(void) { 
  float a[N], b[N], c[N];
  int i;

  /* Initialize arrays a and b. */
  for (i = 0; i < N; i++) {
    a[i] = i * 1.0;
    b[i] = i * 2.0;
  }

 /* Compute values of array c in parallel. */
  #pragma omp parallel shared(a, b, c) private(i)
  { 
    #pragma omp for             // note, no curely braces here
    for (i = 0; i < N; i++) {
      c[i] = a[i] + b[i];
      printf ("%f\n", c[10]);
    }
  }
}

Additional Limitations

In additional to un-parallelizable loops, a other constructs will hinder OpenMP's ability to determine that the loops are parallelizable; otherwise the loop will not get unrolled and assigned to a participating thread.

Some guidelines to note when parallelizing a loops:

  • The for loop must be a structured block, and must not be terminated by a break statement. For example:
// BAD - this is not something to parallelize with OpenMP
for (int i=0;i < 100; i++) {
  if (i > 50)
      break; // breaking when i greater than 50
}
  • Values of the loop control expressions must be the same for all iterations of the loop. For example:
// BAD - something else not to parallelize with OpenMP
for (int i=0;i < 100; i++) {
  if (i == 50)
      i = 0; // resetting i when eq to 50
}

Loop Scheduling and Ordering

OpenMP allows for some explicit control when parallelizing loops. In the examples above, OpenMP uses internal heuristics to allocate each iteration of the loop to a particular thread.

Scheduling

This defines how loop iterations are assigned to each participating thread [1].

Scheduling types include:

  • static
    • Each thread is assigned a chunk of iterations in fixed fashion (round robin).
  • dynamic
    • Each thread is initialized with a chunk of threads, then as each thread completes its iterations, it gets assigned the next set of iterations.
  • guided
    • Iterations are divided into pieces that successively decrease exponentially, with chunk being the smallest size.

This is a form of load balancing that is less.

  • runtime
    • Scheduling is deferred until run time. The schedule type and chunk size can be chosen by setting the environment variable OMP_SCHEDULE.

Comparison Scheduling Methods

Static and Dynamic

Iterations: 100
Threads   : 5
Chunk Size: 3
                     Iterations Assigned
         Thread:   0     1     2     3     4 
 Static            3     3     3     3     3 (* repeated in round robin fashion *)      
 Dynamic           3     3     3     3     3 (* each thread gets 5 more iterations as it completes *)      

Guided

Iterations: 100
Threads   : 5
Chunk Size: 3

 Round 1                       Iterations Assigned
   Thread 0:    (100)/5        = 20
   Thread 1: (100-20)/5 = 80/5 = 16
   Thread 2:  (80-16)/5 = 74/5 = 14
   Thread 3:  (74-14)/5 = 60/5 = 12
   Thread 4:  (60-12)/5 = 48/5 =  9

 Round 2                       Iterations Assigned
   Thread 4:   (48-9)/5 = 39/5 =  7   (* effectively reverses list of threads seeking work            *)
   Thread 3:   (39-7)/5 = 32/5 =  6   (*   because threads with fewer iterations should finish sooner *)
   Thread 2:   (32-6)/5 = 26/5 =  5
   Thread 1:   (26-5)/5 = 21/5 =  4
   Thread 0:   (21-5)/5 = 16/5 =  3   (* chunk size limit reached! *)

 Round 3                       Iterations Assigned 
   Thread 0:  (13 left):       =  3 
   Thread 1:  (10 left):       =  3
   Thread 2:  ( 7 left):       =  3
   Thread 3:  ( 4 left):       =  3
   Thread 4:  ( 1 left):       =  1
*done

Limitations

  • An omp for directive can accept only one schedule clauses.
  • The value of n (chunk size) must be the same for all threads of a parallel region.

Ordering

Ordering refers to ensuring that each iteration of a loop is executed in-order. Up until this point, the tutorial has discussed how to assign iterations to threads, but the execution order of threads with repect to one another has been left to OpenMP.

To make this even more clear, within ORDERED sections code is executed in the order it would be executed in a sequential code. To ensure the proper ordering, the following must be set:

1. the loop pragma must contain the ORDERED key word. For example (fortran):

 !$OMP PARALLEL DO SCHEDULE(DYNAMIC,4) ORDERED

2. the section for which ordering must be enforced needs to be in an ORDERED block:

 ! fortran
 !$OMP ORDERED
   IF(I<21) THEN
     PRINT *, 'Z(',I,') =',Z(I)
   END IF
 !$OMP END ORDERED

The full example from which the above was borrowed may be viewed at http://www.nersc.gov/nusers/help/tutorials/openmp/ordered.php.

The following is an example adapted from some code above:

#include <stdio.h>
#include <omp.h>
#define N 100
int main(void) {
  float a[N], b[N], c[N];
  int i;
   omp_set_dynamic(0);             // ensures use of all available threads
   omp_set_num_threads(20);        // sets number of all available threads to 20
  /* Initialize arrays a and b. */
  for (i = 0; i < N; i++) {
     a[i] = i * 1.0;
     b[i] = i * 2.0;
   }
  /* Compute values of array c in parallel. */
 #pragma omp parallel shared(a, b, c) private(i)
  {

  //not ordered
   #pragma omp for
   for (i = 0; i < N; i++) {
     c[i] = a[i] + b[i];
     printf("%d...\n",i);
   } 
 
  //ordered
   #pragma omp for ordered
   for (i = 0; i < N; i++) {
      /* ....
        In general, one wants to have as much independent work before the ORDERED
        statement as possible - this minimizes the overhead associated with the 
        implied serialization inside of each iteration of a work-shared loop.
      .... */
      #pragma omp ordered
      { 
        c[i] = a[i] + b[i];  // should be *quick* and unintense
      }
    }
  }
  printf ("%f\n", c[10]);
}

Defining Parallel Sections

Sections are another way to divide work among threads. One explicitly defines a section with the intention that a single thread will execute this section block once.

Defining a section is very straightforward, and the only gotcha is that there is an implicit barrier that requires all threads to be done with their blocks before proceding beyond the original definition of each section.

Example: Fortran

Fortran – SECTIONS Directive Example: [2]

       PROGRAM SECTIONS
       USE OMP_LIB
       INTEGER SQUARE
       INTEGER X, Y, Z, XS, YS, ZS
       CALL OMP_SET_DYNAMIC (.FALSE.)
       CALL OMP_SET_NUM_THREADS (3)
       X = 2
       Y = 3
       Z = 5
 !$OMP PARALLEL 
 !$OMP SECTIONS
 !$OMP SECTION
       XS = SQUARE(X)
       PRINT *, "ID = ", OMP_GET_THREAD_NUM(), "XS =", XS 
 !$OMP SECTION
       YS = SQUARE(Y)
       PRINT *, "ID = ", OMP_GET_THREAD_NUM(), "YS =", YS
 !$OMP SECTION
       ZS = SQUARE(Z)
       PRINT *, "ID = ", OMP_GET_THREAD_NUM(), "ZS =", ZS 
 !$OMP END SECTIONS
 !$OMP END PARALLEL
       END
       INTEGER FUNCTION SQUARE(N)
       INTEGER N
       SQUARE = N*N
       END

Example: C/C++

C/C++ – SECTIONS Directive Example: [3]

#include <stdio.h>
#include <omp.h>
int square(int n);
int main(void)
{
  int x, y, z, xs, ys, zs;
  x = 2;
  y = 3;
  z = 5;
  #pragma omp parallel 
   {
    #pragma omp sections
     {
     #pragma omp section
      {
         xs = square(x);
        printf ("id = %d, xs = %d\n", omp_get_thread_num(), xs);
      } 
      #pragma omp section
      {
        ys = square(y);
        printf ("id = %d, ys = %d\n", omp_get_thread_num(), ys);
      }
      #pragma omp section
      {
        zs = square(z);
        printf ("id = %d, zs = %d\n", omp_get_thread_num(), zs);
      }
    }
  }
}
int square(int n)
{
  return n*n;
}

Singling Out Threads

Sometimes only a single thread is required, and one way to utilize a single thread with out having to rejoin into a single master thread, is to use the following methods to single out a thread

Also, unless specified, there is an implied barrier for all threads at the end of the following block types.

Master Thread

This simply specifies that the enclosed block of code is to be executed by the master thread only.

Fortran:

 !$OMP MASTER                 
    ... block, which is executed by only the master thread                  
 !$OMP END MASTER  

C/C++

 #pragma omp master
 {
   ... block, which is executed by only the master thread
 }

Any Single Thread

This block is executed by the first thread that encounters it. All other threads wait for its completion before proceding, unless specified otherwise.

Fortran:

 !$OMP SINGLE                 
   ... block, which is executed only but the first thread to reach it                  
 !$OMP END SINGLE  

C/C++

 #pragma omp single
 {
   ... block, which is executed only but the first thread to reach it
 }

Mutual Exclusion

Mutual exclusion refers to restricting the execution of a block of code or the updating of a shared variable to only one thread at a time. Judicious use of mutual exclusion through critical sections and atomic updating of variables is extrememly important in ensuring code correctness and avoiding run time problems like race conditions.

Critical Sections

A block of code defined inside of a critical section will be executed by each thread, but only a single thread will be executing it at a time.

C/C++ Example:

#include <stdio.h>
#include <omp.h>
#define N 100

int main(void) {
   float a[N], b[N], c[N];
   int i;
  /* Initialize arrays a and b. */
   for (i = 0; i < N; i++)
     { 
       a[i] = i * 1.0;
       b[i] = i * 2.0;
     }
  /* Compute values of array c in parallel. */
  #pragma omp parallel shared(a, b, c) private(i)
  {
    #pragma omp critical
    {
      for (i = 0; i < N; i++)
          c[i] = a[i]++ + b[i]++;
      printf("%f\n",c[10]);
    }
  }
} 


Because a critical section makes each thread wait, OpenMP provides for named critical sections. This provides for one to define a set of critical sections that, from which a thread may enter into if it is being blocked from the other critical sections.

C/C++ Example:

#include <stdio.h>
#include <omp.h>
#define N 100
int main(void) {
  float a[N], b[N], c[N];
  int i;
  /* Initialize arrays a and b. */
   for (i = 0; i < N; i++)
     { a[i] = i * 1.0;
       b[i] = i * 2.0;
     }
  /* Compute values of array c in parallel. */
  #pragma omp parallel shared(a, b, c) private(i)
  {
    #pragma omp critical(a)
    { for (i = 0; i < 33; i++)
          c[i] = a[i]++ + b[i]++;
      printf("%f\n",c[32]);
    }
    #pragma omp critical(b)
    { for (i = 33; i < 66; i++)
          c[i] = a[i]++ + b[i]++;
      printf("%f\n",c[65]);
    }
    #pragma omp critical(c)
    { for (i = 66; i < 99; i++)
         c[i] = a[i]++ + b[i]++;
      printf("%f\n",c[99]);
    }
  }
}

In the code above, waiting threads have a greater chance of entering into a critical section. This allows a much more efficient way to distribute critical sections among all threads. For all of the competing threads, there are more opportunities for them to execute.

The names of the critical section are simply to alert the OpenMP compiler that there are multiple sections that may be run simultaneously. They are not used as any sort of reference or parameter.

Caveats: There is some overhead associated with a thread entering into a critical section, so care must be taken when offering up multiple critical sections for which a set of threads may choose.

Atomic Sections

Whereas critical sections forced threads to take turns when executing code blocks, atomic sections force threads to take turns when writing to a shared memory location (i.e., a shared variable).

#include <stdio.h>
#include <omp.h>
int main(void) {
  int count = 0;
  #pragma omp parallel shared(count)
  {
     #pragma omp atomic
     count++; // count is updated by only a single thread at a time
  }
  printf_s("Number of threads: %d\n", count);
}

Performance Considerations

Any use of mutual exclusion blocks threads from moving forward by serializing the accessing (writing) of shared memory or the execution of a common block of code. Because of this, it eliminates the concurrent execution of the threads. It is very easy to write code that counter acts the benefit of multiple threads through the naive use of blocking pragmas such as critical sections, atomic sections, etc. It is often the case that performance could actually be worse than the serialized application.

Designing a Shared Memory Program from Scratch

Starting from scratch allows one to properly consider all issues involved in order to maximize program efficiency and scaling.The design of a shared memory program from scratch will be covered in another tutorial.

Links

Exercises

References and Notes


Users may direct questions to sys-help@loni.org.

Powered by MediaWiki