Contents


Deprecated: See the LONI Moodle Courses



By Brett Estrade and Shangli Ou

Note: This tutorial is meant to complement the presentation, here.

Objectives

  • get and introduction to the shared memory programming model
  • get familiar with the basics of OpenMP's capabilities
  • write and compile (on IBM and Linux x86 clusters) a very simple OpenMP program
  • become aware of additional resources and tutorials

This is a very gentle introduction, and a user going through it will be able to write a very basic, yet not very useful, OpenMP program for both IBM and Linux x86 LONI Clusters. It is the first in a series that will take a user from this gentle introduction all the way through very advanced and detailed shared memory programming, including the creation of hybrid OpenMP (shared memory) and MPI (message passing) applications.

Execution Model

  • OpenMP is built around a shared memory space and related,

concurrent threads – this is how the parallelism is handled

  • There is a single master thread that is said to be active when there is only

a single living thread; this master thread is from where the concurrent threads are forked and rejoined.

  • CPUs that share memory are called

symmetric multi-processors”, or SMP.

  • Each thread is typically run on a its own processor, though it is becoming common for each

CPU or “core” to handle more than one thread “concurrently”; this is called “hyper-threading”.

  • Thread (process) communication is implicit and uses variables pointing to shared memory

locations; this is in contrast with MPI which uses explicit messages passed among each process.

  • OpenMP simply makes it easier to manage the traditional fork/join paradigm using special

“hints” or directives given to an OpenMP enabled compiler.

  • These days, most major compilers do support OpenMP directives for most platforms

Shared Memory versus Message Passing

Shared memory programming is different from message passing because they rely on two very different means of communication. Shared memory relies on implicit communication, which is facilitated by the shared memory space. This makes shared memory programming much more intuitive for many programmers wishing to parallelize their programs. Unfortunately, because it operates under the assumption of shared memory space, it is susceptible to the inherent scaling limitations of the mechanisms that create this shared view of the memory.Message passing assumes that there is no shared memory and that all communications must be enabled through the sending of explicit messages. It is not intuitive because it requires the programmer to manage the sending, receiving, and coordination among the parallel processes.

Shared memory applications are ideal for use on single SMP systems where all processors and memories are tightly integrated on a hardware level. As CPUs increase the number of cores per chip, shared memory programming will get a lot of attention. It is no silver bullet, however because of the inherent limitations present in scaling up the size of the shared memory systems. At some point, multiple SMP machines must be connected in a network, and coordinating some set of these systems requires explicit message passing due to the fact that their is no physically shared memory. Even if shared memory is emulated among individual SMP machines, it still must be implemented at its lowest level using message passing.

Hybrid Shared Memory/MPI Programs

There do exist situations that are most efficiently solved using a hybrid application of both shared memory and message passing. Often times these situations consist of tightly coupled computations on multiple SMP machines, which communicate (as little as possible) with one another. This is an advanced aspect of parallel programming that will be covered in a more in depth tutorial.

Alternatives to OpenMP

Limitations of OpenMP

  • the more shared variables (address spaces) one has among the threads, the harder the SMP must work to ensure that their view of the variables are current – this is called cache coherency
  • at some point, the overhead associated with ensuring cache coherency (consistency) will cause the parallelization to stop scaling with more processors
  • additional limitations are hardware based, and have to do with how many SMPs are physically able to share the same memory space
  • once physical shared memory space is eclipsed, one must emulate shared memory using message passing (Linda); hybrid OpenMP/MPI methodologies exist as well
  • memory bandwidth does not scale up as more processors are introduced. i.e., scaling is limited by memory architecture
  • lacks fine grain control over thread to processor mapping
  • synchronization of a subset of threads is not allowed

source

A Simple Example

C/C++

#include <omp.h>
#include <stdio.h>
int main (int argc, char *argv[]) {
 int id, nthreads;
 #pragma omp parallel private(id)    //<-- variable “id” is private to each respective threads
 {
   id = omp_get_thread_num();
   printf("Hello World from thread %d\n", id);
   #pragma omp barrier
   if ( id == 0 ) {
     nthreads = omp_get_num_threads();
     printf("There are %d threads\n",nthreads);
   }
 }
 return 0;
}

Fortran

 program hello90
 use omp_lib
 integer:: id, nthreads
  !$omp parallel private(id)
  id = omp_get_thread_num()
  write (*,*) 'Hello World from thread', id
  !$omp barrier
  if ( id == 0 ) then
    nthreads = omp_get_num_threads()
    write (*,*) 'There are', nthreads, 'threads'
  end if
  !$omp end parallel
 end program

source

Variables

Declaring Access Amongst Threads

Private

Private variables belong and are known only to each particular thread.

C/C++

int id, nthreads;
 #pragma omp parallel private(id)
 { id = omp_get_thread_num();

Fortran

 integer:: id, nthreads
  !$omp parallel private(id)
    id = omp_get_thread_num()

Shared

Shared variables are know to all threads. Care must be exercised when utilizing these variables since improper handling will cause hard to detect errors like race conditions and dead locks.

C/C++

int id, nthreads, A, B;
 A = getA();
 B = getB();
 #pragma omp parallel private(id,nthreads) shared(A,B)
 { 
   id = omp_get_thread_num();

Fortran

 integer:: id, nthreads, A, B
  call getA(A)
  call getB(B)
  !$omp parallel
  !$omp default(private)
  !$omp shared(A,B)
    id = omp_get_thread_num()

Note that Fortran supports the default of 'private', but C/C++ does not.

Initializing Variables

Firstprivate

This allows one to initialize a private variable in the master thread before entering a parallel section. Otherwise, any private variables will be considered uninitialized at the beginning of a thread.

C/C++

int id, myPi;
 myPi = 3.1459
 #pragma omp parallel private(id) firstprivate(myPi)
 { id = omp_get_thread_num();

Fortran

 integer:: id, myPi
  myPi = 3.1459
  !$omp parallel firstprivate(id) firstprivate(myPi)
    id = omp_get_thread_num()

Reduction Operations

Reductions allow for a variable used privately by each thread to be aggregate into a single value.

For example:

reduce(*:i)  # for N threads, get product of i1*i2*i3*...*iN
reduce(+:i)  # for N threads, get the sum of i1+i2+i3+...+iN

Reduction operations (C/C++)

Arithmetic:    + - * /      # add, subtract, multiply, divide
Bitwise:       & | ^        # and, or, xor 
Logical:       && ||        # and, or

The example below assigns a value to the private variable i, then at the end of the parallel block multiplies (*) i from all threads together and makes the final product available to the master thread.

C/C++

#include <omp.h>
#include <stdio.h>
int main (int argc, char *argv[]) {
  int i;
  #pragma omp parallel reduction(*:i)
  { 
    i=omp_get_num_threads();
  }
  printf("ans=%d\n",c);
  return 0;
}

The net affect of the code above is the calculation of NN where N is the number of threads associated with this parallel block, i.e. the value returned by omp_get_num_threads().

Threads

Basic Thread Synchronization

When one wants all threads to be at a specific point in their execution before proceeding, they would use a barrier. A barrier basically tells each thread, "wait here until all other threads have reached this point...". This is useful in the situation demonstrated below where each thread may call a function that takes a variable amount of time to complete.

C/C++

#include <omp.h>
#include <stdio.h>
int main (int argc, char *argv[]) {
 int i;
 #pragma omp parallel private(i)
 {
   i = do_something_that_may_take_a_while();
   #pragma omp barrier
 }
 return 0;
}

some Run Time Query and Command Routines

  • omp_get_num_threads
  • omp_get_num_procs
  • omp_set_num_threads
  • omp_get_max_threads
  • omp_in_parallel

Environmental Variables

  1. OMP_NUM_THREADS
    • required, informs execution of the number of threads to use
  2. OMP_SCHEDULE
    • The OMP_SCHEDULE environment variable applies to PARALLEL DO and work-sharing DO directives that have a schedule type of RUNTIME.
  3. OMP_DYNAMIC
    • The OMP_DYNAMIC environment variable enables or disables dynamic adjustment of the number of threads available for the execution of parallel regions
  4. OMP_NESTED
    • The OMP_NESTED environment variable enables or disables nested parallelism.

Compiling and Execution

IBM p5 575

  • Compilers include: xlf_r, xlf90_r, xlf95_r, xlc_r, cc_r, c89_r, c99_r, xlc128_r, cc128_r, c89_128_r, c99_128_r
  • Example:
%xlc_r -qsmp=omp test.c && OMP_NUM_THREADS=5 ./a.out

Simultaneous Muli-threading

See the Simultaneous_Muli-threading section of Using_OpenMP.

Linux x86 Clusters

  • Compilers include: ifort, icc
  • Example:
%icc -openmp test.c && OMP_NUM_THREADS=5 ./a.out

GCC OpenMP Support

As of version 4.2.1 of GCC, OpenMP has been supported and can be compiled using the -fopenmp flag. Similarly, gfortran, GCC's Fortran 95 front-end also accepts the -fopenmp flag.

The following are examples of how to compile OpenMP programs using GCC:

%gcc -fopenmp test.c -o test.x
%gfortran -fopenmp test.f90 -o test.x

Various versions of the GCC compiler suite is available on the various platforms, and 4.2.1 or greater is bound to be there. If not, let us know.

Up to date information on OpenMP support in GCC may be viewed at http://gcc.gnu.org/projects/gomp/.

Exercises

#1 Designing A Simple Multi-Threaded Application

Using the following description, design on paper (visually) how a multi-threaded version may look.

For an N threaded application, return the number N2

  • each thread may use only a single private (or firstprivate) variable
  • this variable must be reduced using addition

Solution

  • declare private variable, int i
  • for each thread, set i = #threadsM
  • reduce i to the sum of all threads' i values

#2 Programming A Simple Multi-Threaded Application

Implement the solution to Lab 1 in either C or Fortran. Compile, then run the code using the examples provided in the presentation.

A Solution

#include <omp.h>
#include <stdio.h>
int main (int argc, char *argv[]) {
 int c;
 #pragma omp parallel reduction(+:c)
 { 
   c=omp_get_num_threads();
 }
 printf("ans=%d\n",c);
 return 0;
}

Compiling and Executing

IBM p5 575
%xlc_r -qsmp=omp test.c && OMP_NUM_THREADS=5 ./a.out
ans=25
Linux x86 Cluster
%icc -openmp test.c && OMP_NUM_THREADS=5 ./a.out
test.c(5) : (col. 3) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
ans=25

Closing Thoughts

The Hard Part about Shared Memory Programming

The difficult aspect of creating a shared memory program is translating what you want to do into a multi-threaded version. It is even more difficult to make this mult-threaded version optimally efficient. The most difficult part of this is verifying that your multi-threaded version is correct, and that there are no issues with shared variable or unintentional situations like race conditions created Program verification and detecting/debugging race conditions (and other run time issues) are beyond the scope of this tutorial, will be covered in future talks on advanced OpenMP issues.

Credits

Additional Resources

  1. Intermediate OpenMP
  2. Using OpenMP
  3. A good next step for now

Users may direct questions to sys-help@loni.org.

Powered by MediaWiki