Note: This page has been marked as "Obselete" by the site administrator.

Contents


How to compile your OpenMP code

To compile you OpenMP code, you need to use the "-r" version of the compiler and add the option "-qsmp=omp".

Example:

xlf90_r omptest.f90 -o omptest -qsmp=omp

How to run your OpenMP code

Again, as running MPI applications, you have two options: interactively or through Loadleveler.

To run an interactive OpenMP job with 8 threads:

(1) set the environment varaible "export OMP_NUM_THREADS=8"
(2) get into the directory where the executable is, directly luanch it by something like "./a.out"


To submit an OpenMP job to the Loadleveler, please refer to the script below to build your own.

#!/usr/bin/ksh
# @ job_type = parallel
# @ executable = /usr/bin/poe
# @ arguments = /work/ou/planet_disk/run/hydro
# @ input = /dev/null
# @ output = /work/ou/planet_disk/output/out.std
# @ error = /work/ou/planet_disk/output/out.err
# @ initialdir = /work/ou/planet_disk/run
# @ notify_user = ou@baton.phys.lsu.edu
# @ class = checkpt
# @ wall_clock_limit = 120:00:00
# @ node_usage = not_shared
# @ node = 1,1
# @ tasks_per_node = 1
# @ requirements = ( Arch == "Power5" )
# @ environment=MP_SHARED_MEMORY=yes; COPY_ALL
# @ queue
export OMP_NUM_THREADS=8
/work/ou/planet_disk/run/hydro

Simultaneous Muli-threading

SMT is a lightweight hardware switch mechanism that makes use of idle time on processors to complete jobs more quickly. In an SMT-enabled node, there are two logical processors for each physical processor.SMT is beneficial where the number of tasks or threads exceeds the number of physical processors available. Further, SMT is most beneficial to codes that are less optimized for memory access or that spend more time waiting for communication or I/O.Simultaneous Multi-Threading (SMT) is a feature that became available under AIX 5.3 and works on Power 5-based systems. No code changes are needed. With simple modifications to your job scripts, you may be able to boost performance by 20% or more on some applications.Under SMT, the Power 5 doubles the number of active threads on a processor by implementing a second, on-board "virtual" processor that is enabled by the CPU architecture. The basic concept of SMT is that no single process uses all processor execution units at the same time, so a second thread can utilize unused cycles

source

SMT is enabled by default on our AIX systems, and may be utilized by shared memory applications by setting OMP_NUM_THREADS to be greater than the number of specified CPUs on the node.

#!/usr/bin/ksh
# @ job_type = parallel
# @ executable = /usr/bin/poe
# @ arguments = /work/ou/planet_disk/run/hydro
# @ input = /dev/null
# @ output = /work/ou/planet_disk/output/out.std
# @ error = /work/ou/planet_disk/output/out.err
# @ initialdir = /work/ou/planet_disk/run
# @ notify_user = ou@baton.phys.lsu.edu
# @ class = checkpt
# @ wall_clock_limit = 120:00:00
# @ node_usage = not_shared
# @ node = 1,1
# @ tasks_per_node = 1
# @ requirements = ( Arch == "Power5" )
# @ environment=MP_SHARED_MEMORY=yes; COPY_ALL
# @ queue
export OMP_NUM_THREADS=16 # num cpus = 8, SMT allows 16 
/work/ou/planet_disk/run/hydro

How to set up the environment

OMP environment variables

OMP_DYNAMIC

The OMP_DYNAMIC environment variable enables or disables dynamic adjustment of the number of threads available for the execution of parallel regions. Its possible values are "TRUE" and "FALSE".

If you set this environment variable to TRUE, the run-time environment can adjust the number of threads it uses for executing parallel regions so that it makes the most efficient use of system resources. If you set this environment variable to FALSE, dynamic adjustment is disabled.

The default value for OMP_DYNAMIC is TRUE. Therefore, if your code needs to use a specific number of threads to run correctly, you should disable dynamic thread adjustment.

OMP_NESTED

The OMP_NESTED environment variable enables or disables nested parallelism. Its possible values are "TRUE" and "FALSE".

If you set this environment variable to TRUE, nested parallelism is enabled. This means that the run-time environment might deploy extra threads to form the team of threads for the nested parallel region. If you set this environment variable to FALSE, nested parallelism is disabled.

The default value for OMP_NESTED is FALSE

OMP_NUM_THREADS

The OMP_NUM_THREADS environment variable sets the number of threads that a program will use when it runs.

Its value is the the maximum number of threads that can be used if dynamic adjustment of the number of threads is enabled. If dynamic adjustment of the number of threads is not enabled, the value of OMP_NUM_THREADS is the exact number of threads that can be used. It must be a positive, scalar integer.

The default number of threads that a program uses when it runs is the number of online processors on the machine.

OMP_SCHEDULE

The OMP_SCHEDULE environment variable applies to PARALLEL DO and work-sharing DO directives that have a schedule type of RUNTIME. The syntax is as follows:

export OMP_SCHEDULE="sched_type"

or

export OMP_SCHEDULE="sched_type,chunk_size"

where sched_type is either DYNAMIC, GUIDED, or STATIC, and chunk_size is a positive, scalar integer that represents the chunk size.

This environment variable is ignored for PARALLEL DO and work-sharing DO directives that have a schedule type other than RUNTIME.

XLSMPOPTS

You can also use XLSMPOPTS to set the options that affect OpenMP execution. Note that, if both XLSMPOPTS and OMP options are specified, the XLSMPOPTS one will be disregarded.

Syntax

export XLSMPOPTS=" <option1>=<option1_setting> : <option2>=<option2_setting> "

Available options

Schedule

Selects the scheduling type and chunk size to be used as the default at run time. The scheduling type that you specify will only be used for loops that were not already marked with a scheduling type at compilation time. Work is assigned to threads in a different manner, depending on the scheduling type and chunk size used.

Note: If you have not specified schedule, the default is set to schedule=static, resulting in block scheduling.

A brief description of the scheduling types and their influence on how work is assigned follows:

dynamic or guided

The run-time library dynamically schedules parallel work for threads on a "first-come, first-do" basis. "Chunks" of the remaining work are assigned to available threads until all work has been assigned. Work is not assigned to threads that are asleep.

static

Chunks of work are assigned to the threads in a "round-robin" fashion. Work is assigned to all threads, both active and asleep. The system must activate sleeping threads in order for them to complete their assigned work.

affinity

The run-time library performs an initial division of the iterations into number_of_threads partitions. The number of iterations that these partitions contain is:

CEILING(number_of_iterations / number_of_threads) 

These partitions are then assigned to each of the threads. It is these partitions that are then subdivided into chunks of iterations. If a thread is asleep, the threads that are active will complete their assigned partition of work.

Choosing chunking granularity is a tradeoff between overhead and load balancing. The syntax for this option is schedule=suboption, where the suboptions are defined as follows:

affinity[=n]

As described previously, the iterations of a loop are initially divided into partitions, which are then preassigned to the threads. Each of these partitions is then further subdivided into chunks that contain n iterations. If you have not specified n, a chunk consists of CEILING(number_of_iterations_remaining_in_local_partition / 2) loop iterations. When a thread becomes available, it takes the next chunk from its preassigned partition. If there are no more chunks in that partition, the thread takes the next available chunk from a partition preassigned to another thread.

dynamic[=n]

The iterations of a loop are divided into chunks that contain n iterations each. If you have not specified n, a chunk consists of CEILING(number_of_iterations / number_of_threads) iterations.

guided[=n]

The iterations of a loop are divided into progressively smaller chunks until a minimum chunk size of n loop iterations is reached. If you have not specified n, the default value for n is 1 iteration. The first chunk contains CEILING(number_of_iterations / number_of_threads) iterations. Subsequent chunks consist of CEILING(number_of_iterations_remaining / number_of_threads) iterations.

static[=n]

The iterations of a loop are divided into chunks that contain n iterations. Threads are assigned chunks in a "round-robin" fashion. This is known as block cyclic scheduling. If the value of n is 1, the scheduling type is specifically referred to as cyclic scheduling. If you have not specified n, the chunks will contain CEILING(number_of_iterations / number_of_threads) iterations. Each thread is assigned one of these chunks. This is known as block scheduling.

Parallel execution options

The three parallel execution options, parthds, usrthds, and stack, are as follows:

parthds=num

Specifies the number of threads (num) to be used for parallel execution of code that you compiled with the -qsmp option. By default, this is equal to the number of online processors. There are some applications that cannot use more than some maximum number of processors. There are also some applications that can achieve performance gains if they use more threads than there are processors.

This option allows you full control over the number of execution threads. The default value for num is 1 if you did not specify -qsmp. Otherwise, it is the number of online processors on the machine.

usrthds=num

Specifies the maximum number of threads (num) that you expect your code will explicitly create if the code does explicit thread creation. The default value for num is 0.

stack=num

Specifies the largest amount of space in bytes (num) that a thread's stack will need. The default value for num is 4194304.

References && Links


Users may direct questions to sys-help@loni.org.

Powered by MediaWiki