Contents


Deprecated: please see new documentation site.



The following is a general guideline regarding the submitting of a job to LoadLeveler.

The Work Directory

This is typically located in /work/default/username. It is important that one runs all of his jobs in here unless otherwise directed.

The Submission Script for Parallel Jobs

LoadLeveler requires a submission script (aka queue script) that sets various parameters, including how many processors one would like to use and what file is the executable for one's program.

An example of a typical script is shown below:

#!/bin/sh
# .. put nothing before this header
#@ account_no = ALLOCATION_NAME
#@ environment = COPY_ALL
#@ job_type = parallel 
#@ output = /work/default/username/$(jobid).out
#@ error = /work/default/username/$(jobid).err
#@ notify_user = youremail@domain.tdl
#@ notification = error
#@ class = checkpt
#@ checkpoint = no
#@ restart = yes
#@ wall_clock_limit = 00:10:00
#@ node_usage = shared
#@ node = 2,2
#@ total_tasks = 16
#@ requirements = (Arch == "Power5")
#@ initialdir = /work/default/username
#@ executable = /some/parallel/executable
#@ arguments = 
#@ network.MPI =sn_single,not_shared,US,HIGH
#@ resources = ConsumableMemory(1 gb)
#@ queue

Note: #@ executable and #@ arguments, if included, are passed on to poe.

You should specify wall_clock_limit whenever possible, which is the maximum estimated time you job will run, because it may shorten the time that your job has to wait in the queue as well as improve the utilization of the cluster (see How LoadLeveler Schedules Jobs).

In the example above, there should be nothing after #@ queue. However, if one would like to issue a series of shell commands, a script such as the following can be used. The only difference is that the #@ executable and #@ arguments keywords are missing, so LoadLeveler looks after #@ queue for the commands to run. Also, in this mode one must explicitly invoke /usr/bin/poe in order to run a provide the environment for a parallel job.

#!/bin/sh
# .. put nothing before this header
#@ environment = COPY_ALL
#@ job_type = parallel #@ output = /work/default/username/$(jobid).out
#@ error = /work/default/username/$(jobid).err
#@ notify_user = youremail@domain.tdl
#@ notification = error
#@ class = checkpt
#@ checkpoint = no
#@ restart = yes
#@ wall_clock_limit = 00:10:00
#@ node_usage = shared
#@ node = 2,2
#@ total_tasks = 16
#@ requirements = (Arch == "Power5")
#@ initialdir = /work/default/username
#@ network.MPI =sn_single,not_shared,US,HIGH
#@ resources = ConsumableMemory(1 gb)
#@ queue
##
## valid shell commands may follow
date > time.out
/usr/bin/poe /some/parallel/executable
## when done, execute the following shell commands
for I in 1 2 3 4 5 6 7 8 9 10; dp
  echo ${I}
  ...do whatever
done 
/usr/bin/poe /another/parallel/executable 
...do some more stuff...
date >> time.out

The Submission Script for Sequential Jobs

Compared to parallel jobs, the submission script for sequential jobs is much simpler:

#!/bin/sh
# .. put nothing before this header
#@ environment = COPY_ALL
#@ output = /work/default/username/$(jobid).out
#@ error = /work/default/username/$(jobid).err
#@ notify_user = youremail@domain.tdl
#@ notification = error
#@ wall_clock_limit = 00:10:00
#@ requirements = (Arch == "Power5")
#@ initialdir = /work/default/username
#@ executable = /some/serial/executable
#@ arguments = 
#@ resources = ConsumableMemory(1 gb)
#@ queue

Note: On LONI machines the serial jobs are restricted to a specific node (usually node 14), which is also open for parallel jobs. Therefore, when parallel jobs are running on that node, all serial jobs will have to wait in the queue even if there are other nodes available.

Submission Script for Pandora

Pandora is the new IBM POWER7 cluster. Please note a few differences in the LoadLeveler submit script compared to the POWER5 clusters:

  • Arch in the requirements directive is "POWER7" (all caps), not "Power5".
  • LoadLeveler now requires you to specify consumable resources via the resources directive. You must specify both how much memory each task uses (ConsumableMemory) and how many CPUs each task uses (ConsumableCpus). In general, you will want ConsumableCpus(1), instead increasing the number of tasks based on your code's scalability: 8 tasks for an 8-way job, 32 tasks for a 32-way job, and so on.
    • As an example, if you request 1 node with 32 tasks and 32 ConsumableCpus, then you are requesting 1024 total processors and 32 times the amount of RAM. Pandora will not be able to provide this.
  • The network directive can either be network.MPI_LAPI or network.MPI, except when you are running GAMESS. For GAMESS, it must be network.MPI_LAPI.
#!/bin/bash
# .. put nothing before this header
#@ job_type = parallel
#@ notification = never
#@ notify_user = youremail@domain.tdl
#@ output = /work/username/$(jobid).out
#@ error = /work/username/$(jobid).err
#@ class = workq 
#@ checkpoint = no
#@ wall_clock_limit = 2:00:00
#@ node_usage = shared
#@ node = 2
#@ tasks_per_node = 32
#@ requirements = (Arch == "POWER7")
#@ network.MPI_LAPI =sn_single,not_shared,US,HIGH
#@ resources = ConsumableMemory(3500 mb) ConsumableCpus(1)
#@ queue

cd /working/directory
poe executable options
exit 0

Submission

Once the queue script is ready and everything is set, submit the job via the utility llsubmit:

# llsubmit mysubmitscript.ll

Allocations

Please see Requesting LONI Allocations.

Troubleshooting

If things go wrong with the submission, please see Troubleshooting LoadLeveler Submission Errors.

If things go wrong with the execution of your job or if it fails to begin after a long period of time, please see Troubleshooting LoadLeveler Job Execution Errors.


Users may direct questions to sys-help@loni.org.

Powered by MediaWiki