Note: This page has been marked as "Obselete" by the site administrator.

Contents


Often times, users of the LONI P5s will say that their submit scripts aren't being accepted by "llsubmit". I have seen the following situations be responsible for this the majority of the time. This is document is meant to be a running list of trouble shooting to perform. It should generally be organized as most to least common causes of error.

Disk Issues

See Common Disk Problems and Solutions.

Malformed Load Leveler Script

In particular, users try to put shell commands above the LoadLeveler directives header. A properly formed script looks like the following:

 #!/bin/sh
 #
 # ....anything other than a comment or a LoadLeveler directive will cause llsubmit to fail
 #
 #@ environment = COPY_ALL
 #@ job_type = parallel #@ output = /work/default/estrabd/adcirc-systest/$(step_name).$(jobid).out
 #@ error = /work/default/estrabd/adcirc-systest/$(step_name).$(jobid).err
 #@ notify_user = estrabd@cct.lsu.edu
 #@ notification = error
 #@ class = checkpt
 #@ checkpoint = no
 #@ restart = yes
 #@ wall_clock_limit = 00:10:00
 #@ node_usage = not_shared
 #@ node = 2,2
 #@ total_tasks = 16
 #@ requirements = (Arch == "Power5")
 #@ initialdir = /work/default/estrabd/adcirc-systest
 #@ executable = /work/default/estrabd/adcirc-systest/padcirc.sh
 #@ network.MPI =sn_single,not_shared,US,HIGH
 #@ resources = ConsumableMemory(1 gb)
 #@ queue

Note, if the "#@executable" directive is used, nothing past "#@queue" will be followed. If it is not used, then LoadLeveler will run the script in the shell environment.

Check Head Node's /tmp Directory

If all of the above fails, make sure that the head node's /tmp directory is either not full,

 df /tmp

and that the permissions for the directory are correct.

 ls -l / | grep tmp

Check Ulimit

The ulimit determines the number of open file handles a user may have at any point in time. A Unix file handle is not just for a file, but for sockets, pipes, etc. A parallel code could easily open a lot of file handles with out trying very hard.

This causes weird issues. For example:

  • a program dies saying that it can't write do directory /X, yet /X is definitely not full and does not fall under the auspice of the quota system.
  • phil, can you think of another case?

Currently the ulimit is set to 4000 on all of the p575s. It may be increased in the future to accommodate codes that require more.

LoadLeveler's Scheduler Might Not Be Accepting Jobs

See Checking the Status of LoadLeveler.


Users may direct questions to sys-help@loni.org.

Powered by MediaWiki