Note: This page has been marked as "Obselete" by the site administrator.

Contents

Pre WS GRAM Error Codes

Pre WS-GRAM Error Codes

Issues

Jobs fail to start

Peer authentication errors

As a consequence of its authentication methodology, the Globus Toolkit will refuse to connect to a system when the advertised name conflicts with the expected name. These errors are typically accompanied by an error message similar to the following:

globus_xio_gsi: The peer authenticated as 
/C=US/O=Louisiana Optical Network Initiative/OU=loni.org/CN=host/l1f1g03.sys.loni.org.
Expected the peer to authenticate as /CN=host/l1f1n03.sys.loni.org

In the previous example, the client refuses to connect to the server due to a naming mismatch. In this case, however, both names are associated with the same server so the conflict does not actually indicate a problem. Although the Globus Toolkit does provide methods to circumvent these issues on the server side, these methods do not work flawlessly in practice.

In the case of such a name mismatch, clients can set the environment variable GLOBUS_HOSTNAME to the expected name of the server in order to work around the problem. In the given example, the server is advertising its name as l1f1g03.sys.loni.org. Setting GLOBUS_HOSTNAME to that name will avoid the naming error:

# GLOBUS_HOSTNAME=l1f1g03.sys.loni.org globus-job-run-ws bluedawg.loni.org /bin/date

NOTE: this name mismatch is known to occur on both Bluedawg and Ducky when jobs are submitted from the same system, eg - running a job on Ducky from the Ducky interactive node. Jobs initiated from outside systems do not appear to suffer from these errors.

Jobs on Each Cluster Run Independently

This is potentially a conflicting label problem in Globus rsl files:

Users need to give each subjob a different label in the Globus rsl file. Otherwise, Globus doesn't treat them as one common job, but rather start different independent jobs.

See the following right rsl file: Note that (label = Bench_WaveToy_PUGH_1001) for the first subjob on l2f1n01 but (label = Bench_WaveToy_PUGH_1002) for the second subjob on l3f1n01.

+
(& (resourceManagerContact = "l2f1n01.sys.loni.org/jobmanager-loadleveler")
  (queue = checkpt)
  (label = Bench_WaveToy_PUGH_100l)
  (job_type = multiple)
  (count = 8)
  (host_count = 1)
  (maxWallTime = 60)
  (directory = /work/default/ou/flower/run)
  (environment =
               (GLOBUS_DUROC_SUBJOB_INDEX 0)
               (GBLL_NODE_USAGE shared)
               ( LD_LIBRARY_PATH /usr/local/globus/globus-4.0.4/lib:/usr/local/packages/mpich-g2-64/lib)
               ( PATH /usr/local/globus/globus-4.0.4/bin:/usr/local/packages/mpich-g2-64/bin:.)
  )
  (executable = /work/default/ou/flower/run/hydro)
  (stderr=/work/default/ou/flower/run/std.err)
  (stdout=/work/default/ou/flower/run/std.out)
  )
(& (resourceManagerContact = "l3f1n01.sys.loni.org/jobmanager-loadleveler")
  (label = Bench_WaveToy_PUGH_1002)
  (queue = checkpt)
  (job_type = multiple)
  (count = 8)
  (host_count = 1)
  (maxWallTime = 60)
  (directory = /work/default/ou/flower/run)
  (environment =
               (GLOBUS_DUROC_SUBJOB_INDEX 1)
               (GBLL_NODE_USAGE shared)
               (LD_LIBRARY_PATH /usr/local/globus/globus-4.0.4/lib:/usr/local/packages/mpich-g2-64/lib)
               (PATH /usr/local/globus/globus-4.0.4/bin:/usr/local/packages/mpich-g2-64/bin:.)
  )
  (executable = /work/default/ou/flower/run/hydro)
  (stderr=/work/default/ou/flower/run/std.err)
  (stdout=/work/default/ou/flower/run/std.out)
)
Powered by MediaWiki