Contents


Deprecated: please see new documentation site.



Submit a Batch Job

# llsubmit mysubmitscript.ll

Queue Information

Provides information about each queue.

# llclass

Example output:

Name        MaxJobCPU   MaxProcCPU  Free   Max Description
            d+hh:mm:ss  d+hh:mm:ss Slots Slots
----------- ----------  ---------- ----- ----- ---------------------
interactive undefined   undefined    4     8   Interactive Parallel jobs running on interactive node
workq       unlimited   unlimited    0    56   Default queue, up to 56 processors
preempt     unlimited   unlimited   16    48   queue resevered for on-demand jobs, up to 48 processors
checkpt     unlimited   unlimited   16   104   queue for checkpointing jobs, up to 104 processors, Job 
                                               running on this queue can be preempted for on-demand job
--------------------------------------------------------------------------------
"Free Slots" values of the classes "workq", "preempt", "checkpt" are constrained by the MAX_STARTERS limit(s).

View Job Status

All Jobs in the Queue
# llq
All of One's Own Jobs
# llq -u username
Details About Why A Job Has Not Yet Started
# llq -s job-id

The key information is located at the end of the output, and will look similar to the following:

==================== EVALUATIONS FOR JOB STEP l1f1n01.4604.0 ====================
The class of this job step is "workq".
Total number of available initiators of this class on all machines in the cluster: 0
Minimum number of initiators of this class required by job step: 4
The number of available initiators of this class is not sufficient for this job step.
Not enough resources to start now.
Not enough resources for this step as backfill.

Or it will tell you the estimated start time:

==================== EVALUATIONS FOR JOB STEP l1f1n01.8207.0 ====================
The class of this job step is "checkpt".
Total number of available initiators of this class on all machines in the cluster: 8
Minimum number of initiators of this class required by job step: 32
The number of available initiators of this class is not sufficient for this job step.
Not enough resources to start now.
This step is top-dog. 
Considered at: Fri Jul 13 12:12:04 2007
Will start by: Tue Jul 17 18:10:32 2007
Generate a long listing rather than the standard one
# llq -l job-id

This command will give you detailed job information.

Job Status States

Canceled

CA

The job has been canceled as by the llcancel command.

Completed

C

The job has completed.

Complete Pending

CP

The job is completed. Some tasks are finished.

Deferred

D

The job will not be assigned until a specified date. The start date may have been specified by the user in the Job Command file or it may have been set by LoadLeveler because a parallel job could not obtain enough machines to run the job.

Idle

I

The job is being considered to run on a machine though no machine has been selected yet.

NotQueued

NQ

The job is not being considered to run. A job may enter this state due to an error in the command file or because LoadLeveler can not obtain information that it needs to act on the request.

Not Run

NR

The job will never run because a stated dependency in the Job Command file evaluated to be false.

Pending

P

The job is in the process of starting on one or more machines. The request to start the job has been sent but has not yet been acknowledged.

Rejected

X

The job did not start because there was a mismatch or requirements for your job and the resources on the target machine or because the user does not have a valid ID on the target machine.

Reject Pending

XP

The job is in the process of being rejected.

Removed

RM

The job was canceled by either LoadLeveler or the owner of the job.

Remove Pending

RP

The job is in the process of being removed.

Running

R

The job is running.

Starting

ST

The job is starting.

Submission Error

SX

The job can not start due to a submission error. Please notify the Bluedawg administration team if you encounter this error.

System Hold

S

The job has been put in hold by a system administrator.

System User Hold

HS

Both the user and a system administrator has put the job on hold.

Terminated

TX

The job was terminated, presumably by means beyond LoadLeveler's control. Please notify the Bluedawg administration team if you encounter this error.

User Hold

H

The job has been put on hold by the owner.

Vacated

V

The started job did not complete. The job will be scheduled again provided that the job may be reschellued.

Vacate Pending

VP

The job is in the process of vacating.

Cancel a Job

A Particular Job
# llcancel job-id
All of One's Jobs
# llcancel -u username

Job History and Usage Summaries

On each cluster, there exists a file that contains the history of all jobs run under LoadLeveler. This file is /var/loadl/archive/history.archive, and may be queried using the llsummary command.

An example of usage would be as follows:

# llsummary -u estrabd /var/loadl/archive/history.archive

And the output would look something like:

       Name   Jobs   Steps        Job Cpu    Starter Cpu     Leverage
    estrabd    118     128       07:55:57       00:00:45        634.6
      TOTAL    118     128       07:55:57       00:00:45        634.6
      Class   Jobs   Steps        Job Cpu    Starter Cpu     Leverage
    checkpt     13      23       03:09:32       00:00:18        631.8
interactive    105     105       04:46:24       00:00:26        660.9
      TOTAL    118     128       07:55:57       00:00:45        634.6
      Group   Jobs   Steps        Job Cpu    Starter Cpu     Leverage
   No_Group    118     128       07:55:57       00:00:45        634.6
      TOTAL    118     128       07:55:57       00:00:45        634.6
    Account   Jobs   Steps        Job Cpu    Starter Cpu     Leverage
       NONE    118     128       07:55:57       00:00:45        634.6
      TOTAL    118     128       07:55:57       00:00:45        634.6

The llsummary tool has a lot of options, which are discussed in its man pages.

Check status of each node

# llstatus

And the output would look something like:

ou@l3f1n01$ llstatus
Name                      Schedd  InQ Act Startd Run LdAvg Idle Arch      OpSys
l3f1n01                   Avail     4   2 Idle     0 1.01     0 Power5    AIX53
l3f1n02                   Down      0   0 Busy     8 8.31  9999 Power5    AIX53
l3f1n03                   Down      0   0 Idle     0 0.00  9999 Power5    AIX53
l3f1n04                   Down      0   0 Idle     0 0.01  9999 Power5    AIX53
l3f1n05                   Down      0   0 Busy     8 7.73  9999 Power5    AIX53
l3f1n06                   Down      0   0 Busy     8 9.03  9999 Power5    AIX53
l3f1n07                   Down      0   0 Busy     8 7.98  9999 Power5    AIX53
l3f1n08                   Down      0   0 Busy     8 9.01  9999 Power5    AIX53
l3f1n09                   Down      0   0 Busy     8 8.73  9999 Power5    AIX53
l3f1n10                   Down      0   0 Busy     8 8.00  9999 Power5    AIX53
l3f1n11                   Down      0   0 Idle     0 1.04  9999 Power5    AIX53
l3f1n12                   Down      0   0 Idle     0 0.00  9999 Power5    AIX53
l3f1n13                   Down      0   0 Idle     0 0.00  9999 Power5    AIX53
l3f1n14                   Down      0   0 Busy     8 8.07  9999 Power5    AIX53 
Power5/AIX53               14 machines      4  jobs     64  running
Total Machines             14 machines      4  jobs     64  running
The Central Manager is defined on l3f1n01
The BACKFILL scheduler is in use
All machines on the machine_list are present.

Users may direct questions to sys-help@loni.org.

Powered by MediaWiki