Examining Checkjob -v

From CAC Documentation wiki
Revision as of 11:32, 22 September 2015 by Ad876 (talk | contribs) (Created page with "==Reading Checkjob Output== The Maui [http://www.adaptivecomputing.com/resources/docs/maui/commands/checkjob.php checkjob] command can show why a job isn't running. ===Examp...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Reading Checkjob Output

The Maui checkjob command can show why a job isn't running.

Example Stalled Job

Below we split the output of checkjob into parts to discuss them.

[user@v4linuxlogin1 ~]$ checkjob -v 1005159
job 1005159 (RM job '1005159.scheduler.v4linux')

Jobs have two names, a pure number from the scheduler, and a longer name from the resource manager (RM). You can refer to the job with either of these when calling Maui commands. You may have noticed that, when you submit the job, it will return either of these two names.

AName: dmsample_sim_ERE_r3.p4_master_7
State: Idle

The state is Idle, but there are different kinds of Idle. A newly submitted job will stay in a purely Idle state for up to 30s before the scheduler will notice it and make a decision about how to run it. After that, you will see below which particular type of Idle a job is in.

Creds:  user:abd27  group:Domain Users  account:ems4_0001  class:v4
WallTime:   00:00:00 of 7:00:00:00
SubmitTime: Mon Jul 13 10:31:57
  (Time Queued  Total: 00:53:41  Eligible: 00:51:32)

The account here is the project account, associated with a University account number. For the scheduler, class means the name of the queue to which you submitted. This wall time is seven days.

Here we see the exact state of the job, which is Eligible. Checkjob lists a "state" and an "estate". Sometimes the first one tells the whole story, but not in the case of Idle jobs. The possible states of a job are:

  • Idle - Idle - Within the first 30s of submission, the scheduler has yet to evaluate this job.
  • Idle - Eligible - User credentials are correct. The job will run as soon as a Resource Manager accepts it on a partition and node.
  • Running - Currently running on nodes.
  • Deferred - The first attempt to run the job failed. The scheduler will try again in a few seconds.
  • Idle - Batch Hold - Attempts to run the job failed. This state sends email to system administrators.
  • Idle - User Hold - You can start a job held so that it queues but won't run until you give it the OK. This can be useful when the queue is long and you aren't quite ready to run.
  • Idle - System Hold - The system administrator asked that the queue be halted temporarily.
  • Completed - The job finished.

More in-depth examination of the reasons for holding a job is at Maui Job Holds. To see the lifetime of a job, look at Programmatically Submitting Jobs and Checking Whether They are Done.

Subsequent sections tell you why it cannot yet run.

NodeMatchPolicy: EXACTNODE
Total Requested Tasks: 1
Total Requested Nodes: 1

Req[0]  TaskCount: 1  Partition: ALL
Memory >= 0  Disk >= 0  Swap >= 0
NodeCount:  1
IWD:            $HOME/dmsample/dmsimulate
UMask:          0022
Executable:     /opt/moab/spool/moab.job.bJAUKo
OutputFile:     v4linuxlogin1.cac.cornell.edu:/home/gfs05/agd27/dmsample/dmsimulate/
ErrorFile:      v4linuxlogin1.cac.cornell.edu:/home/gfs05/agd27/dmsample/dmsimulate/
BypassCount:    1
Partition List: ALL,scheduler
SrcRM:          internal  DstRM: scheduler  DstRMJID: 1005159.scheduler.v4linux
Submit Args:    -q v4@scheduler.v4linux 1
Flags:          GLOBALQUEUE
Attr:           checkpoint
StartPriority:  51
PE:             1.00

Next follows a list of nodes and why each rejected the job. There are a few reasons a particular node will reject running a job.

  • Class - This node is in the wrong queue. For Maui, "class" means queue, as in v4 or v4dev.
  • CPU - This node is busy with another job.
  • State - Usually this means the node is marked Down for repairs.
  • Reserved - The scheduler counts something as reserved when the job is refused because of some reservation on the list. A reservation such as "asr2004_0001.473" indicates that the project asr2004_0001 has special rights to this node. A reservation such as "system.470" is likely an assertion that the nodes will have a down time to apply patches. A reservation such as "1009809/policy" indicates that job 1009809 is already running on those nodes.
Node Availability for Partition scheduler --------
lmcompute-4-1.v4linux    rejected: Class
lmcompute-4-2.v4linux    rejected: Class
lmcompute-4-3.v4linux    rejected: Class
lmcompute-4-4.v4linux    rejected: Class
compute-1-1.v4linux      rejected: State (Down)
compute-1-2.v4linux      rejected: State (Down)
compute-1-3.v4linux      rejected: State (Down)
compute-1-4.v4linux      rejected: State (Down)
compute-1-5.v4linux      rejected: State (Down)
compute-1-6.v4linux      rejected: State (Down)
compute-3-1.v4linux      rejected: Reserved (system.470/policy)
compute-3-2.v4linux      rejected: Reserved (system.470/policy)
compute-3-3.v4linux      rejected: Reserved (system.470/policy)
compute-3-9.v4linux      rejected: Reserved (asr2004_0001.473/policy)
compute-3-10.v4linux     rejected: Reserved (asr2004_0001.473/policy)
compute-3-11.v4linux     rejected: Reserved (asr2004_0001.473/policy)
compute-3-12.v4linux     rejected: Reserved (asr2004_0001.473/policy)
compute-3-13.v4linux     rejected: Reserved (asr2004_0001.473/policy) 
compute-3-18.v4linux     rejected: Reserved (jp86_0005.474/policy)
compute-3-19.v4linux     rejected: Reserved (jp86_0005.474/policy)
compute-3-20.v4linux     rejected: Reserved (jp86_0005.474/policy)
compute-3-21.v4linux     rejected: Reserved (jp86_0005.474/policy)
compute-3-22.v4linux     rejected: Reserved (jp86_0005.474/policy) 
compute-3-23.v4linux     rejected: Reserved (1009809/policy)
compute-3-24.v4linux     rejected: Reserved (1009809/policy)
compute-3-25.v4linux     rejected: Reserved (1009808/policy)
compute-3-26.v4linux     rejected: Reserved (1009808/policy)
compute-3-27.v4linux     rejected: Reserved (1009808/policy)
compute-3-28.v4linux     rejected: Reserved (system.470/policy)
compute-3-29.v4linux     rejected: Reserved (system.470/policy)
compute-3-30.v4linux     rejected: Reserved (system.470/policy)
compute-3-31.v4linux     rejected: Reserved (system.470/policy)  
NOTE:  job cannot run in partition scheduler (idle procs do not meet requirements : 0 of 1  procs found)
idle procs: 416  feasible procs:   0
Node Rejection Summary: [Class: 4][State: 32][Reserved: 48]
NOTE:  job violates constraints for partition CCS (partition CCS not in job partition mask)

Partition CCS is the Windows partition. It isn't surprising a Linux job can't run in the Windows partition, but we see it currently cannot run anywhere. Because many nodes report system.470, it seems this job is running long enough to conflict with an upcoming down time and will start afterward.