Examining Checkjob -v
Reading Checkjob Output
The Maui checkjob command can show why a job isn't running.
Example Stalled Job
Below we split the output of checkjob into parts to discuss them.
[user@v4linuxlogin1 ~]$ checkjob -v 1005159 job 1005159 (RM job '1005159.scheduler.v4linux')
Jobs have two names, a pure number from the scheduler, and a longer name from the resource manager (RM). You can refer to the job with either of these when calling Maui commands. You may have noticed that, when you submit the job, it will return either of these two names.
AName: dmsample_sim_ERE_r3.p4_master_7 State: Idle
The state is Idle, but there are different kinds of Idle. A newly submitted job will stay in a purely Idle state for up to 30s before the scheduler will notice it and make a decision about how to run it. After that, you will see below which particular type of Idle a job is in.
Creds: user:abd27 group:Domain Users account:ems4_0001 class:v4 WallTime: 00:00:00 of 7:00:00:00 SubmitTime: Mon Jul 13 10:31:57 (Time Queued Total: 00:53:41 Eligible: 00:51:32)
The account here is the project account, associated with a University account number. For the scheduler, class means the name of the queue to which you submitted. This wall time is seven days.
Here we see the exact state of the job, which is Eligible. Checkjob lists a "state" and an "estate". Sometimes the first one tells the whole story, but not in the case of Idle jobs. The possible states of a job are:
- Idle - Idle - Within the first 30s of submission, the scheduler has yet to evaluate this job.
- Idle - Eligible - User credentials are correct. The job will run as soon as a Resource Manager accepts it on a partition and node.
- Running - Currently running on nodes.
- Deferred - The first attempt to run the job failed. The scheduler will try again in a few seconds.
- Idle - Batch Hold - Attempts to run the job failed. This state sends email to system administrators.
- Idle - User Hold - You can start a job held so that it queues but won't run until you give it the OK. This can be useful when the queue is long and you aren't quite ready to run.
- Idle - System Hold - The system administrator asked that the queue be halted temporarily.
- Completed - The job finished.
More in-depth examination of the reasons for holding a job is at Maui Job Holds. To see the lifetime of a job, look at Programmatically Submitting Jobs and Checking Whether They are Done.
Subsequent sections tell you why it cannot yet run.
NodeMatchPolicy: EXACTNODE Total Requested Tasks: 1 Total Requested Nodes: 1 Req TaskCount: 1 Partition: ALL Memory >= 0 Disk >= 0 Swap >= 0 NodeCount: 1
IWD: $HOME/dmsample/dmsimulate UMask: 0022 Executable: /opt/moab/spool/moab.job.bJAUKo
OutputFile: v4linuxlogin1.cac.cornell.edu:/home/gfs05/agd27/dmsample/dmsimulate/ ErrorFile: v4linuxlogin1.cac.cornell.edu:/home/gfs05/agd27/dmsample/dmsimulate/ BypassCount: 1 Partition List: ALL,scheduler SrcRM: internal DstRM: scheduler DstRMJID: 1005159.scheduler.v4linux Submit Args: -q firstname.lastname@example.org 1 Flags: GLOBALQUEUE Attr: checkpoint StartPriority: 51 PE: 1.00
Next follows a list of nodes and why each rejected the job. There are a few reasons a particular node will reject running a job.
- Class - This node is in the wrong queue. For Maui, "class" means queue, as in v4 or v4dev.
- CPU - This node is busy with another job.
- State - Usually this means the node is marked Down for repairs.
- Reserved - The scheduler counts something as reserved when the job is refused because of some reservation on the list. A reservation such as "asr2004_0001.473" indicates that the project asr2004_0001 has special rights to this node. A reservation such as "system.470" is likely an assertion that the nodes will have a down time to apply patches. A reservation such as "1009809/policy" indicates that job 1009809 is already running on those nodes.
Node Availability for Partition scheduler -------- lmcompute-4-1.v4linux rejected: Class lmcompute-4-2.v4linux rejected: Class lmcompute-4-3.v4linux rejected: Class lmcompute-4-4.v4linux rejected: Class compute-1-1.v4linux rejected: State (Down) compute-1-2.v4linux rejected: State (Down) compute-1-3.v4linux rejected: State (Down) compute-1-4.v4linux rejected: State (Down) compute-1-5.v4linux rejected: State (Down) compute-1-6.v4linux rejected: State (Down) compute-3-1.v4linux rejected: Reserved (system.470/policy) compute-3-2.v4linux rejected: Reserved (system.470/policy) compute-3-3.v4linux rejected: Reserved (system.470/policy) compute-3-9.v4linux rejected: Reserved (asr2004_0001.473/policy) compute-3-10.v4linux rejected: Reserved (asr2004_0001.473/policy) compute-3-11.v4linux rejected: Reserved (asr2004_0001.473/policy) compute-3-12.v4linux rejected: Reserved (asr2004_0001.473/policy) compute-3-13.v4linux rejected: Reserved (asr2004_0001.473/policy) compute-3-18.v4linux rejected: Reserved (jp86_0005.474/policy) compute-3-19.v4linux rejected: Reserved (jp86_0005.474/policy) compute-3-20.v4linux rejected: Reserved (jp86_0005.474/policy) compute-3-21.v4linux rejected: Reserved (jp86_0005.474/policy) compute-3-22.v4linux rejected: Reserved (jp86_0005.474/policy) compute-3-23.v4linux rejected: Reserved (1009809/policy) compute-3-24.v4linux rejected: Reserved (1009809/policy) compute-3-25.v4linux rejected: Reserved (1009808/policy) compute-3-26.v4linux rejected: Reserved (1009808/policy) compute-3-27.v4linux rejected: Reserved (1009808/policy) compute-3-28.v4linux rejected: Reserved (system.470/policy) compute-3-29.v4linux rejected: Reserved (system.470/policy) compute-3-30.v4linux rejected: Reserved (system.470/policy) compute-3-31.v4linux rejected: Reserved (system.470/policy) NOTE: job cannot run in partition scheduler (idle procs do not meet requirements : 0 of 1 procs found) idle procs: 416 feasible procs: 0 Node Rejection Summary: [Class: 4][State: 32][Reserved: 48] NOTE: job violates constraints for partition CCS (partition CCS not in job partition mask)
Partition CCS is the Windows partition. It isn't surprising a Linux job can't run in the Windows partition, but we see it currently cannot run anywhere. Because many nodes report system.470, it seems this job is running long enough to conflict with an upcoming down time and will start afterward.