Programmatically Submitting Jobs and Checking Whether They are Done

From CAC Documentation wiki
Jump to navigation Jump to search

Submitting Jobs and Checking Them With Scripts

Sometimes you need to write a script that submits many jobs to the cluster and then monitors them to see when they complete. One way to see if your job finishes is to check whether it has written its output files, but checking directly with the scheduler can tell you if there are error conditions when you submit and while the job runs.

The Maui commands showq and checkjob can print their results in XML if you add the --format=XML switch. This is suitable for parsing in a scripting language, and we'll show Python examples below.

There are some details to note if you want to use scripts to read scheduler information. For Maui schedulers that include Windows clusters, the nsub command will sometimes, not always, return a job ID that is not just a number: Maui.3779. Therefore, a regular expression, such as ^((Maui.)?\d+)$ should find the job ID in nsub's output.

Module Interface

Attached to this wiki page is a Python module, nsub.py, which has a few simple function. (We use a wrapper to msub, called nsub.)

  • submit_job(batch_filename) : returns (jobid, stdout, stderr) - If you get a jobid, then it has a chance to run. It returns None in the jobid if nsub did not assign a jobid. This will not catch the case where a job becomes Idle because of privilege problems, because it is Idle in the queue and still has a chance to run.
  • get_job_state(jobid) : returns a dictionary where the key-value pairs are properties of the job from checkjob
  • get_all_jobs() : returns a dictionary where keys are jobids and values are dictionaries which are job properties from showq. This has a subset of the properties from checkjob, but it's faster to get.
  • is_job_done(jobid): returns True or False - If the job just finished, then it will appear as 'Completed' in the scheduler. If it finished more than a few minutes ago, then the scheduler forgets about it. This function will return True if the scheduler never heard of the job, or the jobid never existed, so make sure it is running before asking whether it finished.
  • are_jobs_done([jobids]): returns True or False - This is True if all jobs in the list completed.
  • cancel_job(jobid) : no return

How It Works

As an example of one of the functions, get_all_jobs reads output from showq. It uses xml.dom.minidom to process the XML (avoiding elementtree so that Python 2.4 will work if need be):

def get_all_jobs():
    '''Returns all jobs in queue as dictionary of jobs by job id
    containing dictionary for each job as key-value.'''
    process = Popen('showq --format=XML', shell=True,
                    stdout=PIPE, stderr=PIPE)
    (out,err) = process.communicate()
    if not out: return None
    try:
        dom=parseString(out)
        jobs=dom.getElementsByTagName('job')
        all_job_props={}
        for job in jobs:
            job_props={}
            for prop in job.attributes.keys():
                job_props[prop]=job.attributes[prop].nodeValue
            all_job_props[job_props['JobID']]=job_props
    except:
        return None
    return all_job_props

The Life of a Windows Job

This function gives you a listing of every job on the cluster and, for each job, information on its scheduler status. Over the life of a job, the scheduler reports changes to its properties. Right after you submit, you will see the following from Maui to Windows Computer Cluster.

  Account shm7_0003
  Class NORMAL
  CmdFile /opt/maui/spool/maui.job.w0cqqI
  DRMJID 2654
  EEDuration 0
  EState Idle
  EffPAL [CCS]
  Flags GLOBALQUEUE
  Group Domain Users
  IWD /home/gfs01/ajd27/dev/CAC/helmke/testingdir
  JobID 2654
  JobName clans2win1
  PAL [scheduler][CCS]
  QueueStatus blocked
  RM internal
  ReqAWDuration 00:10:00
  ReqNodes 1
  ReqProcs 1
  SRMJID Maui.9685
  StartPriority 0
  StatMSUtl 0.000
  State Idle
  SubmissionTime 1246387996
  SubmitHost v4linuxlogin1.cac.cornell.edu
  SuspendDuration 0
  UMask 18
  User ajd27

Then the job State moves from Idle to Running, here listed by time, jobid, and changes to properties.

0:00:05.886087 2654 EState Idle -> Running
0:00:05.886087 2654 QueueStatus blocked -> active
0:00:05.886087 2654 ReqProcs 1 -> 8
0:00:05.886087 2654 StartPriority 0 -> 1
0:00:05.886087 2654 State Idle -> Running
0:00:05.886087 2654 BlockReason State:non-idle state 'Running'
0:00:05.886087 2654 StartCount 1
0:00:05.886087 2654 StartTime 1246388005

A little later, the submission time gets fixed.

0:00:41.274613 2654 SubmissionTime 1246387996 -> 1246388005
0:00:41.274613 2654 Variable JOBUNITTYPE=NODE

While the job runs, checkjob will show the scheduler updates three variables: AWDuration, StatPSUtl, and StatPSDed. Finally, the job ends. Some of the variables are no longer defined by checkjob, but it adds a CompletionTime.

0:02:12.290911 2654 EState Running -> Completed
0:02:12.290911 2654 QueueStatus active -> blocked
0:02:12.290911 2654 ReqProcs 8 -> 1
0:02:12.290911 2654 StartPriority 1 -> 0
0:02:12.290911 2654 State Running -> Completed
0:02:12.290911 2654 BlockReason no more value
0:02:12.290911 2654 Flags no more value
0:02:12.290911 2654 PAL no more value
0:02:12.290911 2654 StartCount no more value
0:02:12.290911 2654 SubmitHost no more value
0:02:12.290911 2654 CompletionTime 1246388126

Checking for State=Completed is enough to see whether the job finished. The scheduler will remember the job for about 10 minutes after it has completed.

A job simply waiting in the queue to run because others are ahead of it will show:

EState="Idle"
QueueStatus="eligible"
StartPriority="8"
State="Idle"

Check the queue status of the job to see if it is waiting.

What if the job doesn't run as planned? There are several reasons a job might go on hold. If a scheduler problem puts it into deferred, checkjob will see the following:

BlockReason="EState:deferred for 130 seconds"
EState="Deferred"
QueueStatus="Blocked"
State="Idle"

Note that Hold is not defined until it moves to batch hold:

BlockReason="Hold:job hold active - Batch"
EState="Idle"
Hold="Batch"
QueueStatus="Blocked"
State="Idle"

If you had applied a user hold, you would see:

BlockReason="Hold:job hold active - User"
EState="Idle"
Hold="User"
QueueStatus="Blocked"
State="Idle"

If you remove a job, it goes to:

EState="Removed"
State="Removed"

The Life of a Linux Job

For a Linux job submitted from Maui to Torque, the variables are defined differently. At the start, the following are defined.

  Account shm7_0003
  Class v4
  CmdFile /opt/maui/spool/maui.job.dtUjot
  DRMJID 1004756.scheduler.v4linux
  EEDuration 0
  EState Idle
  EffPAL [scheduler]
  Flags GLOBALQUEUE
  Group Domain Users
  IWD $HOME/dev/CAC/helmke/testingdir
  JobID 1004756
  JobName clans2win1
  PAL [scheduler][CCS]
  QueueStatus blocked
  RM internal
  ReqAWDuration 00:10:00
  ReqNodes 1
  ReqProcs 1
  SRMJID Maui.9691
  StartPriority 0
  StatMSUtl 0.000
  State Idle
  SubmissionTime 1246391469
  SubmitHost v4linuxlogin1.cac.cornell.edu
  SuspendDuration 0
  UMask 18
  User ajd27

When the job first begins running, it goes to the 'Running' state.

0:00:05.324871 1004756 EState Idle -> Running
0:00:05.324871 1004756 QueueStatus blocked -> active
0:00:05.324871 1004756 ReqProcs 1 -> 8
0:00:05.324871 1004756 StartPriority 0 -> 1
0:00:05.324871 1004756 State Idle -> Running
0:00:05.324871 1004756 SubmissionTime 1246391469 -> 1246391471
0:00:05.324871 1004756 BlockReason State:non-idle state 'Running'
0:00:05.324871 1004756 EFile
   v4linuxlogin1.cac.cornell.edu:/home/gfs01/ajd27/dev/CAC/helmke /testingdir/
0:00:05.324871 1004756 OFile
   v4linuxlogin1.cac.cornell.edu:/home/gfs01/ajd27/dev/CAC/helmke/testingdir/
0:00:05.324871 1004756 RMStdErr
   v4linuxlogin1.cac.cornell.edu:/home/gfs01/ajd27/dev/CAC/helmke/testingdir/
0:00:05.324871 1004756 RMStdOut
   v4linuxlogin1.cac.cornell.edu:/home/gfs01/ajd27/dev/CAC/helmke/testingdir/
0:00:05.324871 1004756 StartCount 1
0:00:05.324871 1004756 StartTime 1246391475

Later, it completes. It gets a CompletionTime, and the State becomes 'Completed'.

0:01:06.122374 1004756 EState Running -> Completed
0:01:06.122374 1004756 QueueStatus active -> blocked
0:01:06.122374 1004756 ReqProcs 8 -> 1
0:01:06.122374 1004756 StartPriority 1 -> 0
0:01:06.122374 1004756 State Running -> Completed
0:01:06.122374 1004756 BlockReason no more value
0:01:06.122374 1004756 Flags no more value
0:01:06.122374 1004756 PAL no more value
0:01:06.122374 1004756 StartCount no more value
0:01:06.122374 1004756 SubmitHost no more value
0:01:06.122374 1004756 CompletionTime 1246391537

The scheduler will remember job information, and report it to you through checkjob, for about 10 minutes after the job completes.

What if a job is sitting in the queue? It's state remains Idle, but its queue status changes:

QueueStatus Blocked -> Eligible
StartPriority 0 -> 15

and the scheduler counts the time it waits:

Bypass 6 -> 9
EEDuration 156 -> 188

until it can finally run.

If the scheduler were to reject a job for any restartable reason, it would defer a few times, then go into batch hold, which is a signal to the operators to take a look at this job:

BlockReason="Hold:job hold active - Batch"
EState="Idle"
Hold="Batch"
QueueStatus="Blocked"
State="Idle"

In this case, checkjob will also show messages about why it is blocked.

Were you to cancel a job from the waiting queue, you would see:

0:09:54.221141 1004867 EState Idle -> Removed
0:09:54.221141 1004867 StartPriority 23 -> 0
0:09:54.221141 1004867 State Idle -> Removed
0:09:54.221141 1004867 Bypass no more value
0:09:54.221141 1004867 Flags no more value
0:09:54.221141 1004867 PAL no more value
0:09:54.221141 1004867 SubmitHost no more value
0:09:54.221141 1004867 CompletionCode CNCLD
0:09:54.221141 1004867 CompletionTime 1246648224