Difference between revisions of "Running Jobs on the ASTRA cluster"

From CAC Documentation wiki
Jump to navigation Jump to search
Line 51: Line 51:
  
 
====A job with heavy I/O (use /tmp as noted in the above example)====
 
====A job with heavy I/O (use /tmp as noted in the above example)====
:* Use [http://www.cac.cornell.edu/wiki/index.php?title=Why_/tmp%3F /tmp] to avoid heavy I/O over NFS to your home directory!
+
:* Use [[FAQ#Why_use_a_temporary_directory| /tmp]] to avoid heavy I/O over NFS to your home directory!
 
:* Ignoring this message could bring down the ASTRA CLUSTER HEAD NODE!
 
:* Ignoring this message could bring down the ASTRA CLUSTER HEAD NODE!
  
Line 153: Line 153:
 
:* The command to submit jobs is qsub, not nsub.
 
:* The command to submit jobs is qsub, not nsub.
 
:* The v4 scheduler allows only one job per node.  
 
:* The v4 scheduler allows only one job per node.  
[http://www.cac.cornell.edu/wiki/index.php?title=LinuxBatchExamples#Running_a_serial_job Examples from the CAC v4 scheduler]
+
[[Linux_Batch_Examples#Running_a_serial_job| Examples from the CAC v4 scheduler]]
  
  
[http://www.cac.cornell.edu/wiki/index.php?title=ASTRA_Cluster Astra Documentation Main Page]
+
[[ASTRA Cluster |Astra Documentation Main Page]]

Revision as of 11:26, 30 September 2015

After you have familiarized yourself with the 'Getting Started':

Maui Scheduler and Job submission/monitoring commands

Jobs are scheduled by the Maui scheduler with the Torque resource manager. We suggest you use a job submission batch file utilizing PBS Directives ('Options' section).

Common Maui Commands

(If you have any experience with PBS/Torque or SGE, Maui Commands may be recognizable. Most used:

'qsub - Job submission (jobid will be displayed for the job submitted)

  • $ qsub jobscript.sh

showq - Display queue information.

  • $ showq (dump everything)
  • $ showq -r (show running jobs)
  • $ showq -u foo42 (shows foo42's jobs)

checkjob - Display job information. (You can only checkjob your own jobs.)

  • $ checkjob -A jobid (get dense key-value pair information on job 42)
  • $ checkjob -v jobid (get verbose information on job 42)

canceljob - Cancel Job. (You can only cancel your own jobs.)

  • $ canceljob jobid

Setting Up your Job Submission Batch File

Commands can be run on the command line with the qsub command. However, we suggest running your jobs from a batch script. PBS Directives are command line arguments inserted at the top of the batch script, each directive prepended with '#PBS' (no spaces). Reference PBS Directives

The following script example is requesting: 1 node, max time for job to be 5 minutes,30 seconds; output and error to be joined into the same file, descript name of job 'defaulttest' and use the 'default' queue:

#!/bin/bash
#PBS -l walltime=00:05:30,nodes=1
#PBS -j oe
#PBS -N defaulttest
#PBS -q default

# Turn on echo of shell commands
set -x
# jobs normally start in the HOME directory; cd to where you submitted.
cd $PBS_O_WORKDIR
# copy the binary and data files to a local directory on the node job is executing on
cp $HOME/binaryfile $TMPDIR/binaryfile
cp $HOME/values $TMPDIR/values
cd $TMPDIR
# run your executable from the local disk on the node the job was placed on
./binaryfile >&binary.stdout
# Copy output files to your output folder	
cp -f $TMPDIR/binary.stdout $HOME/outputdir

A job with heavy I/O (use /tmp as noted in the above example)

  • Use /tmp to avoid heavy I/O over NFS to your home directory!
  • Ignoring this message could bring down the ASTRA CLUSTER HEAD NODE!

Running Many Copies of a Serial Job

In order to run 30 separate instances of the same program, use the scheduler's task array feature, through the "-t" option. The "nodes" parameter here refers to a core.

#!/bin/sh
#PBS -l nodes=1,walltime=10:00   (note: this is PBS -l (small case L))
#PBS -t 30
#PBS -N test
#PBS -j oe
#PBS -S /bin/bash

set -x
cd "$PBS_O_WORKDIR"
echo Run my job.

When you start jobs this way, separate jobs will pile one-per-core onto nodes like a box of hamsters.

Request job to run on 4 processors, on 5 nodes

#PBS -l walltime=00:05:30,nodes=5:ppn=4

Request Memory resources

To dedicate 2GB of memory for your job, add the 'mem=xx' to the '-l' option you should already have in your batch script file:

#PBS -l walltime=00:05:30,nodes=1,mem=2gb


Use 'checkjob -v [jobid]' to display your resources:
Total Tasks: 2
Dedicated Resources Per Task: PROCS: 1  MEM: 2048M

To have 2 tasks w/ each proc dedicated to have 2GB, you would need your PBS directive:

#PBS -l walltime=00:05:30,nodes=1:ppn=2,mem=4gb

checkjob -v [jobid]:
Total Tasks: 2
Dedicated Resources Per Task: PROCS: 1  MEM: 2048M

Running a job on a specific nodes

#!/bin/bash

#PBS -l walltime=00:03:00,host=compute-1-1+compute-1-2+compute-1-3,nodes=3
#PBS -N testnodes
#PBS -j oe
#PBS -q default

set -x
cd "$PBS_O_WORKDIR"
echo "Run my job."
job.sh

Running an interactive job

  • Be sure to have logged in w/ X11 forwarding enabled (if using ssh, use ssh -X or ssh -Y; if using putting, you need to be sure to check the X11 forwarding box)
  • You do not have to specify a specific node as below

Running an interactive job on a specific node

To run on a specific node use the '-l' with the'host='option; to run Interactively, use the '-I' (capital I) option. Example below is requesting you get the node for 5 days:

#!/bin/bash
#PBS -l host=compute-1-5, walltime=120:10:00       
#PBS -I
#PBS -N test
#PBS -j oe
#PBS -q default

set -x

Running multiple copies of your job

  • This method is recommended over using a for/while loop within a bash script. We have seen this "confuse the scheduler"

The following example creates two jobs at once.

#!/bin/bash
#PBS -q default
#PBS -l walltime=00:05:00,nodes=1
#PBS -t 1-2
#PBS -j oe
#PBS -N intro
#PBS -V
set -x
cd "$PBS_O_WORKDIR"
echo "Run my job."
jobscript.sh

Further examples:

The scheduler on astra is similar to that of the CAC v4 scheduler with a few distinct changes:

  • There is no need to specify an account number in batch scripts.
  • The command to submit jobs is qsub, not nsub.
  • The v4 scheduler allows only one job per node.

Examples from the CAC v4 scheduler


Astra Documentation Main Page