Running Jobs on the KINGLAB cluster

From CAC Documentation wiki
Revision as of 11:36, 4 January 2016 by Rda1 (talk | contribs) (→‎Select your default MPI)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

After you have familiarized yourself with the 'Getting Started':

Maui Scheduler and Job submission/monitoring commands

Jobs are scheduled by the Maui scheduler with the Torque resource manager. We suggest you use a job submission batch file utilizing PBS Directives ('Options' section).

Common Maui Commands

(If you have any experience with PBS/Torque or SGE, Maui Commands may be recognizable. Most used:

'qsub - Job submission (jobid will be displayed for the job submitted)

  • $ qsub

showq - Display queue information.

  • $ showq (dump everything)
  • $ showq -r (show running jobs)
  • $ showq -u foo42 (shows foo42's jobs)

checkjob - Display job information. (You can only checkjob your own jobs.)

  • $ checkjob -A jobid (get dense key-value pair information on job 42)
  • $ checkjob -v jobid (get verbose information on job 42)

canceljob - Cancel Job. (You can only cancel your own jobs.)

  • $ canceljob jobid
Setting Up your Job Submission Batch File

Commands can be run on the command line with the qsub command. However, we suggest running your jobs from a batch script. PBS Directives are command line arguments inserted at the top of the batch script, each directive prepended with '#PBS' (no spaces). Reference PBS Directives

The following script example is requesting: 1 node, max time for job to be 5 minutes,30 seconds; output and error to be joined into the same file, descript name of job 'batchtest' and use the 'default' queue:

#PBS -l walltime=00:05:30,nodes=1
#PBS -j oe
#PBS -N batchtest
#PBS -q default

# Turn on echo of shell commands
set -x
# jobs normally start in the HOME directory; cd to where you submitted.
# copy the binary and data files to a local directory on the node job is executing on
cp $HOME/binaryfile $TMPDIR/binaryfile
cp $HOME/values $TMPDIR/values
# run your executable from the local disk on the node the job was placed on
./binaryfile >&binary.stdout
# Copy output files to your output folder	
cp -f $TMPDIR/binary.stdout $HOME/outputdir
A job with heavy I/O (use /tmp! (as found in the above example)
  • Use /tmp to avoid heavy I/O over NFS to your home directory!
  • Ignoring this message could bring down the KINGLAB CLUSTER HEAD NODE!
Select your default MPI

There are several versions of MPI on the KINGLAB cluster. Use the following commands to modify your default mpirun.

  • which mpirun -> shows the current mpirun in your working environment
  • module avail -> shows all software packages available as modules on your system
  • module load <software-name> -> loads (adds) the <software-name) package into your working environment
  • module list -> shows all software loaded into your current working environment
  • module unload <software-name> -> unloads (removes) the <software-name>
Running an MPI Job using 12 Tasks Per Node

Because the nodes compute-1-30 & compute-1-[36-39] have 12 physical cores, you may want to limit jobs to 12 tasks per node. The node file lists each node 1 time, so make a copy with each node listed 12 times, and hand that version to MPI.

#PBS -l walltime=00:03:00,host=compute-1-30.kinglab+compute-1-36.kinglab+compute-1-37.kinglab+compute-1-38.kinglab+compute-1-39.kinglab,walltime=00:02:00,nodes=5

#PBS -N test
#PBS -j oe
#PBS -S /bin/bas

set -x

# Construct a copy of the hostfile with only 12 entries per node.
# MPI can use this to run 12 tasks on each node.
uniq "$PBS_NODEFILE"|awk '{for(i=0;i<12;i+=1) print}'>nodefile.12way

# to Run 12-way on 5 nodes, we request 60 core to obtain 5 nodes
mpiexec --hostfile nodefile.12way ring -v
Running Many Copies of a Serial Job

In order to run 30 separate instances of the same program, use the scheduler's task array feature, through the "-t" option. The "nodes" parameter here refers to a core.

#PBS -l nodes=1,walltime=10:00   (note: this is PBS -l (small case L))
#PBS -t 30
#PBS -N test
#PBS -j oe
#PBS -S /bin/bash

set -x
echo Run my job.

When you start jobs this way, separate jobs will pile one-per-core onto nodes like a box of hamsters.

Request job to run on 2 processors, on 5 nodes
#PBS -l walltime=00:05:30,nodes=5:ppn=2
Request Memory resources

To dedicate 2GB of memory for your job, add the 'mem=xx' to the '-l' option you should already have in your batch script file:

#PBS -l walltime=00:05:30,nodes=1,mem=2gb

Use 'checkjob -v [jobid]' to display your resources:
Total Tasks: 2
Dedicated Resources Per Task: PROCS: 1  MEM: 2048M

To have 2 tasks w/ each proc dedicated to have 2GB, you would need your PBS directive:

#PBS -l walltime=00:05:30,nodes=1:ppn=2,mem=4gb

checkjob -v [jobid]:
Total Tasks: 2
Dedicated Resources Per Task: PROCS: 1  MEM: 2048M
Running a job on a specific nodes

#PBS -l walltime=00:03:00,host=compute-1-36.kinglab+compute-1-37.kinglab+compute-1-38.kinglab,nodes=3
#PBS -N testnodes
#PBS -j oe
#PBS -q default

set -x
echo "Run my job."
Running an interactive job
  • Be sure to have logged in w/ X11 forwarding enabled (if using ssh, use ssh -X or ssh -Y; if using putting, you need to be sure to check the X11 forwarding box)
  • You do not have to specify a specific node as below
    • Asking for -q inter will always put you on the Fast/Interactive nodes
Running an interactive job on a specific node

To run on a specific node use the '-l' with the'host='option; to run Interactively, use the '-I' (capital I) option. Example below is requesting you get the node for 5 days:

#PBS -l host=compute-1-30.kinglab, walltime=120:10:00       
#PBS -N test
#PBS -j oe
#PBS -q default

set -x
Running multiple copies of your job
  • This method is recommended versus using a for/while loop within a bash script. Using a for/while loop in a batch script will "confuse the scheduler"

The following example creates two jobs at once.

#PBS -q protected
#PBS -l walltime=00:05:00,nodes=1
#PBS -t 1-2
#PBS -j oe
#PBS -N intro
set -x
echo "Run my job."

Kinglab Documentation Main Page