Batch Scheduler Maui

From CAC Documentation wiki
Jump to navigation Jump to search

Maui Commands

Maui Scheduler and Job submission/monitoring commands

V4 cluster and several private clusters, are scheduled by the Maui scheduler with the Torque resource manager.

How the v4 Cluster is Scheduled

The v4 cluster is configured to have three queues you can submit to: v4, v4-64g, and v4dev. The showqlease command shows how many nodes are in each. The scheduler implements backfill without fairshare or limits on the number of jobs that can be submitted. The result is roughly first-come, first-serve, with the exception that backfill can allow smaller jobs to sneak past multi-node jobs when it won't affect the large jobs' start time.

The v4 cluster also schedules per-node, so that there are multiple cores available to each job. This is not true of several private clusters.

In general, no project id has priority ahead of any other on the queues. Within each queue, some groups have purchased rights to exclusive access to certain nodes. We call these leased nodes. If a job's project id is in a lease, then that job will preferentially run within the leased node but spill over into the other, non-leased nodes (called "standard" by the showqlease command). Those jobs still do not have higher priority, but they have access to leased nodes that other jobs do not.

Commands

nsub - Job submission. This command submits a job to the scheduler for scheduling. If the job is submitted successfully, a unique id identifying that job is returned. nsub is a CAC custom wrapper around the Torque qsub command that you must use to submit jobs. nsub understands all qsub parameters but adds restrictions to ensure that you have properly specified a valid account, node and walltime specifications. Specification of job properties can only be done using #PBS directives. Windows and Linux handle nsub directives differently.

$ nsub jobscript.sh
 (submit the jobscript.sh job, which contains PBS directives, see the examples)


See the Linux Batch Examples for information on submitting jobs.
showq - Display queue information. Allows users to see what jobs are in the queue and filter jobs according to passed commands.

$ showq (display everything)
$ showq -r (show running jobs)
$ showq -u foo42 (shows foo42's jobs)


checkjob - Display job information. Users are only able to execute this command on their own jobs. 'checkjob' provides information about the job id and its state.

$ checkjob -A 42  (get dense key-value pair information on job 42)
$ checkjob -v 42 (get verbose information on job 42 including information as to why a job may not be running)


canceljob - Cancel Job. This command can be used (amongst other things) to cancel a running job. You can only cancel your own jobs.

$ canceljob 42 (cancel job 42, if you own it)


showbf - Show available resources. This command can be used to help understand what resources are available on the system for immediate use. This can help you to understand when your jobs will run. You simply specify the desired resources for your job and a list of available resources will be returned to you. You must use -A.

$ showbf -u foo42 -A  (displays all available resources on all available resource managers and queues)


showqlease - CAC-provided command to show summary information on the availability of queues and leases.
showbalance - CAC-provided command to show remaining compute time balance.
Full listing of Maui Commands' ==Linux Batch Examples

Scheduler Examples

This page provides instructions on how to create a batch script for the scheduler to actually run jobs on the V4 system. This page covers the basics of submitting various types of jobs to the scheduler. Jobs are submitted using the nsub command, so for the full listing of options that can be passed to nsub, consult the command reference page.

HPC Cluster Hardware List

Things to do before you run any linux jobs on the nodes

  1. Connect to the v4 linux login node (linuxlogin.cac.cornell.edu or linuxlogin2.cac.cornell.edu) using ssh.
  2. Before you submit a job, you must have valid ssh keys. ROCKS creates the keys files for you automatically when you first log in. You will be prompted for a passphrase. If you choose not to enter one, just hit <enter> twice. If you suspect a problem with your keys and want to recreate them, rename your ssh directory (in case you have other keys you wish to save), log off, and log back in. Your keys files will be recreated automatically.
  3. You must have a valid account number that you are allowed to charge against. All jobs must be submitted with the account number you wish to charge against explicitly provided, so have that number handy.
  4. Get your batch script ready. This is a script that will run on the compute node; an example that executes a fictional binary in serial is shown below.
     #!/bin/sh 
     # copy the binary and data files to a local directory
     cp $HOME/test/karpc $TMPDIR/karpc
     cp $HOME/test/values $TMPDIR/values
     
     cd $TMPDIR
     
     # run the executable from local disk
     ./karpc >& karpc.stdout
     
     # Copy output files to your output folder
     cp -f $TMPDIR/karpc.stdout $HOME/test
  5. At this point, you should be ready to submit your job. You'll need to provide an account number, walltime estimate, and the number of nodes you're requesting as you submit the job. You can also provide any other msub arguments that you like.

Using a temporary directory

  • CLUSTER POLICY: A program is run at the beginning of every job that clears /tmp. This will assure each user that they have all of /tmp for the duration of their job.
  1. Most jobs will want to create a temporary directory on the compute node's local disk to read and write data from. It is faster to do local file I/O and copy complete data files to and from $HOME at the beginning and the end of the job, rather than do I/O over the network ($HOME is network mounted on the compute nodes). But please see the section on /v4scratch for another alternative. In the past, this meant that you did something like this in the beginning of the batch script:
    $ mkdir /tmp/$USER
    Now Torque creates a uniquely named directory (/tmp/$PBS_JOBID) when a job starts and stores the path of this directory in the $TMPDIR environment variable. This directory is cleaned up when the job exits.
  2. To use this feature, simply change references to /tmp/$USER (or whatever scheme you're using to create a temporary directory) to $TMPDIR. The directory is created and cleaned up for you, so you can get rid of all mkdir /tmp/$USER and rmdir /tmp$USER lines in your script as well.
  3. All of the examples on this page use $TMPDIR, so please check the specific serial or parallel sections if you're unsure what to do.
  4. You may also find the -wdir option for mpiexec helpful. This sets the working directory for the mpiexec process to start.


Using /v4scratch

If you submit multiple batch files which all copy data from $HOME to $TMPDIR, it can be enough to slow your jobs tremendously and/or crash our $HOME filesystem, which is bad because that's where I keep all my stuff. Having multiple jobs read or write simultaneously can flood the NFS controller on $HOME.

# This is bad, mkay?
for f in *hugefiles.dat
do
  cp $f $TMPDIR
done

We worked out a solution, specifically for people doing genomics, but it helps many others. There is a second, parallel filesystem called /v4scratch, available both from login nodes and from cluster nodes, which handles load better, is a little faster, really big (16 TB), and just gets slow instead of crashing. The big drawback is that there is no backup. Here is how you use it:

  1. From the login node, interactively use the cp command to copy your files from $HOME to /v4scratch, or alternatively transfer files from the get-go directly to /v4scratch with scp or sftp. Note that while this cp command may be slow for large data, it is not charged time.
  2. Once the files are transferred, start your batch jobs, telling them to read initial data from /v4scratch, work in $TMPDIR locally on the node, and then write back to /v4scratch. They may run faster if you stagger when you start the batch jobs.
  3. Use a single cp command from the login node again to copy back to home.

That's it on how to use it. Some more details on the /v4scratch filesystem:

  • 13 TB of space are available via two fileservers (Dell PE2950s) running the gluster parallel file system.
  • A fiber channel storage unit (16x1TB) has been divided into two volumes. Each file server is connected to one volume via one of the two fiber channel ports on the storage unit.
  • Data rates are intermediate between home directories (/home/....) and the local scratch space on the v4 linux nodes (/tmp).

This parallel gluster filesystem is a small example of a nice solution for fast parallel file storage. Given interest we may expand its capabilities.

Running a serial job

  1. Submitting a serial job is easy, you just need to provide the account that the node hours should be charged against, the number of nodes you are requesting (1 for serial jobs), and the walltime estimate of your job. If your job exceeds the walltime, it will be terminated. You can supply these arguments on the command line using nsub as shown in Maui Commands but a simpler and generally safer way is to add these into the batch script as PBS Directives. PBS Directives are simply command line arguments inserted at the top of the batch script and prepended with #PBS (no spaces). The account you wish to use is specified with -A and the nodes and walltime are specified with the -l flag. More information on these flags (as well as the other flags that may be supplied can be found in the job submission documentation. We'll use the example batch script from the prerequisites section. By putting these parameters in the batch script, you'll have a record of what you did as well as save any information in the file for the next time you run the job. Thus, we recommend that you avoid using the command line parameters in pretty much all cases.
     #!/bin/sh  
     
     #NOTE: The -l, -A and -q flags are required, others are optional
     
     #1 node for 10 minutes
     #PBS -l nodes=1,walltime=10:00
     #your account number
     #PBS -A abc42_00001
     #Join stdout and stderr
     #PBS -j oe
     #Name the job something useful
     #PBS -N wickedCoolJob
     #PBS -q v4dev
    
     # Because jobs start in the HOME directory, move to where we submitted.
     cd "$PBS_O_WORKDIR" 
    
     # copy the binary and data files to a local directory
     cp $HOME/test/karpc $TMPDIR/karpc
     cp $HOME/test/values $TMPDIR/values
     
     cd $TMPDIR
     
     # run the executable from local disk
     ./karpc >&karpc.stdout
     
     # Copy output files to your output folder	
     cp -f $TMPDIR/karpc.stdout $HOME/test
  2. Once the script is ready, save to something reasonable (like batch_test.sh) and submit the job using nsub:
    $ nsub batch_test.sh

Running a parallel MPI job

The default version of MPI is Intel MPI.

  1. Submitting a parallel job is not significantly different than submitting a serial job. First, you'll need to compile your MPI code. In this example, we'll use a simple 'HelloWorld' application. You can follow along from the MPI Hello World source code. Compile this (or any MPI program). We'll call ours hello.out:
    $ mpicc -o hello.out HelloWorld.c
  2. Write a batch script that calls mpiexec to execute the compiled MPI code.
     #!/bin/sh
     #PBS -A AcctNumber
     #PBS -l walltime=02:00,nodes=4
     #PBS -N mpiTest
     #PBS -j oe
     #PBS -q v4
     
     # Because jobs start in the HOME directory, move to where we submitted.
     cd "$PBS_O_WORKDIR" 
     
     #Count the number of nodes
     np=$(wc -l < $PBS_NODEFILE)
     
     #boot mpi on the nodes
     mpdboot -n $np --verbose -r /usr/bin/ssh -f $PBS_NODEFILE
     
     
     #now execute one process per node
     mpiexec -np $np $HOME/v4Test/hello.out
     mpdallexit
    There are two things worth noting in the above script. First, the "-j oe" directive joins the output and error streams into the same output file. Second, the "-l ppn:1" directive usually determines how many times each machine name appears in $PBS_NODEFILE... HOWEVER, be aware that at at CAC, this specification is ignored! In effect, it is always reset to 1, so that each machine name appears exactly once in the machine file. This is necessary to guarantee that you are granted exclusive access to the nodes in your batch job, and it helps the CAC accounting system to function properly.
    The "ppn" batch directive should be distinguished from a second use of "ppn", which is as an option to mpiexec (actually -ppn is an alias for -perhost). The -ppn or -perhost flag determines how many MPI processes are started up on each machine listed in the $PBS_HOSTFILE. At CAC, the standard $PATH takes you to the Intel mpiexec, and by default the Intel mpiexec assumes the value "-perhost 8" on v4. However, since this default value is implementation-dependent, we recommend that you always set your -ppn or -perhost flag explicitly every time you call mpiexec, as shown above. That way you ensure that you'll get exactly what you expect in case you ever use a different mpiexec. See below for more details on PPN.
  3. Once the script is ready, save to something reasonable (i.e. mpi.sh) and submit the job using nsub:
    $ nsub mpi.sh

Moving data to nodes in a parallel job

  1. This is essentially the same as the example above, we'll just add additional mpiexec commands to move data to and from the nodes using additional mpiexec commands. Below find a modified script that will perform a copy the same input file to all nodes in the job.
     #!/bin/sh
     #PBS -A AcctNumber
     #PBS -l walltime=02:00,nodes=4:ppn=1
     #PBS -q v4
     
     # Because jobs start in the HOME directory, move to where we submitted.
     cd "$PBS_O_WORKDIR" 
     
     #Count the number of nodes
     np=$(wc -l < $PBS_NODEFILE)
     
     #boot mpi on the nodes
     mpdboot -n $np --verbose -r /usr/bin/ssh -f $PBS_NODEFILE
     
     
     mpiexec -ppn 1 -np $np cp $HOME/somedir/file.dat $TMPDIR/file.dat
     mpiexec -ppn 1 -np $np ls $TMPDIR
     
     #now do something that wants that file
     mpiexec -np $np $HOME/app.bin $TMPDIR/file.dat
     
     
     #clean up the mpds
     mpdallexit
  2. mpiexec is basically just a distributed ssh, and so we can use it as such to perform basic shell commands on all of the hosts in the PBS_NODEFILE. The -ppn 1 flag to mpiexec ensures that only a single process is fired per node, so that the copy is only performed once on each node in your group. If you have several files you need to move around or if you have a specific directory structure you need to create in the scratch space, you may want to break those mpiexec commands out into a separate shell script. To do this, just put all of your mpiexec mkdir/cp commands in a separate script (like copy.sh) and just mpiexec that script from your batch script.
    $ mpiexec -ppn 1 -np $np $HOME/jobstuff/copy.sh
    Remember that you'll still need to clean up /tmp once you're done.
  3. Finally once you're script is ready, save it to something reasonable (like mpi_data.sh) and submit the job using nsub:
    $ nsub mpi_data.sh

You can turn any job into an interactive job by adding a PBS directive "-I", as in Interactive. You may run an interactive job on any queue. If the batch file includes more commands, the scheduler will ignore them. It reads directives, then gives you the node without running the batch script.

For instance, the following script is exactly the same one that ran as a regular batch job, but the "#PBS -I" line is added:

#!/bin/bash
#PBS -l walltime=00:30:00,nodes=1
#PBS -A your_account
#PBS -j oe
#PBS -N batchtest
#PBS -q v4dev
#PBS -I

# Turn on echo of shell commands
set -x

# Did you not put these in your ~/.profile? Then include them here.
#export PATH=$PATH:/opt/nextgen/bin
#export PERL5LIB=/opt/nextgen/lib/perl5

# The data is located in datadir.
DATADIR=/home/gfs08/jp86/ngw2010/session1/lecture2/
DATA=s_8_1_sequence.txt

echo starting job `date`
# Copy data to a temporary directory.
cp ${DATADIR}/${DATA} ${TMPDIR}
# Move to that directory.
cd ${TMPDIR}
fastx_quality_stats -i ${DATA} -o stat.xls
# Copy the data back.
cp stat.xls ${PBS_O_WORKDIR}/
echo finishing job `date`

For interactive jobs, the scheduler will wait until the node is available and open an ssh shell as soon as a node is ready, which may take some time. Use 'showqlease' to check on free nodes by the queue:

$ showqlease
[user@linuxlogin]$ nsub batch.sh
Looking for directives in batch.sh
Executing interactive
qsub: waiting for job 1001525.scheduler.v4linux to start
qsub: job 1001525.scheduler.v4linux ready
[user@compute-3-48 farm]$

That last prompt shows that you are now typing in a shell on the main node of your interactive job. Commands, such as mpiexec, should work here just as they would in a batch script. From here, you could run the rest of your script.

[user@compute-3-48 farm]$ cd ~/nextgen
[user@compute-3-48 farm]$ source batch.sh

The source command runs the script in the current shell. Another way to run the script is to tell the operating system that the file is executable. Then the script can be run like any other executable.

$ chmod a+x batch.sh
$ ./batch.sh

Variables Defined During Batch Jobs

When a batch job runs on a node, the scheduler defines some variables that may be useful in job scripts, such as $PBS_JOBID:

ENVIRONMENT=BATCH
HOME=/home/gfs01/ajd27
HOSTNAME=compute-3-48.v4linux
PBS_ENVIRONMENT=PBS_BATCH
PBS_JOBCOOKIE=E71020A373DAB09637C527E120F71C55
PBS_JOBID=1004757.scheduler.v4linux
PBS_JOBNAME=clans2win1
PBS_MOMPORT=15003
PBS_NODEFILE=/var/spool/torque/aux//1004757.scheduler.v4linux
PBS_NODENUM=0
PBS_O_HOME=/home/gfs01/ajd27
PBS_O_HOST=scheduler.cac.cornell.edu
PBS_O_INITDIR=/home/gfs01/ajd27/dev/CAC/helmke/testingdir
PBS_O_LANG=en_US.UTF-8
PBS_O_LOGNAME=ajd27
PBS_O_PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/tools/
PBS_O_QUEUE=v4
PBS_O_SHELL=/bin/tcsh
PBS_O_WORKDIR=/home/gfs01/ajd27/dev/CAC/helmke/testingdir
PBS_QUEUE=v4
PBS_SERVER=scheduler.cac.cornell.edu
PBS_TASKNUM=1
PBS_VNODENUM=0
TMPDIR=/tmp/1004758.scheduler.v4linux

You can find these variables by adding a line to your job script, such as

env | sort > variables.txt

When you run a job under mpiexec, the PBS_O_* variables are still defined, but there are a few more.

HOME=/home/gfs01/ajd27
HOST=compute-3-48.v4linux
HOSTNAME=compute-3-48.v4linux
HOSTTYPE=x86_64-linux
I_MPI_DEVICE=ssm
I_MPI_EXT_MPI_CALLS_2921=0x2aaaaaba33f8
I_MPI_INFO_CACHE1=0,4,2,6,1,5,3,7
I_MPI_INFO_CACHE2=0,2,1,3,0,2,1,3
I_MPI_INFO_CACHE3=0,1,2,3,4,5,6,7
I_MPI_INFO_CACHES=2
I_MPI_INFO_CACHE_SHARE=1,2
I_MPI_INFO_CACHE_SIZE=32768,6291456
I_MPI_INFO_CORE=0,0,2,2,1,1,3,3
I_MPI_INFO_LCPU=8
I_MPI_INFO_PACK=0,1,0,1,0,1,0,1
I_MPI_INFO_SIGN=67190
I_MPI_INFO_THREAD=0,0,0,0,0,0,0,0
I_MPI_PERHOST=allcores
I_MPI_PIN_INFO=7
I_MPI_PIN_MAP=0 0
I_MPI_PIN_MAP_SIZE=8
I_MPI_PIN_UNIT=7
I_MPI_ROOT=/opt/intel/impi/3.1
MPD_JID=1  compute-3-48.v4linux_52334  
MPD_JRANK=7
MPD_JSIZE=8
MPD_NAME=compute-3-48.v4linux
MPD_TVDEBUG=0
PMI_DEBUG=0
PMI_PORT=192.168.49.128:47582
PMI_RANK=7
PMI_SIZE=8
PMI_SPAWNED=0
PMI_TOTALVIEW=0

The most common one is PMI_RANK, which is the rank within the MPI communicator.

Debugging a Batch Script

Debugging scripts run in batch can be difficult. The standard methodology is to put lots of echo statements to print to log files. By checking the variables above, you can create a script that works either interactively or in batch mode.

What tends to go wrong when working with batch scripts?

  • Tested script in one directory but moved it to another to run real job. - If you have directories hard-coded to your user directory, you may be able to replace them with directories relative to ${PBS_O_WORKDIR}, a variable that remembers what directory you were in when you submitted the batch job.
  • Ran script on v4dev for 30 minutes. Forgot to increase time for v4. - The length of time a job runs is typically in the first line of the batch file as a resource listing, -l walltime=4:00:00,nodes=10:ppn=1. You could leave the walltime long for production jobs and make a short script to submit development jobs. This script strips scheduler specifications from the script and replaces them with those appropriate to testing.
grep -v '#PBS' $1>devscript.sh
nsub -l walltime=10:00,nodes=2:ppn=1 -A account_0001 -j oe -q v4dev -N testing devscript.sh
rm devscript.sh

Some techniques for development:

  • Run the script interactively. - Submit an interactive job with the -I option to nsub and the v4dev queue. Execute your script on the command line to see what it does.
  • Use variables for directories. - $HOME is your home directory, and $PBS_O_WORKDIR is, by default, the directory where you submit the batch file.
  • Echo commands in the batch file. - set -x tells the batch file to echo each command before it runs. You can also set a prompt for the batch script with the command export PS4='${BASH_SOURCE}:${LINENO}: '. This will tell you what line of the scrip tis executing.
  • Login to the batch node while the job is running. - When your job is running, even on v4, you can login to the node where it is running. Find the name of the head node with checkjob -v jobid. Then ssh to that node to see any output or data files.

In the following example, the script checks to see whether PBS_NODEFILE exists to determine whether it is running in batch or interactive mode. If it is interactive, it sets necessary variables to test MPI.

#!/bin/sh
#PBS -l walltime=00:10:00,nodes=2:ppn=1
#PBS -A	account_0001
#PBS -j oe
#PBS -N batchtest
#PBS -q v4dev

# Turn on echo of shell commands
set -x
# Set the prompt to show which line is running and from which file.
export PS4='${BASH_SOURCE}:${LINENO}: '

# Because jobs start in the HOME directory, move to where we submitted.
cd "$PBS_O_WORKDIR" 

# Pull standard stuff from the environment variables
if [ -n "$PBS_NODEFILE" ]
then
  NODECNT=$(wc -l < "$PBS_NODEFILE")
  TASKCNT=`expr 8 '*' $NODECNT`
  RUNDIR=$PBS_O_WORKDIR
  # The job id is something like 613.scheduler.v4linux.
  # This deletes everything after the first dot.
  JOBNUMBER=${PBS_JOBID%%.*}
  echo '============================'
  echo $0
  echo '============================'
else
  # For interactive testing, create your own node file with "localhost"
  NODECNT=1
  TASKCNT=4
  RUNDIR=$HOME/dev/CAC/batch
  PBS_NODEFILE=$RUNDIR/nodefile
  echo localhost>$PBS_NODEFILE
  JOBNUMBER=01
fi

# Set up our job
EXT=$JOBNUMBER

cd $RUNDIR

cat $PBS_NODEFILE
if mpdboot -n $NODECNT -r /usr/bin/ssh -f $PBS_NODEFILE
then
  mpiexec -ppn 8 -np $TASKCNT $RUNDIR/subscript.sh
  mpdallexit
fi

Running a Python script as a batch job

  1. Create and test your python script.
  2. Create a shell script that executes the Python script.
     #!/bin/sh
     
     #PBS -A dal16_0001
     #PBS -l walltime=02:00,nodes=1:ppn=1
     #PBS -N PythonTest
     #PBS -j oe
     #PBS -q v4-64g
     
     # Because jobs start in the HOME directory, move to where we submitted.
     cd "$PBS_O_WORKDIR" 
    
     #Execute python2.4 script
     #python script.py scriptargs
     python GA.py pop=128 maxgen=100
     #To use python 2.5 specify the python2.5 binary
     #python2.5 GA.py pop=128 maxgen=100
    Note that it is possible to submit a python script directly, but this isn't recommended. Any application that can be run (python script, compiled binary, executable jar, perl script, etc) can be started with a single line inside of a shell script. This keeps the format of the batch scripts the same and makes for more reproducible results.
  3. Once the script is ready, save to something reasonable (like python.sh) and submit the job using nsub:
    $ nsub python.sh

Controlling the number of MPI processes per node

The CAC clusters all have multiple CPUs per node, which allows you to run more than one MPI process per node if desired. It is very easy to control this process by using the "ppn" flag to mpiexec. PPN stands for processors per node, and it effectively allows you to specify the number of times you would like each hostname to appear in the machinefile list that you will pass to mpiexec.

To run a single MPI process per node (e.g., you are using OpenMP for intra-node parallelism, and/or you have a memory-hungry application):

 $ mpiexec -ppn 1 -np $np ./multithreaded_app

To run multiple MPI processes per node (specifically, to run separate processes on all 8 CPUs on a v4 node):

 $ mpiexec -ppn 8 -np $processcnt ./mpi_process

or

 $ mpiexec -np $np ./mpi_process

Note that in the second example, the default is to use all the available CPUs (cores) on a node, which equals 8 on v4. In general, the -ppn X flag implies X CPUs on a single node will be assigned MPI processes before any processes are assigned to the next node in your parallel job. Whatever you require for your job, we recommend that you set the -ppn flag explicitly to be sure you're getting what you expect. You should also note that nodes are scheduled exclusively to you while your job is running, so specifying a single CPU on a node (-ppn 1) does not imply or allow the simultaneous use of the remaining CPUs by another user.

Given this, how would you start a multi-node, multi-core job using BASH?

CPUPERNODE=`grep processor /proc/cpuinfo|wc -l`
NODECNT=`wc -l<"$PBS_NODEFILE"`
PROCESSCNT=$$((CPUPERNODE * NODECNT))

mpdboot -n $NODECNT -r /usr/bin/ssh -f "$PBS_NODEFILE"
mpiexec -n $PROCESSCNT /your/program

Advanced Linux Batch

Linux Batch FAQ

Linux Batch FAQ