Linux Batch Examples

From CAC Documentation wiki
Jump to navigation Jump to search

Scheduler Examples

This page provides instructions on how to create a batch script for the scheduler to actually run jobs on the V4 system. This page covers the basics of submitting various types of jobs to the scheduler. Jobs are submitted using the nsub command, so for the full listing of options that can be passed to nsub, consult the command reference page.

HPC Cluster Hardware List

Things to do before you run any linux jobs on the nodes

  1. Connect to the v4 linux login node ( or using ssh.
  2. Before you submit a job, you must have valid ssh keys. ROCKS creates the keys files for you automatically when you first log in. You will be prompted for a passphrase. If you choose not to enter one, just hit <enter> twice. If you suspect a problem with your keys and want to recreate them, rename your ssh directory (in case you have other keys you wish to save), log off, and log back in. Your keys files will be recreated automatically.
  3. You must have a valid account number that you are allowed to charge against. All jobs must be submitted with the account number you wish to charge against explicitly provided, so have that number handy.
  4. Get your batch script ready. This is a script that will run on the compute node; an example that executes a fictional binary in serial is shown below.
     # copy the binary and data files to a local directory
     cp $HOME/test/karpc $TMPDIR/karpc
     cp $HOME/test/values $TMPDIR/values
     cd $TMPDIR
     # run the executable from local disk
     ./karpc >& karpc.stdout
     # Copy output files to your output folder
     cp -f $TMPDIR/karpc.stdout $HOME/test
  5. At this point, you should be ready to submit your job. You'll need to provide an account number, walltime estimate, and the number of nodes you're requesting as you submit the job. You can also provide any other msub arguments that you like.

Using a temporary directory

  • CLUSTER POLICY: A program is run at the beginning of every job that clears /tmp. This will assure each user that they have all of /tmp for the duration of their job.
  1. Most jobs will want to create a temporary directory on the compute node's local disk to read and write data from. It is faster to do local file I/O and copy complete data files to and from $HOME at the beginning and the end of the job, rather than do I/O over the network ($HOME is network mounted on the compute nodes). But please see the section on /v4scratch for another alternative. In the past, this meant that you did something like this in the beginning of the batch script:
    $ mkdir /tmp/$USER
    Now Torque creates a uniquely named directory (/tmp/$PBS_JOBID) when a job starts and stores the path of this directory in the $TMPDIR environment variable. This directory is cleaned up when the job exits.
  2. To use this feature, simply change references to /tmp/$USER (or whatever scheme you're using to create a temporary directory) to $TMPDIR. The directory is created and cleaned up for you, so you can get rid of all mkdir /tmp/$USER and rmdir /tmp$USER lines in your script as well.
  3. All of the examples on this page use $TMPDIR, so please check the specific serial or parallel sections if you're unsure what to do.
  4. You may also find the -wdir option for mpiexec helpful. This sets the working directory for the mpiexec process to start.

Using /v4scratch

If you submit multiple batch files which all copy data from $HOME to $TMPDIR, it can be enough to slow your jobs tremendously and/or crash our $HOME filesystem, which is bad because that's where I keep all my stuff. Having multiple jobs read or write simultaneously can flood the NFS controller on $HOME.

# This is bad, mkay?
for f in *hugefiles.dat
  cp $f $TMPDIR

We worked out a solution, specifically for people doing genomics, but it helps many others. There is a second, parallel filesystem called /v4scratch, available both from login nodes and from cluster nodes, which handles load better, is a little faster, really big (16 TB), and just gets slow instead of crashing. The big drawback is that there is no backup. Here is how you use it:

  1. From the login node, interactively use the cp command to copy your files from $HOME to /v4scratch, or alternatively transfer files from the get-go directly to /v4scratch with scp or sftp. Note that while this cp command may be slow for large data, it is not charged time.
  2. Once the files are transferred, start your batch jobs, telling them to read initial data from /v4scratch, work in $TMPDIR locally on the node, and then write back to /v4scratch. They may run faster if you stagger when you start the batch jobs.
  3. Use a single cp command from the login node again to copy back to home.

That's it on how to use it. Some more details on the /v4scratch filesystem:

  • 13 TB of space are available via two fileservers (Dell PE2950s) running the gluster parallel file system.
  • A fiber channel storage unit (16x1TB) has been divided into two volumes. Each file server is connected to one volume via one of the two fiber channel ports on the storage unit.
  • Data rates are intermediate between home directories (/home/....) and the local scratch space on the v4 linux nodes (/tmp).

This parallel gluster filesystem is a small example of a nice solution for fast parallel file storage. Given interest we may expand its capabilities.

Running a serial job

  1. Submitting a serial job is easy, you just need to provide the account that the node hours should be charged against, the number of nodes you are requesting (1 for serial jobs), and the walltime estimate of your job. If your job exceeds the walltime, it will be terminated. You can supply these arguments on the command line using nsub as shown in Maui Commands but a simpler and generally safer way is to add these into the batch script as PBS Directives. PBS Directives are simply command line arguments inserted at the top of the batch script and prepended with #PBS (no spaces). The account you wish to use is specified with -A and the nodes and walltime are specified with the -l flag. More information on these flags (as well as the other flags that may be supplied can be found in the job submission documentation. We'll use the example batch script from the prerequisites section. By putting these parameters in the batch script, you'll have a record of what you did as well as save any information in the file for the next time you run the job. Thus, we recommend that you avoid using the command line parameters in pretty much all cases.
     #NOTE: The -l, -A and -q flags are required, others are optional
     #1 node for 10 minutes
     #PBS -l nodes=1,walltime=10:00
     #your account number
     #PBS -A abc42_00001
     #Join stdout and stderr
     #PBS -j oe
     #Name the job something useful
     #PBS -N wickedCoolJob
     #PBS -q v4dev
     # Because jobs start in the HOME directory, move to where we submitted.
     cd "$PBS_O_WORKDIR" 
     # copy the binary and data files to a local directory
     cp $HOME/test/karpc $TMPDIR/karpc
     cp $HOME/test/values $TMPDIR/values
     cd $TMPDIR
     # run the executable from local disk
     ./karpc >&karpc.stdout
     # Copy output files to your output folder	
     cp -f $TMPDIR/karpc.stdout $HOME/test
  2. Once the script is ready, save to something reasonable (like and submit the job using nsub:
    $ nsub

Running a parallel MPI job

The default version of MPI is Intel MPI.

  1. Submitting a parallel job is not significantly different than submitting a serial job. First, you'll need to compile your MPI code. In this example, we'll use a simple 'HelloWorld' application. You can follow along from the MPI Hello World source code. Compile this (or any MPI program). We'll call ours hello.out:
    $ mpicc -o hello.out HelloWorld.c
  2. Write a batch script that calls mpiexec to execute the compiled MPI code.
     #PBS -A AcctNumber
     #PBS -l walltime=02:00,nodes=4
     #PBS -N mpiTest
     #PBS -j oe
     #PBS -q v4
     # Because jobs start in the HOME directory, move to where we submitted.
     cd "$PBS_O_WORKDIR" 
     #Count the number of nodes
     np=$(wc -l < $PBS_NODEFILE)
     #boot mpi on the nodes
     mpdboot -n $np --verbose -r /usr/bin/ssh -f $PBS_NODEFILE
     #now execute one process per node
     mpiexec -np $np $HOME/v4Test/hello.out
    There are two things worth noting in the above script. First, the "-j oe" directive joins the output and error streams into the same output file. Second, the "-l ppn:1" directive usually determines how many times each machine name appears in $PBS_NODEFILE... HOWEVER, be aware that at at CAC, this specification is ignored! In effect, it is always reset to 1, so that each machine name appears exactly once in the machine file. This is necessary to guarantee that you are granted exclusive access to the nodes in your batch job, and it helps the CAC accounting system to function properly.
    The "ppn" batch directive should be distinguished from a second use of "ppn", which is as an option to mpiexec (actually -ppn is an alias for -perhost). The -ppn or -perhost flag determines how many MPI processes are started up on each machine listed in the $PBS_HOSTFILE. At CAC, the standard $PATH takes you to the Intel mpiexec, and by default the Intel mpiexec assumes the value "-perhost 8" on v4. However, since this default value is implementation-dependent, we recommend that you always set your -ppn or -perhost flag explicitly every time you call mpiexec, as shown above. That way you ensure that you'll get exactly what you expect in case you ever use a different mpiexec. See below for more details on PPN.
  3. Once the script is ready, save to something reasonable (i.e. and submit the job using nsub:
    $ nsub

Moving data to nodes in a parallel job

  1. This is essentially the same as the example above, we'll just add additional mpiexec commands to move data to and from the nodes using additional mpiexec commands. Below find a modified script that will perform a copy the same input file to all nodes in the job.
     #PBS -A AcctNumber
     #PBS -l walltime=02:00,nodes=4:ppn=1
     #PBS -q v4
     # Because jobs start in the HOME directory, move to where we submitted.
     cd "$PBS_O_WORKDIR" 
     #Count the number of nodes
     np=$(wc -l < $PBS_NODEFILE)
     #boot mpi on the nodes
     mpdboot -n $np --verbose -r /usr/bin/ssh -f $PBS_NODEFILE
     mpiexec -ppn 1 -np $np cp $HOME/somedir/file.dat $TMPDIR/file.dat
     mpiexec -ppn 1 -np $np ls $TMPDIR
     #now do something that wants that file
     mpiexec -np $np $HOME/app.bin $TMPDIR/file.dat
     #clean up the mpds
  2. mpiexec is basically just a distributed ssh, and so we can use it as such to perform basic shell commands on all of the hosts in the PBS_NODEFILE. The -ppn 1 flag to mpiexec ensures that only a single process is fired per node, so that the copy is only performed once on each node in your group. If you have several files you need to move around or if you have a specific directory structure you need to create in the scratch space, you may want to break those mpiexec commands out into a separate shell script. To do this, just put all of your mpiexec mkdir/cp commands in a separate script (like and just mpiexec that script from your batch script.
    $ mpiexec -ppn 1 -np $np $HOME/jobstuff/
    Remember that you'll still need to clean up /tmp once you're done.
  3. Finally once you're script is ready, save it to something reasonable (like and submit the job using nsub:
    $ nsub

You can turn any job into an interactive job by adding a PBS directive "-I", as in Interactive. You may run an interactive job on any queue. If the batch file includes more commands, the scheduler will ignore them. It reads directives, then gives you the node without running the batch script.

For instance, the following script is exactly the same one that ran as a regular batch job, but the "#PBS -I" line is added:

#PBS -l walltime=00:30:00,nodes=1
#PBS -A your_account
#PBS -j oe
#PBS -N batchtest
#PBS -q v4dev

# Turn on echo of shell commands
set -x

# Did you not put these in your ~/.profile? Then include them here.
#export PATH=$PATH:/opt/nextgen/bin
#export PERL5LIB=/opt/nextgen/lib/perl5

# The data is located in datadir.

echo starting job `date`
# Copy data to a temporary directory.
# Move to that directory.
cd ${TMPDIR}
fastx_quality_stats -i ${DATA} -o stat.xls
# Copy the data back.
cp stat.xls ${PBS_O_WORKDIR}/
echo finishing job `date`

For interactive jobs, the scheduler will wait until the node is available and open an ssh shell as soon as a node is ready, which may take some time. Use 'showqlease' to check on free nodes by the queue:

$ showqlease
[user@linuxlogin]$ nsub
Looking for directives in
Executing interactive
qsub: waiting for job 1001525.scheduler.v4linux to start
qsub: job 1001525.scheduler.v4linux ready
[user@compute-3-48 farm]$

That last prompt shows that you are now typing in a shell on the main node of your interactive job. Commands, such as mpiexec, should work here just as they would in a batch script. From here, you could run the rest of your script.

[user@compute-3-48 farm]$ cd ~/nextgen
[user@compute-3-48 farm]$ source

The source command runs the script in the current shell. Another way to run the script is to tell the operating system that the file is executable. Then the script can be run like any other executable.

$ chmod a+x
$ ./

Variables Defined During Batch Jobs

When a batch job runs on a node, the scheduler defines some variables that may be useful in job scripts, such as $PBS_JOBID:


You can find these variables by adding a line to your job script, such as

env | sort > variables.txt

When you run a job under mpiexec, the PBS_O_* variables are still defined, but there are a few more.

MPD_JID=1  compute-3-48.v4linux_52334  

The most common one is PMI_RANK, which is the rank within the MPI communicator.

Debugging a Batch Script

Debugging scripts run in batch can be difficult. The standard methodology is to put lots of echo statements to print to log files. By checking the variables above, you can create a script that works either interactively or in batch mode.

What tends to go wrong when working with batch scripts?

  • Tested script in one directory but moved it to another to run real job. - If you have directories hard-coded to your user directory, you may be able to replace them with directories relative to ${PBS_O_WORKDIR}, a variable that remembers what directory you were in when you submitted the batch job.
  • Ran script on v4dev for 30 minutes. Forgot to increase time for v4. - The length of time a job runs is typically in the first line of the batch file as a resource listing, -l walltime=4:00:00,nodes=10:ppn=1. You could leave the walltime long for production jobs and make a short script to submit development jobs. This script strips scheduler specifications from the script and replaces them with those appropriate to testing.
grep -v '#PBS' $1>
nsub -l walltime=10:00,nodes=2:ppn=1 -A account_0001 -j oe -q v4dev -N testing

Some techniques for development:

  • Run the script interactively. - Submit an interactive job with the -I option to nsub and the v4dev queue. Execute your script on the command line to see what it does.
  • Use variables for directories. - $HOME is your home directory, and $PBS_O_WORKDIR is, by default, the directory where you submit the batch file.
  • Echo commands in the batch file. - set -x tells the batch file to echo each command before it runs. You can also set a prompt for the batch script with the command export PS4='${BASH_SOURCE}:${LINENO}: '. This will tell you what line of the scrip tis executing.
  • Login to the batch node while the job is running. - When your job is running, even on v4, you can login to the node where it is running. Find the name of the head node with checkjob -v jobid. Then ssh to that node to see any output or data files.

In the following example, the script checks to see whether PBS_NODEFILE exists to determine whether it is running in batch or interactive mode. If it is interactive, it sets necessary variables to test MPI.

#PBS -l walltime=00:10:00,nodes=2:ppn=1
#PBS -A	account_0001
#PBS -j oe
#PBS -N batchtest
#PBS -q v4dev

# Turn on echo of shell commands
set -x
# Set the prompt to show which line is running and from which file.
export PS4='${BASH_SOURCE}:${LINENO}: '

# Because jobs start in the HOME directory, move to where we submitted.

# Pull standard stuff from the environment variables
if [ -n "$PBS_NODEFILE" ]
  NODECNT=$(wc -l < "$PBS_NODEFILE")
  TASKCNT=`expr 8 '*' $NODECNT`
  # The job id is something like 613.scheduler.v4linux.
  # This deletes everything after the first dot.
  echo '============================'
  echo $0
  echo '============================'
  # For interactive testing, create your own node file with "localhost"
  echo localhost>$PBS_NODEFILE

# Set up our job


if mpdboot -n $NODECNT -r /usr/bin/ssh -f $PBS_NODEFILE
  mpiexec -ppn 8 -np $TASKCNT $RUNDIR/

Running a Python script as a batch job

  1. Create and test your python script.
  2. Create a shell script that executes the Python script.
     #PBS -A dal16_0001
     #PBS -l walltime=02:00,nodes=1:ppn=1
     #PBS -N PythonTest
     #PBS -j oe
     #PBS -q v4-64g
     # Because jobs start in the HOME directory, move to where we submitted.
     cd "$PBS_O_WORKDIR" 
     #Execute python2.4 script
     #python scriptargs
     python pop=128 maxgen=100
     #To use python 2.5 specify the python2.5 binary
     #python2.5 pop=128 maxgen=100
    Note that it is possible to submit a python script directly, but this isn't recommended. Any application that can be run (python script, compiled binary, executable jar, perl script, etc) can be started with a single line inside of a shell script. This keeps the format of the batch scripts the same and makes for more reproducible results.
  3. Once the script is ready, save to something reasonable (like and submit the job using nsub:
    $ nsub

Controlling the number of MPI processes per node

The CAC clusters all have multiple CPUs per node, which allows you to run more than one MPI process per node if desired. It is very easy to control this process by using the "ppn" flag to mpiexec. PPN stands for processors per node, and it effectively allows you to specify the number of times you would like each hostname to appear in the machinefile list that you will pass to mpiexec.

To run a single MPI process per node (e.g., you are using OpenMP for intra-node parallelism, and/or you have a memory-hungry application):

 $ mpiexec -ppn 1 -np $np ./multithreaded_app

To run multiple MPI processes per node (specifically, to run separate processes on all 8 CPUs on a v4 node):

 $ mpiexec -ppn 8 -np $processcnt ./mpi_process


 $ mpiexec -np $np ./mpi_process

Note that in the second example, the default is to use all the available CPUs (cores) on a node, which equals 8 on v4. In general, the -ppn X flag implies X CPUs on a single node will be assigned MPI processes before any processes are assigned to the next node in your parallel job. Whatever you require for your job, we recommend that you set the -ppn flag explicitly to be sure you're getting what you expect. You should also note that nodes are scheduled exclusively to you while your job is running, so specifying a single CPU on a node (-ppn 1) does not imply or allow the simultaneous use of the remaining CPUs by another user.

Given this, how would you start a multi-node, multi-core job using BASH?

CPUPERNODE=`grep processor /proc/cpuinfo|wc -l`

mpdboot -n $NODECNT -r /usr/bin/ssh -f "$PBS_NODEFILE"
mpiexec -n $PROCESSCNT /your/program