Batch Script to Run Multiple Copies

From CAC Documentation wiki
Jump to navigation Jump to search
  • Linux cluster
  • Many copies of the same program with different arguments

Each separate computer in the cluster is called a node. There can be more than one processor on each node, and each processor usually has multiple processing cores. For the V4 cluster, that means there are 8 cores, so that you can run 8 tasks per node, or 8 copies of your program.

While tasks run and write output to files, the files should be stored on the local temporary directory on each node. That directory is /tmp. At the end of the job, copy files back to the home directory on the shared filesystem.

The batch script will setup the computing environment, run tasks, and tear it down:

  • batch.sh: The user submits this script to the batch system. It runs the next three.
  • to_node.sh: Run once for each node, this sets up the /tmp directory.
  • task.sh: Run once for each task (8 per node), this runs the main program.
  • from_node.sh: Run once for each node, this copies files back.

You run "nsub batch.sh" to submit the job, and the scheduler later runs batch.sh on one of the nodes. Batch.sh, in turn, calls the other three files, to_node, task, and from_node. Here are a couple of friendly tips:

  • If you want to copy-and-paste these scripts, first switch to the "view source" tab at the top of this page.
  • Remember to set the execute bit on the scripts by typing "chmod +x *.sh" at the command line.

batch.sh:

#!/bin/sh
#PBS -l walltime=1:00:00,nodes=2
#PBS -A dal16_0001
#PBS -j oe
#PBS -N insects
#PBS -q v4dev

# Turn on echo of shell commands
set -x

# Counts the number of cores on this host.
# Assume this doesn't change across the cluster.
CORESPERNODE=`grep processor /proc/cpuinfo | wc -l`

# Pull standard stuff from the environment variables
# If running under batch, as defined by the presence of
# a PBS_NODEFILE, then pull info from environment variables.
if [ -n "$PBS_NODEFILE" ]
then
  NODECNT=$(wc -l < "$PBS_NODEFILE")
  TASKCNT=`expr $CORESPERNODE \* $NODECNT`
  RUNDIR=$PBS_O_WORKDIR
  # The job id is something like 613.scheduler.v4linux.
  # This deletes everything after the first dot.
  JOBNUMBER=${PBS_JOBID%%.*}
  echo '============================'
  echo $0
  echo '============================'
else
  # These variables are used running an interactive debugging job
  # on v4dev.
  NODECNT=1
  TASKCNT=4
  RUNDIR=/home/gfs01/ajd27/dev/working
  PBS_NODEFILE=$RUNDIR/nodefile
  echo localhost>$PBS_NODEFILE
  JOBNUMBER=01
fi

# Set up our job
EXT=$JOBNUMBER
cd $RUNDIR

cat $PBS_NODEFILE
if mpdboot -n $NODECNT -r /usr/bin/ssh -f $PBS_NODEFILE
then
  mpiexec -ppn 1 -np $NODECNT $RUNDIR/to_node.sh $EXT $RUNDIR
  mpiexec -ppn $CORESPERNODE -np $TASKCNT $RUNDIR/task.sh $EXT $RUNDIR
  mpiexec -ppn 1 -np $NODECNT $RUNDIR/from_node.sh $EXT $RUNDIR

  mpdallexit
fi

to_node.sh:

#!/bin/bash

EXT=$1-${HOSTNAME:ar}
RUNDIR=$2

SCRATCH=/tmp/$USER
# -p tells mkdir not to worry if the directory already exists.
# If it matters, you could delete everything in the directory before starting.
mkdir -p $SCRATCH

cp $RUNDIR/*.R $SCRATCH/

When the task.sh script runs, MPI defines a variable, called PMI_RANK, which holds the zero-based index of this script among the tasks. For instance, if you started four tasks, then PMI_RANK would be 0, 1, 2, or 3 for each of the scripts when they run. We use this variable to paramaterize our program so that it does different work in each script and saves the results to a different output file. task.sh:

#!/bin/bash

# Create a file extension unique to each hostname (if you want)
# The %%.* turns v4linuxlogin1.cac.cornell.edu into v4linuxlogin1.
EXT=$1-${HOSTNAME%%.*}
RUNDIR=$2

SCRATCH=/tmp/$USER

cd $SCRATCH

R --no-save --args ${PMI_RANK} < main.R > out${PMI_RANK}.txt

Copy results back to the shared drive. from_node.sh:

#!/bin/bash

RUNDIR=$2

SCRATCH=/tmp/$USER
cd $SCRATCH

cp out* $RUNDIR

The R code to get the command-line argument is as follows.

takeOnlyArgumentsAfterDashArgs=TRUE
args=commandArgs(takeOnlyArgumentsAfterDashArgs)
if (length(args)<1) {
  print("R --no-save --args arg1 arg2... < script > out.dat");
  stop();
} 

whichParameter = as.integer(args[1])
cat("which parameter", whichParameter, "\n");

This enables you to use the PMI_RANK, now accessible as the integer variable whichParameter, to decide what this particular R process should do.

Running Lines From a Script File

If it feels easier, you could make a script file, todo.txt:

 myprogram -size 100
 myprogram -size 200
 myprogram -size 300
 myotherprogram -size 100
 myotherprogram -size 100

and then have the batch script run individual lines from this file. Change task.sh:

#!/bin/bash

# Create a file extension unique to each hostname (if you want)
# The %%.* turns v4linuxlogin1.cac.cornell.edu into v4linuxlogin1.
EXT=$1-${HOSTNAME%%.*}
RUNDIR=$2
SCRATCH=/tmp/$USER
cd $SCRATCH

linenumber=`expr ${PMI_RANK} + 1`
eval `head -$linenumber todo.txt | tail -1`