Gnu Parallel

From CAC Documentation wiki
Jump to navigation Jump to search

Gnu Parallel in Batch

Gnu Parallel is a very simple program that runs command lines on multiple computers. Its man page has lots of nice examples. On the cluster, you can use Gnu Parallel within a batch job to run multiple copies of the same program.

For example, let's say you have a program which you run as "./process_one input.dat". If you make a file with a series of input files, one on each line, the first few lines of that file might be:

input01.dat
input02.dat
input03.dat

Then submit a batch file which uses Gnu Parallel to process all of those datasets. This example starts a job using two nodes of the v4dev queue. This would make sense if there were sixteen or more input files, one for each of eight cores on two nodes. The $PBS_NODEFILE is a list of the nodes available to the job when it runs.

#!/bin/sh  
#PBS -l nodes=2,walltime=30:00
#PBS -A your_account_0001
#PBS -j oe
#PBS -N gnu-parallel
#PBS -q v4dev

set -x
cd "$PBS_O_WORKDIR"

cat input_file.txt | parallel --sshloginfile $PBS_NODEFILE ./process_one {}

Gnu Parallel counts the number of cores on each node and starts a job for each core. Then, when each of those completes, it starts another job on the core until there are no more lines in the file. It doesn't need MPI, and you could run the program the same way on your own Linux multicore computer in the office.

If you don't have an input file, you can use the seq function to generate an input list of the right length.

NODECNT=$(wc -l < "$PBS_NODEFILE")
TASKCNT=$((8 * NODECNT))
seq 1 $TASKCNT | parallel --sshloginfile $PBS_NODEFILE ./process_one.sh $TMPDIR $PBS_O_WORKDIR {}

and process_one could be a script that uses either the command line argument or the PARALLEL_SEQ variable, defined by Gnu Parallel, to determine which task it is. When the process_one.sh script starts, there are not a lot of variables defined, so pass in arguments as needed. Just these are defined:

HOME=/home/fs01/ajd27
LOGNAME=ajd27
MAIL=/var/mail/ajd27
PARALLEL_PID=30932
PARALLEL_SEQ=2
PATH=/usr/local/bin:/bin:/usr/bin
PWD=/home/fs01/ajd27
SHELL=/bin/bash
SHLVL=2
SSH_CLIENT=192.168.49.128 52366 22
SSH_CONNECTION=192.168.49.128 52366 192.168.49.128 22
USER=ajd27

Some typical helpful variables are TMPDIR and PBS_O_WORKDIR. In addition, the process on the remote host will start in your home directory.

#!/bin/bash
#process_one.sh
TMPDIR=$1
PBS_O_WORKDIR=$2
TASKID=$3
# The TASKID and the PARALLEL_SEQ variable will be the same.

cd working/project0

OUTFILE=$TMPDIR/data${PARALLEL_SEQ}.txt
(/usr/bin/time my_program > $OUTFILE) 2>> $TRACE_FILE
cp $OUTFILE $PBS_O_WORKDIR

The /usr/bin/time program prints its output on a separate stream, so it's hard to capture to a file. We work around this by running the program within a subshell (parentheses create a subshell) and then redirect stderr to the trace file to get timing on the command.