Difference between revisions of "Slurm"

From CAC Documentation wiki
Jump to navigation Jump to search
(added long versions of flags to table)
 
(One intermediate revision by the same user not shown)
Line 361: Line 361:
 
Try running the script 'srun_batch.sh' again, but this time add the <code>--exclusive</code> option. Do this either by editing the script, or by adding it to the command line immediately after <code>sbatch</code>. This option prevents other users from gaining access to the nodes assigned to your job; therefore, your job is allocated all the CPUs and memory on those nodes. Note that the parallel tasks are now evenly distributed, and a greater number of CPUs (and accompanying memory) is now available on each node.
 
Try running the script 'srun_batch.sh' again, but this time add the <code>--exclusive</code> option. Do this either by editing the script, or by adding it to the command line immediately after <code>sbatch</code>. This option prevents other users from gaining access to the nodes assigned to your job; therefore, your job is allocated all the CPUs and memory on those nodes. Note that the parallel tasks are now evenly distributed, and a greater number of CPUs (and accompanying memory) is now available on each node.
  
==== MPI Applications ====
+
==== MPI and Hybrid MPI/OpenMP Applications ====
  
 
For MPI applications, it is usually desirable to distribute tasks evenly among the nodes. One option is to use <code>--ntasks-per-node</code> instead of <code>-n</code> to fix the number of MPI tasks (processes) running on each node. Thus, in the above example with <code>-N 2</code>, you would specify <code>--ntasks-per-node=4</code> instead of <code>-n 8</code> , because 8 tasks ÷ 2 nodes = 4 tasks per node.
 
For MPI applications, it is usually desirable to distribute tasks evenly among the nodes. One option is to use <code>--ntasks-per-node</code> instead of <code>-n</code> to fix the number of MPI tasks (processes) running on each node. Thus, in the above example with <code>-N 2</code>, you would specify <code>--ntasks-per-node=4</code> instead of <code>-n 8</code> , because 8 tasks ÷ 2 nodes = 4 tasks per node.
  
However, a preferred option for many MPI codes is to use the <code>--exclusive</code> flag to prevent other users from sharing your nodes with you. A "side effect" of this option is to cause your tasks to be distributed evenly among nodes. It's an option you probably want anyway, especially if you are requesting fewer tasks than there are CPUs on your collection of nodes. By including <code>--exclusive</code> with your other options, you can allow each MPI process to consume a greater percentage of a node's memory, e.g. Or, you can allow each process to fork multiple threads (with OpenMP, say) and use all the CPUs on a node in that way.
+
Another good option for many MPI codes is the <code>--exclusive</code> flag, which prevents other users from sharing your nodes with you. A "side effect" of this option is that it causes your MPI tasks to be distributed evenly among the nodes. Thus, you can just specify <code>-n</code> and not worry about having to do the math. But even if you prefer to specify <code>--ntasks-per-node</code>, using <code>--exclusive</code> is probably a good idea anyway, particularly if your tasks make heavy use of resources other than CPUs. For example, by including <code>--exclusive</code> among your other options, you guarantee that the network bandwidth to and from your nodes will not be shared with other users' processes. Also, when you give a few MPI processes exclusive use of a node, each one can consume a greater percentage of the node's memory.  
 +
 
 +
''Hybrid MPI/OpenMP'' codes allow the MPI processes to fork multiple OpenMP threads, opening up the possibility of using all the CPUs on a node in a different way. For example, let's say that your batch nodes have 12 physical cores each; to keep all of the cores busy, you might choose to run 3 MPI tasks per node, with each task spawning 4 OpenMP threads. You would then specify the following in your batch script:
 +
 
 +
<pre>
 +
...
 +
#SBATCH --ntasks-per-node=3
 +
#SBATCH --exclusive
 +
...
 +
export OMP_NUM_THREADS=4
 +
mpirun ./my_hybrid_code
 +
</pre>
 +
 
 +
The environment variables on the primary node, including <code>OMP_NUM_THREADS</code>, should propagate all to the MPI tasks on all the nodes; this means that the 3 MPI tasks on each node will start 4 OpenMP threads apiece. (Note that when <code>OMP_NUM_THREADS</code> is undefined, the default behavior of OpenMP is generally to spawn as many threads as there are CPUs. Thus, to avoid any nasty surprises, you'll want to define the number of threads per MPI task.)
  
 
==== MATLAB PCT Applications ====
 
==== MATLAB PCT Applications ====

Latest revision as of 20:37, 23 September 2020

Some of the CAC's Private Clusters are managed with OpenHPC, which includes the Slurm Workload Manager (Slurm for short). Slurm (originally the Simple Linux Utility for Resource Management) is a group of utilities used for managing workloads on compute clusters.

This page is intended to give users an overview of Slurm. Some of the information on this page has been adapted from the Cornell Virtual Workshop topics on the Stampede2 Environment and Advanced Slurm. For a more in-depth tutorial, please review these topics directly.

One important distinction of Slurm configurations on CAC private clusters is that scheduling is typically done by CPU, not by node. This means that by default, a node may be shared among multiple users.

Overview

Some clusters use Slurm as the batch queuing system and the scheduling mechanism. This means that jobs are submitted to Slurm from a login node and Slurm handles scheduling these jobs on nodes as resources becomes available. Users submit jobs to the batch component which is responsible for maintaining one or more queues (also known as "partitions"). These jobs include information about themselves as well as a set of resource requests. Resource requests include anything from the number of CPUs or nodes to specific node requirements (e.g. only use nodes with > 2GB RAM). A separate component, called the scheduler, is responsible for figuring out when and where these jobs can be run on the cluster. The scheduler needs to take into account the priority of the job, any reservations that may exist, when currently running jobs are likely to end, etc. Once informed of scheduling information, the batch system will handle starting your job at the appropriate time and place. Slurm handles both of these components, so you don't have to think of them as separate processes, you just need to know how to submit jobs to the batch queue(s).

Note: Refer to the documentation for your cluster to determine what queues/partitions are available.

Running Jobs

This section covers general job submission and job script composition; for more specific details on how to run jobs or job scripts and use queues on your particular system, see the documentation for the private cluster you are working on. Also note, many of the following commands have several options. For full details, see the man page for the command, or the Slurm Docs.

Cluster Info

Here are the common commands used to display information about a cluster:

  • sinfo displays information about nodes and partitions/queues. Use -l for more detailed information.
  • scontrol show nodes <list> views the state of the nodes in the list. If <list> is not supplied, all nodes are shown.
  • scontrol show partition <name> views the state of the named partition/queue. If <name> is not supplied, all queues are shown.

Job Control

These are the most common commands for controlling jobs:

  • sbatch testjob.sh submits a job where testjob.sh is the script you want to run. Also see the Job Scripts section and the sbatch documentation.
    • As your job runs, the stdout and stderr streams are recorded in a pair of files, generally in the directory from which you submitted the job (be sure it is writable). Typically the names of the files will contain the job id. Examine these files after your job completes to verify that your job ran successfully.
  • srun -p <partition> --pty /bin/bash -l starts an interactive job for you and opens a login shell where you can enter commands. Also see the srun documentation.
    • If you intend to use multiple CPUs, or occupy a whole node, specify -n 1 -c <m> or --exclusive immediately after srun (details below).
    • Note: remember to exit the session once you are done to free resources for other users.
  • scancel <job id> cancels a job. Also see the scancel documentation.

Job Status

Here are some commands you can use to learn about the status of your jobs:

  • squeue -u my_userid shows state of jobs for user my_userid. Also see the squeue documentation.
  • scontrol show job <job id> views the state of a job. Also see the scontrol documentation.
  • squeue with no arguments retrieves summary information on all jobs currently queued.

Options for Submitting Jobs

The scheduling and execution of jobs submitted using sbatch or srun can be affected by various options that you specify for these commands. Slurm offers many such options; in this section, we try to cover the important ones.

Key Options for Serial Applications

The following table shows certain key options for Slurm jobs. We recommend that you provide these options for every job you submit. If these directives are not supplied, then defaults go into effect which could influence how your job runs in some unexpected way. This table assumes (for simplicity) that you are running a serial application, i.e., a job with just one task. In a later section, we will look at parallel applications.

Meaning Flag Long Flag Allowed Value Example Default
Submission queue -p --partition= Queue/partition name
(valid names are cluster dependent)
-p normal (cluster dependent)
Job walltime -t --time= hh:mm:ss
(not to exceed time limit of queue)
-t 00:05:00 time limit of queue
Number of tasks -n --ntasks= This table assumes -n 1 is present -n 1 (when absent, Slurm must compute
a value based on other job options)
CPUs per task -c --cpus-per-task= 1 ... number of CPUs on one node
(job fails if no node has enough CPUs)
-c 16 1 (but 2 are assigned for -n 1)
(CPU total on each node must be even)
Number of nodes -N --nodes= This table assumes -n 1 is present,
so Slurm enforces -N 1 as well
enough to satisfy -n and -c
(therefore, 1 for -n 1)

The queue (partition) name must be chosen from among those that are defined for your cluster. Consult your cluster's documentation for valid queue names, as well as the resources and time limits that are associated with each queue.

Specifying a time limit is helpful not only to you, but also to the scheduler and to your fellow cluster users. If you know in advance that your job will not take the maximum running time allowed in the queue, you should use the -t option to give an accurate expectation. This encourages the scheduler to "backfill" your job so it might run sooner that it otherwise would; and as a nice bonus, your job might get out of the way of other waiting jobs.

The default CPUs per task is 1, but when you specifically request 1 task (on 1 node), 2 CPUs are actually assigned to your task by default. Why is this? It is important to know that Slurm counts each physical core of a multi-core processor as two CPUs (in CAC's typical configurations). This is due to Intel's hyperthreading technology, which makes each physical core appear to be two hardware threads to the OS. For this reason, an even number of CPUs is allocated on each node that is assigned to your job.

For the purpose of scheduling jobs, Slurm calculates the total number of CPUs per node as follows:

CPUs/node = (boards/node) * (sockets/board) * (cores/socket) * (hardware threads/core)
          =       1       *        2        * (cores/socket) *            2

Most of the above factors are fixed, because nearly all CAC clusters consist of dual-socket nodes, i.e., each of the 2 sockets in the node holds one Intel multi-core processor. The number of cores per processor is generally the lone variable, and it might vary quite a lot, even within a single queue. Check your cluster's documentation for details.

How many CPUs do you get when you specify -n 1 -c 2? For 1 task, requesting 2 CPUs per task vs. 1 (the default) makes no difference to Slurm, because either way it is going to schedule your job on 2 CPUs = 2 hardware threads = 1 physical core. This is Slurm's way of ensuring that you are getting the equivalent of 1 dedicated physical core on the node.

When Do I Need More Than 2 CPUs (1 Core)?

Maybe you have a parallel application that appears to be a single task or process, but it makes use of multiple cores through some internal mechanism. Examples are:

  1. Your application is multithreaded with OpenMP to use multiple cores on the same node
  2. Your MATLAB or Simulink application is parallelized with the Parallel Computing Toolbox (PCT) to use multiple cores

In such cases, along with -n 1, set -c to be twice the number of OpenMP threads or MATLAB workers that will be launched by your application. Then, prior to running your OpenMP application (for example), set the OMP_NUM_THREADS variable as follows:

export OMP_NUM_THREADS=$((SLURM_CPUS_PER_TASK/2))
./my_omp_app

If, however, you intend for your application to use all the cores on a single node, don't try to set -c. Instead, set the --exclusive option along with -n 1, so that your job has an entire node to itself. Then, in order to fill up the node, you would have your application launch threads or workers like this:

export OMP_NUM_THREADS=$((SLURM_CPUS_ON_NODE/2))
./my_omp_app

If you think your OpenMP or MATLAB PCT application can take advantage of hyperthreading (i.e., 2 or more software threads/core), then the --exclusive option is again the right choice. In this case, for OpenMP applications, you wouldn't divide SLURM_CPUS_ON_NODE by 2. For MATLAB PCT applications, there is a NumThreads as well as a NumWorkers setting that you might try in experiments with hyperthreading.

When Do I Need More Tasks and Nodes?

Certain kinds of parallel applications require more careful consideration:

  1. Your job script launches multiple processes (using srun, e.g.) and exits when all are done
  2. Your application is parallelized with MPI to run on multiple cores, perhaps across multiple nodes

In these situations, you should assign -n or --ntasks-per-node (along with other options such as -N) to make sure your job has the resources it will need to run successfully. Please refer to the Parallel Applications section for guidance.

Example Command-line Job Submission

All of the key options can be specified on the command-line with sbatch. For example, say you had the following script "simple_cmd.sh" to run:

#!/bin/bash
#Ensures that the node can sleep

#print date and hostname
date
hostname
#verify that sleep 5 works
time sleep 5

In order to run this on the command-line, you could issue (where short is an available queue on the system):

$  sbatch -p short -t 00:01:00 -n 1 simple_cmd.sh

After you submit the job, see if you can quickly detect it in the queue with squeue. When it completes, you will find the output and error files in the same directory from which you submitted the job, with the names "slurm-%j.out" and "slurm-%j.err", where %j is the Slurm job number.

To save yourself from typing the same options repeatedly, there is an easier way to submit options to sbatch, as demonstrated in Job Scripts.

Additional Arguments

Meaning Flag Value Example Default
Exclusive node access (all CPUs) --exclusive (none) --exclusive (none)
Name of job -J any string that can be part of a filename -J SimpleJob filename of job script
Stdout -o filename (with or without path) -o $HOME/project1/%j.out slurm-%j.out
Sterr -e filename (with or without path) -e $HOME/project1/%j.err slurm-%j.err
Job dependency -d type:job_id -d=afterok:1234 (none)
Email address --mail-user email@domain --mail-user=genius@gmail.com (none)
Email notification type --mail-type BEGIN, END, FAIL, REQUEUE, or ALL --mail-type=ALL (none)
Environment variable(s) to pass --export ALL,varname[=varvalue],... --export ALL,LOC=/tmp/x ALL

If you specify --export, make sure your list includes ALL (and excludes NONE); otherwise, you may get unexpected results.

Job Scripts

Note: for more specific examples on how to write job scripts and use queues on your particular system, see the documentation for the Private Cluster you are working on.

Options to sbatch can be put into the batch script itself. This makes it easy to copy and paste to new scripts, as well as be confident that a job is submitted with the same arguments over and over again.

All that is required is to place the command line options in the batch script and prepend them with #SBATCH. They appear as comments to the shell, but Slurm parses them for you and applies them to your job.

Simple Job Script

Here is an example, 'batch.sh', which also illustrates some of the useful environment variables that Slurm sets for you in your batch environment:

#!/bin/bash
#SBATCH -p short
#SBATCH -t 00:01:00
#SBATCH -n 1
echo "starting at `date` on `hostname`"

# Print properties of job as submitted
echo "SLURM_JOB_ID = $SLURM_JOB_ID"
echo "SLURM_NTASKS = $SLURM_NTASKS"
echo "SLURM_NTASKS_PER_NODE = $SLURM_NTASKS_PER_NODE"
echo "SLURM_CPUS_PER_TASK = $SLURM_CPUS_PER_TASK"
echo "SLURM_JOB_NUM_NODES = $SLURM_JOB_NUM_NODES"

# Print properties of job as scheduled by Slurm
echo "SLURM_JOB_NODELIST = $SLURM_JOB_NODELIST"
echo "SLURM_TASKS_PER_NODE = $SLURM_TASKS_PER_NODE"
echo "SLURM_JOB_CPUS_PER_NODE = $SLURM_JOB_CPUS_PER_NODE"
echo "SLURM_CPUS_ON_NODE = $SLURM_CPUS_ON_NODE"

Submit the above script with sbatch and examine the output in 'slurm-NNNNN.out' (where 'NNNNN' is the job number) to see the values of the Slurm environment variables.

Parallel Applications

As mentioned in When Do I Need More Tasks and Nodes?, for applications that consist of multiple parallel tasks such as MPI codes, careful consideration must be given to the options that you specify to Slurm. In this section we take a closer look at -n, -c, and -N, as well as other options that can help you to get the right resources for your job, including:

  • --ntasks-per-node
  • --exclusive

Key Options for Parallel Applications

First, let's expand our table of key options to cover cases where -n and/or -N are permitted to have values greater than 1.

Meaning Flag Allowed Value Example Default
Submission queue -p Queue/partition name
(valid names are cluster dependent)
-p normal (cluster dependent)
Job walltime -t hh:mm:ss
(not to exceed time limit of queue)
-t 00:05:00 time limit of queue
Number of tasks -n This table assumes -n 2 or greater
(or --ntasks-per-node) is present
-n 4 (when absent, Slurm must compute
a value based on other job options)
CPUs per task -c 1 ... number of CPUs on one node
(job fails if no node has enough CPUs)
-c 2 1, but CPU total may be +1 on some nodes
(CPU total on each node must be even)
Number of nodes -N 1 ... NP = number of nodes in partition
(if N > NP, job is queued but never runs)
-N 2 enough to satisfy -n and -c

The maximum number of tasks that is feasible for a job depends on the specified CPUs per task and the hardware that is available in the chosen queue. In the absence of any other information, Slurm allocates the minimum number of nodes that can accommodate -n tasks with each task occupying -c CPUs. (Be careful that this number of nodes does not exceed the total number in the partition!)

For jobs consisting of multiple independent tasks, the -n option is the primary way to obtain the right amount of resource. Again, Slurm must assign you the right number of nodes and CPUs for the requested -n and -c, where the latter is 1 by default. The assigned CPUs and nodes will depend on resource availability and the values of other Slurm options.

The best value for CPUs per task is often -c 2, so that each task is granted dedicated access to a full physical core. Otherwise, by default, 2 tasks will run on 2 hardware threads (or Slurm CPUs) within the same physical core, and there will be contention for the resources of that core (cycles, registers, caches, etc.). If tasks are frequently stalled due to I/O limitations, then perhaps the default (1) might be a valid choice. Higher values than -c 2 might be appropriate for parallel tasks that are also multithreaded.

If -N is specified along with -n, Slurm will allocate the exact number of nodes specified by -N, then distribute the total number of tasks specified by -n among the nodes. For example, -n 8 -c 2 -N 2 specifies 8 tasks to be launched, each taking 2 CPUs (1 core), distributed among the 2 allocated nodes.

Simple Job Script for Parallel Tasks

With -n 8 -c 2 -N 2, are the tasks (and CPUs) distributed evenly among the nodes? Let's find out. The example batch script below, 'srun_batch.sh', also illustrates how srun can be used to launch multiple processes within an sbatch job.

#!/bin/bash
#SBATCH -p short
#SBATCH -t 00:01:00
#SBATCH -n 8
#SBATCH -c 2
#SBATCH -N 2
#SBATCH -J srun_batch
#SBATCH -o sbatch-%j.out
#SBATCH -e sbatch-%j.err
echo "starting at `date` on `hostname`"

# Print properties of job as submitted
echo "SLURM_JOB_ID = $SLURM_JOB_ID"
echo "SLURM_NTASKS = $SLURM_NTASKS"
echo "SLURM_NTASKS_PER_NODE = $SLURM_NTASKS_PER_NODE"
echo "SLURM_CPUS_PER_TASK = $SLURM_CPUS_PER_TASK"
echo "SLURM_JOB_NUM_NODES = $SLURM_JOB_NUM_NODES"

# Print properties of job as scheduled by Slurm
echo "SLURM_JOB_NODELIST = $SLURM_JOB_NODELIST"
echo "SLURM_TASKS_PER_NODE = $SLURM_TASKS_PER_NODE"
echo "SLURM_JOB_CPUS_PER_NODE = $SLURM_JOB_CPUS_PER_NODE"
echo "SLURM_CPUS_ON_NODE = $SLURM_CPUS_ON_NODE"

srun srun_hello.sh

echo "ended at `date` on `hostname`"
exit 0

The commands in the above script will run on just the first node allocated to your job, and not in parallel. However, the srun command inside it will launch -n copies of the script 'srun_hello.sh' on the full set of CPUs allocated to your job. The 'srun_hello.sh' script, shown below, reports the Slurm variables that are defined in the environment of each process started by srun. Please create both of these scripts on your cluster, submit the first script to sbatch, then watch for the output in 'sbatch-NNNNN.out'. Here is 'srun_hello.sh':

#!/bin/bash
echo "Hello from node $SLURM_NODEID (`hostname`)," \
"I am rank $SLURM_PROCID of $SLURM_NTASKS," \
"local rank is $SLURM_LOCALID, $SLURM_CPUS_ON_NODE CPUs here"

You will find that in most runs the tasks are not distributed evenly among the nodes.

Try running the script 'srun_batch.sh' again, but this time add the --exclusive option. Do this either by editing the script, or by adding it to the command line immediately after sbatch. This option prevents other users from gaining access to the nodes assigned to your job; therefore, your job is allocated all the CPUs and memory on those nodes. Note that the parallel tasks are now evenly distributed, and a greater number of CPUs (and accompanying memory) is now available on each node.

MPI and Hybrid MPI/OpenMP Applications

For MPI applications, it is usually desirable to distribute tasks evenly among the nodes. One option is to use --ntasks-per-node instead of -n to fix the number of MPI tasks (processes) running on each node. Thus, in the above example with -N 2, you would specify --ntasks-per-node=4 instead of -n 8 , because 8 tasks ÷ 2 nodes = 4 tasks per node.

Another good option for many MPI codes is the --exclusive flag, which prevents other users from sharing your nodes with you. A "side effect" of this option is that it causes your MPI tasks to be distributed evenly among the nodes. Thus, you can just specify -n and not worry about having to do the math. But even if you prefer to specify --ntasks-per-node, using --exclusive is probably a good idea anyway, particularly if your tasks make heavy use of resources other than CPUs. For example, by including --exclusive among your other options, you guarantee that the network bandwidth to and from your nodes will not be shared with other users' processes. Also, when you give a few MPI processes exclusive use of a node, each one can consume a greater percentage of the node's memory.

Hybrid MPI/OpenMP codes allow the MPI processes to fork multiple OpenMP threads, opening up the possibility of using all the CPUs on a node in a different way. For example, let's say that your batch nodes have 12 physical cores each; to keep all of the cores busy, you might choose to run 3 MPI tasks per node, with each task spawning 4 OpenMP threads. You would then specify the following in your batch script:

...
#SBATCH --ntasks-per-node=3
#SBATCH --exclusive
...
export OMP_NUM_THREADS=4
mpirun ./my_hybrid_code

The environment variables on the primary node, including OMP_NUM_THREADS, should propagate all to the MPI tasks on all the nodes; this means that the 3 MPI tasks on each node will start 4 OpenMP threads apiece. (Note that when OMP_NUM_THREADS is undefined, the default behavior of OpenMP is generally to spawn as many threads as there are CPUs. Thus, to avoid any nasty surprises, you'll want to define the number of threads per MPI task.)

MATLAB PCT Applications

As mentioned previously, MATLAB Parallel Computing Toolbox (PCT) is a special case. The best plan may be to specify -n 1 --exclusive when you submit the batch job. Then, have your MATLAB script fill up the node with MATLAB parallel workers, so that one worker runs on each physical core (SLURM_CPUS_PER_TASK/2). Hyperthreading can also be exploited (if desired) by setting NumThreads=2 in the "local" cluster profile in the MATLAB client.

References